Comparing API
Once records are blocked together into pairs, we actually can do pairwise comparisons on them.
All of the APIs revolve around the protocol.
This is simply a function which takes a table of record pairs,
(eg with columns suffixed with _l and _r), and returns a modified
version of this table. For example, it could add a column with match scores,
add rows that were missed during the initial blocking, or remove rows
that we no longer want to consider as matched.
mismo.compare.PComparer
Bases: Protocol
A Callable that adds column(s) of features to a table of record pairs.
mismo.compare.PComparer.__call__
__call__(pairs: Table, **kwargs) -> Table
Add column(s) of features to a table of record pairs.
For example, add a match score to each record pair, modify a score from a previous PComparer, or similar.
Implementers must expect to be called with a table of record pairs. Columns suffixed with "_l" come from the left table, columns suffixed with "_r" come from the right table, and columns with neither suffix are features of the pair itself (eg from a different PComparer).
Level-Based Comparers
Bin record pairs into discrete levels, based on levels of agreement.
Each EnumComparer represents a dimension, such as name, location, price, date, etc. Each one uses an IbisEnum, each of which is a level of aggreement, such as exact, misspelling, within_1_km, etc.
mismo.compare.EnumComparer
Bases: Generic[IbisEnumT]
Assigns an IbisEnum-backed level to record pairs based on one dimension.
mismo.compare.EnumComparer.cases
instance-attribute
The cases to check for each level.
mismo.compare.EnumComparer.levels
instance-attribute
levels: type[IbisEnumT]
The levels of agreement.
mismo.compare.EnumComparer.name
instance-attribute
name: str
The name of the comparer, eg "date", "address", "latlon", "price".
mismo.compare.EnumComparer.representation
class-attribute
instance-attribute
representation: Literal['string', 'integer'] = 'integer'
The native representation of the levels in ibis expressions.
Integers are more performant, but strings are more human-readable.
mismo.compare.EnumComparer.__call__
__call__(
pairs: Table,
*,
representation: Literal["string", "integer"]
| None = None,
) -> Table
Label each record pair with the level that it matches.
Go through the levels in order. If a record pair matches a level, label ir. If none of the levels match a pair, it labeled as "else".
| PARAMETER | DESCRIPTION |
|---|---|
pairs
|
A table of record pairs.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
labels
|
The input table with an additional column named
TYPE:
|
Plotting
mismo.compare.compared_dashboard
compared_dashboard(
compared: Table,
comparers: Iterable[EnumComparer],
weights: Weights | None = None,
*,
width: int = 500,
) -> VBox
A dashboard for debugging compared record pairs.
Used to see which match levels are common, which are rare, and which Comparers are related to each other. For example, exact matches should appear together across all Comparers, this probably represents true matches.
| PARAMETER | DESCRIPTION |
|---|---|
compared
|
The result of running the blocked table through the supplied
TYPE:
|
comparers
|
The EnumCompareres that were used to compare
TYPE:
|
weights
|
The Weights used to score the comparers. If provided, the chart will be colored by the odds found from the Weights.
TYPE:
|
width
|
The width of the chart.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
VBox
|
The dashboard. |