Comparing API
Once records are blocked together into pairs, we actually can do pairwise comparisons on them.
All of the APIs revolve around the protocol.
This is simply a function which takes a table of record pairs,
(eg with columns suffixed with _l
and _r
), and returns a modified
version of this table. For example, it could add a column with match scores,
add rows that were missed during the initial blocking, or remove rows
that we no longer want to consider as matched.
mismo.compare.PComparer
Bases: Protocol
A Callable that adds column(s) of features to a table of record pairs.
mismo.compare.PComparer.__call__(pairs: ir.Table, **kwargs) -> ir.Table
Add column(s) of features to a table of record pairs.
For example, add a match score to each record pair, modify a score from a previous PComparer, or similar.
Implementers must expect to be called with a table of record pairs. Columns suffixed with "_l" come from the left table, columns suffixed with "_r" come from the right table, and columns with neither suffix are features of the pair itself (eg from a different PComparer).
Level-Based Comparers
Bin record pairs into discrete levels, based on levels of agreement.
Each LevelComparer represents a dimension, such as name, location, price, date, etc. Each one contains many MatchLevels, each of which is a level of aggreement, such as exact, misspelling, within_1_km, etc.
mismo.compare.MatchLevel
An enum-like class for match levels.
This class is used to define the levels of agreement between two records.
Examples:
>>> from mismo.compare import MatchLevel
>>> class NameMatchLevel(MatchLevel):
... EXACT = 0
... NEAR = 1
... ELSE = 2
...
The class acts as a container:
>>> len(NameMatchLevel)
3
>>> 2 in NameMatchLevel
True
>>> list(NameMatchLevel)
['EXACT', 'NEAR', 'ELSE']
You can access the hardcoded values:
>>> str(NameMatchLevel.EXACT)
'EXACT'
>>> int(NameMatchLevel.EXACT)
0
You can use indexing semantics to translate between strings and ints:
>>> NameMatchLevel[1]
'NEAR'
>>> NameMatchLevel["NEAR"]
1
>>> NameMatchLevel[ibis.literal(1)].execute()
'NEAR'
>>> NameMatchLevel[ibis.literal("NEAR")].execute()
np.int8(1)
You can construct your own values, both from python literals...
>>> NameMatchLevel("NEAR").as_integer()
1
>>> NameMatchLevel(2).as_string()
'ELSE'
>>> NameMatchLevel(3)
Traceback (most recent call last):
...
ValueError: Invalid value: 3. Must be one of {0, 1, 2}`
...And Ibis expressions
>>> import ibis
>>> levels_raw = ibis.array([0, 2, 1, 99]).unnest()
>>> levels = NameMatchLevel(levels_raw)
>>> levels.as_string().execute()
0 EXACT
1 ELSE
2 NEAR
3 None
Name: NameMatchLevel, dtype: object
>>> levels.as_integer().name("levels").execute()
0 0
1 2
2 1
3 99
Name: levels, dtype: int8
Comparisons work as you expect:
>>> NameMatchLevel.NEAR == 1
True
>>> NameMatchLevel(1) == "NEAR"
True
>>> (levels_raw == NameMatchLevel.NEAR).name("eq").execute()
0 False
1 False
2 True
3 False
Name: eq, dtype: bool
However, implicit ordering is not supported (file an issue if you think it should be):
>>> NameMatchLevel.NEAR > 0
Traceback (most recent call last):
...
TypeError: '>' not supported between instances of 'NameMatchLevel' and 'int'
mismo.compare.MatchLevel.__eq__(other: int | str | ir.NumericValue | ir.StringValue | MatchLevel) -> bool | ir.BooleanValue
mismo.compare.MatchLevel.__init__(value: MatchLevel | int | str | ir.StringValue | ir.IntegerValue)
Create a new match level value.
If the given value is a python int or str, it is checked against the valid values for this class. If it is an ibis expression, we do no such check.
PARAMETER | DESCRIPTION |
---|---|
value |
The value of the match level.
TYPE:
|
mismo.compare.MatchLevel.as_integer() -> int | ir.IntegerValue
Convert to a python int or ibis integer, depending on the original type.
mismo.compare.MatchLevel.as_string() -> str | ir.StringValue
Convert to a python str or ibis string, depending on the original type.
mismo.compare.LevelComparer
Assigns a MatchLevel to record pairs based on one dimension, e.g. name
mismo.compare.LevelComparer.cases: tuple[tuple[ir.BooleanColumn, MatchLevelT], ...]
instance-attribute
The cases to check for each level.
mismo.compare.LevelComparer.levels: Type[MatchLevelT]
instance-attribute
The levels of agreement.
mismo.compare.LevelComparer.name: str
instance-attribute
The name of the comparer, eg "date", "address", "latlon", "price".
mismo.compare.LevelComparer.representation: Literal['string', 'integer'] = 'integer'
class-attribute
instance-attribute
The native representation of the levels in ibis expressions.
Integers are more performant, but strings are more human-readable.
mismo.compare.LevelComparer.__call__(pairs: ir.Table, *, representation: Literal['string', 'integer'] | None = None) -> ir.StringColumn | ir.IntegerColumn
Label each record pair with the level that it matches.
Go through the levels in order. If a record pair matches a level, label ir. If none of the levels match a pair, it labeled as "else".
PARAMETER | DESCRIPTION |
---|---|
pairs |
A table of record pairs.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
labels
|
The labels for each record pair.
TYPE:
|
Plotting
mismo.compare.compared_dashboard(compared: ir.Table, comparers: Iterable[LevelComparer], weights: Weights | None = None, *, width: int = 500) -> ipywidgets.VBox
A dashboard for debugging compared record pairs.
Used to see which match levels are common, which are rare, and which Comparers are related to each other. For example, exact matches should appear together across all Comparers, this probably represents true matches.
PARAMETER | DESCRIPTION |
---|---|
compared |
The result of running the blocked table through the supplied
TYPE:
|
comparers |
The LevelCompareres that were used to compare
TYPE:
|
weights |
The Weights used to score the comparers. If provided, the chart will be colored by the odds found from the Weights.
TYPE:
|
width |
The width of the chart.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
VBox
|
The dashboard. |