Skip to content

Comparing API

Once records are blocked together into pairs, we actually can do pairwise comparisons on them.

All of the APIs revolve around the protocol. This is simply a function which takes a table of record pairs, (eg with columns suffixed with _l and _r), and returns a modified version of this table. For example, it could add a column with match scores, add rows that were missed during the initial blocking, or remove rows that we no longer want to consider as matched.

mismo.compare.PComparer

Bases: Protocol

A Callable that adds column(s) of features to a table of record pairs.

mismo.compare.PComparer.__call__(pairs: ir.Table, **kwargs) -> ir.Table

Add column(s) of features to a table of record pairs.

For example, add a match score to each record pair, modify a score from a previous PComparer, or similar.

Implementers must expect to be called with a table of record pairs. Columns suffixed with "_l" come from the left table, columns suffixed with "_r" come from the right table, and columns with neither suffix are features of the pair itself (eg from a different PComparer).

Level-Based Comparers

Bin record pairs into discrete levels, based on levels of agreement.

Each LevelComparer represents a dimension, such as name, location, price, date, etc. Each one contains many MatchLevels, each of which is a level of aggreement, such as exact, misspelling, within_1_km, etc.

mismo.compare.MatchLevel

An enum-like class for match levels.

This class is used to define the levels of agreement between two records.

Examples:

>>> from mismo.compare import MatchLevel
>>> class NameMatchLevel(MatchLevel):
...     EXACT = 0
...     NEAR = 1
...     ELSE = 2
...

The class acts as a container:

>>> len(NameMatchLevel)
3
>>> 2 in NameMatchLevel
True
>>> list(NameMatchLevel)
['EXACT', 'NEAR', 'ELSE']

You can access the hardcoded values:

>>> str(NameMatchLevel.EXACT)
'EXACT'
>>> int(NameMatchLevel.EXACT)
0

You can use indexing semantics to translate between strings and ints:

>>> NameMatchLevel[1]
'NEAR'
>>> NameMatchLevel["NEAR"]
1
>>> NameMatchLevel[ibis.literal(1)].execute()
'NEAR'
>>> NameMatchLevel[ibis.literal("NEAR")].execute()
np.int8(1)

You can construct your own values, both from python literals...

>>> NameMatchLevel("NEAR").as_integer()
1
>>> NameMatchLevel(2).as_string()
'ELSE'
>>> NameMatchLevel(3)
Traceback (most recent call last):
...
ValueError: Invalid value: 3. Must be one of {0, 1, 2}`

...And Ibis expressions

>>> import ibis
>>> levels_raw = ibis.array([0, 2, 1, 99]).unnest()
>>> levels = NameMatchLevel(levels_raw)
>>> levels.as_string().execute()
0    EXACT
1     ELSE
2     NEAR
3     None
Name: NameMatchLevel, dtype: object
>>> levels.as_integer().name("levels").execute()
0     0
1     2
2     1
3    99
Name: levels, dtype: int8

Comparisons work as you expect:

>>> NameMatchLevel.NEAR == 1
True
>>> NameMatchLevel(1) == "NEAR"
True
>>> (levels_raw == NameMatchLevel.NEAR).name("eq").execute()
0    False
1    False
2     True
3    False
Name: eq, dtype: bool

However, implicit ordering is not supported (file an issue if you think it should be):

>>> NameMatchLevel.NEAR > 0
Traceback (most recent call last):
...
TypeError: '>' not supported between instances of 'NameMatchLevel' and 'int'

mismo.compare.MatchLevel.__eq__(other: int | str | ir.NumericValue | ir.StringValue | MatchLevel) -> bool | ir.BooleanValue

mismo.compare.MatchLevel.__init__(value: MatchLevel | int | str | ir.StringValue | ir.IntegerValue)

Create a new match level value.

If the given value is a python int or str, it is checked against the valid values for this class. If it is an ibis expression, we do no such check.

PARAMETER DESCRIPTION
value

The value of the match level.

TYPE: MatchLevel | int | str | StringValue | IntegerValue

mismo.compare.MatchLevel.as_integer() -> int | ir.IntegerValue

Convert to a python int or ibis integer, depending on the original type.

mismo.compare.MatchLevel.as_string() -> str | ir.StringValue

Convert to a python str or ibis string, depending on the original type.

mismo.compare.LevelComparer

Assigns a MatchLevel to record pairs based on one dimension, e.g. name

mismo.compare.LevelComparer.cases: tuple[tuple[ir.BooleanColumn, MatchLevelT], ...] instance-attribute

The cases to check for each level.

mismo.compare.LevelComparer.levels: Type[MatchLevelT] instance-attribute

The levels of agreement.

mismo.compare.LevelComparer.name: str instance-attribute

The name of the comparer, eg "date", "address", "latlon", "price".

mismo.compare.LevelComparer.representation: Literal['string', 'integer'] = 'integer' class-attribute instance-attribute

The native representation of the levels in ibis expressions.

Integers are more performant, but strings are more human-readable.

mismo.compare.LevelComparer.__call__(pairs: ir.Table, *, representation: Literal['string', 'integer'] | None = None) -> ir.StringColumn | ir.IntegerColumn

Label each record pair with the level that it matches.

Go through the levels in order. If a record pair matches a level, label ir. If none of the levels match a pair, it labeled as "else".

PARAMETER DESCRIPTION
pairs

A table of record pairs.

TYPE: Table

RETURNS DESCRIPTION
labels

The labels for each record pair.

TYPE: StringColumn

Plotting

mismo.compare.compared_dashboard(compared: ir.Table, comparers: Iterable[LevelComparer], weights: Weights | None = None, *, width: int = 500) -> ipywidgets.VBox

A dashboard for debugging compared record pairs.

Used to see which match levels are common, which are rare, and which Comparers are related to each other. For example, exact matches should appear together across all Comparers, this probably represents true matches.

PARAMETER DESCRIPTION
compared

The result of running the blocked table through the supplied comparers.

TYPE: Table

comparers

The LevelCompareres that were used to compare compared.

TYPE: Iterable[LevelComparer]

weights

The Weights used to score the comparers. If provided, the chart will be colored by the odds found from the Weights.

TYPE: Weights | None DEFAULT: None

width

The width of the chart.

TYPE: int DEFAULT: 500

RETURNS DESCRIPTION
VBox

The dashboard.