Comparing API

Once records are blocked together into pairs, we actually can do pairwise comparisons on them.

All of the APIs revolve around the protocol. This is simply a function which takes a table of record pairs, (eg with columns suffixed with _l and _r), and returns a modified version of this table. For example, it could add a column with match scores, add rows that were missed during the initial blocking, or remove rows that we no longer want to consider as matched.

`mismo.compare.PComparer`

Bases: Protocol

A Callable that adds column(s) of features to a table of record pairs.

`mismo.compare.PComparer.call(pairs: ir.Table, **kwargs) -> ir.Table`

Add column(s) of features to a table of record pairs.

For example, add a match score to each record pair, modify a score from a previous PComparer, or similar.

Implementers must expect to be called with a table of record pairs. Columns suffixed with "_l" come from the left table, columns suffixed with "_r" come from the right table, and columns with neither suffix are features of the pair itself (eg from a different PComparer).

Level-Based Comparers

Bin record pairs into discrete levels, based on levels of agreement.

Each LevelComparer represents a dimension, such as name, location, price, date, etc. Each one contains many MatchLevels, each of which is a level of aggreement, such as exact, misspelling, within_1_km, etc.

`mismo.compare.MatchLevel`

An enum-like class for match levels.

This class is used to define the levels of agreement between two records.

Examples:

>>> from mismo.compare import MatchLevel
>>> class NameMatchLevel(MatchLevel):
...     EXACT = 0
...     NEAR = 1
...     ELSE = 2
...

The class acts as a container:

>>> len(NameMatchLevel)
3
>>> 2 in NameMatchLevel
True
>>> list(NameMatchLevel)
['EXACT', 'NEAR', 'ELSE']

You can access the hardcoded values:

>>> str(NameMatchLevel.EXACT)
'EXACT'
>>> int(NameMatchLevel.EXACT)
0

You can use indexing semantics to translate between strings and ints:

>>> NameMatchLevel[1]
'NEAR'
>>> NameMatchLevel["NEAR"]
1
>>> NameMatchLevel[ibis.literal(1)].execute()
'NEAR'
>>> NameMatchLevel[ibis.literal("NEAR")].execute()
np.int8(1)

You can construct your own values, both from python literals...

>>> NameMatchLevel("NEAR").as_integer()
1
>>> NameMatchLevel(2).as_string()
'ELSE'
>>> NameMatchLevel(3)
Traceback (most recent call last):
...
ValueError: Invalid value: 3. Must be one of {0, 1, 2}`

...And Ibis expressions

>>> import ibis
>>> levels_raw = ibis.array([0, 2, 1, 99]).unnest()
>>> levels = NameMatchLevel(levels_raw)
>>> levels.as_string().execute()
0    EXACT
1     ELSE
2     NEAR
3     None
Name: NameMatchLevel, dtype: object
>>> levels.as_integer().name("levels").execute()
0     0
1     2
2     1
3    99
Name: levels, dtype: int8

Comparisons work as you expect:

>>> NameMatchLevel.NEAR == 1
True
>>> NameMatchLevel(1) == "NEAR"
True
>>> (levels_raw == NameMatchLevel.NEAR).name("eq").execute()
0    False
1    False
2     True
3    False
Name: eq, dtype: bool

However, implicit ordering is not supported (file an issue if you think it should be):

>>> NameMatchLevel.NEAR > 0
Traceback (most recent call last):
...
TypeError: '>' not supported between instances of 'NameMatchLevel' and 'int'

`mismo.compare.MatchLevel.eq(other: int | str | ir.NumericValue | ir.StringValue | MatchLevel) -> bool | ir.BooleanValue`

`mismo.compare.MatchLevel.init(value: MatchLevel | int | str | ir.StringValue | ir.IntegerValue)`

Create a new match level value.

If the given value is a python int or str, it is checked against the valid values for this class. If it is an ibis expression, we do no such check.

PARAMETER	DESCRIPTION
`value`	The value of the match level. TYPE: `MatchLevel \| int \| str \| StringValue \| IntegerValue`

`mismo.compare.MatchLevel.as_integer() -> int | ir.IntegerValue`

Convert to a python int or ibis integer, depending on the original type.

`mismo.compare.MatchLevel.as_string() -> str | ir.StringValue`

Convert to a python str or ibis string, depending on the original type.

`mismo.compare.LevelComparer`

Assigns a MatchLevel to record pairs based on one dimension, e.g. name

`mismo.compare.LevelComparer.cases: tuple[tuple[ir.BooleanColumn, MatchLevelT], ...]` `instance-attribute`

The cases to check for each level.

`mismo.compare.LevelComparer.levels: Type[MatchLevelT]` `instance-attribute`

The levels of agreement.

`mismo.compare.LevelComparer.name: str` `instance-attribute`

The name of the comparer, eg "date", "address", "latlon", "price".

`mismo.compare.LevelComparer.representation: Literal['string', 'integer'] = 'integer'` `class-attribute` `instance-attribute`

The native representation of the levels in ibis expressions.

Integers are more performant, but strings are more human-readable.

`mismo.compare.LevelComparer.call(pairs: ir.Table, *, representation: Literal['string', 'integer'] | None = None) -> ir.StringColumn | ir.IntegerColumn`

Label each record pair with the level that it matches.

Go through the levels in order. If a record pair matches a level, label ir. If none of the levels match a pair, it labeled as "else".

PARAMETER	DESCRIPTION
`pairs`	A table of record pairs. TYPE: `Table`

RETURNS	DESCRIPTION
`labels`	The labels for each record pair. TYPE: `StringColumn`

Plotting

`mismo.compare.compared_dashboard(compared: ir.Table, comparers: Iterable[LevelComparer], weights: Weights | None = None, *, width: int = 500) -> ipywidgets.VBox`

A dashboard for debugging compared record pairs.

Used to see which match levels are common, which are rare, and which Comparers are related to each other. For example, exact matches should appear together across all Comparers, this probably represents true matches.

PARAMETER	DESCRIPTION
`compared`	The result of running the blocked table through the supplied `comparers`. TYPE: `Table`
`comparers`	The LevelCompareres that were used to compare `compared`. TYPE: `Iterable[LevelComparer]`
`weights`	The Weights used to score the comparers. If provided, the chart will be colored by the odds found from the Weights. TYPE: `Weights \| None` DEFAULT: `None`
`width`	The width of the chart. TYPE: `int` DEFAULT: `500`

RETURNS	DESCRIPTION
`VBox`	The dashboard.

Comparing API

mismo.compare.PComparer

mismo.compare.PComparer.__call__(pairs: ir.Table, **kwargs) -> ir.Table

Level-Based Comparers

mismo.compare.MatchLevel

mismo.compare.MatchLevel.__eq__(other: int | str | ir.NumericValue | ir.StringValue | MatchLevel) -> bool | ir.BooleanValue

mismo.compare.MatchLevel.__init__(value: MatchLevel | int | str | ir.StringValue | ir.IntegerValue)

mismo.compare.MatchLevel.as_integer() -> int | ir.IntegerValue

mismo.compare.MatchLevel.as_string() -> str | ir.StringValue

mismo.compare.LevelComparer

mismo.compare.LevelComparer.cases: tuple[tuple[ir.BooleanColumn, MatchLevelT], ...] instance-attribute

mismo.compare.LevelComparer.levels: Type[MatchLevelT] instance-attribute

mismo.compare.LevelComparer.name: str instance-attribute

mismo.compare.LevelComparer.representation: Literal['string', 'integer'] = 'integer' class-attribute instance-attribute

mismo.compare.LevelComparer.__call__(pairs: ir.Table, *, representation: Literal['string', 'integer'] | None = None) -> ir.StringColumn | ir.IntegerColumn