Comparing API

Once records are blocked together into pairs, we actually can do pairwise comparisons on them.

All of the APIs revolve around the protocol. This is simply a function which takes a table of record pairs, (eg with columns suffixed with _l and _r), and returns a modified version of this table. For example, it could add a column with match scores, add rows that were missed during the initial blocking, or remove rows that we no longer want to consider as matched.

mismo.compare.PComparer

Bases: Protocol

A Callable that adds column(s) of features to a table of record pairs.

mismo.compare.PComparer.call

__call__(pairs: Table, **kwargs) -> Table

Add column(s) of features to a table of record pairs.

For example, add a match score to each record pair, modify a score from a previous PComparer, or similar.

Implementers must expect to be called with a table of record pairs. Columns suffixed with "_l" come from the left table, columns suffixed with "_r" come from the right table, and columns with neither suffix are features of the pair itself (eg from a different PComparer).

Level-Based Comparers

Bin record pairs into discrete levels, based on levels of agreement.

Each EnumComparer represents a dimension, such as name, location, price, date, etc. Each one uses an IbisEnum, each of which is a level of aggreement, such as exact, misspelling, within_1_km, etc.

mismo.compare.EnumComparer

Bases: Generic[IbisEnumT]

Assigns an IbisEnum-backed level to record pairs based on one dimension.

mismo.compare.EnumComparer.cases `instance-attribute`

cases: tuple[tuple[BooleanValue | bool, IbisEnumT], ...]

The cases to check for each level.

mismo.compare.EnumComparer.levels `instance-attribute`

levels: type[IbisEnumT]

The levels of agreement.

mismo.compare.EnumComparer.name `instance-attribute`

name: str

The name of the comparer, eg "date", "address", "latlon", "price".

mismo.compare.EnumComparer.representation `class-attribute` `instance-attribute`

representation: Literal['string', 'integer'] = 'integer'

The native representation of the levels in ibis expressions.

Integers are more performant, but strings are more human-readable.

mismo.compare.EnumComparer.call

__call__(
    pairs: Table,
    *,
    representation: Literal["string", "integer"]
    | None = None,
) -> Table

Label each record pair with the level that it matches.

Go through the levels in order. If a record pair matches a level, label ir. If none of the levels match a pair, it labeled as "else".

PARAMETER	DESCRIPTION
`pairs`	A table of record pairs. TYPE: `Table`

RETURNS	DESCRIPTION
`labels`	The input table with an additional column named `self.name` that contains the level that each record pair matches. TYPE: `Table`

Plotting

mismo.compare.compared_dashboard

compared_dashboard(
    compared: Table,
    comparers: Iterable[EnumComparer],
    weights: Weights | None = None,
    *,
    width: int = 500,
) -> VBox

A dashboard for debugging compared record pairs.

Used to see which match levels are common, which are rare, and which Comparers are related to each other. For example, exact matches should appear together across all Comparers, this probably represents true matches.

PARAMETER	DESCRIPTION
`compared`	The result of running the blocked table through the supplied `comparers`. TYPE: `Table`
`comparers`	The EnumCompareres that were used to compare `compared`. TYPE: `Iterable[EnumComparer]`
`weights`	The Weights used to score the comparers. If provided, the chart will be colored by the odds found from the Weights. TYPE: `Weights \| None` DEFAULT: `None`
`width`	The width of the chart. TYPE: `int` DEFAULT: `500`

RETURNS	DESCRIPTION
`VBox`	The dashboard.

Comparing API

mismo.compare.PComparer

mismo.compare.PComparer.__call__

Level-Based Comparers

mismo.compare.EnumComparer

mismo.compare.EnumComparer.cases instance-attribute

mismo.compare.EnumComparer.levels instance-attribute

mismo.compare.EnumComparer.name instance-attribute

mismo.compare.EnumComparer.representation class-attribute instance-attribute

mismo.compare.EnumComparer.__call__