Skip to content

Comparing API

Once records are blocked together into pairs, we actually can do pairwise comparisons on them.

All of the APIs revolve around the protocol. This is simply a function which takes a table of record pairs, (eg with columns suffixed with _l and _r), and returns a modified version of this table. For example, it could add a column with match scores, add rows that were missed during the initial blocking, or remove rows that we no longer want to consider as matched.

mismo.compare.PComparer

Bases: Protocol

A Callable that adds column(s) of features to a table of record pairs.

mismo.compare.PComparer.__call__

__call__(pairs: Table, **kwargs) -> Table

Add column(s) of features to a table of record pairs.

For example, add a match score to each record pair, modify a score from a previous PComparer, or similar.

Implementers must expect to be called with a table of record pairs. Columns suffixed with "_l" come from the left table, columns suffixed with "_r" come from the right table, and columns with neither suffix are features of the pair itself (eg from a different PComparer).

Level-Based Comparers

Bin record pairs into discrete levels, based on levels of agreement.

Each EnumComparer represents a dimension, such as name, location, price, date, etc. Each one uses an IbisEnum, each of which is a level of aggreement, such as exact, misspelling, within_1_km, etc.

mismo.compare.EnumComparer

Bases: Generic[IbisEnumT]

Assigns an IbisEnum-backed level to record pairs based on one dimension.

mismo.compare.EnumComparer.cases instance-attribute

cases: tuple[tuple[BooleanValue | bool, IbisEnumT], ...]

The cases to check for each level.

mismo.compare.EnumComparer.levels instance-attribute

levels: type[IbisEnumT]

The levels of agreement.

mismo.compare.EnumComparer.name instance-attribute

name: str

The name of the comparer, eg "date", "address", "latlon", "price".

mismo.compare.EnumComparer.representation class-attribute instance-attribute

representation: Literal['string', 'integer'] = 'integer'

The native representation of the levels in ibis expressions.

Integers are more performant, but strings are more human-readable.

mismo.compare.EnumComparer.__call__

__call__(
    pairs: Table,
    *,
    representation: Literal["string", "integer"]
    | None = None,
) -> Table

Label each record pair with the level that it matches.

Go through the levels in order. If a record pair matches a level, label ir. If none of the levels match a pair, it labeled as "else".

PARAMETER DESCRIPTION
pairs

A table of record pairs.

TYPE: Table

RETURNS DESCRIPTION
labels

The input table with an additional column named self.name that contains the level that each record pair matches.

TYPE: Table

Plotting

mismo.compare.compared_dashboard

compared_dashboard(
    compared: Table,
    comparers: Iterable[EnumComparer],
    weights: Weights | None = None,
    *,
    width: int = 500,
) -> VBox

A dashboard for debugging compared record pairs.

Used to see which match levels are common, which are rare, and which Comparers are related to each other. For example, exact matches should appear together across all Comparers, this probably represents true matches.

PARAMETER DESCRIPTION
compared

The result of running the blocked table through the supplied comparers.

TYPE: Table

comparers

The EnumCompareres that were used to compare compared.

TYPE: Iterable[EnumComparer]

weights

The Weights used to score the comparers. If provided, the chart will be colored by the odds found from the Weights.

TYPE: Weights | None DEFAULT: None

width

The width of the chart.

TYPE: int DEFAULT: 500

RETURNS DESCRIPTION
VBox

The dashboard.