Blocking API

Utilities and classes for the blocking phase of record linkage, where we choose pairs of records to compare.

Without blocking, we would have to compare N*M records, which becomes intractable for datasets much larger than a few thousand.

Blockers

`mismo.block.PBlocker`

Bases: Protocol

A Callable that takes two tables and returns candidate pairs.

`mismo.block.PBlocker.call(left: ir.Table, right: ir.Table, **kwargs) -> ir.Table`

Return a table of candidate pairs.

Implementers must expect to be called with two tables, left and right. Each one will have a column record_id that uniquely identifies each record. Implementers must accept a variable number of keyword arguments, even if they don't use them for anything.

The returned table must follow the Mismo convention of blocked tables:

The left table must have all columns suffixed with "_l".
The right table must have all columns suffixed with "_r".
There must be no duplicate pairs. ie there must be no two rows with the same record_id_l and record_id_r.
There may be additional columns that could get used by later steps, such as a column of rule names used to generate the pair, or some kind of initial score.

`mismo.block.CrossBlocker`

A Blocker that cross joins two tables, yielding all possible pairs.

`mismo.block.EmptyBlocker`

A Blocker that yields no pairs.

`mismo.block.ConditionBlocker`

Blocks based on a join predicate.

`mismo.block.ConditionBlocker.init(*conditions, name: str | None = None)`

Create a Blocker that blocks based on a join condition.

PARAMETER	DESCRIPTION
`condition`	A callable that takes two tables and returns a boolean condition.
`name`	The name of the rule, if any. TYPE: `str \| None` DEFAULT: `None`

Utils

`mismo.block.n_naive_comparisons(left: ir.Table | Sized | int, right: ir.Table | Sized | int | None = None) -> int`

The number of comparisons if we compared every record to every other record.

PARAMETER	DESCRIPTION
`left`	The number of records in the left dataset, or the left dataset itself. TYPE: `Table \| Sized \| int`
`right`	The number of records in the right dataset, or the right dataset itself. For dedupe tasks leave this as None. TYPE: `Table \| Sized \| int \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`int`	The number of comparisons.

`mismo.block.sample_all_pairs(left: ir.Table, right: ir.Table, *, max_pairs: int | None = None) -> ir.Table`

Samples up to max_pairs from all possible pairs of records.

PARAMETER	DESCRIPTION
`left`	The left table. TYPE: `Table`
`right`	The right table. TYPE: `Table`
`max_pairs`	The maximum number of pairs to sample. If None, all possible pairs are sampled. TYPE: `int \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`Table`	A blocked table of pairs, in the same schema as from other blocking functions. All pairs will be unique.

Examples:

>>> import ibis
>>> from mismo import playdata, block
>>> ibis.options.interactive = True
>>> t, _labels = playdata.load_febrl1()
>>> t.head(5)
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━┓
┃ record_id    ┃ given_name ┃ surname    ┃ street_number ┃ address_1         ┃ … ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━┩
│ string       │ string     │ string     │ string        │ string            │ … │
├──────────────┼────────────┼────────────┼───────────────┼───────────────────┼───┤
│ rec-0-dup-0  │ thomas     │ rokobaro   │ 12            │ herschell circuit │ … │
│ rec-0-org    │ flynn      │ rokobaro   │ 12            │ herschell circuit │ … │
│ rec-1-dup-0  │ karli      │ alderson   │ 144           │ nulsen circuit    │ … │
│ rec-1-org    │ karli      │ alderson   │ 144           │ nulsen circuit    │ … │
│ rec-10-dup-0 │ kayla      │ harrington │ NULL          │ maltby circuit    │ … │
└──────────────┴────────────┴────────────┴───────────────┴───────────────────┴───┘
>>> block.sample_all_pairs(t, t, max_pairs = 7)
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━┓
┃ record_id_l   ┃ record_id_r  ┃ address_1_l      ┃ address_1_r       ┃ … ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━┩
│ string        │ string       │ string           │ string            │ … │
├───────────────┼──────────────┼──────────────────┼───────────────────┼───┤
│ rec-138-org   │ rec-63-org   │ strachan place   │ charleston street │ … │
│ rec-2-org     │ rec-34-org   │ finniss crescent │ messenger street  │ … │
│ rec-325-dup-0 │ rec-422-org  │ becker uplace    │ booth crescent    │ … │
│ rec-338-org   │ rec-367-org  │ crease place     │ pickles street    │ … │
│ rec-367-org   │ rec-431-org  │ pickles street   │ kaveneys road     │ … │
│ rec-448-org   │ rec-69-org   │ fenwick place    │ torrens street    │ … │
│ rec-72-org    │ rec-58-dup-0 │ madigan street   │ clark close       │ … │
└───────────────┴──────────────┴──────────────────┴───────────────────┴───┘

`mismo.block.join(left: ir.Table, right: ir.Table, *conditions: _IntoCondition, on_slow: Literal['error', 'warn', 'ignore'] = 'error', task: Literal['dedupe', 'link'] | None = None, **kwargs) -> ir.Table`

Join two tables together using the given conditions.

This is a wrapper around ibis.join that

Allows for slightly more flexible join conditions.
Renames the columns in the resulting table to have "_l" and "_r" suffixes, to be consistent with Mismo's conventions.
Sorts the output columns into a nicer order.
Checks if the join condition is likely to cause a slow O(n*m) join algorithm.
If task is "dedupe", adds an additional restriction that record_id_l < record_id_r to remove duplicates.

PARAMETER	DESCRIPTION
`left`	The left table to block TYPE: `Table`
`right`	The right table to block TYPE: `Table`
`conditions`	The conditions that determine if two records should be blocked together. All conditions are ANDed together. Each condition can be any of the following: A string, which is interpreted as the name of a column in both tables. eg "price" is equivalent to `left.price == right.price` A Deferred, which is used to reference a column in a table eg "_.price.fill_null(0)" is equivalent to `left.price.fill_null(0) == right.price.fill_null(0)` A Callable of signature (left: Table, right: Table, args) -> one of the above A 2-tuple tuple of one of the above. The first element is for the left table, the second is for the right table. This is useful when the column names are different in the two tables, or require some transformation/normalization. For example `("last_name", _.surname.upper()")` is equivalent to `left.last_name == right.surname.upper()` A literal `True`, which results in a cross join. A literal `False`, which results in an empty table. A Table in the expected output schema, which is assumed to be the result of blocking, and will be used as-is. A callable with the signature def block( left: Table, right: Table, , on_slow: Literal["error", "warn", "ignore"] = "error", dedupe: bool \| None = None, kwargs, ) -> BooleanColumn of the join condition, or one of the above. Note You can't reference the input tables directly in the conditions. eg `block_one(left, right, left.name == right.name)` will raise an error. This is because mismo might be modifying the tables before the actual join takes place, which would lead to the condition referencing stale tables that don't exist anymore. Instead, use a lambda or Deferreds. TYPE:* `_IntoCondition` DEFAULT: `()`
`on_slow`	What to do if the join condition causes a slow O(nm) join algorithm. If "error", raise a SlowJoinError. If "warn", issue a SlowJoinWarning. If "ignore", do nothing. See check_join_algorithm() for more information. TYPE:* `Literal['error', 'warn', 'ignore']` DEFAULT: `'error'`
`task`	If "dedupe", the resulting pairs will have the additional restriction that `record_id_l < record_id_r`. If "link", no additional restriction is added. If None, will be inferred by looking at left and right: - If `id(left) == id(right) or left.equals(right)`, then "dedupe" is assumed. - Otherwise, "link" is assumed. TYPE: `Literal['dedupe', 'link'] \| None` DEFAULT: `None`

Key-Based Blockers

Generate pairs where records share a single key, eg "first_name"

`mismo.block.KeyBlocker`

Blocks records together wherever they share a key, eg "emails match."

This is one of the most basic blocking rules, used very often in record linkage. This is what is used in splink.

Examples:

>>> import ibis
>>> from ibis import _
>>> import mismo
>>> ibis.options.interactive = True
>>> t = mismo.playdata.load_patents()["record_id", "name", "latitude"]
>>> t.head()
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ record_id ┃ name                         ┃ latitude ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ int64     │ string                       │ float64  │
├───────────┼──────────────────────────────┼──────────┤
│      2909 │ * AGILENT TECHNOLOGIES, INC. │     0.00 │
│      3574 │ * AKZO NOBEL N.V.            │     0.00 │
│      3575 │ * AKZO NOBEL NV              │     0.00 │
│      3779 │ * ALCATEL N.V.               │    52.35 │
│      3780 │ * ALCATEL N.V.               │    52.35 │
└───────────┴──────────────────────────────┴──────────┘

Block the table with itself wherever the names match:

>>> blocker = mismo.block.KeyBlocker("name")
>>> blocker(t, t).order_by("record_id_l", "record_id_r").head()
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ record_id_l ┃ record_id_r ┃ latitude_l ┃ latitude_r ┃ name_l         ┃ name_r         ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ int64       │ int64       │ float64    │ float64    │ string         │ string         │
├─────────────┼─────────────┼────────────┼────────────┼────────────────┼────────────────┤
│        3779 │        3780 │      52.35 │  52.350000 │ * ALCATEL N.V. │ * ALCATEL N.V. │
│        3779 │        3782 │      52.35 │   0.000000 │ * ALCATEL N.V. │ * ALCATEL N.V. │
│        3780 │        3782 │      52.35 │   0.000000 │ * ALCATEL N.V. │ * ALCATEL N.V. │
│       25388 │     7651559 │       0.00 │  50.966667 │ DSM N.V.       │ DSM N.V.       │
│       25388 │     7651560 │       0.00 │  52.500000 │ DSM N.V.       │ DSM N.V.       │
└─────────────┴─────────────┴────────────┴────────────┴────────────────┴────────────────┘

Arbitrary blocking keys are supported. For example, block the table wherever

the first 5 characters of the name in uppercase, are the same AND
the latitudes, rounded to 1 decimal place, are the same

>>> blocker = mismo.block.KeyBlocker(_["name"][:5].upper(), _.latitude.round(1))
>>> blocker(t, t).order_by("record_id_l", "record_id_r").head()
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ record_id_l ┃ record_id_r ┃ latitude_l ┃ latitude_r ┃ name_l              ┃ name_r              ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ int64       │ int64       │ float64    │ float64    │ string              │ string              │
├─────────────┼─────────────┼────────────┼────────────┼─────────────────────┼─────────────────────┤
│        3574 │        3575 │       0.00 │       0.00 │ * AKZO NOBEL N.V.   │ * AKZO NOBEL NV     │
│        3779 │        3780 │      52.35 │      52.35 │ * ALCATEL N.V.      │ * ALCATEL N.V.      │
│       15041 │       15042 │       0.00 │       0.00 │ * CANON EUROPA N.V  │ * CANON EUROPA N.V. │
│       15041 │       15043 │       0.00 │       0.00 │ * CANON EUROPA N.V  │ * CANON EUROPA NV   │
│       15042 │       15043 │       0.00 │       0.00 │ * CANON EUROPA N.V. │ * CANON EUROPA NV   │
└─────────────┴─────────────┴────────────┴────────────┴─────────────────────┴─────────────────────┘

We can even block on arrays! For example, first let's split each name into significant tokens:

>>> tokens = _.name.upper().split(" ").filter(lambda x: x.length() > 4)
>>> t.select(tokens.name("tokens"))
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ tokens                       ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ array<string>                │
├──────────────────────────────┤
│ ['AGILENT', 'TECHNOLOGIES,'] │
│ ['NOBEL']                    │
│ ['NOBEL']                    │
│ ['ALCATEL']                  │
│ ['ALCATEL']                  │
│ ['ALCATEL']                  │
│ ['CANON', 'EUROPA']          │
│ ['CANON', 'EUROPA']          │
│ ['CANON', 'EUROPA']          │
│ []                           │
│ …                            │
└──────────────────────────────┘

Now, block the tables together wherever two records share a token. Note that this blocked * SCHLUMBERGER LIMITED with * SCHLUMBERGER TECHNOLOGY BV. because they both share the SCHLUMBERGER token.

>>> blocker = mismo.block.KeyBlocker(tokens.unnest())
>>> blocker(t, t).filter(_.name_l != _.name_r).order_by("record_id_l", "record_id_r").head()
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ record_id_l ┃ record_id_r ┃ latitude_l ┃ latitude_r ┃ name_l                                                     ┃ name_r                                                     ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ int64       │ int64       │ float64    │ float64    │ string                                                     │ string                                                     │
├─────────────┼─────────────┼────────────┼────────────┼────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────────┤
│        2909 │    13390969 │        0.0 │      52.35 │ * AGILENT TECHNOLOGIES, INC.                               │ Hitachi Global Storage Technologies, Inc. Netherlands B.V  │
│        2909 │    13390970 │        0.0 │      52.35 │ * AGILENT TECHNOLOGIES, INC.                               │ Hitachi Global Storage Technologies, Inc. Netherlands B.V. │
│        2909 │    13391015 │        0.0 │      52.35 │ * AGILENT TECHNOLOGIES, INC.                               │ Hitachi Global Storage Technologies, Netherland B.V.       │
│        2909 │    13391055 │        0.0 │      52.50 │ * AGILENT TECHNOLOGIES, INC.                               │ Hitachi Global Storage Technologies, Netherlands, B.V.     │
│        2909 │    13391056 │        0.0 │      52.35 │ * AGILENT TECHNOLOGIES, INC.                               │ Hitachi Global Storage Technologies, Netherlands, B.V.     │
└─────────────┴─────────────┴────────────┴────────────┴────────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────────┘

`mismo.block.KeyBlocker.name: str` `property`

The name of the KeyBlocker.

`mismo.block.KeyBlocker.init(*keys: str | Deferred | Callable[[ibis.Table], ir.Column | str | Deferred] | tuple[str | Deferred | Callable[[ibis.Table], ir.Column | str | Deferred], str | Deferred | Callable[[ibis.Table], ir.Column | str | Deferred]] | Callable[[ibis.Table, ibis.Table], tuple[ir.Column, ir.Column]], name: str | None = None) -> None`

Create a KeyBlocker.

PARAMETER DESCRIPTION

keys

The keys to block on. The tables will be blocked together wherever they share ALL the keys. Each key can be any of the following:

A string, which is interpreted as the name of a column in both tables. eg "price" is equivalent to left.price == right.price
A Deferred, which is used to reference a column in a table. eg _.price.fill_null(0) is equivalent to left.price.fill_null(0) == right.price.fill_null(0)
A Callable that takes a table and returns a Column.
A 2-tuple of the above, where the first element describes the key in the left table and the second element describes the key in the right table. eg ("first_name", _.GivenName.upper()") is equivalent to left.first_name == right.GivenName.upper() This is useful when the keys have different names in the two tables.
A callable that takes the left and right tables and returns a tuple of columns. Left and right will be joined where the columns are equal.

`mismo.block.KeyBlocker.key_counts(t: ir.Table) -> KeyCountsTable` `cached`

Count the join keys in a table.

This is very similar to t.group_by(key).count(), except that counts NULLs, whereas this method does not count NULLs since they are not counted as a match during joins.

This is useful for analyzing the skew of join keys. For example, if you are joining on (surname, city), there might be only 4 values for (hoessle, tinytown), which would lead to a block of 4 * 4 = 16 record pairs.

On the other hand, there could be 10_000 values for (smith, new york city). This would lead to 10_000 * 10_000 = 100_000_000 record pairs, which is likely too many for you to be able to compare.

PARAMETER	DESCRIPTION
`t`	The table of records. TYPE: `Table`

RETURNS	DESCRIPTION
Will have column(s) for each `key` and a column `n` with the count.

Examples:

>>> import ibis
>>> ibis.options.interactive = True
>>> import mismo
>>> records = [
...     ("a", 1),
...     ("b", 1),
...     ("b", 1),
...     ("c", 3),
...     ("b", 2),
...     ("c", 3),
...     (None, 4),
...     ("c", 3),
... ]
>>> letters, nums = zip(*records)
>>> t = ibis.memtable({"letter": letters, "num": nums})
>>> blocker = mismo.block.KeyBlocker("letter", "num")

Note how the (None, 4) record is not counted, since NULLs are not counted as a match during a join.

>>> counts = blocker.key_counts(t)
>>> counts.order_by("letter", "num")
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ letter ┃ num   ┃ n     ┃
┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ string │ int64 │ int64 │
├────────┼───────┼───────┤
│ a      │     1 │     1 │
│ b      │     1 │     2 │
│ b      │     2 │     1 │
│ c      │     3 │     3 │
└────────┴───────┴───────┘

The returned CountsTable is a subclass of an Ibis Table with a special n_total property for convenience:

>>> counts.n_total
7
>>> isinstance(counts, ibis.Table)
True

`mismo.block.KeyBlocker.pair_counts(left: ir.Table, right: ir.Table, *, task: Literal['dedupe', 'link'] | None = None) -> PairCountsTable` `cached`

Count the number of pairs that would be generated by each key.

If you were to use this blocker to join left with right, how many pairs would be generated for each key?

This is useful for analyzing the skew of join keys. For example, if you are joining on (surname, city), there might be only 4 values for (hoessle, tinytown), which would lead to a block of 4 * 4 = 16 record pairs.

On the other hand, there could be 10_000 values for (smith, new york city). This would lead to 10_000 * 10_000 = 100_000_000 record pairs, which is likely too many for you to be able to compare.

PARAMETER	DESCRIPTION
`left`	The left table. TYPE: `Table`
`right`	The right table. TYPE: `Table`
`task`	The task to count pairs for. "link": each key results in n_left * n_right pairs "dedupe": each key results in n_left * (n_right - 1) / 2 pairs since we will only generate pair (A, B), not also (B, A). None: inferred from the input tables: if `left is right`, then "dedupe", otherwise "link". TYPE: `Literal['dedupe', 'link'] \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
Will have column(s) for each `key` and a column `n` with the count.

Examples:

>>> import ibis
>>> ibis.options.interactive = True
>>> import mismo
>>> records = [
...     ("a", 1),
...     ("b", 1),
...     ("b", 1),
...     ("c", 3),
...     ("b", 2),
...     ("c", 3),
...     (None, 4),
...     ("c", 3),
... ]
>>> letters, nums = zip(*records)
>>> t = ibis.memtable({"letter": letters, "num": nums})
>>> blocker = mismo.block.KeyBlocker("letter", "num")

If we joined t with itself using this blocker in a link task, we would end up with

9 pairs in the (c, 3) block due to pairs (3,3), (3, 6), (3, 8), (6, 3), (6, 6), (6, 8), (8, 3), (8, 6), and (8, 8)
4 pairs in the (b, 1) block due to pairs (1, 1), (1, 2), (2, 1), and (2, 2)
1 pairs in the (a, 1) block due to pair (0, 0)
1 pairs in the (b, 2) block due to pair (4, 4)

>>> counts = blocker.pair_counts(t, t, task="link").order_by("letter", "num")
>>> counts
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ letter ┃ num   ┃ n     ┃
┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ string │ int64 │ int64 │
├────────┼───────┼───────┤
│ a      │     1 │     1 │
│ b      │     1 │     4 │
│ b      │     2 │     1 │
│ c      │     3 │     9 │
└────────┴───────┴───────┘

If we joined t with itself using this blocker in a dedupe task, we would end up with

3 pairs in the (c, 3) block due to pairs (3, 6), (3, 8), and (6, 8)
1 pairs in the (b, 1) block due to pairs (1, 2)
0 pairs in the (a, 1) block due to record 0 not getting blocked with itself
0 pairs in the (b, 2) block due to record 4 not getting blocked with itself

>>> counts = blocker.pair_counts(t, t)
>>> counts.order_by("letter", "num")
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ letter ┃ num   ┃ n     ┃
┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ string │ int64 │ int64 │
├────────┼───────┼───────┤
│ a      │     1 │     0 │
│ b      │     1 │     1 │
│ b      │     2 │     0 │
│ c      │     3 │     3 │
└────────┴───────┴───────┘

The returned CountsTable is a subclass of an Ibis Table with a special n_total property for convenience:

>>> counts.n_total
4
>>> isinstance(counts, ibis.Table)
True

`mismo.block.CountsTable`

Bases: TableWrapper

A table with at least an Integer column named n.

There will also be variable number of other columns that act as identifiers.

You won't create this directly, it will be returned to you from eg KeyBlocker.key_counts() or KeyBlocker.pair_counts().

`mismo.block.CountsTable.n: ir.IntegerColumn` `instance-attribute`

The column containing the count.

`mismo.block.CountsTable.n_total: int` `cached` `property`

n.sum(), just here for convenience.

Ensemble Blockers

Blockers that use other Blockers

`mismo.block.UnionBlocker`

A Blocker that takes the union of the results of other Blockers.

`mismo.block.UnionBlocker.init(*blockers: PBlocker, labels: bool = False)`

Create a UnionBlocker from the given Blockers.

PARAMETER	DESCRIPTION
`blockers`	The Blockers to use. TYPE: `PBlocker` DEFAULT: `()`
`labels`	If True, a column of type `array<string>` will be added to the resulting table indicating which rules caused each record pair to be blocked. If False, the resulting table will only contain the columns of left and right. TYPE: `bool` DEFAULT: `False`

`mismo.block.upset_chart(blocked: ir.Table) -> alt.Chart`

Generate an Altair-based UpSet plot for a blocked table.

An UpSet plot is useful to visualize the overlap between blocking rules. For example, how many pairs are blocked by the "given name" rule, how many pairs are blocked by the "surname name" rule, and how many pairs are blocked by both rules together.

For example, if there is one group that generates a lot of pairs, that could be an opportunity to make those rules more restrictive, so that they generate fewer pairs.

PARAMETER	DESCRIPTION
`blocked`	The blocked table to plot. TYPE: `Table`

RETURNS	DESCRIPTION
`Chart`	An Altair chart.

Analyze: Join Algorithm

Analyze the actual algorithm that the SQL engine will use when pairing up records during the join. In particular, check for slow O(n*m) nested loop joins.

See https://duckdb.org/2022/05/27/iejoin.html for a very good explanation of how SQL engines (or at least duckdb) chooses a join algorithm.

`mismo.block.get_join_algorithm(left: ir.Table, right: ir.Table, condition) -> str`

Return one of the JOIN_ALGORITHMS for the outermost join in the expression.

If there are multiple joins in the query, this will return the top (outermost) one. This only works with expressions bound to a DuckDB backend. Other kinds of expressions will raise NotImplementedError.

`mismo.block.check_join_algorithm(left: ir.Table, right: ir.Table, condition, *, on_slow: Literal['error', 'warn', 'ignore'] = 'error') -> None`

Error or warn if the join in the expression is likely to be slow.

Issues a SlowJoinWarning or raises a SlowJoinError.

This is only implemented for the duckdb backend. All other backends will issue a warning and skip the check.

By "slow", we mean that the join algorithm is one of:

"NESTED_LOOP_JOIN" O(n*m)
"BLOCKWISE_NL_JOIN" O(n*m)
"CROSS_PRODUCT" O(n*m)

and not one of the fast join algorithms:

"EMPTY_RESULT" O(1)
"POSITIONAL_JOIN" O(n)
"HASH_JOIN" O(n)
"PIECEWISE_MERGE_JOIN" O(m*log(n))
"IE_JOIN" O(n*log(n))
"ASOF_JOIN" O(n*log(n))

This is done by using the EXPLAIN command to generate the query plan, and checking the join algorithm.

See https://duckdb.org/2022/05/27/iejoin.html for a very good explanation of these join algorithms.

`mismo.block.JOIN_ALGORITHMS = frozenset({'EMPTY_RESULT', 'LEFT_DELIM_JOIN', 'RIGHT_DELIM_JOIN', 'BLOCKWISE_NL_JOIN', 'NESTED_LOOP_JOIN', 'HASH_JOIN', 'PIECEWISE_MERGE_JOIN', 'IE_JOIN', 'ASOF_JOIN', 'CROSS_PRODUCT', 'POSITIONAL_JOIN'})` `module-attribute`

Based on all the JOIN operators in https://github.com/duckdb/duckdb/blob/b0b1562e293718ee9279c9621cefe4cb5dc01ef9/src/common/enums/physical_operator_type.cpp#L56 (very good) explanation of these at https://duckdb.org/2022/05/27/iejoin.html

`mismo.block.SLOW_JOIN_ALGORITHMS = frozenset({'NESTED_LOOP_JOIN', 'BLOCKWISE_NL_JOIN', 'CROSS_PRODUCT'})` `module-attribute`

`mismo.block.SlowJoinError`

Bases: _SlowJoinMixin, ValueError

Error for slow join algorithms.

`mismo.block.SlowJoinWarning`

Bases: _SlowJoinMixin, UserWarning

Warning for slow join algorithms.

Blocking API

Blockers

mismo.block.PBlocker

mismo.block.PBlocker.__call__(left: ir.Table, right: ir.Table, **kwargs) -> ir.Table

mismo.block.CrossBlocker

mismo.block.EmptyBlocker

mismo.block.ConditionBlocker

mismo.block.ConditionBlocker.__init__(*conditions, name: str | None = None)

Utils

mismo.block.n_naive_comparisons(left: ir.Table | Sized | int, right: ir.Table | Sized | int | None = None) -> int

mismo.block.sample_all_pairs(left: ir.Table, right: ir.Table, *, max_pairs: int | None = None) -> ir.Table

mismo.block.join(left: ir.Table, right: ir.Table, *conditions: _IntoCondition, on_slow: Literal['error', 'warn', 'ignore'] = 'error', task: Literal['dedupe', 'link'] | None = None, **kwargs) -> ir.Table

Key-Based Blockers

mismo.block.KeyBlocker

mismo.block.KeyBlocker.name: str property

mismo.block.KeyBlocker.key_counts(t: ir.Table) -> KeyCountsTable cached

mismo.block.KeyBlocker.pair_counts(left: ir.Table, right: ir.Table, *, task: Literal['dedupe', 'link'] | None = None) -> PairCountsTable cached

mismo.block.CountsTable

mismo.block.CountsTable.n: ir.IntegerColumn instance-attribute

mismo.block.CountsTable.n_total: int cached property

Ensemble Blockers

mismo.block.UnionBlocker

mismo.block.UnionBlocker.__init__(*blockers: PBlocker, labels: bool = False)

mismo.block.upset_chart(blocked: ir.Table) -> alt.Chart

Analyze: Join Algorithm

mismo.block.get_join_algorithm(left: ir.Table, right: ir.Table, condition) -> str

mismo.block.check_join_algorithm(left: ir.Table, right: ir.Table, condition, *, on_slow: Literal['error', 'warn', 'ignore'] = 'error') -> None

mismo.block.JOIN_ALGORITHMS = frozenset({'EMPTY_RESULT', 'LEFT_DELIM_JOIN', 'RIGHT_DELIM_JOIN', 'BLOCKWISE_NL_JOIN', 'NESTED_LOOP_JOIN', 'HASH_JOIN', 'PIECEWISE_MERGE_JOIN', 'IE_JOIN', 'ASOF_JOIN', 'CROSS_PRODUCT', 'POSITIONAL_JOIN'}) module-attribute

mismo.block.SLOW_JOIN_ALGORITHMS = frozenset({'NESTED_LOOP_JOIN', 'BLOCKWISE_NL_JOIN', 'CROSS_PRODUCT'}) module-attribute

mismo.block.SlowJoinError

mismo.block.SlowJoinWarning

`mismo.block.PBlocker`

`mismo.block.PBlocker.call(left: ir.Table, right: ir.Table, **kwargs) -> ir.Table`

`mismo.block.CrossBlocker`

`mismo.block.EmptyBlocker`

`mismo.block.ConditionBlocker`

`mismo.block.ConditionBlocker.init(*conditions, name: str | None = None)`

`mismo.block.n_naive_comparisons(left: ir.Table | Sized | int, right: ir.Table | Sized | int | None = None) -> int`

`mismo.block.sample_all_pairs(left: ir.Table, right: ir.Table, *, max_pairs: int | None = None) -> ir.Table`

`mismo.block.join(left: ir.Table, right: ir.Table, *conditions: _IntoCondition, on_slow: Literal['error', 'warn', 'ignore'] = 'error', task: Literal['dedupe', 'link'] | None = None, **kwargs) -> ir.Table`

`mismo.block.KeyBlocker`

`mismo.block.KeyBlocker.name: str` `property`

`mismo.block.KeyBlocker.key_counts(t: ir.Table) -> KeyCountsTable` `cached`

`mismo.block.KeyBlocker.pair_counts(left: ir.Table, right: ir.Table, *, task: Literal['dedupe', 'link'] | None = None) -> PairCountsTable` `cached`

`mismo.block.CountsTable`

`mismo.block.CountsTable.n: ir.IntegerColumn` `instance-attribute`

`mismo.block.CountsTable.n_total: int` `cached` `property`

`mismo.block.UnionBlocker`

`mismo.block.UnionBlocker.init(*blockers: PBlocker, labels: bool = False)`

`mismo.block.upset_chart(blocked: ir.Table) -> alt.Chart`

`mismo.block.get_join_algorithm(left: ir.Table, right: ir.Table, condition) -> str`

`mismo.block.check_join_algorithm(left: ir.Table, right: ir.Table, condition, *, on_slow: Literal['error', 'warn', 'ignore'] = 'error') -> None`

`mismo.block.JOIN_ALGORITHMS = frozenset({'EMPTY_RESULT', 'LEFT_DELIM_JOIN', 'RIGHT_DELIM_JOIN', 'BLOCKWISE_NL_JOIN', 'NESTED_LOOP_JOIN', 'HASH_JOIN', 'PIECEWISE_MERGE_JOIN', 'IE_JOIN', 'ASOF_JOIN', 'CROSS_PRODUCT', 'POSITIONAL_JOIN'})` `module-attribute`

`mismo.block.SLOW_JOIN_ALGORITHMS = frozenset({'NESTED_LOOP_JOIN', 'BLOCKWISE_NL_JOIN', 'CROSS_PRODUCT'})` `module-attribute`

`mismo.block.SlowJoinError`

`mismo.block.SlowJoinWarning`