Skip to content

Blocking API

Utilities and classes for the blocking phase of record linkage, where we choose pairs of records to compare.

Without blocking, we would have to compare N*M records, which becomes intractable for datasets much larger than a few thousand.

Blockers

mismo.block.PBlocker

Bases: Protocol

A Callable that takes two tables and returns candidate pairs.

mismo.block.PBlocker.__call__(left: ir.Table, right: ir.Table, **kwargs) -> ir.Table

Return a table of candidate pairs.

Implementers must expect to be called with two tables, left and right. Each one will have a column record_id that uniquely identifies each record. Implementers must accept a variable number of keyword arguments, even if they don't use them for anything.

The returned table must follow the Mismo convention of blocked tables:

  • The left table must have all columns suffixed with "_l".
  • The right table must have all columns suffixed with "_r".
  • There must be no duplicate pairs. ie there must be no two rows with the same record_id_l and record_id_r.
  • There may be additional columns that could get used by later steps, such as a column of rule names used to generate the pair, or some kind of initial score.

mismo.block.CrossBlocker

A Blocker that cross joins two tables, yielding all possible pairs.

mismo.block.EmptyBlocker

A Blocker that yields no pairs.

mismo.block.ConditionBlocker

Blocks based on a join predicate.

mismo.block.ConditionBlocker.__init__(condition: Callable[[ir.Table, ir.Table], ir.BooleanValue], *, name: str | None = None)

Create a Blocker that blocks based on a join condition.

PARAMETER DESCRIPTION
condition

A callable that takes two tables and returns a boolean condition.

TYPE: Callable[[Table, Table], BooleanValue]

name

The name of the rule, if any.

TYPE: str | None DEFAULT: None

Utils

mismo.block.n_naive_comparisons(left: ir.Table | Sized | int, right: ir.Table | Sized | int | None = None) -> int

The number of comparisons if we compared every record to every other record.

PARAMETER DESCRIPTION
left

The number of records in the left dataset, or the left dataset itself.

TYPE: Table | Sized | int

right

The number of records in the right dataset, or the right dataset itself. For dedupe tasks leave this as None.

TYPE: Table | Sized | int | None DEFAULT: None

RETURNS DESCRIPTION
int

The number of comparisons.

mismo.block.sample_all_pairs(left: ir.Table, right: ir.Table, *, max_pairs: int | None = None) -> ir.Table

Samples up to max_pairs from all possible pairs of records.

PARAMETER DESCRIPTION
left

The left table.

TYPE: Table

right

The right table.

TYPE: Table

max_pairs

The maximum number of pairs to sample. If None, all possible pairs are sampled.

TYPE: int | None DEFAULT: None

RETURNS DESCRIPTION
Table

A blocked table of pairs, in the same schema as from other blocking functions. All pairs will be unique.

Examples:

>>> import ibis
>>> from mismo import playdata, block
>>> ibis.options.interactive = True
>>> t, _labels = playdata.load_febrl1()
>>> t.head(5)
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━┓
┃ record_id    ┃ given_name ┃ surname    ┃ street_number ┃ address_1         ┃ … ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━┩
│ string       │ string     │ string     │ string        │ string            │ … │
├──────────────┼────────────┼────────────┼───────────────┼───────────────────┼───┤
│ rec-0-dup-0  │ thomas     │ rokobaro   │ 12            │ herschell circuit │ … │
│ rec-0-org    │ flynn      │ rokobaro   │ 12            │ herschell circuit │ … │
│ rec-1-dup-0  │ karli      │ alderson   │ 144           │ nulsen circuit    │ … │
│ rec-1-org    │ karli      │ alderson   │ 144           │ nulsen circuit    │ … │
│ rec-10-dup-0 │ kayla      │ harrington │ NULL          │ maltby circuit    │ … │
└──────────────┴────────────┴────────────┴───────────────┴───────────────────┴───┘
>>> block.sample_all_pairs(t, t, max_pairs = 7)
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━┓
┃ record_id_l   ┃ record_id_r  ┃ address_1_l      ┃ address_1_r       ┃ … ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━┩
│ string        │ string       │ string           │ string            │ … │
├───────────────┼──────────────┼──────────────────┼───────────────────┼───┤
│ rec-138-org   │ rec-63-org   │ strachan place   │ charleston street │ … │
│ rec-2-org     │ rec-34-org   │ finniss crescent │ messenger street  │ … │
│ rec-325-dup-0 │ rec-422-org  │ becker uplace    │ booth crescent    │ … │
│ rec-338-org   │ rec-367-org  │ crease place     │ pickles street    │ … │
│ rec-367-org   │ rec-431-org  │ pickles street   │ kaveneys road     │ … │
│ rec-448-org   │ rec-69-org   │ fenwick place    │ torrens street    │ … │
│ rec-72-org    │ rec-58-dup-0 │ madigan street   │ clark close       │ … │
└───────────────┴──────────────┴──────────────────┴───────────────────┴───┘

mismo.block.join(left: ir.Table, right: ir.Table, *conditions: _Condition, on_slow: Literal['error', 'warn', 'ignore'] = 'error', task: Literal['dedupe', 'link'] | None = None, **kwargs) -> ir.Table

Join two tables together using the given conditions.

This is a wrapper around ibis.join that

  • Allows for slightly more flexible join conditions.
  • Renames the columns in the resulting table to have "_l" and "_r" suffixes, to be consistent with Mismo's conventions.
  • Sorts the output columns into a nicer order.
  • Checks if the join condition is likely to cause a slow O(n*m) join algorithm.
  • If task is "dedupe", adds an additional restriction that record_id_l < record_id_r to remove duplicates.
PARAMETER DESCRIPTION
left

The left table to block

TYPE: Table

right

The right table to block

TYPE: Table

conditions

The conditions that determine if two records should be blocked together. All conditions are ANDed together. Each condition can be any of the following:

  • A string, which is interpreted as the name of a column in both tables. eg "price" is equivalent to left.price == right.price
  • A Deferred, which is used to reference a column in a table eg "_.price.fill_null(0)" is equivalent to left.price.fill_null(0) == right.price.fill_null(0)
  • A Callable of signature (left: Table, right: Table, args) -> one of the above
  • A 2-tuple tuple of one of the above. The first element is for the left table, the second is for the right table. This is useful when the column names are different in the two tables, or require some transformation/normalization. For example ("last_name", _.surname.upper()") is equivalent to left.last_name == right.surname.upper()
  • A literal True, which results in a cross join.
  • A literal False, which results in an empty table.
  • A Table in the expected output schema, which is assumed to be the result of blocking, and will be used as-is.
  • A callable with the signature def block( left: Table, right: Table, , on_slow: Literal["error", "warn", "ignore"] = "error", dedupe: bool | None = None, *kwargs, ) -> BooleanColumn of the join condition, or one of the above.

Note

You can't reference the input tables directly in the conditions. eg block_one(left, right, left.name == right.name) will raise an error. This is because mismo might be modifying the tables before the actual join takes place, which would lead to the condition referencing stale tables that don't exist anymore. Instead, use a lambda or Deferreds.

TYPE: _Condition DEFAULT: ()

on_slow

What to do if the join condition causes a slow O(n*m) join algorithm. If "error", raise a SlowJoinError. If "warn", issue a SlowJoinWarning. If "ignore", do nothing. See check_join_algorithm() for more information.

TYPE: Literal['error', 'warn', 'ignore'] DEFAULT: 'error'

task

If "dedupe", the resulting pairs will have the additional restriction that record_id_l < record_id_r. If "link", no additional restriction is added. If None, will be inferred by looking at left and right: - If id(left) == id(right) or left.equals(right), then "dedupe" is assumed. - Otherwise, "link" is assumed.

TYPE: Literal['dedupe', 'link'] | None DEFAULT: None

Key-Based Blockers

Generate pairs where records share a single key, eg "first_name"

mismo.block.KeyBlocker

Blocks records together wherever they share a key, eg "emails match."

This is one of the most basic blocking rules, used very often in record linkage. This is what is used in splink.

Examples:

>>> import ibis
>>> from ibis import _
>>> import mismo
>>> ibis.options.interactive = True
>>> t = mismo.playdata.load_patents()["record_id", "name", "latitude"]
>>> t.head()
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ record_id ┃ name                         ┃ latitude ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ int64     │ string                       │ float64  │
├───────────┼──────────────────────────────┼──────────┤
│      2909 │ * AGILENT TECHNOLOGIES, INC. │     0.00 │
│      3574 │ * AKZO NOBEL N.V.            │     0.00 │
│      3575 │ * AKZO NOBEL NV              │     0.00 │
│      3779 │ * ALCATEL N.V.               │    52.35 │
│      3780 │ * ALCATEL N.V.               │    52.35 │
└───────────┴──────────────────────────────┴──────────┘

Block the table with itself wherever the names match:

>>> blocker = mismo.block.KeyBlocker("name")
>>> blocker(t, t).order_by("record_id_l", "record_id_r").head()
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ record_id_l ┃ record_id_r ┃ latitude_l ┃ latitude_r ┃ name_l         ┃ name_r         ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ int64       │ int64       │ float64    │ float64    │ string         │ string         │
├─────────────┼─────────────┼────────────┼────────────┼────────────────┼────────────────┤
│        3779 │        3780 │      52.35 │  52.350000 │ * ALCATEL N.V. │ * ALCATEL N.V. │
│        3779 │        3782 │      52.35 │   0.000000 │ * ALCATEL N.V. │ * ALCATEL N.V. │
│        3780 │        3782 │      52.35 │   0.000000 │ * ALCATEL N.V. │ * ALCATEL N.V. │
│       25388 │     7651559 │       0.00 │  50.966667 │ DSM N.V.       │ DSM N.V.       │
│       25388 │     7651560 │       0.00 │  52.500000 │ DSM N.V.       │ DSM N.V.       │
└─────────────┴─────────────┴────────────┴────────────┴────────────────┴────────────────┘

Arbitrary blocking keys are supported. For example, block the table wherever

  • the first 5 characters of the name in uppercase, are the same AND
  • the latitudes, rounded to 1 decimal place, are the same
>>> blocker = mismo.block.KeyBlocker(_["name"][:5].upper(), _.latitude.round(1))
>>> blocker(t, t).order_by("record_id_l", "record_id_r").head()
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ record_id_l ┃ record_id_r ┃ latitude_l ┃ latitude_r ┃ name_l              ┃ name_r              ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ int64       │ int64       │ float64    │ float64    │ string              │ string              │
├─────────────┼─────────────┼────────────┼────────────┼─────────────────────┼─────────────────────┤
│        3574 │        3575 │       0.00 │       0.00 │ * AKZO NOBEL N.V.   │ * AKZO NOBEL NV     │
│        3779 │        3780 │      52.35 │      52.35 │ * ALCATEL N.V.      │ * ALCATEL N.V.      │
│       15041 │       15042 │       0.00 │       0.00 │ * CANON EUROPA N.V  │ * CANON EUROPA N.V. │
│       15041 │       15043 │       0.00 │       0.00 │ * CANON EUROPA N.V  │ * CANON EUROPA NV   │
│       15042 │       15043 │       0.00 │       0.00 │ * CANON EUROPA N.V. │ * CANON EUROPA NV   │
└─────────────┴─────────────┴────────────┴────────────┴─────────────────────┴─────────────────────┘

We can even block on arrays! For example, first let's split each name into significant tokens:

>>> tokens = _.name.upper().split(" ").filter(lambda x: x.length() > 4)
>>> t.select(tokens)
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ ArrayFilter(StringSplit(Uppercase(name), ' '), Greater(StringLength(x), 4)) ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ array<string>                                                               │
├─────────────────────────────────────────────────────────────────────────────┤
│ ['AGILENT', 'TECHNOLOGIES,']                                                │
│ ['NOBEL']                                                                   │
│ ['NOBEL']                                                                   │
│ ['ALCATEL']                                                                 │
│ ['ALCATEL']                                                                 │
│ ['ALCATEL']                                                                 │
│ ['CANON', 'EUROPA']                                                         │
│ ['CANON', 'EUROPA']                                                         │
│ ['CANON', 'EUROPA']                                                         │
│ []                                                                          │
│ …                                                                           │
└─────────────────────────────────────────────────────────────────────────────┘

Now, block the tables together wherever two records share a token. Note that this blocked * SCHLUMBERGER LIMITED with * SCHLUMBERGER TECHNOLOGY BV. because they both share the SCHLUMBERGER token.

>>> blocker = mismo.block.KeyBlocker(tokens.unnest())
>>> blocker(t, t).filter(_.name_l != _.name_r).order_by("record_id_l", "record_id_r").head()
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ record_id_l ┃ record_id_r ┃ latitude_l ┃ latitude_r ┃ name_l                                                     ┃ name_r                                                     ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ int64       │ int64       │ float64    │ float64    │ string                                                     │ string                                                     │
├─────────────┼─────────────┼────────────┼────────────┼────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────────┤
│        2909 │    13390969 │        0.0 │      52.35 │ * AGILENT TECHNOLOGIES, INC.                               │ Hitachi Global Storage Technologies, Inc. Netherlands B.V  │
│        2909 │    13390970 │        0.0 │      52.35 │ * AGILENT TECHNOLOGIES, INC.                               │ Hitachi Global Storage Technologies, Inc. Netherlands B.V. │
│        2909 │    13391015 │        0.0 │      52.35 │ * AGILENT TECHNOLOGIES, INC.                               │ Hitachi Global Storage Technologies, Netherland B.V.       │
│        2909 │    13391055 │        0.0 │      52.50 │ * AGILENT TECHNOLOGIES, INC.                               │ Hitachi Global Storage Technologies, Netherlands, B.V.     │
│        2909 │    13391056 │        0.0 │      52.35 │ * AGILENT TECHNOLOGIES, INC.                               │ Hitachi Global Storage Technologies, Netherlands, B.V.     │
└─────────────┴─────────────┴────────────┴────────────┴────────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────────┘

mismo.block.KeyBlocker.name: str property

The name of the KeyBlocker.

mismo.block.KeyBlocker.__init__(*keys: str | Deferred | Callable[[ibis.Table], ir.Column | str | Deferred] | tuple[str | Deferred | Callable[[ibis.Table], ir.Column | str | Deferred], str | Deferred | Callable[[ibis.Table], ir.Column | str | Deferred]] | Callable[[ibis.Table, ibis.Table], tuple[ir.Column, ir.Column]], name: str | None = None) -> None

Create a KeyBlocker.

PARAMETER DESCRIPTION
keys

The keys to block on. The tables will be blocked together wherever they share ALL the keys. Each key can be any of the following:

  • A string, which is interpreted as the name of a column in both tables. eg "price" is equivalent to left.price == right.price
  • A Deferred, which is used to reference a column in a table. eg _.price.fill_null(0) is equivalent to left.price.fill_null(0) == right.price.fill_null(0)
  • A Callable that takes a table and returns a Column.
  • A 2-tuple of the above, where the first element describes the key in the left table and the second element describes the key in the right table. eg ("first_name", _.GivenName.upper()") is equivalent to left.first_name == right.GivenName.upper() This is useful when the keys have different names in the two tables.
  • A callable that takes the left and right tables and returns a tuple of columns. Left and right will be joined where the columns are equal.

TYPE: str | Deferred | Callable[[Table], Column | str | Deferred] | tuple[str | Deferred | Callable[[Table], Column | str | Deferred], str | Deferred | Callable[[Table], Column | str | Deferred]] | Callable[[Table, Table], tuple[Column, Column]] DEFAULT: ()

mismo.block.KeyBlocker.key_counts(t: ir.Table) -> CountsTable cached

Count the join keys in a table.

This is very similar to t.group_by(key).count(), except that counts NULLs, whereas this method does not count NULLs since they are not counted as a match during joins.

This is useful for analyzing the skew of join keys. For example, if you are joining on (surname, city), there might be only 4 values for (hoessle, tinytown), which would lead to a block of 4 * 4 = 16 record pairs.

On the other hand, there could be 10_000 values for (smith, new york city). This would lead to 10_000 * 10_000 = 100_000_000 record pairs, which is likely too many for you to be able to compare.

PARAMETER DESCRIPTION
t

The table of records.

TYPE: Table

RETURNS DESCRIPTION
CountsTable

Will have column(s) for each key and a column n with the count.

Examples:

>>> import ibis
>>> ibis.options.interactive = True
>>> import mismo
>>> records = [
...     ("a", 1),
...     ("b", 1),
...     ("b", 1),
...     ("c", 3),
...     ("b", 2),
...     ("c", 3),
...     (None, 4),
...     ("c", 3),
... ]
>>> letters, nums = zip(*records)
>>> t = ibis.memtable({"letter": letters, "num": nums})
>>> blocker = mismo.block.KeyBlocker("letter", "num")

Note how the (None, 4) record is not counted, since NULLs are not counted as a match during a join.

>>> counts = blocker.key_counts(t)
>>> counts.order_by("letter", "num")
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ letter ┃ num   ┃ n     ┃
┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ string │ int64 │ int64 │
├────────┼───────┼───────┤
│ a      │     1 │     1 │
│ b      │     1 │     2 │
│ b      │     2 │     1 │
│ c      │     3 │     3 │
└────────┴───────┴───────┘

The returned CountsTable is a subclass of an Ibis Table with a special n_total property for convenience:

>>> counts.n_total
7
>>> isinstance(counts, ibis.Table)
True

mismo.block.KeyBlocker.pair_counts(left: ir.Table, right: ir.Table, *, task: Literal['dedupe', 'link'] | None = None) -> CountsTable cached

Count the number of pairs that would be generated by each key.

If you were to use this blocker to join left with right, how many pairs would be generated for each key?

This is useful for analyzing the skew of join keys. For example, if you are joining on (surname, city), there might be only 4 values for (hoessle, tinytown), which would lead to a block of 4 * 4 = 16 record pairs.

On the other hand, there could be 10_000 values for (smith, new york city). This would lead to 10_000 * 10_000 = 100_000_000 record pairs, which is likely too many for you to be able to compare.

PARAMETER DESCRIPTION
left

The left table.

TYPE: Table

right

The right table.

TYPE: Table

task

The task to count pairs for.

  • "link": each key results in n_left * n_right pairs
  • "dedupe": each key results in n_left * (n_right - 1) / 2 pairs since we will only generate pair (A, B), not also (B, A).
  • None: inferred from the input tables: if left is right, then "dedupe", otherwise "link".

TYPE: Literal['dedupe', 'link'] | None DEFAULT: None

RETURNS DESCRIPTION
CountsTable

Will have column(s) for each key and a column n with the count.

Examples:

>>> import ibis
>>> ibis.options.interactive = True
>>> import mismo
>>> records = [
...     ("a", 1),
...     ("b", 1),
...     ("b", 1),
...     ("c", 3),
...     ("b", 2),
...     ("c", 3),
...     (None, 4),
...     ("c", 3),
... ]
>>> letters, nums = zip(*records)
>>> t = ibis.memtable({"letter": letters, "num": nums})
>>> blocker = mismo.block.KeyBlocker("letter", "num")

If we joined t with itself using this blocker in a link task, we would end up with

  • 9 pairs in the (c, 3) block due to pairs (3,3), (3, 6), (3, 8), (6, 3), (6, 6), (6, 8), (8, 3), (8, 6), and (8, 8)
  • 4 pairs in the (b, 1) block due to pairs (1, 1), (1, 2), (2, 1), and (2, 2)
  • 1 pairs in the (a, 1) block due to pair (0, 0)
  • 1 pairs in the (b, 2) block due to pair (4, 4)
>>> counts = blocker.pair_counts(t, t, task="link").order_by("letter", "num")
>>> counts
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ letter ┃ num   ┃ n     ┃
┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ string │ int64 │ int64 │
├────────┼───────┼───────┤
│ a      │     1 │     1 │
│ b      │     1 │     4 │
│ b      │     2 │     1 │
│ c      │     3 │     9 │
└────────┴───────┴───────┘

If we joined t with itself using this blocker in a dedupe task, we would end up with

  • 3 pairs in the (c, 3) block due to pairs (3, 6), (3, 8), and (6, 8)
  • 1 pairs in the (b, 1) block due to pairs (1, 2)
  • 0 pairs in the (a, 1) block due to record 0 not getting blocked with itself
  • 0 pairs in the (b, 2) block due to record 4 not getting blocked with itself
>>> counts = blocker.pair_counts(t, t)
>>> counts.order_by("letter", "num")
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ letter ┃ num   ┃ n     ┃
┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ string │ int64 │ int64 │
├────────┼───────┼───────┤
│ a      │     1 │     0 │
│ b      │     1 │     1 │
│ b      │     2 │     0 │
│ c      │     3 │     3 │
└────────┴───────┴───────┘

The returned CountsTable is a subclass of an Ibis Table with a special n_total property for convenience:

>>> counts.n_total
4
>>> isinstance(counts, ibis.Table)
True

mismo.block.CountsTable

Bases: TableWrapper

A table with at least an Integer column named n.

There will also be variable number of other columns that act as identifiers.

You won't create this directly, it will be returned to you from eg KeyBlocker.key_counts() or KeyBlocker.pair_counts().

mismo.block.CountsTable.n: ir.IntegerColumn instance-attribute

The column containing the count.

mismo.block.CountsTable.n_total: int cached property

n.sum(), just here for convenience.

Ensemble Blockers

Blockers that use other Blockers

mismo.block.UnionBlocker

A Blocker that takes the union of the results of other Blockers.

mismo.block.UnionBlocker.__init__(*blockers: PBlocker, labels: bool = False)

Create a UnionBlocker from the given Blockers.

PARAMETER DESCRIPTION
blockers

The Blockers to use.

TYPE: PBlocker DEFAULT: ()

labels

If True, a column of type array<string> will be added to the resulting table indicating which rules caused each record pair to be blocked. If False, the resulting table will only contain the columns of left and right.

TYPE: bool DEFAULT: False

mismo.block.upset_chart(blocked: ir.Table) -> alt.Chart

Generate an Altair-based UpSet plot for a blocked table.

An UpSet plot is useful to visualize the overlap between blocking rules. For example, how many pairs are blocked by the "given name" rule, how many pairs are blocked by the "surname name" rule, and how many pairs are blocked by both rules together.

For example, if there is one group that generates a lot of pairs, that could be an opportunity to make those rules more restrictive, so that they generate fewer pairs.

PARAMETER DESCRIPTION
blocked

The blocked table to plot.

TYPE: Table

RETURNS DESCRIPTION
Chart

An Altair chart.

Analyze: Join Algorithm

Analyze the actual algorithm that the SQL engine will use when pairing up records during the join. In particular, check for slow O(n*m) nested loop joins.

See https://duckdb.org/2022/05/27/iejoin.html for a very good explanation of how SQL engines (or at least duckdb) chooses a join algorithm.

mismo.block.get_join_algorithm(left: ir.Table, right: ir.Table, condition) -> str

Return one of the JOIN_ALGORITHMS for the outermost join in the expression.

If there are multiple joins in the query, this will return the top (outermost) one. This only works with expressions bound to a DuckDB backend. Other kinds of expressions will raise NotImplementedError.

mismo.block.check_join_algorithm(left: ir.Table, right: ir.Table, condition, *, on_slow: Literal['error', 'warn', 'ignore'] = 'error') -> None

Error or warn if the join in the expression is likely to be slow.

Issues a SlowJoinWarning or raises a SlowJoinError.

This is only implemented for the duckdb backend. All other backends will issue a warning and skip the check.

By "slow", we mean that the join algorithm is one of:

  • "NESTED_LOOP_JOIN" O(n*m)
  • "BLOCKWISE_NL_JOIN" O(n*m)
  • "CROSS_PRODUCT" O(n*m)

and not one of the fast join algorithms:

  • "EMPTY_RESULT" O(1)
  • "POSITIONAL_JOIN" O(n)
  • "HASH_JOIN" O(n)
  • "PIECEWISE_MERGE_JOIN" O(m*log(n))
  • "IE_JOIN" O(n*log(n))
  • "ASOF_JOIN" O(n*log(n))

This is done by using the EXPLAIN command to generate the query plan, and checking the join algorithm.

See https://duckdb.org/2022/05/27/iejoin.html for a very good explanation of these join algorithms.

mismo.block.JOIN_ALGORITHMS = frozenset({'EMPTY_RESULT', 'LEFT_DELIM_JOIN', 'RIGHT_DELIM_JOIN', 'BLOCKWISE_NL_JOIN', 'NESTED_LOOP_JOIN', 'HASH_JOIN', 'PIECEWISE_MERGE_JOIN', 'IE_JOIN', 'ASOF_JOIN', 'CROSS_PRODUCT', 'POSITIONAL_JOIN'}) module-attribute

mismo.block.SLOW_JOIN_ALGORITHMS = frozenset({'NESTED_LOOP_JOIN', 'BLOCKWISE_NL_JOIN', 'CROSS_PRODUCT'}) module-attribute

mismo.block.SlowJoinError

Bases: _SlowJoinMixin, ValueError

Error for slow join algorithms.

mismo.block.SlowJoinWarning

Bases: _SlowJoinMixin, UserWarning

Warning for slow join algorithms.