Skip to content

Core API

Types, functions, etc that are core to mismo and are used throughout the rest of the framework

Core Types

mismo.Linkage

A dataclass of two Tables of records (LinkedTables) and a Table of links (LinksTable) between them.

Each record in left can be linked from 0 to N records in right, and vice versa.

See Also

The Diff dataclass, for representing the special case where each record is linked to at most one other record.

mismo.Linkage.left property

The left Table.

links: LinksTable

A table of (record_id_l, record_id_r, ...) that link left and right.

mismo.Linkage.right property

right: LinkedTable

The right Table.

mismo.Linkage.__init__

__init__(
    *, left: Table, right: Table, links: Table
) -> None

Create from two Tables and a Table of links between them.

PARAMETER DESCRIPTION
left

A Table of records, with at least a column 'record_id'.

TYPE: Table

right

A Table of records, with at least a column 'record_id'.

TYPE: Table

links

A Table of links between the two tables. Must have columns 'record_id_l' and 'record_id_r', which refer to the 'record_id' columns in left and right, respectively. May have other columns. May not have duplicate (record_id_l, record_id_r) pairs.

TYPE: Table

mismo.Linkage.cache

cache() -> Self

Cache left, right, and links for faster subsequent access.

mismo.Linkage.copy

copy(
    *,
    left: LinkedTable | None = None,
    right: LinkedTable | None = None,
    links: LinksTable | None = None,
) -> Linkage

Create a new Linkage, optionally replacing the left, right, and links tables.

mismo.Linkage.from_join_condition classmethod

from_join_condition(
    left: Table, right: Table, condition: Any
) -> Self

Create a Linkage from two Tables and a join condition.

PARAMETER DESCRIPTION
left

A Table of records, with at least a column 'record_id'.

TYPE: Table

right

A Table of records, with at least a column 'record_id'.

TYPE: Table

condition

A join condition, such as a boolean expression or an ibis expression. See mismo.join_condition for more details.

TYPE: Any

RETURNS DESCRIPTION
A Linkage object.

mismo.Linkage.from_parquets classmethod

from_parquets(
    directory: str | Path,
    /,
    *,
    backend: BaseBackend | None = None,
) -> Self

Create a Linkage by reading parquets from the given directory.

link_counts_chart() -> Chart

A side by side altair Chart of left.link_counts().chart() and right.link_counts().chart().

Number of           Left Table               Number of    Right Table
  Records                                      Records
        |    █                                       |    █
100,000 | █  █                                       |    █
        | █  █                                10,000 |    █
        | █  █  █                                    |    █
 10,000 | █  █  █                                    |    █  █
        | █  █  █                                    | █  █  █
        | █  █  █                              1,000 | █  █  █
  1,000 | █  █  █  █                                 | █  █  █
        | █  █  █  █  █  █                           | █  █  █
        | █  █  █  █  █  █  █                        | █  █  █  █
    100 | █  █  █  █  █  █  █  █  █              100 | █  █  █  █
        | 0  1  2  3  4 10 12 14 23                  | 0  1  2  3
        Number of Links                              Number of Links

mismo.Linkage.to_parquets

to_parquets(
    directory: str | Path, /, *, overwrite: bool = False
) -> None

Write left, right, and links to parquet files in the given directory.

mismo.LinkedTable

Bases: TableWrapper

A table of records that are linked to another table.

Each record here can be linked to 0-N records in the other_.

This acts just like an Ibis Table, but it has a few extra attributes and methods that make it more ergonomic to work with, eg to add data from the linked table.

links_: LinksTable

The table of links between this table and other.

Trailing underscore to avoid name conflicts with column names.

mismo.LinkedTable.other_ property

other_: Self

The other table that this table is linked to.

Trailing underscore to avoid name conflicts with column names.

link_counts() -> LinkCountsTable

Describes 'There are n_records in self that linked to n_links in `other'.

This is basically a histogram of self.with_n_links()

See Also

with_n_links

Examples:

>>> import ibis
>>> import mismo
>>> ibis.options.interactive = True
>>> left = ibis.memtable({"record_id": [4, 5, 6]})
>>> right = ibis.memtable({"record_id": [7, 8, 9]})
>>> links = ibis.memtable({"record_id_l": [4, 4, 5], "record_id_r": [7, 8, 9]})
>>> linkage = mismo.Linkage(left=left, right=right, links=links)

There is 1 record in left (6) that didn't match any in right. There is 1 record in left (5) that matched 1 in right. There is 1 record in left (4) that matched 2 in right.

>>> linkage.left.link_counts().order_by("n_links")
┏━━━━━━━━━┳━━━━━━━━━━━┓
┃ n_links ┃ n_records ┃
┡━━━━━━━━━╇━━━━━━━━━━━┩
│ int64   │ int64     │
├─────────┼───────────┤
│       0 │         1 │
│       1 │         1 │
│       2 │         1 │
└─────────┴───────────┘

All 3 records in right matched 1 in left.

>>> linkage.right.link_counts()
┏━━━━━━━━━┳━━━━━━━━━━━┓
┃ n_links ┃ n_records ┃
┡━━━━━━━━━╇━━━━━━━━━━━┩
│ int64   │ int64     │
├─────────┼───────────┤
│       1 │         3 │
└─────────┴───────────┘

mismo.LinkedTable.make_pair classmethod

make_pair(
    *, left: Table, right: Table, links: Table
) -> tuple[Self, Self]

Create a pair of LinkedTables from left, right, and links.

This basically just wraps the logic to make it so that the the _l and _r suffixes in the links table are consistent.

mismo.LinkedTable.with_linked_values

with_linked_values(
    *values: Deferred | Callable[[Table], Value] | None,
    default_agg: Deferred = collect(),
    **named_values: Deferred
    | Callable[[Table], Value]
    | None,
) -> Self

This table, with values from the linked records.

See the examples below, this is easy to understand with examples, but hard to describe.

PARAMETER DESCRIPTION
values

unnamed values

TYPE: Deferred | Callable[[Table], Value] | None DEFAULT: ()

named_values

named values

TYPE: Deferred | Callable[[Table], Value] | None DEFAULT: {}

RETURNS DESCRIPTION
A new LinkedTable with new columns of values from the linked records.

Examples:

>>> import ibis
>>> ibis.options.interactive = True
>>> this = ibis.memtable({"record_id": [40, 50, 60], "x": ["a", "b", "c"]})
>>> other = ibis.memtable({"record_id": [41, 42, 51], "y": [4.1, 4.2, 9.0]})
>>> links = ibis.memtable(
...     {"record_id_l": [40, 40, 50], "record_id_r": [41, 42, 51]}
... )
>>> lt = LinkedTable(this, other=other, links=links)
>>> lt
LinkedTable<3 records, 3 links>
┏━━━━━━━━━━━┳━━━━━━━━┓
┃ record_id ┃ x      ┃
┡━━━━━━━━━━━╇━━━━━━━━┩
│ int64     │ string │
├───────────┼────────┤
│        40 │ a      │
│        50 │ b      │
│        60 │ c      │
└───────────┴────────┘

Select exactly which values you want from other Since every record in self can be linked to 0-N records in other, we need a way of aggregating the 0-N linked values in other to a single value. By default, if the given expression is not an aggregate, we default to .collect() to combine them all into an array. If the given expression is already an aggregate, then it is kept as-is.

This uses the same semantics as ibis.Table.select(*values, **named_values).

>>> lt.with_linked_values(
...     "y",
...     ibis._.y.max().name("y_max"),
...     others="record_id",
...     other=ibis._.record_id.arbitrary(),
... )
LinkedTable<3 records, 3 links>
┏━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ record_id ┃ x      ┃ y                    ┃ y_max   ┃ others               ┃ other ┃
┡━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
│ int64     │ string │ array<float64>       │ float64 │ array<int64>         │ int64 │
├───────────┼────────┼──────────────────────┼─────────┼──────────────────────┼───────┤
│        40 │ a      │ [4.1, 4.2]           │     4.2 │ [41, 42]             │    41 │
│        50 │ b      │ [9.0]                │     9.0 │ [51]                 │    51 │
│        60 │ c      │ NULL                 │    NULL │ NULL                 │  NULL │
└───────────┴────────┴──────────────────────┴─────────┴──────────────────────┴───────┘

One common use case for this is during the task of a lookup. I want to see the record id of the linked record, exluding when there are no links and when there are multiple (ie ambiguous) links.

>>> lt.with_n_links().with_linked_values(
...     ibis._.record_id.arbitrary().name("record_id_other"),
... ).filter(ibis._.n_links == 1)
┏━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ record_id ┃ x      ┃ n_links ┃ record_id_other ┃
┡━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ int64     │ string │ int64   │ int64           │
├───────────┼────────┼─────────┼─────────────────┤
│        50 │ b      │       1 │              51 │
└───────────┴────────┴─────────┴─────────────────┘

If no values are provided, we will by default add a column named other with all the values from other packed into an array of structs.

>>> lt.with_linked_values()
LinkedTable<3 records, 3 links>
┏━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ record_id ┃ x      ┃ other                                                      ┃
┡━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ int64     │ string │ array<struct<record_id: int64, y: float64>>                │
├───────────┼────────┼────────────────────────────────────────────────────────────┤
│        40 │ a      │ [{'record_id': 41, 'y': 4.1}, {'record_id': 42, 'y': 4.2}] │
│        50 │ b      │ [{'record_id': 51, 'y': 9.0}]                              │
│        60 │ c      │ NULL                                                       │
└───────────┴────────┴────────────────────────────────────────────────────────────┘
with_n_links(*, name: str = 'n_links') -> Self

Add a column to this table with the number of links each record has.

PARAMETER DESCRIPTION
name

The name of the new column.

TYPE: str DEFAULT: 'n_links'

RETURNS DESCRIPTION
A new LinkedTable with the new column.

Examples:

>>> import ibis
>>> import mismo
>>> ibis.options.interactive = True
>>> left = ibis.memtable({"record_id": [4, 5, 6]})
>>> right = ibis.memtable({"record_id": [7, 8, 9]})
>>> links = ibis.memtable({"record_id_l": [4, 4, 5], "record_id_r": [7, 8, 9]})
>>> linkage = mismo.Linkage(left=left, right=right, links=links)
>>> linkage.left.with_n_links().order_by("record_id")
┏━━━━━━━━━━━┳━━━━━━━━━┓
┃ record_id ┃ n_links ┃
┡━━━━━━━━━━━╇━━━━━━━━━┩
│ int64     │ int64   │
├───────────┼─────────┤
│         4 │       2 │
│         5 │       1 │
│         6 │       0 │
└───────────┴─────────┘
>>> linkage.right.with_n_links(name="link_count").order_by("record_id")
┏━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ record_id ┃ link_count ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ int64     │ int64      │
├───────────┼────────────┤
│         7 │          1 │
│         8 │          1 │
│         9 │          1 │
└───────────┴────────────┘

mismo.LinksTable

Bases: TableWrapper

A table of links between two tables.

This acts like an ibis table, guaranteed to have at least the columns record_id_l and record_id_r. It may have more columns, such as address_match_level, that describe the relationship between two records.

In addition to the columns, this table has two properties, left_ and right_, each of which is a LinkedTable object, which is a wrapper around the left and right tables respectively.

mismo.LinksTable.left property

The left table.

mismo.LinksTable.record_id_l instance-attribute

record_id_l: Column

The record_id of the left table.

mismo.LinksTable.record_id_r instance-attribute

record_id_r: Column

The record_id of the right table.

mismo.LinksTable.right property

right: LinkedTable

The right table.

mismo.LinksTable.cache

cache() -> LinksTable

Cache the links table.

mismo.LinksTable.with_both

with_both() -> LinksTable

Add all columns from left and right with suffixes _l and _r

mismo.LinksTable.with_left

with_left(
    *values: Deferred | Callable[[Table], Value] | None,
    **named_values: Deferred
    | Callable[[Table], Value]
    | None,
) -> LinksTable

Add columns from the left table to this table of links.

This allows you to add specific columns from the left table, renaming or modifying them as needed, following the ibis.Table pattern of my_table.select("my_col", new_col=my_table.foo), except here we choose from the left table.

PARAMETER DESCRIPTION
values

The columns to add from the left table. Support string names, Deferreds, etc, just like ibis.Table.select.

TYPE: Deferred | Callable[[Table], Value] | None DEFAULT: ()

named_values

Like values, but with names, just like ibis.Table.select.

TYPE: Deferred | Callable[[Table], Value] | None DEFAULT: {}

Examples:

>>> ibis.options.interactive = True
>>> left = ibis.memtable({"record_id": [1, 2, 3], "address": ["a", "b", "c"]})
>>> right = ibis.memtable({"record_id": [8, 9], "address": ["x", "y"]})
>>> links_raw = ibis.memtable({"record_id_l": [1, 3], "record_id_r": [8, 9]})
>>> links = LinksTable(links_raw, left=left, right=right)
>>> links.with_left(
...     "address",
...     ibis._.address.upper().name("address_upper"),
...     left_address=ibis._.address,
... )
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ record_id_l ┃ record_id_r ┃ address ┃ address_upper ┃ left_address ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ int64       │ int64       │ string  │ string        │ string       │
├─────────────┼─────────────┼─────────┼───────────────┼──────────────┤
│           1 │           8 │ a       │ A             │ a            │
│           3 │           9 │ c       │ C             │ c            │
└─────────────┴─────────────┴─────────┴───────────────┴──────────────┘

mismo.LinksTable.with_right

with_right(
    *values: Deferred | Callable[[Table], Value] | None,
    **named_values: Deferred
    | Callable[[Table], Value]
    | None,
) -> LinksTable

Add columns from the right table to this table of links.

This allows you to add specific columns from the right table, renaming or modifying them as needed, following the ibis.Table pattern of my_table.select("my_col", new_col=my_table.foo), except here we choose from the right table.

PARAMETER DESCRIPTION
values

The columns to add from the right table. Support string names, Deferreds, etc, just like ibis.Table.select.

TYPE: Deferred | Callable[[Table], Value] | None DEFAULT: ()

named_values

Like values, but with names, just like ibis.Table.select.

TYPE: Deferred | Callable[[Table], Value] | None DEFAULT: {}

Examples:

>>> ibis.options.interactive = True
>>> left = ibis.memtable({"record_id": [1, 2, 3], "address": ["a", "b", "c"]})
>>> right = ibis.memtable({"record_id": [8, 9], "address": ["x", "y"]})
>>> links_raw = ibis.memtable({"record_id_l": [1, 3], "record_id_r": [8, 9]})
>>> links = LinksTable(links_raw, left=left, right=right)
>>> links.with_right(
...     "address",
...     ibis._.address.upper().name("address_upper"),
...     right_address=ibis._.address,
... )
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ record_id_l ┃ record_id_r ┃ address ┃ address_upper ┃ right_address ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ int64       │ int64       │ string  │ string        │ string        │
├─────────────┼─────────────┼─────────┼───────────────┼───────────────┤
│           1 │           8 │ x       │ X             │ x             │
│           3 │           9 │ y       │ Y             │ y             │
└─────────────┴─────────────┴─────────┴───────────────┴───────────────┘

mismo.CountsTable

Bases: TableWrapper

A table with at least an Integer column named n.

There will also be variable number of other columns that act as identifiers.

You won't create this directly, it will be returned to you from eg KeyLinker.key_counts_left, KeyLinker.key_counts_right, or KeyLinker.pair_counts.

mismo.CountsTable.n instance-attribute

n: IntegerColumn

The column containing the count.

mismo.CountsTable.n_total cached

n_total() -> int

n.sum().fill_null(0), just here for convenience.

mismo.UnionTable

Bases: TableWrapper

A Table whose rows are the non-unique union of the rows from all sub-Tables.

mismo.UnionTable.tables property

tables: tuple[Table, ...]

The tuple of underlying ibis Tables.

mismo.UnionTable.__init__

__init__(tables: Iterable[Table]) -> None

Create a UnionTable from an iterable of ibis Tables.

Linkers

Utilities and classes for the blocking phase of record linkage, where we choose pairs of records to compare.

Without blocking, we would have to compare N*M records, which becomes intractable for datasets much larger than a few thousand.

mismo.Linker

Bases: Protocol

A Protocol that takes two tables of records and produces a Linkage.

mismo.Linker.__call__

__call__(left: Table, right: Table) -> Linkage

Given two tables, return a Linkage.

mismo.FullLinker

Bases: Linker

A Linker that yields all possible pairs.

This will be N x M pairs for linking tasks, and N x (M-1) pairs for deduplication tasks.

mismo.EmptyLinker

Bases: Linker

A Linker that yields no pairs.

mismo.JoinLinker

Bases: Linker

A Linker based on a join condition.

mismo.JoinLinker.__call__

__call__(left: Table, right: Table) -> Linkage

Create a linkage from the join condition.

mismo.JoinLinker.__init__

__init__(
    condition: Callable[[Table, Table], BooleanValue],
    *,
    task: Literal["dedupe", "link"] | None = None,
    on_slow: Literal["error", "warn", "ignore"] = "error",
) -> None

Create from a join condition.

PARAMETER DESCRIPTION
condition

The join condition.

TYPE: Callable[[Table, Table], BooleanValue]

task

The task to perform. If None, the task will be inferred as "dedupe" if the two tables passed to call are the same, otherwise it will be inferred as "link".

TYPE: Literal['dedupe', 'link'] | None DEFAULT: None

on_slow

What to do if the join is slow. See mismo.joins.check_join_algorithm.

TYPE: Literal['error', 'warn', 'ignore'] DEFAULT: 'error'

mismo.JoinLinker.__join_condition__

__join_condition__(
    left: Table, right: Table
) -> BooleanValue

Create a join condition.

mismo.KeyLinker

Bases: Linker

A Linker that links records wherever they share a key, eg "emails match."

This is one of the most basic blocking rules, used very often in record linkage. This is what is used in splink.

Examples:

>>> import ibis
>>> from ibis import _
>>> import mismo
>>> ibis.options.interactive = True
>>> t = mismo.playdata.load_patents()["record_id", "name", "latitude"]
>>> t.head()
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ record_id ┃ name                         ┃ latitude ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ int64     │ string                       │ float64  │
├───────────┼──────────────────────────────┼──────────┤
│      2909 │ * AGILENT TECHNOLOGIES, INC. │     0.00 │
│      3574 │ * AKZO NOBEL N.V.            │     0.00 │
│      3575 │ * AKZO NOBEL NV              │     0.00 │
│      3779 │ * ALCATEL N.V.               │    52.35 │
│      3780 │ * ALCATEL N.V.               │    52.35 │
└───────────┴──────────────────────────────┴──────────┘

Block the table with itself wherever the names match:

>>> linker = mismo.KeyLinker("name")
>>> linker(t, t).links.order_by(
...     "record_id_l", "record_id_r"
... ).head()
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ record_id_l ┃ record_id_r ┃ latitude_l ┃ latitude_r ┃ name_l         ┃ name_r         ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ int64       │ int64       │ float64    │ float64    │ string         │ string         │
├─────────────┼─────────────┼────────────┼────────────┼────────────────┼────────────────┤
│        3779 │        3780 │      52.35 │  52.350000 │ * ALCATEL N.V. │ * ALCATEL N.V. │
│        3779 │        3782 │      52.35 │   0.000000 │ * ALCATEL N.V. │ * ALCATEL N.V. │
│        3780 │        3782 │      52.35 │   0.000000 │ * ALCATEL N.V. │ * ALCATEL N.V. │
│       25388 │     7651559 │       0.00 │  50.966667 │ DSM N.V.       │ DSM N.V.       │
│       25388 │     7651560 │       0.00 │  52.500000 │ DSM N.V.       │ DSM N.V.       │
└─────────────┴─────────────┴────────────┴────────────┴────────────────┴────────────────┘

Arbitrary blocking keys are supported. For example, block the table wherever

  • the first 5 characters of the name in uppercase, are the same AND
  • the latitudes, rounded to 1 decimal place, are the same
>>> linker = mismo.KeyLinker((_["name"][:5].upper(), _.latitude.round(1)))
>>> blocker(t, t).order_by("record_id_l", "record_id_r").head()
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ record_id_l ┃ record_id_r ┃ latitude_l ┃ latitude_r ┃ name_l              ┃ name_r              ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ int64       │ int64       │ float64    │ float64    │ string              │ string              │
├─────────────┼─────────────┼────────────┼────────────┼─────────────────────┼─────────────────────┤
│        3574 │        3575 │       0.00 │       0.00 │ * AKZO NOBEL N.V.   │ * AKZO NOBEL NV     │
│        3779 │        3780 │      52.35 │      52.35 │ * ALCATEL N.V.      │ * ALCATEL N.V.      │
│       15041 │       15042 │       0.00 │       0.00 │ * CANON EUROPA N.V  │ * CANON EUROPA N.V. │
│       15041 │       15043 │       0.00 │       0.00 │ * CANON EUROPA N.V  │ * CANON EUROPA NV   │
│       15042 │       15043 │       0.00 │       0.00 │ * CANON EUROPA N.V. │ * CANON EUROPA NV   │
└─────────────┴─────────────┴────────────┴────────────┴─────────────────────┴─────────────────────┘

We can even block on arrays! For example, first let's split each name into significant tokens:

>>> tokens = _.name.upper().split(" ").filter(lambda x: x.length() > 4)
>>> t.select(tokens.name("tokens"))
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ tokens                       ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ array<string>                │
├──────────────────────────────┤
│ ['AGILENT', 'TECHNOLOGIES,'] │
│ ['NOBEL']                    │
│ ['NOBEL']                    │
│ ['ALCATEL']                  │
│ ['ALCATEL']                  │
│ ['ALCATEL']                  │
│ ['CANON', 'EUROPA']          │
│ ['CANON', 'EUROPA']          │
│ ['CANON', 'EUROPA']          │
│ []                           │
│ …                            │
└──────────────────────────────┘

Now, block the tables together wherever two records share a token. Note that this blocked * SCHLUMBERGER LIMITED with * SCHLUMBERGER TECHNOLOGY BV. because they both share the SCHLUMBERGER token.

>>> linker = mismo.KeyLinker(tokens.unnest())
>>> linker(t, t).links.filter(_.name_l != _.name_r).order_by(
...     "record_id_l", "record_id_r"
... ).head()
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ record_id_l ┃ record_id_r ┃ latitude_l ┃ latitude_r ┃ name_l                                                     ┃ name_r                                                     ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ int64       │ int64       │ float64    │ float64    │ string                                                     │ string                                                     │
├─────────────┼─────────────┼────────────┼────────────┼────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────────┤
│        2909 │    13390969 │        0.0 │      52.35 │ * AGILENT TECHNOLOGIES, INC.                               │ Hitachi Global Storage Technologies, Inc. Netherlands B.V  │
│        2909 │    13390970 │        0.0 │      52.35 │ * AGILENT TECHNOLOGIES, INC.                               │ Hitachi Global Storage Technologies, Inc. Netherlands B.V. │
│        2909 │    13391015 │        0.0 │      52.35 │ * AGILENT TECHNOLOGIES, INC.                               │ Hitachi Global Storage Technologies, Netherland B.V.       │
│        2909 │    13391055 │        0.0 │      52.50 │ * AGILENT TECHNOLOGIES, INC.                               │ Hitachi Global Storage Technologies, Netherlands, B.V.     │
│        2909 │    13391056 │        0.0 │      52.35 │ * AGILENT TECHNOLOGIES, INC.                               │ Hitachi Global Storage Technologies, Netherlands, B.V.     │
└─────────────┴─────────────┴────────────┴────────────┴────────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────────┘

mismo.KeyLinker.__call__

__call__(left: Table, right: Table) -> Linkage

The linkage between the two tables.

mismo.KeyLinker.__init__

__init__(
    keys: str
    | Value
    | Deferred
    | Callable[
        [Table, Table],
        tuple[Value | Column, Value | Column],
    ]
    | Iterable[
        str
        | Value
        | Deferred
        | Callable[[Table], Value | Column | str | Deferred]
        | tuple[
            str
            | Deferred
            | Callable[
                [Table], Value | Column | str | Deferred
            ],
            str
            | Deferred
            | Callable[
                [Table], Value | Column | str | Deferred
            ],
        ]
        | Callable[
            [Table, Table],
            tuple[Value | Column, Value | Column],
        ]
    ],
    *,
    max_pairs: int | None = None,
    task: Literal["dedupe", "lookup", "link"] | None = None,
) -> None

Create a KeyBlocker.

PARAMETER DESCRIPTION
keys

The keys to block on. The tables will be blocked together wherever they share ALL the keys. Each key can be any of the following:

  • A string, which is interpreted as the name of a column in both tables. eg "price" is equivalent to left.price == right.price
  • A Deferred, which is used to reference a column in a table. eg _.price.fill_null(0) is equivalent to left.price.fill_null(0) == right.price.fill_null(0)
  • A Callable that takes a table and returns a Column.
  • A 2-tuple of the above, where the first element describes the key in the left table and the second element describes the key in the right table. eg ("first_name", _.GivenName.upper()") is equivalent to left.first_name == right.GivenName.upper() This is useful when the keys have different names in the two tables.
  • A callable that takes the left and right tables and returns a tuple of columns. Left and right will be joined where the columns are equal.

TYPE: str | Value | Deferred | Callable[[Table, Table], tuple[Value | Column, Value | Column]] | Iterable[str | Value | Deferred | Callable[[Table], Value | Column | str | Deferred] | tuple[str | Deferred | Callable[[Table], Value | Column | str | Deferred], str | Deferred | Callable[[Table], Value | Column | str | Deferred]] | Callable[[Table, Table], tuple[Value | Column, Value | Column]]]

max_pairs

The maximum number of pairs to generate for each key. This is to reduce the impact of very common keys. For example, if you are linking people, the name "John Smith" might be very common, appearing 5000 times in both left and right. This name alone would generate 5000 * 5000 = 25 million pairs, which might be too computationally expensive. If you set max_pairs=1000, then any key that generates more than 1000 pairs will be ignored.

TYPE: int | None DEFAULT: None

task

The task to count pairs for.

  • "link": each key results in n_left * n_right pairs
  • "dedupe": each key results in n_left * (n_right - 1) / 2 pairs since we will only generate pair (A, B), not also (B, A).
  • None: inferred from the input tables: if left is right, then "dedupe", otherwise "link".

TYPE: Literal['dedupe', 'lookup', 'link'] | None DEFAULT: None

mismo.KeyLinker.pair_counts

pair_counts(left: Table, right: Table) -> PairCountsTable

Count the number of pairs that would be generated by each key.

If you were to use this blocker to join left with right, how many pairs would be generated for each key?

This is useful for analyzing the skew of join keys. For example, if you are joining on (surname, city), there might be only 4 values for (hoessle, tinytown), which would lead to a block of 4 * 4 = 16 record pairs.

On the other hand, there could be 10_000 values for (smith, new york city). This would lead to 10_000 * 10_000 = 100_000_000 record pairs, which is likely too many for you to be able to compare.

PARAMETER DESCRIPTION
left

The left table.

TYPE: Table

right

The right table.

TYPE: Table

RETURNS DESCRIPTION
CountsTable

Will have column(s) for each key and a column n with the count.

Examples:

>>> import ibis
>>> ibis.options.interactive = True
>>> import mismo
>>> records = [
...     (1, "a", 1),
...     (2, "b", 1),
...     (3, "b", 1),
...     (4, "c", 3),
...     (5, "b", 2),
...     (6, "c", 3),
...     (7, None, 4),
...     (8, "c", 3),
... ]
>>> t = ibis.memtable(
...     records, schema={"record_id": int, "letter": str, "num": int}
... ).cache()
>>> t
┏━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ record_id ┃ letter ┃ num   ┃
┡━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ int64     │ string │ int64 │
├───────────┼────────┼───────┤
│         1 │ a      │     1 │
│         2 │ b      │     1 │
│         3 │ b      │     1 │
│         4 │ c      │     3 │
│         5 │ b      │     2 │
│         6 │ c      │     3 │
│         7 │ NULL   │     4 │
│         8 │ c      │     3 │
└───────────┴────────┴───────┘

If we joined t with itself using this blocker in a dedupe task, we would end up with

  • 3 pairs in the (c, 3) block due to pairs (3, 6), (3, 8), and (6, 8)
  • 1 pairs in the (b, 1) block due to pairs (1, 2)
  • 0 pairs in the (a, 1) block due to record 0 not getting blocked with itself
  • 0 pairs in the (b, 2) block due to record 4 not getting blocked with itself
>>> linker = mismo.KeyLinker(["letter", "num"], task="dedupe")
>>> counts = linker.pair_counts(t, t)
>>> counts.order_by("letter", "num")
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ letter ┃ num   ┃ n     ┃
┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ string │ int64 │ int64 │
├────────┼───────┼───────┤
│ a      │     1 │     0 │
│ b      │     1 │     1 │
│ b      │     2 │     0 │
│ c      │     3 │     3 │
└────────┴───────┴───────┘

If we joined t with a copy of itself using this linker in a link task, we would end up with

  • 9 pairs in the (c, 3) block due to pairs (3,3), (3, 6), (3, 8), (6, 3), (6, 6), (6, 8), (8, 3), (8, 6), and (8, 8)
  • 4 pairs in the (b, 1) block due to pairs (1, 1), (1, 2), (2, 1), and (2, 2)
  • 1 pairs in the (a, 1) block due to pair (0, 0)
  • 1 pairs in the (b, 2) block due to pair (4, 4)
>>> linker = mismo.KeyLinker(["letter", "num"], task="link")
>>> counts = linker.pair_counts(t, t)
>>> counts.order_by("letter", "num")
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ letter ┃ num   ┃ n     ┃
┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ string │ int64 │ int64 │
├────────┼───────┼───────┤
│ a      │     1 │     1 │
│ b      │     1 │     4 │
│ b      │     2 │     1 │
│ c      │     3 │     9 │
└────────┴───────┴───────┘

The returned CountsTable is a subclass of an Ibis Table with a special n_total method for convenience:

>>> counts.n_total()
15
>>> isinstance(counts, ibis.Table)
True

mismo.KeyLinker.too_common_of_records

too_common_of_records(
    left: Table, right: Table
) -> tuple[Table, Table]

The prefilter clause that removes keys that would generate too many pairs.

mismo.KeyLinker.key_counts_left

key_counts_left(left: Table) -> KeyCountsTable

mismo.KeyLinker.key_counts_right

key_counts_right(right: Table) -> KeyCountsTable

mismo.OrLinker

Bases: Linker

A Linker that is the logical OR of multiple [mismo.HasJoinCondition].

Physically, this is implemented as follows: - remove any condition overlap using [mismo.joins.remove_condition_overlap] - create a LinksTable for each join condition (which should be fast) - Union the LinksTables into a single LinksTable

mismo.OrLinker.join_conditions property

join_conditions: dict[str, HasJoinCondition]

The tuple of underling HasJoinCondition objects.

sample_all_links(
    left: Table,
    right: Table,
    *,
    max_pairs: int | None = None,
) -> LinksTable

Samples up to max_pairs from all possible pairs of records.

PARAMETER DESCRIPTION
left

The left table.

TYPE: Table

right

The right table.

TYPE: Table

max_pairs

The maximum number of pairs to sample. If None, all possible pairs are sampled.

TYPE: int | None DEFAULT: None

RETURNS DESCRIPTION
A [LinksTable][mismo.LinksTable] with just record_id_l and record_id_r.
All pairs will be unique.

Examples:

>>> import ibis
>>> import mismo
>>> ibis.options.interactive = True
>>> t, _labels = mismo.playdata.load_febrl1()
>>> t.head(5)
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━┓
┃ record_id    ┃ given_name ┃ surname    ┃ street_number ┃ address_1         ┃ … ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━┩
│ string       │ string     │ string     │ string        │ string            │ … │
├──────────────┼────────────┼────────────┼───────────────┼───────────────────┼───┤
│ rec-0-dup-0  │ thomas     │ rokobaro   │ 12            │ herschell circuit │ … │
│ rec-0-org    │ flynn      │ rokobaro   │ 12            │ herschell circuit │ … │
│ rec-1-dup-0  │ karli      │ alderson   │ 144           │ nulsen circuit    │ … │
│ rec-1-org    │ karli      │ alderson   │ 144           │ nulsen circuit    │ … │
│ rec-10-dup-0 │ kayla      │ harrington │ NULL          │ maltby circuit    │ … │
└──────────────┴────────────┴────────────┴───────────────┴───────────────────┴───┘
>>> mismo.linkage.sample_all_links(t, t, max_pairs=7)
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ record_id_l   ┃ record_id_r   ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ string        │ string        │
├───────────────┼───────────────┤
│ rec-226-dup-0 │ rec-396-org   │
│ rec-232-dup-0 │ rec-402-dup-0 │
│ rec-259-dup-0 │ rec-61-org    │
│ rec-293-dup-0 │ rec-41-dup-0  │
│ rec-448-org   │ rec-25-org    │
└───────────────┴───────────────┘

Comparing tables

mismo.Updates

Bases: TableWrapper

A Table representing how individual rows were updated.

This only represents differences in rows that exist both before and after the changes. To represent a general difference between two tables, eg with insertions and deletions, use Diff.

This represents how each column has changed between two tables. If a column only has a 'before' field, it means this column was removed. If a column only has an 'after' field, it means this column was added. If a column has both 'before' and 'after' fields, it means this column was present in both tables.

mismo.Updates.check_schemas property

check_schemas: Literal['exactly', 'names', 'lax']

The schema checking mode used for this Updates object.

mismo.Updates.filters class-attribute instance-attribute

filters = Filters

A set of filters for convenience.

Examples:

>>> u = Updates.from_tables(before, after, join_on="id")
>>> u.filter(u.filters.all_different(["name", "age"]))

mismo.Updates.__init__

__init__(
    diff_table: Table,
    /,
    *,
    check_schemas: Literal[
        "exactly", "names", "lax"
    ] = "exactly",
) -> None

Create an Updates object from a table of differences.

PARAMETER DESCRIPTION
diff_table

A table with columns that are structs with at least 'before' and 'after' fields.

TYPE: Table

check_schemas

How to check the schemas of the before and after values. - "exactly": both before and after must have the same columns and types. - "names": both before and after must have the same columns, but types can differ. - "lax": no schema checking, just that there is at least one of 'before' or 'after' in each column.

TYPE: Literal['exactly', 'names', 'lax'] DEFAULT: 'exactly'

mismo.Updates.after

after() -> Table

The table after the changes.

mismo.Updates.after_values

after_values() -> dict[str, Column]

The values after the changes.

mismo.Updates.apply_to

apply_to(
    t: Table, /, *, defaults: None | Any = NOT_SET
) -> Table

Return the input table with these updates applied to it.

PARAMETER DESCRIPTION
t

The table to apply the updates to.

TYPE: Table

defaults

If the after table has more columns than the before table, you must provide defaults. This is because there might be some rows in t that are not touched by this Updates. We need to know what to put in those columns for the untouched rows. This accepts anything that ibis.Table.mutate() accepts.

If None, as convenience, we will use null as the default for all new columns. If _util.NOT_SET, an error is raised if the after table has more columns than the before table.

TYPE: None | Any DEFAULT: NOT_SET

mismo.Updates.before

before() -> Table

The table before the changes.

mismo.Updates.before_values

before_values() -> dict[str, Column]

The values before the changes.

mismo.Updates.from_before_after classmethod

from_before_after(
    before: Mapping[str, Value] | StructValue,
    after: Mapping[str, Value] | StructValue,
    *,
    check_schemas: Literal[
        "exactly", "names", "lax"
    ] = "exactly",
) -> Updates

Create an Updates object from before and after values.

PARAMETER DESCRIPTION
before

The values before the changes.

TYPE: Mapping[str, Value] | StructValue

after

The values after the changes.

TYPE: Mapping[str, Value] | StructValue

check_schemas

How to check the schemas of the before and after values. - "exactly": both before and after must have the same columns and types. - "names": both before and after must have the same columns, but types can differ. - "lax": no schema checking, just that there is at least one of 'before' or 'after' in each column.

TYPE: Literal['exactly', 'names', 'lax'] DEFAULT: 'exactly'

RETURNS DESCRIPTION
Updates

An Updates object representing the changes.

mismo.Updates.from_tables classmethod

from_tables(
    before: Table,
    after: Table,
    *,
    join_on: str | Literal[False],
    check_schemas: Literal[
        "exactly", "names", "lax"
    ] = "exactly",
) -> Updates

Create from two different tables by joining them on a key.

Note that this results in only the rows that are present in both tables, due to the inner join on the key. Insertions and deletions should be handled separately.

mismo.Updates.is_changed

is_changed(column: str) -> BooleanColumn

Is column.before different from column.after? Never returns NULL.

mismo.Diff

A set of insertions, updates, and deletions between two tables.

This can only semantically represent 1-1 relationships, eg a row in the before table corresponds to only 0 or 1 row in the after table, and vice versa. eg "this row changed in these ways between these two tables".

To represent more general 0-N relationships, use a Linkage. eg many rows in a "dirty" dataset are linked to a single row in a "clean" dataset. Say you have a clean database of records. You just got a new batch of dirty data that might contain duplicates. Each record in the clean database might match multiple records in the dirty data. This makes it difficult to use a Diff object, because each clean record can't be paired up nicely with a single dirty record. A Linkage object is more appropriate in this case.

This is able to represent a difference between two tables with different schemas, eg if a column is added or removed.

mismo.Diff.stats property

stats: DiffStats

Statistics about this Diff.

mismo.Diff.after

after() -> Table

The table after the changes.

mismo.Diff.before

before() -> Table

The table before the changes.

mismo.Diff.cache

cache() -> Diff

Return a new Diff with all tables cached.

mismo.Diff.chart

chart() -> Chart

Create a chart that shows the flow of rows through the diff.

This is just a convenience method that calls self.stats.chart().

Rows
800,000 |                                 ▓▓  Inserted (50,000)
        |                                 ▒▒  Deleted (100,000)
700,000 |                                 ░░  Updated (200,000)
        |                                 ██  Unchanged (300,000)
600,000 |      ▒▒▒▒
        |      ▒▒▒▒             ▓▓▓▓
500,000 |      ░░░░             ░░░░
        |      ░░░░             ░░░░
400,000 |      ░░░░             ░░░░
        |      ░░░░             ░░░░
300,000 |      ████             ████
        |      ████             ████
200,000 |      ████             ████
        |      ████             ████
100,000 |      ████             ████
        |      ████             ████
      0 | Before (600,000)  After (550,000)

mismo.Diff.deletions

deletions() -> Table

Rows that were in before but not in after.

This has the same schema as before.

mismo.Diff.from_before_after classmethod

from_before_after(
    before: Table,
    after: Table,
    *,
    join_on: str | Literal[False],
) -> Self

Create from a before and after table.

mismo.Diff.from_deltas classmethod

from_deltas(
    *,
    before: Table,
    insertions: Table | None = None,
    updates: Updates | None = None,
    deletions: Table | None = None,
    updates_defaults: None | Any = NOT_SET,
) -> Self

Create from a starting point and a set of transformations.

PARAMETER DESCRIPTION
before

The table before the changes.

TYPE: Table

insertions

Rows that were in after but not in before. If None, we assume there are no insertions.

TYPE: Table | None DEFAULT: None

updates

Rows that were changed between before and after. If None, we assume there are no updates.

TYPE: Updates | None DEFAULT: None

deletions

Rows that were in before but not in after. If None, we assume there are no deletions.

TYPE: Table | None DEFAULT: None

RETURNS DESCRIPTION
Diff

mismo.Diff.from_parquets classmethod

from_parquets(
    directory: str | Path,
    /,
    *,
    backend: BaseBackend | None = None,
) -> Self

Create a Diff by reading parquets from the given directory.

mismo.Diff.insertions

insertions() -> Table

Rows that were in after but not in before.

This has the same schema as after.

mismo.Diff.to_parquets

to_parquets(
    directory: str | Path, /, *, overwrite: bool = False
) -> None

Write the tables in the changes to parquet files.

mismo.Diff.unchanged

unchanged() -> Table

Rows that were unchanged between before and after.

mismo.Diff.updates

updates() -> Updates

Rows that were changed between before and after.

self.updates().before() and self.updates().after() have the same schemas as before and after, respectively.

mismo.DiffStats

Summary statistics about a Diff, such as number of insertions, deletions, etc.

mismo.DiffStats.__init__

__init__(diff: Diff) -> None

Create from a Diff.

mismo.DiffStats.chart

chart() -> Chart

Create a chart that shows the flow of rows through the diff.

Rows
800,000 |                                 ▓▓  Inserted (50,000)
        |                                 ▒▒  Deleted (100,000)
700,000 |                                 ░░  Updated (200,000)
        |                                 ██  Unchanged (300,000)
600,000 |      ▒▒▒▒
        |      ▒▒▒▒             ▓▓▓▓
500,000 |      ░░░░             ░░░░
        |      ░░░░             ░░░░
400,000 |      ░░░░             ░░░░
        |      ░░░░             ░░░░
300,000 |      ████             ████
        |      ████             ████
200,000 |      ████             ████
        |      ████             ████
100,000 |      ████             ████
        |      ████             ████
      0 | Before (600,000)  After (550,000)

mismo.DiffStats.n_after cached

n_after() -> int

Number of rows in after.

mismo.DiffStats.n_before cached

n_before() -> int

Number of rows in before.

mismo.DiffStats.n_deletions cached

n_deletions() -> int

Number of rows that were in before but not in after.

mismo.DiffStats.n_insertions cached

n_insertions() -> int

Number of rows that were in after but not in before.

mismo.DiffStats.n_unchanged cached

n_unchanged() -> int

Number of rows that were unchanged between before and after.

mismo.DiffStats.n_updates cached

n_updates() -> int

Number of rows that were changed between before and after.

mismo.LinkCountsTable

Bases: TableWrapper

A table representing the number of records binned by number of links.

eg "There were 700 records with 0 links, 300 with 1 link, 20 with 2 links, ..."

n_links: IntegerColumn

The number of links.

mismo.LinkCountsTable.n_records instance-attribute

n_records: IntegerColumn

The number of records.

mismo.LinkCountsTable.__init__

__init__(t: Table) -> None

Create from an ibis table with exactly columns 'n_records' and 'n_links'.

PARAMETER DESCRIPTION
t

The table with exactly columns 'n_records' and 'n_links'.

TYPE: Table

mismo.LinkCountsTable.chart

chart() -> Chart

A bar chart of the number of records by the number of links.

                 Number of Records
Number of          By Link Count
  Records
        |    █
100,000 | █  █
        | █  █
        | █  █  █
 10,000 | █  █  █
        | █  █  █
        | █  █  █
  1,000 | █  █  █  █
        | █  █  █  █  █  █
        | █  █  █  █  █  █  █
    100 | █  █  █  █  █  █  █  █  █
        | 0  1  2  3  4 10 12 14 23
        Number of Links