Clustering API

After we have compared pairs of records, we need to somehow resolve these links into groups of records that are all the same entity. This is done with various graph algorithms, which are implemented in this module.

Algorithms

`mismo.cluster.connected_components(*, links: ir.Table, records: ir.Column | ir.Table | Iterable[ir.Table] | Mapping[str, ir.Table] = None, max_iter: int | None = None, label_as: str = 'component') -> ir.Table | Datasets`

connected_components(
    *,
    links: ir.Table,
    records: ir.Column | ir.Table | None,
    max_iter: int | None = None,
    label_as: str = "component",
) -> ir.Table

connected_components(
    *,
    links: ir.Table,
    records: Iterable[ir.Table] | Mapping[str, ir.Table],
    max_iter: int | None = None,
    label_as: str = "component",
) -> Datasets

Label records using connected components, based on the given links.

This uses an iterative algorithm that is linear in terms of the diameter of the largest component (ie how many "hops" it takes to get from one end of a cluster to the other). This is usually acceptable for our use case, because we expect the components to be small.

PARAMETER	DESCRIPTION
`links`	A table with the columns (record_id_l, record_id_r), corresponding to the `record_id`s in `records`. TYPE: `Table`
`records`	Table(s) of records with at least the column `record_id`, the column of record_ids itself, or None. Note If you supply multiple Tables, the record_ids must be the same type across all tables, and universally unique across all tables TYPE: `Column \| Table \| Iterable[Table] \| Mapping[str, Table]` DEFAULT: `None`
`max_iter`	The maximum number of iterations to run. If None, run until convergence. TYPE: `int \| None` DEFAULT: `None`
`label_as`	The name of the label column that will contain the component ID. TYPE: `str` DEFAULT: `'component'`

RETURNS DESCRIPTION

result

If records is None, a Table will be returned with columns record_id and <label_as> of type int64 that maps record_id to component.
If records is a single Table, that table will be returned with a <label_as> column added of typeint64`.
If records is an iterable/mapping of Tables, a Datasets will be returned, with a <label_as> column of type int64 added to each contained Table.

Examples:

>>> import ibis
>>> ibis.options.interactive = True
>>> from mismo.cluster import connected_components
>>> records1 = ibis.memtable(
...     [
...         ("a", 0),
...         ("b", 1),
...         ("c", 2),
...         ("d", 3),
...         ("g", 6),
...     ],
...     columns=["record_id", "other"],
... )
>>> records2 = ibis.memtable(
...     [
...         ("h", 7),
...         ("x", 23),
...         ("y", 24),
...         ("z", 25),
...     ],
...     columns=["record_id", "other"],
... )
>>> links = ibis.memtable(
...     [
...         ("a", "x"),
...         ("b", "x"),
...         ("b", "y"),
...         ("c", "y"),
...         ("c", "z"),
...         ("g", "h"),
...     ],
...     columns=["record_id_l", "record_id_r"],
... )

If you don't supply the records, then you just get a labeling map from record_id -> component. Note how only the record_ids that are present in links are returned, eg there is no record_id "d" present:

>>> connected_components(links=links).order_by("record_id")
┏━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ record_id ┃ component ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━┩
│ string    │ int64     │
├───────────┼───────────┤
│ a         │         0 │
│ b         │         0 │
│ c         │         0 │
│ g         │         3 │
│ h         │         3 │
│ x         │         0 │
│ y         │         0 │
│ z         │         0 │
└───────────┴───────────┘

If you supply records, then the records are labeled with the component. We can also change the name of the column that contains the component:

>>> connected_components(records=records1, links=links, label_as="label").order_by("record_id")
┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ record_id ┃ other ┃ label ┃
┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ string    │ int64 │ int64 │
├───────────┼───────┼───────┤
│ a         │     0 │     0 │
│ b         │     1 │     0 │
│ c         │     2 │     0 │
│ d         │     3 │     4 │
│ g         │     6 │     3 │
└───────────┴───────┴───────┘

You can supply multiple sets of records, which are coerced to a Datasets, and returned as a Datasets, with each table of records labeled individually.

>>> a, b = connected_components(records=(records1, records2), links=links)
>>> a.order_by("record_id")
┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┓
┃ record_id ┃ other ┃ component ┃
┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━┩
│ string    │ int64 │ int64     │
├───────────┼───────┼───────────┤
│ a         │     0 │         0 │
│ b         │     1 │         0 │
│ c         │     2 │         0 │
│ d         │     3 │         4 │
│ g         │     6 │         3 │
└───────────┴───────┴───────────┘
>>> b.order_by("record_id")
┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┓
┃ record_id ┃ other ┃ component ┃
┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━┩
│ string    │ int64 │ int64     │
├───────────┼───────┼───────────┤
│ h         │     7 │         3 │
│ x         │    23 │         0 │
│ y         │    24 │         0 │
│ z         │    25 │         0 │
└───────────┴───────┴───────────┘

`mismo.cluster.degree(*, links: ir.Table, records: ir.Table | Iterable[ir.Table] | Mapping[str, ir.Table] | None = None) -> ir.Table | Datasets`

degree(
    *, links: ir.Table, records: ir.Table | None
) -> ir.Table

degree(
    *,
    links: ir.Table,
    records: Iterable[ir.Table] | Mapping[str, ir.Table],
) -> Datasets

Label records with their degree (number of links to other records).

This is the graph theory definition of degree, i.e. the number of vertices coming into or out of a vertex. In this case, the number of links coming into or out of a record.

PARAMETER	DESCRIPTION
`links`	A table of edges with at least columns (record_id_l, record_id_r). TYPE: `Table`
`records`	Table(s) of records with at least the column `record_id`, or None. TYPE: `Table \| Iterable[Table] \| Mapping[str, Table] \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`result`	If `records` is None, a Table will be returned with columns `record_id` and `degree:uint64` that maps record_id to a degree. If `records` is a single Table, that table will be returned with a `degree:uint64` column added. If an iterable/mapping of Tables is given, a `Datasets` will be returned, with a `component` column added to each contained Table.