Skip to content

Clustering API

After we have compared pairs of records, we need to somehow resolve these links into groups of records that are all the same entity. This is done with various graph algorithms, which are implemented in this module.

Algorithms

mismo.cluster.connected_components(*, links: ir.Table, records: ir.Column | ir.Table | Iterable[ir.Table] | Mapping[str, ir.Table] = None, max_iter: int | None = None, label_as: str = 'component') -> ir.Table | Datasets

Label records using connected components, based on the given links.

This uses an iterative algorithm that is linear in terms of the diameter of the largest component (ie how many "hops" it takes to get from one end of a cluster to the other). This is usually acceptable for our use case, because we expect the components to be small.

PARAMETER DESCRIPTION
links

A table with the columns (record_id_l, record_id_r), corresponding to the record_ids in records.

TYPE: Table

records

Table(s) of records with at least the column record_id, the column of record_ids itself, or None.

Note

If you supply multiple Tables, the record_ids must be the same type across all tables, and universally unique across all tables

TYPE: Column | Table | Iterable[Table] | Mapping[str, Table] DEFAULT: None

max_iter

The maximum number of iterations to run. If None, run until convergence.

TYPE: int | None DEFAULT: None

label_as

The name of the label column that will contain the component ID.

TYPE: str DEFAULT: 'component'

RETURNS DESCRIPTION
result
  • If records is None, a Table will be returned with columns record_id and <label_as> of type int64 that maps record_id to component.
  • If records is a single Table, that table will be returned with a <label_as> column added of typeint64`.
  • If records is an iterable/mapping of Tables, a Datasets will be returned, with a <label_as> column of type int64 added to each contained Table.

Examples:

>>> import ibis
>>> ibis.options.interactive = True
>>> from mismo.cluster import connected_components
>>> records1 = ibis.memtable(
...     [
...         ("a", 0),
...         ("b", 1),
...         ("c", 2),
...         ("d", 3),
...         ("g", 6),
...     ],
...     columns=["record_id", "other"],
... )
>>> records2 = ibis.memtable(
...     [
...         ("h", 7),
...         ("x", 23),
...         ("y", 24),
...         ("z", 25),
...     ],
...     columns=["record_id", "other"],
... )
>>> links = ibis.memtable(
...     [
...         ("a", "x"),
...         ("b", "x"),
...         ("b", "y"),
...         ("c", "y"),
...         ("c", "z"),
...         ("g", "h"),
...     ],
...     columns=["record_id_l", "record_id_r"],
... )

If you don't supply the records, then you just get a labeling map from record_id -> component. Note how only the record_ids that are present in links are returned, eg there is no record_id "d" present:

>>> connected_components(links=links).order_by("record_id")
┏━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ record_id ┃ component ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━┩
│ string    │ int64     │
├───────────┼───────────┤
│ a         │         0 │
│ b         │         0 │
│ c         │         0 │
│ g         │         3 │
│ h         │         3 │
│ x         │         0 │
│ y         │         0 │
│ z         │         0 │
└───────────┴───────────┘

If you supply records, then the records are labeled with the component. We can also change the name of the column that contains the component:

>>> connected_components(records=records1, links=links, label_as="label").order_by("record_id")
┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ record_id ┃ other ┃ label ┃
┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ string    │ int64 │ int64 │
├───────────┼───────┼───────┤
│ a         │     0 │     0 │
│ b         │     1 │     0 │
│ c         │     2 │     0 │
│ d         │     3 │     4 │
│ g         │     6 │     3 │
└───────────┴───────┴───────┘

You can supply multiple sets of records, which are coerced to a Datasets, and returned as a Datasets, with each table of records labeled individually.

>>> a, b = connected_components(records=(records1, records2), links=links)
>>> a.order_by("record_id")
┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┓
┃ record_id ┃ other ┃ component ┃
┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━┩
│ string    │ int64 │ int64     │
├───────────┼───────┼───────────┤
│ a         │     0 │         0 │
│ b         │     1 │         0 │
│ c         │     2 │         0 │
│ d         │     3 │         4 │
│ g         │     6 │         3 │
└───────────┴───────┴───────────┘
>>> b.order_by("record_id")
┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┓
┃ record_id ┃ other ┃ component ┃
┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━┩
│ string    │ int64 │ int64     │
├───────────┼───────┼───────────┤
│ h         │     7 │         3 │
│ x         │    23 │         0 │
│ y         │    24 │         0 │
│ z         │    25 │         0 │
└───────────┴───────┴───────────┘

mismo.cluster.degree(*, links: ir.Table, records: ir.Table | Iterable[ir.Table] | Mapping[str, ir.Table] | None = None) -> ir.Table | Datasets

Label records with their degree (number of links to other records).

This is the graph theory definition of degree, i.e. the number of vertices coming into or out of a vertex. In this case, the number of links coming into or out of a record.

PARAMETER DESCRIPTION
links

A table of edges with at least columns (record_id_l, record_id_r).

TYPE: Table

records

Table(s) of records with at least the column record_id, or None.

TYPE: Table | Iterable[Table] | Mapping[str, Table] | None DEFAULT: None

RETURNS DESCRIPTION
result

If records is None, a Table will be returned with columns record_id and degree:uint64 that maps record_id to a degree. If records is a single Table, that table will be returned with a degree:uint64 column added. If an iterable/mapping of Tables is given, a Datasets will be returned, with a component column added to each contained Table.

Evaluation

Utilities for assessing the quality of a linkage result.

mismo.cluster.mutual_info_score(labels_true: ir.Table, labels_pred: ir.Table, *, contingency: Any = None) -> float

Compute the mutual information between two clusterings.

The two input tables must have columns "record_id" and "label", a map from record ID to cluster label. They must have the same record IDs.

See sklearn.metrics.mutual_info_score for more information.

mismo.cluster.adjusted_mutual_info_score(labels_true: ir.Table, labels_pred: ir.Table, *, average_method: str = 'arithmetic') -> float

Adjusted Mutual Information between two clusterings.

The two input tables must have columns "record_id" and "label", a map from record ID to cluster label. They must have the same record IDs.

See sklearn.metrics.adjusted_mutual_info_score for more information.

mismo.cluster.normalized_mutual_info_score(labels_true: ir.Table, labels_pred: ir.Table, *, average_method: str = 'arithmetic') -> float

Compute the normalized mutual information between two clusterings.

The two input tables must have columns "record_id" and "label", a map from record ID to cluster label. They must have the same record IDs.

See sklearn.metrics.normalized_mutual_info_score for more information.

mismo.cluster.rand_score(labels_true: ir.Table, labels_pred: ir.Table) -> float

Compute the Rand Index between two clusterings.

The two input tables must have columns "record_id" and "label", a map from record ID to cluster label. They must have the same record IDs.

See sklearn.metrics.rand_score for more information.

mismo.cluster.adjusted_rand_score(labels_true: ir.Table, labels_pred: ir.Table) -> float

Adjusted Rand Index between two clusterings.

The two input tables must have columns "record_id" and "label", a map from record ID to cluster label. They must have the same record IDs.

See sklearn.metrics.adjusted_rand_score for more information.

mismo.cluster.fowlkes_mallows_score(labels_true: ir.Table, labels_pred: ir.Table) -> float

Measure the similarity of two clusterings of a set of points.

The two input tables must have columns "record_id" and "label", a map from record ID to cluster label. They must have the same record IDs.

See sklearn.metrics.fowlkes_mallows_score for more information.

mismo.cluster.homogeneity_score(labels_true: ir.Table, labels_pred: ir.Table) -> float

Homogeneity metric of a cluster labeling given a ground truth.

The two input tables must have columns "record_id" and "label", a map from record ID to cluster label. They must have the same record IDs.

See sklearn.metrics.homogeneity_score for more information.

mismo.cluster.completeness_score(labels_true: ir.Table, labels_pred: ir.Table) -> float

Compute completeness metric of a cluster labeling given a ground truth.

The two input tables must have columns "record_id" and "label", a map from record ID to cluster label. They must have the same record IDs.

See sklearn.metrics.completeness_score for more information.

mismo.cluster.v_measure_score(labels_true: ir.Table, labels_pred: ir.Table) -> float

V-measure metric of a cluster labeling given a ground truth.

The two input tables must have columns "record_id" and "label", a map from record ID to cluster label. They must have the same record IDs.

See sklearn.metrics.v_measure_score for more information.

mismo.cluster.homogeneity_completeness_v_measure(labels_true: ir.Table, labels_pred: ir.Table, *, beta: float = 1.0) -> tuple[float, float, float]

Compute the homogeneity, completeness, and V-measure scores at once.

The two input tables must have columns "record_id" and "label", a map from record ID to cluster label. They must have the same record IDs.

See sklearn.metrics.homogeneity_completeness_v_measure for more information.

Plot

mismo.cluster.degree_dashboard(tables: ir.Table | Iterable[ir.Table] | Mapping[str, ir.Table], links: ir.Table)

Make a dashboard for exploring the degree (number of links) of records.

The "degree" of a record is the number of other records it is linked to.

Pass the entire dataset and the links between records, and use this to explore the distribution of degrees.

mismo.cluster.cluster_dashboard(ds: Datasets | ir.Table | Iterable[ir.Table] | Mapping[str, ir.Table], links: ir.Table) -> solara.Column

A Solara component for that shows a cluster of records and links.

This shows ALL the supplied records and links, so be careful, you probably want to filter them down first. You can use the clusters_dashboard component for that.

This is like cytoscape_widget, but with a status bar that shows information about the selected node or edge.

PARAMETER DESCRIPTION
ds

Table(s) of records with at least the column record_id.

TYPE: Datasets | Table | Iterable[Table] | Mapping[str, Table]

links

A table of edges with at least columns (record_id_l, record_id_r) and optionally other columns. The column width is used to set the width of the edges. If not given, it is determined from the column odds, if present, or set to 5 otherwise. The column opacity is used to set the opacity of the edges. If not given, it is set to 0.5.

TYPE: Table

mismo.cluster.clusters_dashboard(tables: Datasets | ir.Table | Iterable[ir.Table] | Mapping[str, ir.Table], links: ir.Table) -> solara.Column

Make a dashboard for exploring different clusters of records.

Pass the entire dataset and the links between records, and use this to filter down to a particular cluster.