Clustering API
After we have compared pairs of records, we need to somehow resolve these links into groups of records that are all the same entity. This is done with various graph algorithms, which are implemented in this module.
Algorithms
mismo.cluster.connected_components(*, links: ir.Table, records: ir.Column | ir.Table | Iterable[ir.Table] | Mapping[str, ir.Table] = None, max_iter: int | None = None, label_as: str = 'component') -> ir.Table | Datasets
Label records using connected components, based on the given links.
This uses an iterative algorithm that is linear in terms of the diameter of the largest component (ie how many "hops" it takes to get from one end of a cluster to the other). This is usually acceptable for our use case, because we expect the components to be small.
PARAMETER | DESCRIPTION |
---|---|
links |
A table with the columns (record_id_l, record_id_r), corresponding
to the
TYPE:
|
records |
Table(s) of records with at least the column Note If you supply multiple Tables, the record_ids must be the same type across all tables, and universally unique across all tables
TYPE:
|
max_iter |
The maximum number of iterations to run. If None, run until convergence.
TYPE:
|
label_as |
The name of the label column that will contain the component ID.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
result
|
|
Examples:
>>> import ibis
>>> ibis.options.interactive = True
>>> from mismo.cluster import connected_components
>>> records1 = ibis.memtable(
... [
... ("a", 0),
... ("b", 1),
... ("c", 2),
... ("d", 3),
... ("g", 6),
... ],
... columns=["record_id", "other"],
... )
>>> records2 = ibis.memtable(
... [
... ("h", 7),
... ("x", 23),
... ("y", 24),
... ("z", 25),
... ],
... columns=["record_id", "other"],
... )
>>> links = ibis.memtable(
... [
... ("a", "x"),
... ("b", "x"),
... ("b", "y"),
... ("c", "y"),
... ("c", "z"),
... ("g", "h"),
... ],
... columns=["record_id_l", "record_id_r"],
... )
If you don't supply the records, then you just get a labeling map
from record_id -> component. Note how only the record_ids that are
present in links
are returned, eg there is no record_id "d"
present:
>>> connected_components(links=links).order_by("record_id")
┏━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ record_id ┃ component ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━┩
│ string │ int64 │
├───────────┼───────────┤
│ a │ 0 │
│ b │ 0 │
│ c │ 0 │
│ g │ 3 │
│ h │ 3 │
│ x │ 0 │
│ y │ 0 │
│ z │ 0 │
└───────────┴───────────┘
If you supply records, then the records are labeled with the component. We can also change the name of the column that contains the component:
>>> connected_components(records=records1, links=links, label_as="label").order_by("record_id")
┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ record_id ┃ other ┃ label ┃
┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ string │ int64 │ int64 │
├───────────┼───────┼───────┤
│ a │ 0 │ 0 │
│ b │ 1 │ 0 │
│ c │ 2 │ 0 │
│ d │ 3 │ 4 │
│ g │ 6 │ 3 │
└───────────┴───────┴───────┘
You can supply multiple sets of records, which are coerced to a Datasets
,
and returned as a Datasets
, with each table of records labeled
individually.
>>> a, b = connected_components(records=(records1, records2), links=links)
>>> a.order_by("record_id")
┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┓
┃ record_id ┃ other ┃ component ┃
┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━┩
│ string │ int64 │ int64 │
├───────────┼───────┼───────────┤
│ a │ 0 │ 0 │
│ b │ 1 │ 0 │
│ c │ 2 │ 0 │
│ d │ 3 │ 4 │
│ g │ 6 │ 3 │
└───────────┴───────┴───────────┘
>>> b.order_by("record_id")
┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┓
┃ record_id ┃ other ┃ component ┃
┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━┩
│ string │ int64 │ int64 │
├───────────┼───────┼───────────┤
│ h │ 7 │ 3 │
│ x │ 23 │ 0 │
│ y │ 24 │ 0 │
│ z │ 25 │ 0 │
└───────────┴───────┴───────────┘
mismo.cluster.degree(*, links: ir.Table, records: ir.Table | Iterable[ir.Table] | Mapping[str, ir.Table] | None = None) -> ir.Table | Datasets
Label records with their degree (number of links to other records).
This is the graph theory definition of degree, i.e. the number of vertices coming into or out of a vertex. In this case, the number of links coming into or out of a record.
PARAMETER | DESCRIPTION |
---|---|
links |
A table of edges with at least columns (record_id_l, record_id_r).
TYPE:
|
records |
Table(s) of records with at least the column
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
result
|
If |
Evaluation
Utilities for assessing the quality of a linkage result.
mismo.cluster.mutual_info_score(labels_true: ir.Table, labels_pred: ir.Table, *, contingency: Any = None) -> float
Compute the mutual information between two clusterings.
The two input tables must have columns "record_id" and "label", a map from record ID to cluster label. They must have the same record IDs.
See sklearn.metrics.mutual_info_score for more information.
mismo.cluster.adjusted_mutual_info_score(labels_true: ir.Table, labels_pred: ir.Table, *, average_method: str = 'arithmetic') -> float
Adjusted Mutual Information between two clusterings.
The two input tables must have columns "record_id" and "label", a map from record ID to cluster label. They must have the same record IDs.
See sklearn.metrics.adjusted_mutual_info_score for more information.
mismo.cluster.normalized_mutual_info_score(labels_true: ir.Table, labels_pred: ir.Table, *, average_method: str = 'arithmetic') -> float
Compute the normalized mutual information between two clusterings.
The two input tables must have columns "record_id" and "label", a map from record ID to cluster label. They must have the same record IDs.
See sklearn.metrics.normalized_mutual_info_score for more information.
mismo.cluster.rand_score(labels_true: ir.Table, labels_pred: ir.Table) -> float
Compute the Rand Index between two clusterings.
The two input tables must have columns "record_id" and "label", a map from record ID to cluster label. They must have the same record IDs.
See sklearn.metrics.rand_score for more information.
mismo.cluster.adjusted_rand_score(labels_true: ir.Table, labels_pred: ir.Table) -> float
Adjusted Rand Index between two clusterings.
The two input tables must have columns "record_id" and "label", a map from record ID to cluster label. They must have the same record IDs.
See sklearn.metrics.adjusted_rand_score for more information.
mismo.cluster.fowlkes_mallows_score(labels_true: ir.Table, labels_pred: ir.Table) -> float
Measure the similarity of two clusterings of a set of points.
The two input tables must have columns "record_id" and "label", a map from record ID to cluster label. They must have the same record IDs.
See sklearn.metrics.fowlkes_mallows_score for more information.
mismo.cluster.homogeneity_score(labels_true: ir.Table, labels_pred: ir.Table) -> float
Homogeneity metric of a cluster labeling given a ground truth.
The two input tables must have columns "record_id" and "label", a map from record ID to cluster label. They must have the same record IDs.
See sklearn.metrics.homogeneity_score for more information.
mismo.cluster.completeness_score(labels_true: ir.Table, labels_pred: ir.Table) -> float
Compute completeness metric of a cluster labeling given a ground truth.
The two input tables must have columns "record_id" and "label", a map from record ID to cluster label. They must have the same record IDs.
See sklearn.metrics.completeness_score for more information.
mismo.cluster.v_measure_score(labels_true: ir.Table, labels_pred: ir.Table) -> float
V-measure metric of a cluster labeling given a ground truth.
The two input tables must have columns "record_id" and "label", a map from record ID to cluster label. They must have the same record IDs.
See sklearn.metrics.v_measure_score for more information.
mismo.cluster.homogeneity_completeness_v_measure(labels_true: ir.Table, labels_pred: ir.Table, *, beta: float = 1.0) -> tuple[float, float, float]
Compute the homogeneity, completeness, and V-measure scores at once.
The two input tables must have columns "record_id" and "label", a map from record ID to cluster label. They must have the same record IDs.
See sklearn.metrics.homogeneity_completeness_v_measure for more information.
Plot
mismo.cluster.degree_dashboard(tables: ir.Table | Iterable[ir.Table] | Mapping[str, ir.Table], links: ir.Table)
Make a dashboard for exploring the degree (number of links) of records.
The "degree" of a record is the number of other records it is linked to.
Pass the entire dataset and the links between records, and use this to explore the distribution of degrees.
mismo.cluster.cluster_dashboard(ds: Datasets | ir.Table | Iterable[ir.Table] | Mapping[str, ir.Table], links: ir.Table) -> solara.Column
A Solara component for that shows a cluster of records and links.
This shows ALL the supplied records and links, so be careful,
you probably want to filter them down first.
You can use the clusters_dashboard
component for that.
This is like cytoscape_widget
, but with a status bar that shows
information about the selected node or edge.
PARAMETER | DESCRIPTION |
---|---|
ds |
Table(s) of records with at least the column
TYPE:
|
links |
A table of edges with at least columns
(record_id_l, record_id_r) and optionally other columns.
The column
TYPE:
|
mismo.cluster.clusters_dashboard(tables: Datasets | ir.Table | Iterable[ir.Table] | Mapping[str, ir.Table], links: ir.Table) -> solara.Column
Make a dashboard for exploring different clusters of records.
Pass the entire dataset and the links between records, and use this to filter down to a particular cluster.