Core API
Types, functions, etc that are core to mismo and are used throughout the rest of the framework
Core Types
mismo.Linkage
A dataclass of two Tables of records (LinkedTables) and a Table of links (LinksTable) between them.
Each record in left
can be linked from 0 to N records in right
, and vice versa.
See Also
The Diff dataclass, for representing the special case where each record is linked to at most one other record.
mismo.Linkage.links
property
links: LinksTable
A table of (record_id_l, record_id_r, left
and right
.
mismo.Linkage.__init__
__init__(
*, left: Table, right: Table, links: Table
) -> None
Create from two Tables and a Table of links between them.
PARAMETER | DESCRIPTION |
---|---|
left
|
A Table of records, with at least a column 'record_id'.
TYPE:
|
right
|
A Table of records, with at least a column 'record_id'.
TYPE:
|
links
|
A Table of links between the two tables.
Must have columns 'record_id_l' and 'record_id_r', which refer to the
'record_id' columns in
TYPE:
|
mismo.Linkage.cache
cache() -> Self
Cache left, right, and links for faster subsequent access.
mismo.Linkage.copy
copy(
*,
left: LinkedTable | None = None,
right: LinkedTable | None = None,
links: LinksTable | None = None,
) -> Linkage
Create a new Linkage, optionally replacing the left, right, and links tables.
mismo.Linkage.from_join_condition
classmethod
from_join_condition(
left: Table, right: Table, condition: Any
) -> Self
Create a Linkage from two Tables and a join condition.
PARAMETER | DESCRIPTION |
---|---|
left
|
A Table of records, with at least a column 'record_id'.
TYPE:
|
right
|
A Table of records, with at least a column 'record_id'.
TYPE:
|
condition
|
A join condition, such as a boolean expression or an ibis expression. See mismo.join_condition for more details.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
A Linkage object.
|
|
mismo.Linkage.from_parquets
classmethod
Create a Linkage by reading parquets from the given directory.
mismo.Linkage.link_counts_chart
link_counts_chart() -> Chart
A side by side altair Chart of left.link_counts().chart()
and right.link_counts().chart()
.
Number of Left Table Number of Right Table
Records Records
| █ | █
100,000 | █ █ | █
| █ █ 10,000 | █
| █ █ █ | █
10,000 | █ █ █ | █ █
| █ █ █ | █ █ █
| █ █ █ 1,000 | █ █ █
1,000 | █ █ █ █ | █ █ █
| █ █ █ █ █ █ | █ █ █
| █ █ █ █ █ █ █ | █ █ █ █
100 | █ █ █ █ █ █ █ █ █ 100 | █ █ █ █
| 0 1 2 3 4 10 12 14 23 | 0 1 2 3
Number of Links Number of Links
mismo.LinkedTable
Bases: TableWrapper
A table of records that are linked to another table.
Each record here can be linked to 0-N records in the other_
.
This acts just like an Ibis Table, but it has a few extra attributes and methods that make it more ergonomic to work with, eg to add data from the linked table.
mismo.LinkedTable.links_
property
links_: LinksTable
The table of links between this table and other
.
Trailing underscore to avoid name conflicts with column names.
mismo.LinkedTable.other_
property
other_: Self
The other table that this table is linked to.
Trailing underscore to avoid name conflicts with column names.
mismo.LinkedTable.link_counts
link_counts() -> LinkCountsTable
Describes 'There are n_records
in self that linked to n_links
in `other'.
This is basically a histogram of self.with_n_links()
See Also
with_n_links
Examples:
>>> import ibis
>>> import mismo
>>> ibis.options.interactive = True
>>> left = ibis.memtable({"record_id": [4, 5, 6]})
>>> right = ibis.memtable({"record_id": [7, 8, 9]})
>>> links = ibis.memtable({"record_id_l": [4, 4, 5], "record_id_r": [7, 8, 9]})
>>> linkage = mismo.Linkage(left=left, right=right, links=links)
There is 1 record in left (6) that didn't match any in right. There is 1 record in left (5) that matched 1 in right. There is 1 record in left (4) that matched 2 in right.
>>> linkage.left.link_counts().order_by("n_links")
┏━━━━━━━━━┳━━━━━━━━━━━┓
┃ n_links ┃ n_records ┃
┡━━━━━━━━━╇━━━━━━━━━━━┩
│ int64 │ int64 │
├─────────┼───────────┤
│ 0 │ 1 │
│ 1 │ 1 │
│ 2 │ 1 │
└─────────┴───────────┘
All 3 records in right matched 1 in left.
>>> linkage.right.link_counts()
┏━━━━━━━━━┳━━━━━━━━━━━┓
┃ n_links ┃ n_records ┃
┡━━━━━━━━━╇━━━━━━━━━━━┩
│ int64 │ int64 │
├─────────┼───────────┤
│ 1 │ 3 │
└─────────┴───────────┘
mismo.LinkedTable.make_pair
classmethod
make_pair(
*, left: Table, right: Table, links: Table
) -> tuple[Self, Self]
Create a pair of LinkedTables from left, right, and links.
This basically just wraps the logic to make it so that the the _l and _r suffixes in the links table are consistent.
mismo.LinkedTable.with_linked_values
with_linked_values(
*values: Deferred | Callable[[Table], Value] | None,
default_agg: Deferred = collect(),
**named_values: Deferred
| Callable[[Table], Value]
| None,
) -> Self
This table, with values from the linked records.
See the examples below, this is easy to understand with examples, but hard to describe.
PARAMETER | DESCRIPTION |
---|---|
values
|
unnamed values
TYPE:
|
named_values
|
named values
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
A new LinkedTable with new columns of values from the linked records.
|
|
Examples:
>>> import ibis
>>> ibis.options.interactive = True
>>> this = ibis.memtable({"record_id": [40, 50, 60], "x": ["a", "b", "c"]})
>>> other = ibis.memtable({"record_id": [41, 42, 51], "y": [4.1, 4.2, 9.0]})
>>> links = ibis.memtable(
... {"record_id_l": [40, 40, 50], "record_id_r": [41, 42, 51]}
... )
>>> lt = LinkedTable(this, other=other, links=links)
>>> lt
LinkedTable<3 records, 3 links>
┏━━━━━━━━━━━┳━━━━━━━━┓
┃ record_id ┃ x ┃
┡━━━━━━━━━━━╇━━━━━━━━┩
│ int64 │ string │
├───────────┼────────┤
│ 40 │ a │
│ 50 │ b │
│ 60 │ c │
└───────────┴────────┘
Select exactly which values you want from other
Since every record in self
can be linked to 0-N records in other
,
we need a way of aggregating the 0-N linked values in other
to a single value.
By default, if the given expression is not an aggregate,
we default to .collect()
to combine them all into an array.
If the given expression is already an aggregate, then it is kept as-is.
This uses the same semantics as ibis.Table.select(*values, **named_values)
.
>>> lt.with_linked_values(
... "y",
... ibis._.y.max().name("y_max"),
... others="record_id",
... other=ibis._.record_id.arbitrary(),
... )
LinkedTable<3 records, 3 links>
┏━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ record_id ┃ x ┃ y ┃ y_max ┃ others ┃ other ┃
┡━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
│ int64 │ string │ array<float64> │ float64 │ array<int64> │ int64 │
├───────────┼────────┼──────────────────────┼─────────┼──────────────────────┼───────┤
│ 40 │ a │ [4.1, 4.2] │ 4.2 │ [41, 42] │ 41 │
│ 50 │ b │ [9.0] │ 9.0 │ [51] │ 51 │
│ 60 │ c │ NULL │ NULL │ NULL │ NULL │
└───────────┴────────┴──────────────────────┴─────────┴──────────────────────┴───────┘
One common use case for this is during the task of a lookup. I want to see the record id of the linked record, exluding when there are no links and when there are multiple (ie ambiguous) links.
>>> lt.with_n_links().with_linked_values(
... ibis._.record_id.arbitrary().name("record_id_other"),
... ).filter(ibis._.n_links == 1)
┏━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ record_id ┃ x ┃ n_links ┃ record_id_other ┃
┡━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ int64 │ string │ int64 │ int64 │
├───────────┼────────┼─────────┼─────────────────┤
│ 50 │ b │ 1 │ 51 │
└───────────┴────────┴─────────┴─────────────────┘
If no values are provided, we will by default add a column named other
with all the values from other packed into an array of structs.
>>> lt.with_linked_values()
LinkedTable<3 records, 3 links>
┏━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ record_id ┃ x ┃ other ┃
┡━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ int64 │ string │ array<struct<record_id: int64, y: float64>> │
├───────────┼────────┼────────────────────────────────────────────────────────────┤
│ 40 │ a │ [{'record_id': 41, 'y': 4.1}, {'record_id': 42, 'y': 4.2}] │
│ 50 │ b │ [{'record_id': 51, 'y': 9.0}] │
│ 60 │ c │ NULL │
└───────────┴────────┴────────────────────────────────────────────────────────────┘
mismo.LinkedTable.with_n_links
with_n_links(*, name: str = 'n_links') -> Self
Add a column to this table with the number of links each record has.
PARAMETER | DESCRIPTION |
---|---|
name
|
The name of the new column.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
A new LinkedTable with the new column.
|
|
Examples:
>>> import ibis
>>> import mismo
>>> ibis.options.interactive = True
>>> left = ibis.memtable({"record_id": [4, 5, 6]})
>>> right = ibis.memtable({"record_id": [7, 8, 9]})
>>> links = ibis.memtable({"record_id_l": [4, 4, 5], "record_id_r": [7, 8, 9]})
>>> linkage = mismo.Linkage(left=left, right=right, links=links)
>>> linkage.left.with_n_links().order_by("record_id")
┏━━━━━━━━━━━┳━━━━━━━━━┓
┃ record_id ┃ n_links ┃
┡━━━━━━━━━━━╇━━━━━━━━━┩
│ int64 │ int64 │
├───────────┼─────────┤
│ 4 │ 2 │
│ 5 │ 1 │
│ 6 │ 0 │
└───────────┴─────────┘
>>> linkage.right.with_n_links(name="link_count").order_by("record_id")
┏━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ record_id ┃ link_count ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ int64 │ int64 │
├───────────┼────────────┤
│ 7 │ 1 │
│ 8 │ 1 │
│ 9 │ 1 │
└───────────┴────────────┘
mismo.LinksTable
Bases: TableWrapper
A table of links between two tables.
This acts like an ibis table, guaranteed to have at least the
columns record_id_l and record_id_r.
It may have more columns, such as address_match_level
,
that describe the relationship between two records.
In addition to the columns, this table has two properties, left_
and right_
,
each of which is a LinkedTable object,
which is a wrapper around the left and right tables respectively.
mismo.LinksTable.record_id_l
instance-attribute
record_id_l: Column
The record_id of the left table.
mismo.LinksTable.record_id_r
instance-attribute
record_id_r: Column
The record_id of the right table.
mismo.LinksTable.with_both
with_both() -> LinksTable
Add all columns from left
and right
with suffixes _l
and _r
mismo.LinksTable.with_left
with_left(
*values: Deferred | Callable[[Table], Value] | None,
**named_values: Deferred
| Callable[[Table], Value]
| None,
) -> LinksTable
Add columns from the left table to this table of links.
This allows you to add specific columns from the left table,
renaming or modifying them as needed, following the ibis.Table
pattern of
my_table.select("my_col", new_col=my_table.foo)
,
except here we choose from the left table.
PARAMETER | DESCRIPTION |
---|---|
values
|
The columns to add from the left table.
Support string names, Deferreds, etc, just like
TYPE:
|
named_values
|
Like values, but with names, just like
TYPE:
|
Examples:
>>> ibis.options.interactive = True
>>> left = ibis.memtable({"record_id": [1, 2, 3], "address": ["a", "b", "c"]})
>>> right = ibis.memtable({"record_id": [8, 9], "address": ["x", "y"]})
>>> links_raw = ibis.memtable({"record_id_l": [1, 3], "record_id_r": [8, 9]})
>>> links = LinksTable(links_raw, left=left, right=right)
>>> links.with_left(
... "address",
... ibis._.address.upper().name("address_upper"),
... left_address=ibis._.address,
... )
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ record_id_l ┃ record_id_r ┃ address ┃ address_upper ┃ left_address ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ int64 │ int64 │ string │ string │ string │
├─────────────┼─────────────┼─────────┼───────────────┼──────────────┤
│ 1 │ 8 │ a │ A │ a │
│ 3 │ 9 │ c │ C │ c │
└─────────────┴─────────────┴─────────┴───────────────┴──────────────┘
mismo.LinksTable.with_right
with_right(
*values: Deferred | Callable[[Table], Value] | None,
**named_values: Deferred
| Callable[[Table], Value]
| None,
) -> LinksTable
Add columns from the right table to this table of links.
This allows you to add specific columns from the right table,
renaming or modifying them as needed, following the ibis.Table
pattern of
my_table.select("my_col", new_col=my_table.foo)
,
except here we choose from the right table.
PARAMETER | DESCRIPTION |
---|---|
values
|
The columns to add from the right table.
Support string names, Deferreds, etc, just like
TYPE:
|
named_values
|
Like values, but with names, just like
TYPE:
|
Examples:
>>> ibis.options.interactive = True
>>> left = ibis.memtable({"record_id": [1, 2, 3], "address": ["a", "b", "c"]})
>>> right = ibis.memtable({"record_id": [8, 9], "address": ["x", "y"]})
>>> links_raw = ibis.memtable({"record_id_l": [1, 3], "record_id_r": [8, 9]})
>>> links = LinksTable(links_raw, left=left, right=right)
>>> links.with_right(
... "address",
... ibis._.address.upper().name("address_upper"),
... right_address=ibis._.address,
... )
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ record_id_l ┃ record_id_r ┃ address ┃ address_upper ┃ right_address ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ int64 │ int64 │ string │ string │ string │
├─────────────┼─────────────┼─────────┼───────────────┼───────────────┤
│ 1 │ 8 │ x │ X │ x │
│ 3 │ 9 │ y │ Y │ y │
└─────────────┴─────────────┴─────────┴───────────────┴───────────────┘
mismo.CountsTable
Bases: TableWrapper
A table with at least an Integer column named n
.
There will also be variable number of other columns that act as identifiers.
You won't create this directly, it will be returned to you from eg KeyLinker.key_counts_left, KeyLinker.key_counts_right, or KeyLinker.pair_counts.
mismo.CountsTable.n
instance-attribute
n: IntegerColumn
The column containing the count.
mismo.UnionTable
Bases: TableWrapper
A Table whose rows are the non-unique union of the rows from all sub-Tables.
Linkers
Utilities and classes for the blocking phase of record linkage, where we choose pairs of records to compare.
Without blocking, we would have to compare N*M records, which becomes intractable for datasets much larger than a few thousand.
mismo.Linker
mismo.FullLinker
mismo.JoinLinker
Bases: Linker
A Linker based on a join condition.
mismo.JoinLinker.__call__
__call__(left: Table, right: Table) -> Linkage
Create a linkage from the join condition.
mismo.JoinLinker.__init__
__init__(
condition: Callable[[Table, Table], BooleanValue],
*,
task: Literal["dedupe", "link"] | None = None,
on_slow: Literal["error", "warn", "ignore"] = "error",
) -> None
Create from a join condition.
PARAMETER | DESCRIPTION |
---|---|
condition
|
The join condition.
TYPE:
|
task
|
The task to perform. If
TYPE:
|
on_slow
|
What to do if the join is slow. See mismo.joins.check_join_algorithm.
TYPE:
|
mismo.JoinLinker.__join_condition__
__join_condition__(
left: Table, right: Table
) -> BooleanValue
Create a join condition.
mismo.KeyLinker
Bases: Linker
A Linker that links records wherever they share a key, eg "emails match."
This is one of the most basic blocking rules, used very often in record linkage.
This is what is used in splink
.
Examples:
>>> import ibis
>>> from ibis import _
>>> import mismo
>>> ibis.options.interactive = True
>>> t = mismo.playdata.load_patents()["record_id", "name", "latitude"]
>>> t.head()
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ record_id ┃ name ┃ latitude ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ int64 │ string │ float64 │
├───────────┼──────────────────────────────┼──────────┤
│ 2909 │ * AGILENT TECHNOLOGIES, INC. │ 0.00 │
│ 3574 │ * AKZO NOBEL N.V. │ 0.00 │
│ 3575 │ * AKZO NOBEL NV │ 0.00 │
│ 3779 │ * ALCATEL N.V. │ 52.35 │
│ 3780 │ * ALCATEL N.V. │ 52.35 │
└───────────┴──────────────────────────────┴──────────┘
Block the table with itself wherever the names match:
>>> linker = mismo.KeyLinker("name")
>>> linker(t, t).links.order_by(
... "record_id_l", "record_id_r"
... ).head()
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ record_id_l ┃ record_id_r ┃ latitude_l ┃ latitude_r ┃ name_l ┃ name_r ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ int64 │ int64 │ float64 │ float64 │ string │ string │
├─────────────┼─────────────┼────────────┼────────────┼────────────────┼────────────────┤
│ 3779 │ 3780 │ 52.35 │ 52.350000 │ * ALCATEL N.V. │ * ALCATEL N.V. │
│ 3779 │ 3782 │ 52.35 │ 0.000000 │ * ALCATEL N.V. │ * ALCATEL N.V. │
│ 3780 │ 3782 │ 52.35 │ 0.000000 │ * ALCATEL N.V. │ * ALCATEL N.V. │
│ 25388 │ 7651559 │ 0.00 │ 50.966667 │ DSM N.V. │ DSM N.V. │
│ 25388 │ 7651560 │ 0.00 │ 52.500000 │ DSM N.V. │ DSM N.V. │
└─────────────┴─────────────┴────────────┴────────────┴────────────────┴────────────────┘
Arbitrary blocking keys are supported. For example, block the table wherever
- the first 5 characters of the name in uppercase, are the same AND
- the latitudes, rounded to 1 decimal place, are the same
>>> linker = mismo.KeyLinker((_["name"][:5].upper(), _.latitude.round(1)))
>>> blocker(t, t).order_by("record_id_l", "record_id_r").head()
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ record_id_l ┃ record_id_r ┃ latitude_l ┃ latitude_r ┃ name_l ┃ name_r ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ int64 │ int64 │ float64 │ float64 │ string │ string │
├─────────────┼─────────────┼────────────┼────────────┼─────────────────────┼─────────────────────┤
│ 3574 │ 3575 │ 0.00 │ 0.00 │ * AKZO NOBEL N.V. │ * AKZO NOBEL NV │
│ 3779 │ 3780 │ 52.35 │ 52.35 │ * ALCATEL N.V. │ * ALCATEL N.V. │
│ 15041 │ 15042 │ 0.00 │ 0.00 │ * CANON EUROPA N.V │ * CANON EUROPA N.V. │
│ 15041 │ 15043 │ 0.00 │ 0.00 │ * CANON EUROPA N.V │ * CANON EUROPA NV │
│ 15042 │ 15043 │ 0.00 │ 0.00 │ * CANON EUROPA N.V. │ * CANON EUROPA NV │
└─────────────┴─────────────┴────────────┴────────────┴─────────────────────┴─────────────────────┘
We can even block on arrays! For example, first let's split each name into significant tokens:
>>> tokens = _.name.upper().split(" ").filter(lambda x: x.length() > 4)
>>> t.select(tokens.name("tokens"))
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ tokens ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ array<string> │
├──────────────────────────────┤
│ ['AGILENT', 'TECHNOLOGIES,'] │
│ ['NOBEL'] │
│ ['NOBEL'] │
│ ['ALCATEL'] │
│ ['ALCATEL'] │
│ ['ALCATEL'] │
│ ['CANON', 'EUROPA'] │
│ ['CANON', 'EUROPA'] │
│ ['CANON', 'EUROPA'] │
│ [] │
│ … │
└──────────────────────────────┘
Now, block the tables together wherever two records share a token.
Note that this blocked * SCHLUMBERGER LIMITED
with * SCHLUMBERGER TECHNOLOGY BV
.
because they both share the SCHLUMBERGER
token.
>>> linker = mismo.KeyLinker(tokens.unnest())
>>> linker(t, t).links.filter(_.name_l != _.name_r).order_by(
... "record_id_l", "record_id_r"
... ).head()
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ record_id_l ┃ record_id_r ┃ latitude_l ┃ latitude_r ┃ name_l ┃ name_r ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ int64 │ int64 │ float64 │ float64 │ string │ string │
├─────────────┼─────────────┼────────────┼────────────┼────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────────┤
│ 2909 │ 13390969 │ 0.0 │ 52.35 │ * AGILENT TECHNOLOGIES, INC. │ Hitachi Global Storage Technologies, Inc. Netherlands B.V │
│ 2909 │ 13390970 │ 0.0 │ 52.35 │ * AGILENT TECHNOLOGIES, INC. │ Hitachi Global Storage Technologies, Inc. Netherlands B.V. │
│ 2909 │ 13391015 │ 0.0 │ 52.35 │ * AGILENT TECHNOLOGIES, INC. │ Hitachi Global Storage Technologies, Netherland B.V. │
│ 2909 │ 13391055 │ 0.0 │ 52.50 │ * AGILENT TECHNOLOGIES, INC. │ Hitachi Global Storage Technologies, Netherlands, B.V. │
│ 2909 │ 13391056 │ 0.0 │ 52.35 │ * AGILENT TECHNOLOGIES, INC. │ Hitachi Global Storage Technologies, Netherlands, B.V. │
└─────────────┴─────────────┴────────────┴────────────┴────────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────────┘
mismo.KeyLinker.__call__
__call__(left: Table, right: Table) -> Linkage
The linkage between the two tables.
mismo.KeyLinker.__init__
__init__(
keys: str
| Value
| Deferred
| Callable[
[Table, Table],
tuple[Value | Column, Value | Column],
]
| Iterable[
str
| Value
| Deferred
| Callable[[Table], Value | Column | str | Deferred]
| tuple[
str
| Deferred
| Callable[
[Table], Value | Column | str | Deferred
],
str
| Deferred
| Callable[
[Table], Value | Column | str | Deferred
],
]
| Callable[
[Table, Table],
tuple[Value | Column, Value | Column],
]
],
*,
max_pairs: int | None = None,
task: Literal["dedupe", "lookup", "link"] | None = None,
) -> None
Create a KeyBlocker.
PARAMETER | DESCRIPTION |
---|---|
keys
|
The keys to block on. The tables will be blocked together wherever they share ALL the keys. Each key can be any of the following:
TYPE:
|
max_pairs
|
The maximum number of pairs to generate for each key.
This is to reduce the impact of very common keys.
For example, if you are linking people, the name "John Smith" might be
very common, appearing 5000 times in both left and right.
This name alone would generate 5000 * 5000 = 25 million pairs,
which might be too computationally expensive.
If you set
TYPE:
|
task
|
The task to count pairs for.
TYPE:
|
mismo.KeyLinker.pair_counts
pair_counts(left: Table, right: Table) -> PairCountsTable
Count the number of pairs that would be generated by each key.
If you were to use this blocker to join left
with right
,
how many pairs would be generated for each key?
This is useful for analyzing the skew of join keys. For example, if you are joining on (surname, city), there might be only 4 values for (hoessle, tinytown), which would lead to a block of 4 * 4 = 16 record pairs.
On the other hand, there could be 10_000 values for (smith, new york city). This would lead to 10_000 * 10_000 = 100_000_000 record pairs, which is likely too many for you to be able to compare.
PARAMETER | DESCRIPTION |
---|---|
left
|
The left table.
TYPE:
|
right
|
The right table.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
CountsTable
|
Will have column(s) for each |
Examples:
>>> import ibis
>>> ibis.options.interactive = True
>>> import mismo
>>> records = [
... (1, "a", 1),
... (2, "b", 1),
... (3, "b", 1),
... (4, "c", 3),
... (5, "b", 2),
... (6, "c", 3),
... (7, None, 4),
... (8, "c", 3),
... ]
>>> t = ibis.memtable(
... records, schema={"record_id": int, "letter": str, "num": int}
... ).cache()
>>> t
┏━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ record_id ┃ letter ┃ num ┃
┡━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ int64 │ string │ int64 │
├───────────┼────────┼───────┤
│ 1 │ a │ 1 │
│ 2 │ b │ 1 │
│ 3 │ b │ 1 │
│ 4 │ c │ 3 │
│ 5 │ b │ 2 │
│ 6 │ c │ 3 │
│ 7 │ NULL │ 4 │
│ 8 │ c │ 3 │
└───────────┴────────┴───────┘
If we joined t with itself using this blocker in a dedupe task, we would end up with
- 3 pairs in the (c, 3) block due to pairs (3, 6), (3, 8), and (6, 8)
- 1 pairs in the (b, 1) block due to pairs (1, 2)
- 0 pairs in the (a, 1) block due to record 0 not getting blocked with itself
- 0 pairs in the (b, 2) block due to record 4 not getting blocked with itself
>>> linker = mismo.KeyLinker(["letter", "num"], task="dedupe")
>>> counts = linker.pair_counts(t, t)
>>> counts.order_by("letter", "num")
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ letter ┃ num ┃ n ┃
┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ string │ int64 │ int64 │
├────────┼───────┼───────┤
│ a │ 1 │ 0 │
│ b │ 1 │ 1 │
│ b │ 2 │ 0 │
│ c │ 3 │ 3 │
└────────┴───────┴───────┘
If we joined t with a copy of itself using this linker in a link task, we would end up with
- 9 pairs in the (c, 3) block due to pairs (3,3), (3, 6), (3, 8), (6, 3), (6, 6), (6, 8), (8, 3), (8, 6), and (8, 8)
- 4 pairs in the (b, 1) block due to pairs (1, 1), (1, 2), (2, 1), and (2, 2)
- 1 pairs in the (a, 1) block due to pair (0, 0)
- 1 pairs in the (b, 2) block due to pair (4, 4)
>>> linker = mismo.KeyLinker(["letter", "num"], task="link")
>>> counts = linker.pair_counts(t, t)
>>> counts.order_by("letter", "num")
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ letter ┃ num ┃ n ┃
┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ string │ int64 │ int64 │
├────────┼───────┼───────┤
│ a │ 1 │ 1 │
│ b │ 1 │ 4 │
│ b │ 2 │ 1 │
│ c │ 3 │ 9 │
└────────┴───────┴───────┘
The returned CountsTable is a subclass of an Ibis Table
with a special n_total
method for convenience:
>>> counts.n_total()
15
>>> isinstance(counts, ibis.Table)
True
mismo.KeyLinker.too_common_of_records
too_common_of_records(
left: Table, right: Table
) -> tuple[Table, Table]
The prefilter clause that removes keys that would generate too many pairs.
mismo.KeyLinker.key_counts_left
key_counts_left(left: Table) -> KeyCountsTable
mismo.KeyLinker.key_counts_right
key_counts_right(right: Table) -> KeyCountsTable
mismo.OrLinker
Bases: Linker
A Linker that is the logical OR of multiple [mismo.HasJoinCondition].
Physically, this is implemented as follows: - remove any condition overlap using [mismo.joins.remove_condition_overlap] - create a LinksTable for each join condition (which should be fast) - Union the LinksTables into a single LinksTable
mismo.OrLinker.join_conditions
property
join_conditions: dict[str, HasJoinCondition]
The tuple of underling HasJoinCondition objects.
mismo.linkage.sample_all_links
sample_all_links(
left: Table,
right: Table,
*,
max_pairs: int | None = None,
) -> LinksTable
Samples up to max_pairs
from all possible pairs of records.
PARAMETER | DESCRIPTION |
---|---|
left
|
The left table.
TYPE:
|
right
|
The right table.
TYPE:
|
max_pairs
|
The maximum number of pairs to sample. If None, all possible pairs are sampled.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
A [LinksTable][mismo.LinksTable] with just record_id_l and record_id_r.
|
|
All pairs will be unique.
|
|
Examples:
>>> import ibis
>>> import mismo
>>> ibis.options.interactive = True
>>> t, _labels = mismo.playdata.load_febrl1()
>>> t.head(5)
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━┓
┃ record_id ┃ given_name ┃ surname ┃ street_number ┃ address_1 ┃ … ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━┩
│ string │ string │ string │ string │ string │ … │
├──────────────┼────────────┼────────────┼───────────────┼───────────────────┼───┤
│ rec-0-dup-0 │ thomas │ rokobaro │ 12 │ herschell circuit │ … │
│ rec-0-org │ flynn │ rokobaro │ 12 │ herschell circuit │ … │
│ rec-1-dup-0 │ karli │ alderson │ 144 │ nulsen circuit │ … │
│ rec-1-org │ karli │ alderson │ 144 │ nulsen circuit │ … │
│ rec-10-dup-0 │ kayla │ harrington │ NULL │ maltby circuit │ … │
└──────────────┴────────────┴────────────┴───────────────┴───────────────────┴───┘
>>> mismo.linkage.sample_all_links(t, t, max_pairs=7)
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ record_id_l ┃ record_id_r ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ string │ string │
├───────────────┼───────────────┤
│ rec-226-dup-0 │ rec-396-org │
│ rec-232-dup-0 │ rec-402-dup-0 │
│ rec-259-dup-0 │ rec-61-org │
│ rec-293-dup-0 │ rec-41-dup-0 │
│ rec-448-org │ rec-25-org │
└───────────────┴───────────────┘
Comparing tables
mismo.Updates
Bases: TableWrapper
A Table representing how individual rows were updated.
This only represents differences in rows that exist both before and after the changes. To represent a general difference between two tables, eg with insertions and deletions, use Diff.
This represents how each column has changed between two tables. If a column only has a 'before' field, it means this column was removed. If a column only has an 'after' field, it means this column was added. If a column has both 'before' and 'after' fields, it means this column was present in both tables.
mismo.Updates.check_schemas
property
check_schemas: Literal['exactly', 'names', 'lax']
The schema checking mode used for this Updates object.
mismo.Updates.filters
class-attribute
instance-attribute
filters = Filters
A set of filters for convenience.
Examples:
>>> u = Updates.from_tables(before, after, join_on="id")
>>> u.filter(u.filters.all_different(["name", "age"]))
mismo.Updates.__init__
__init__(
diff_table: Table,
/,
*,
check_schemas: Literal[
"exactly", "names", "lax"
] = "exactly",
) -> None
Create an Updates object from a table of differences.
PARAMETER | DESCRIPTION |
---|---|
diff_table
|
A table with columns that are structs with at least 'before' and 'after' fields.
TYPE:
|
check_schemas
|
How to check the schemas of the before and after values. - "exactly": both before and after must have the same columns and types. - "names": both before and after must have the same columns, but types can differ. - "lax": no schema checking, just that there is at least one of 'before' or 'after' in each column.
TYPE:
|
mismo.Updates.after
after() -> Table
The table after the changes.
mismo.Updates.apply_to
apply_to(
t: Table, /, *, defaults: None | Any = NOT_SET
) -> Table
Return the input table with these updates applied to it.
PARAMETER | DESCRIPTION |
---|---|
t
|
The table to apply the updates to.
TYPE:
|
defaults
|
If the after table has more columns than the before table, you must provide defaults.
This is because there might be some rows in If None, as convenience, we will use
TYPE:
|
mismo.Updates.before
before() -> Table
The table before the changes.
mismo.Updates.from_before_after
classmethod
from_before_after(
before: Mapping[str, Value] | StructValue,
after: Mapping[str, Value] | StructValue,
*,
check_schemas: Literal[
"exactly", "names", "lax"
] = "exactly",
) -> Updates
Create an Updates object from before and after values.
PARAMETER | DESCRIPTION |
---|---|
before
|
The values before the changes. |
after
|
The values after the changes. |
check_schemas
|
How to check the schemas of the before and after values. - "exactly": both before and after must have the same columns and types. - "names": both before and after must have the same columns, but types can differ. - "lax": no schema checking, just that there is at least one of 'before' or 'after' in each column.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Updates
|
An Updates object representing the changes. |
mismo.Updates.from_tables
classmethod
from_tables(
before: Table,
after: Table,
*,
join_on: str | Literal[False],
check_schemas: Literal[
"exactly", "names", "lax"
] = "exactly",
) -> Updates
Create from two different tables by joining them on a key.
Note that this results in only the rows that are present in both tables, due to the inner join on the key. Insertions and deletions should be handled separately.
mismo.Updates.is_changed
is_changed(column: str) -> BooleanColumn
Is column.before different from column.after? Never returns NULL.
mismo.Diff
A set of insertions, updates, and deletions between two tables.
This can only semantically represent 1-1 relationships,
eg a row in the before
table corresponds to only 0 or 1 row in the after
table,
and vice versa. eg "this row changed in these ways between these two tables".
To represent more general 0-N relationships, use a Linkage. eg many rows in a "dirty" dataset are linked to a single row in a "clean" dataset. Say you have a clean database of records. You just got a new batch of dirty data that might contain duplicates. Each record in the clean database might match multiple records in the dirty data. This makes it difficult to use a Diff object, because each clean record can't be paired up nicely with a single dirty record. A Linkage object is more appropriate in this case.
This is able to represent a difference between two tables with different schemas, eg if a column is added or removed.
mismo.Diff.after
after() -> Table
The table after the changes.
mismo.Diff.before
before() -> Table
The table before the changes.
mismo.Diff.chart
chart() -> Chart
Create a chart that shows the flow of rows through the diff.
This is just a convenience method that calls self.stats.chart()
.
Rows
800,000 | ▓▓ Inserted (50,000)
| ▒▒ Deleted (100,000)
700,000 | ░░ Updated (200,000)
| ██ Unchanged (300,000)
600,000 | ▒▒▒▒
| ▒▒▒▒ ▓▓▓▓
500,000 | ░░░░ ░░░░
| ░░░░ ░░░░
400,000 | ░░░░ ░░░░
| ░░░░ ░░░░
300,000 | ████ ████
| ████ ████
200,000 | ████ ████
| ████ ████
100,000 | ████ ████
| ████ ████
0 | Before (600,000) After (550,000)
mismo.Diff.deletions
deletions() -> Table
Rows that were in before
but not in after
.
This has the same schema as before
.
mismo.Diff.from_before_after
classmethod
Create from a before and after table.
mismo.Diff.from_deltas
classmethod
from_deltas(
*,
before: Table,
insertions: Table | None = None,
updates: Updates | None = None,
deletions: Table | None = None,
updates_defaults: None | Any = NOT_SET,
) -> Self
Create from a starting point and a set of transformations.
PARAMETER | DESCRIPTION |
---|---|
before
|
The table before the changes.
TYPE:
|
insertions
|
Rows that were in
TYPE:
|
updates
|
Rows that were changed between
TYPE:
|
deletions
|
Rows that were in
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Diff
|
|
mismo.Diff.from_parquets
classmethod
Create a Diff by reading parquets from the given directory.
mismo.Diff.insertions
insertions() -> Table
Rows that were in after
but not in before
.
This has the same schema as after
.
mismo.Diff.to_parquets
Write the tables in the changes to parquet files.
mismo.Diff.unchanged
unchanged() -> Table
Rows that were unchanged between before
and after
.
mismo.Diff.updates
updates() -> Updates
Rows that were changed between before
and after
.
self.updates().before()
and self.updates().after()
have the same schemas
as before
and after
, respectively.
mismo.DiffStats
Summary statistics about a Diff, such as number of insertions, deletions, etc.
mismo.DiffStats.chart
chart() -> Chart
Create a chart that shows the flow of rows through the diff.
Rows
800,000 | ▓▓ Inserted (50,000)
| ▒▒ Deleted (100,000)
700,000 | ░░ Updated (200,000)
| ██ Unchanged (300,000)
600,000 | ▒▒▒▒
| ▒▒▒▒ ▓▓▓▓
500,000 | ░░░░ ░░░░
| ░░░░ ░░░░
400,000 | ░░░░ ░░░░
| ░░░░ ░░░░
300,000 | ████ ████
| ████ ████
200,000 | ████ ████
| ████ ████
100,000 | ████ ████
| ████ ████
0 | Before (600,000) After (550,000)
mismo.DiffStats.n_deletions
cached
n_deletions() -> int
Number of rows that were in before
but not in after
.
mismo.DiffStats.n_insertions
cached
n_insertions() -> int
Number of rows that were in after
but not in before
.
mismo.DiffStats.n_unchanged
cached
n_unchanged() -> int
Number of rows that were unchanged between before
and after
.
mismo.DiffStats.n_updates
cached
n_updates() -> int
Number of rows that were changed between before
and after
.
mismo.LinkCountsTable
Bases: TableWrapper
A table representing the number of records binned by number of links.
eg "There were 700 records with 0 links, 300 with 1 link, 20 with 2 links, ..."
mismo.LinkCountsTable.n_links
instance-attribute
n_links: IntegerColumn
The number of links.
mismo.LinkCountsTable.n_records
instance-attribute
n_records: IntegerColumn
The number of records.
mismo.LinkCountsTable.__init__
__init__(t: Table) -> None
Create from an ibis table with exactly columns 'n_records' and 'n_links'.
PARAMETER | DESCRIPTION |
---|---|
t
|
The table with exactly columns 'n_records' and 'n_links'.
TYPE:
|
mismo.LinkCountsTable.chart
chart() -> Chart
A bar chart of the number of records by the number of links.
Number of Records
Number of By Link Count
Records
| █
100,000 | █ █
| █ █
| █ █ █
10,000 | █ █ █
| █ █ █
| █ █ █
1,000 | █ █ █ █
| █ █ █ █ █ █
| █ █ █ █ █ █ █
100 | █ █ █ █ █ █ █ █ █
| 0 1 2 3 4 10 12 14 23
Number of Links