Util Classes
Mismo Types
These are some utility classes. I tried to make them be simple and open to extension.
mismo.types.Linkage
Two tables of records and links between them.
This is semantically similar to a Diff object,
except in a Diff object every row in left
is linked
to either 0 or 1 rows in right
.
Because there can't be many-to- relationships,
Diffs allow for the semantics of insertions, updates, and deletions.
eg "this row changed in these ways between these two tables".
On the other hand, a Linkage object is more general. It supports the semantics of a many-to-many relationship between two tables. Say you have a clean database of records. You just got a new batch of dirty data that might contain duplicates. Each record in the clean database might match multiple records in the dirty data. This makes it difficult to use a Diff object, because each clean record can't be paired up nicely with a single dirty record. A Linkage object is more appropriate in this case.
mismo.types.Linkage.left: LinkedTable
property
The left Table.
mismo.types.Linkage.links: ibis.Table
property
A table of (record_id_l, record_id_r, left
and right
.
mismo.types.Linkage.right: LinkedTable
property
The right Table.
mismo.types.Linkage.__init__(left: ibis.Table, right: ibis.Table, links: ibis.Table) -> None
Create from two tables and a table of links between them.
PARAMETER | DESCRIPTION |
---|---|
left |
A Table of records, with at least a column 'record_id'.
TYPE:
|
right |
A Table of records, with at least a column 'record_id'.
TYPE:
|
links |
A Table of links between the two tables.
Must have columns 'record_id_l' and 'record_id_r', which refer to the
'record_id' columns in
TYPE:
|
mismo.types.Linkage.cache() -> _typing.Self
Cache the left, right, and links tables.
RETURNS | DESCRIPTION |
---|---|
A new Linkage with the cached tables.
|
|
mismo.types.Linkage.from_parquets(directory: str | Path, /, *, backend: ibis.BaseBackend | None = None) -> _typing.Self
classmethod
Create a Linkage by reading parquets from the given directory.
mismo.types.Linkage.from_predicates(left: ibis.Table, right: ibis.Table, predicates) -> _typing.Self
classmethod
Create a Linkage from join predicates.
This is useful if you don't already have a table of links.
This will create a table of links by joining the left and right tables
on the given predicates.
It will either use the existing record_id
columns in the tables,
or create new ones if they don't exist.
PARAMETER | DESCRIPTION |
---|---|
left |
The left table.
TYPE:
|
right |
The right table.
TYPE:
|
predicates |
The join predicates. Anything that ibis.join() accepts.
|
Examples:
>>> import ibis
>>> ibis.options.interactive = True
>>> tl = ibis.memtable({"record_id": [1, 2, 3]})
>>> tr = ibis.memtable({"record_id": [1, 2, 2]})
>>> linkage = Linkage.from_predicates(tl, tr, "record_id")
>>> linkage.links
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ record_id_l ┃ record_id_r ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ int64 │ int64 │
├─────────────┼─────────────┤
│ 1 │ 1 │
│ 2 │ 2 │
│ 2 │ 2 │
└─────────────┴─────────────┘
mismo.types.Linkage.link_counts_chart() -> alt.Chart
A side by side altair Chart of left.link_counts(
)and
right.link_counts()`
Number of Left Table Number of Right Table
Records Records
| █ | █
100,000 | █ █ | █
| █ █ 10,000 | █
| █ █ █ | █
10,000 | █ █ █ | █ █
| █ █ █ | █ █ █
| █ █ █ 1,000 | █ █ █
1,000 | █ █ █ █ | █ █ █
| █ █ █ █ █ █ | █ █ █
| █ █ █ █ █ █ █ | █ █ █ █
100 | █ █ █ █ █ █ █ █ █ 100 | █ █ █ █
| 0 1 2 3 4 10 12 14 23 | 0 1 2 3
Number of Links Number of Links
mismo.types.Linkage.to_parquets(directory: str | Path) -> None
Write the needle, haystack, and links to parquet files in the given directory.
mismo.types.LinkedTable
Bases: TableWrapper
A table of records that are linked to another table.
Each record here can be linked to 0-N records in the other_
.
This acts just like an Ibis Table, but it has a few extra attributes and methods that make it more ergonomic to work with, eg to add data from the linked table.
mismo.types.LinkedTable.links_: ibis.Table
instance-attribute
The table of links between this table and other
.
Trailing underscore to avoid name conflicts with column names.
mismo.types.LinkedTable.other_: ibis.Table
instance-attribute
The other table that this table is linked to. Trailing underscore to avoid name conflicts with column names.
mismo.types.LinkedTable.link_counts() -> LinkCountsTable
Describes 'There are n_records
in self that linked to n_links
in `other'.
This is basically a histogram of self.with_n_links()
See Also
with_n_links
Examples:
>>> import ibis
>>> ibis.options.interactive = True
>>> left = ibis.memtable({"record_id": [4, 5, 6]})
>>> right = ibis.memtable({"record_id": [7, 8, 9]})
>>> links = ibis.memtable({"record_id_l": [4, 4, 5], "record_id_r": [7, 8, 9]})
>>> linkage = Linkage(left, right, links)
There is 1 record in left (6) that didn't match any in right. There is 1 record in left (5) that matched 1 in right. There is 1 record in left (4) that matched 2 in right.
>>> linkage.left.link_counts()
┏━━━━━━━━━━━┳━━━━━━━━━┓
┃ n_records ┃ n_links ┃
┡━━━━━━━━━━━╇━━━━━━━━━┩
│ int64 │ int64 │
├───────────┼─────────┤
│ 1 │ 2 │
│ 1 │ 1 │
│ 1 │ 0 │
└───────────┴─────────┘
All 3 records in right matched 1 in left.
>>> linkage.right.link_counts()
┏━━━━━━━━━━━┳━━━━━━━━━┓
┃ n_records ┃ n_links ┃
┡━━━━━━━━━━━╇━━━━━━━━━┩
│ int64 │ int64 │
├───────────┼─────────┤
│ 3 │ 1 │
└───────────┴─────────┘
mismo.types.LinkedTable.with_many_linked_values(*values: ibis.Deferred | Callable[[ibis.Table], ir.Value] | None, **named_values: ibis.Deferred | Callable[[ibis.Table], ir.Value] | None) -> _typing.Self
This table, with array<>
columns of values from linked records in other
This is very similar to with_single_linked_values
, except:
- This includes values from all linked records, not just the single match.
- Here, the values from the N linked records are returned in an array. There, since there is only one linked record, we return it directly, not as a length-1 array.
This uses the same semantics as ibis.Table.select(*values, **named_values)
to choose which values from other
to add to self
If no values are provided, we will by default add a column named other
with all the values from other packed into a struct.
PARAMETER | DESCRIPTION |
---|---|
values |
unnamed values
TYPE:
|
named_values |
named values
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
A new LinkedTable with the new column.
|
|
Examples:
>>> import ibis
>>> ibis.options.interactive = True
>>> this = ibis.memtable({"record_id": [4, 5, 6]})
>>> other = ibis.memtable({"record_id": [7, 8, 9]})
>>> links = ibis.memtable({"record_id_l": [4, 4, 5], "record_id_r": [7, 8, 9]})
>>> lt = LinkedTable(this, other, links)
>>> lt
┏━━━━━━━━━━━┓
┃ record_id ┃
┡━━━━━━━━━━━┩
│ int64 │
├───────────┤
│ 4 │
│ 5 │
│ 6 │
└───────────┘
Default is to pack everything into array
>>> lt.with_many_linked_values()
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ record_id ┃ other ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ int64 │ array<struct<record_id: int64>> │
├───────────┼──────────────────────────────────────┤
│ 4 │ [{'record_id': 7}, {'record_id': 8}] │
│ 5 │ [{'record_id': 9}] │
│ 6 │ [] │
└───────────┴──────────────────────────────────────┘
Or you can select exactly which values you want. They will be returned in an array, one for each linked record:
>>> lt.with_many_linked_values(_.record_id.name("idrs"), plus_ones=_.record_id + 1)
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ record_id ┃ idrs ┃ plus_ones ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
│ int64 │ array<int64> │ array<int64> │
├───────────┼──────────────────────┼──────────────────────┤
│ 4 │ [7, 8] │ [8, 9] │
│ 5 │ [9] │ [10] │
│ 6 │ [] │ [] │
└───────────┴──────────────────────┴──────────────────────┘
mismo.types.LinkedTable.with_n_links(name: str = 'n_links') -> _typing.Self
Add a column to this table with the number of links each record has.
PARAMETER | DESCRIPTION |
---|---|
name |
The name of the new column.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
A new LinkedTable with the new column.
|
|
Examples:
>>> import ibis
>>> ibis.options.interactive = True
>>> left = ibis.memtable({"record_id": [4, 5, 6]})
>>> right = ibis.memtable({"record_id": [7, 8, 9]})
>>> links = ibis.memtable({"record_id_l": [4, 4, 5], "record_id_r": [7, 8, 9]})
>>> linkage = Linkage(left, right, links)
>>> linkage.left.with_n_links().order_by("record_id")
┏━━━━━━━━━━━┳━━━━━━━━━┓
┃ record_id ┃ n_links ┃
┡━━━━━━━━━━━╇━━━━━━━━━┩
│ int64 │ int64 │
├───────────┼─────────┤
│ 4 │ 2 │
│ 5 │ 1 │
│ 6 │ 0 │
└───────────┴─────────┘
>>> linkage.right.with_n_links().order_by("record_id")
┏━━━━━━━━━━━┳━━━━━━━━━┓
┃ record_id ┃ n_links ┃
┡━━━━━━━━━━━╇━━━━━━━━━┩
│ int64 │ int64 │
├───────────┼─────────┤
│ 7 │ 1 │
│ 8 │ 1 │
│ 9 │ 1 │
└───────────┴─────────┘
mismo.types.LinkedTable.with_single_linked_values(*values: ibis.Deferred | Callable[[ibis.Table], ir.Value] | None, **named_values: ibis.Deferred | Callable[[ibis.Table], ir.Value] | None) -> _typing.Self
This table filtered to single matches, with values from the linked record.
This is very similar to with_many_linked_values
, except:
- It filters to only include records that have exactly 1 link.
- In
with_many_linked_values
the values from the N linked records are returned in an array. Here, since there is only one linked record, we return it directly, not as a length-1 array.
This uses the same semantics as ibis.Table.select(*values, **named_values)
to choose which values from other
to add to self
If no values are provided, we will by default add a column named other
with all the values from other packed into a struct.
PARAMETER | DESCRIPTION |
---|---|
values |
unnamed values
TYPE:
|
named_values |
named values
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
A new LinkedTable with the new column.
|
|
Examples:
>>> import ibis
>>> ibis.options.interactive = True
>>> left = ibis.memtable({"record_id": [4, 5, 6]})
>>> right = ibis.memtable({"record_id": [7, 8, 9]})
>>> links = ibis.memtable({"record_id_l": [4, 4, 5], "record_id_r": [7, 8, 9]})
>>> lt = LinkedTable(left, right, links)
>>> lt
┏━━━━━━━━━━━┓
┃ record_id ┃
┡━━━━━━━━━━━┩
│ int64 │
├───────────┤
│ 4 │
│ 5 │
│ 6 │
└───────────┘
We only include record with id 5, because it has exactly 1 link. Record 4 is linked to 2 records (7 and 8), and record 6 is linked to 0 records. Default is to pack everything into a struct:
>>> lt.with_single_linked_values()
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ record_id ┃ other ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ int64 │ struct<record_id: int64> │
├───────────┼──────────────────────────┤
│ 5 │ {'record_id': 9} │
└───────────┴──────────────────────────┘
Or you can select exactly which values you want:
>>> lt.with_single_linked_values(_.record_id.name("idr"), plus_one=_.record_id + 1)
┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┓
┃ record_id ┃ idr ┃ plus_one ┃
┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━┩
│ int64 │ int64 │ int64 │
├───────────┼───────┼──────────┤
│ 5 │ 9 │ 10 │
└───────────┴───────┴──────────┘
mismo.types.LinkCountsTable
Bases: TableWrapper
A table representing the number of records binned by number of links.
eg "There were 700 records with 0 links, 300 with 1 link, 20 with 2 links, ..."
mismo.types.LinkCountsTable.n_links: ir.IntegerColumn
instance-attribute
The number of links.
mismo.types.LinkCountsTable.n_records: ir.IntegerColumn
instance-attribute
The number of records.
mismo.types.LinkCountsTable.__init__(t: ibis.Table) -> None
Create from an ibis table with exactly columns 'n_records' and 'n_links'.
PARAMETER | DESCRIPTION |
---|---|
t |
The table with exactly columns 'n_records' and 'n_links'.
TYPE:
|
mismo.types.LinkCountsTable.chart() -> alt.Chart
A bar chart of the number of records by the number of links.
Number of Records
Number of By Link Count
Records
| █
100,000 | █ █
| █ █
| █ █ █
10,000 | █ █ █
| █ █ █
| █ █ █
1,000 | █ █ █ █
| █ █ █ █ █ █
| █ █ █ █ █ █ █
100 | █ █ █ █ █ █ █ █ █
| 0 1 2 3 4 10 12 14 23
Number of Links
mismo.types.Updates
Bases: TableWrapper
A Table representing how individual rows were updated.
This only represents differences in rows that exist both before and after the changes. To represent a general difference between two tables, eg with insertions and deletions, use Diff.
This represents how each column has changed between two tables. If a column only has a 'before' field, it means this column was removed. If a column only has an 'after' field, it means this column was added. If a column has both 'before' and 'after' fields, it means this column was present in both tables.
mismo.types.Updates.filters = Filters
class-attribute
instance-attribute
A set of filters for convenience.
Examples:
>>> u = Updates.from_tables(before, after, join_on="id")
>>> u.filter(u.filters.all_different(["name", "age"]))
mismo.types.Updates.after() -> ibis.Table
The table after the changes.
mismo.types.Updates.apply_to(t: ibis.Table, *, defaults: None | Literal[_util.NOT_SET] | Any = _util.NOT_SET) -> ibis.Table
Return the input table with these updates applied to it.
PARAMETER | DESCRIPTION |
---|---|
t |
The table to apply the updates to.
TYPE:
|
defaults |
If the after table has more columns than the before table, you must provide defaults.
This is because there might be some rows in If None, as convenience, we will use |
mismo.types.Updates.before() -> ibis.Table
The table before the changes.
mismo.types.Updates.from_tables(before: ibis.Table, after: ibis.Table, *, join_on: str, schema: Literal['exactly', 'names', 'lax'] = 'exactly') -> Updates
classmethod
Create from two different tables by joining them on a key.
Note that this results in only the rows that are present in both tables, due to the inner join on the key. Insertions and deletions should be handled separately.
mismo.types.Diff
A set of insertions, updates, and deletions between two tables.
This can only semantically represent 1-1 relationships,
eg a row in the before
table corresponds to only 0 or 1 row in the after
table,
and vice versa.
To represent more general 0-N relationships, use a Linkage. eg many rows in a "dirty" dataset are linked to a single row in a "clean" dataset.
This is able to represent a difference between two tables with different schemas, eg if a column is added or removed.
mismo.types.Diff.stats: DiffStats
property
Statistics about this Diff.
mismo.types.Diff.after() -> ibis.Table
The table after the changes.
mismo.types.Diff.before() -> ibis.Table
The table before the changes.
mismo.types.Diff.cache() -> Diff
Return a new Diff with all tables cached.
mismo.types.Diff.deletions() -> ibis.Table
Rows that were in before
but not in after
.
This has the same schema as before
.
mismo.types.Diff.from_before_after(before: ibis.Table, after: ibis.Table, *, join_on: str) -> _typing.Self
classmethod
Create from a before and after table.
mismo.types.Diff.from_deltas(*, before: ibis.Table, insertions: ibis.Table | None = None, updates: Updates | None = None, deletions: ibis.Table | None = None, updates_defaults: None | Any | Literal[_util.NOT_SET] = _util.NOT_SET) -> _typing.Self
classmethod
Create from a starting point and a set of transformations.
PARAMETER | DESCRIPTION |
---|---|
before |
The table before the changes.
TYPE:
|
insertions |
Rows that were in
TYPE:
|
updates |
Rows that were changed between
TYPE:
|
deletions |
Rows that were in
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Diff
|
|
mismo.types.Diff.from_parquets(directory: str | Path, /, *, backend: ibis.BaseBackend | None = None) -> _typing.Self
classmethod
Create a Diff by reading parquets from the given directory.
mismo.types.Diff.insertions() -> ibis.Table
Rows that were in after
but not in before
.
This has the same schema as after
.
mismo.types.Diff.to_parquets(directory: str | Path) -> None
Write the tables in the changes to parquet files.
mismo.types.Diff.unchanged() -> ibis.Table
Rows that were unchanged between before
and after
.
mismo.types.Diff.updates() -> Updates
Rows that were changed between before
and after
.
self.updates().before()
and self.updates().after()
have the same schemas
as before
and after
, respectively.
mismo.types.DiffStats
Summary statistics about a Diff, such as number of insertions, deletions, etc.
mismo.types.DiffStats.__init__(diff: Diff) -> None
Create from a Diff.
mismo.types.DiffStats.chart() -> alt.Chart
Create a chart that shows the flow of rows through the diff.
Rows
800,000 | ▓▓ Inserted (50,000)
| ▒▒ Deleted (100,000)
700,000 | ░░ Updated (200,000)
| ██ Unchanged (300,000)
600,000 | ▒▒▒▒
| ▒▒▒▒ ▓▓▓▓
500,000 | ░░░░ ░░░░
| ░░░░ ░░░░
400,000 | ░░░░ ░░░░
| ░░░░ ░░░░
300,000 | ████ ████
| ████ ████
200,000 | ████ ████
| ████ ████
100,000 | ████ ████
| ████ ████
0 | Before (600,000) After (550,000)
mismo.types.DiffStats.n_after() -> int
cached
Number of rows in after
.
mismo.types.DiffStats.n_before() -> int
cached
Number of rows in before
.
mismo.types.DiffStats.n_deletions() -> int
cached
Number of rows that were in before
but not in after
.
mismo.types.DiffStats.n_insertions() -> int
cached
Number of rows that were in after
but not in before
.
mismo.types.DiffStats.n_unchanged() -> int
cached
Number of rows that were unchanged between before
and after
.
mismo.types.DiffStats.n_updates() -> int
cached
Number of rows that were changed between before
and after
.