Util Classes
Mismo Types
These are some utility classes. I tried to make them be simple and open to extension.
mismo.types.Linkage
Two tables of records and links between them.
This is semantically similar to a Diff object,
except in a Diff object every row in left
is linked
to either 0 or 1 rows in right
.
Because there can't be many-to- relationships,
Diffs allow for the semantics of insertions, updates, and deletions.
eg "this row changed in these ways between these two tables".
On the other hand, a Linkage object is more general. It supports the semantics of a many-to-many relationship between two tables. Say you have a clean database of records. You just got a new batch of dirty data that might contain duplicates. Each record in the clean database might match multiple records in the dirty data. This makes it difficult to use a Diff object, because each clean record can't be paired up nicely with a single dirty record. A Linkage object is more appropriate in this case.
mismo.types.Linkage.left: LinkedTable
property
The left Table.
mismo.types.Linkage.links: ibis.Table
property
A table of (record_id_l, record_id_r, left
and right
.
mismo.types.Linkage.right: LinkedTable
property
The right Table.
mismo.types.Linkage.__init__(left: ibis.Table, right: ibis.Table, links: ibis.Table) -> None
Create from two tables and a table of links between them.
PARAMETER | DESCRIPTION |
---|---|
left
|
A Table of records, with at least a column 'record_id'.
TYPE:
|
right
|
A Table of records, with at least a column 'record_id'.
TYPE:
|
links
|
A Table of links between the two tables.
Must have columns 'record_id_l' and 'record_id_r', which refer to the
'record_id' columns in
TYPE:
|
mismo.types.Linkage.cache() -> _typing.Self
Cache the left, right, and links tables.
RETURNS | DESCRIPTION |
---|---|
A new Linkage with the cached tables.
|
|
mismo.types.Linkage.from_parquets(directory: str | Path, /, *, backend: ibis.BaseBackend | None = None) -> _typing.Self
classmethod
Create a Linkage by reading parquets from the given directory.
mismo.types.Linkage.from_predicates(left: ibis.Table, right: ibis.Table, predicates) -> _typing.Self
classmethod
Create a Linkage from join predicates.
This is useful if you don't already have a table of links.
This will create a table of links by joining the left and right tables
on the given predicates.
It will either use the existing record_id
columns in the tables,
or create new ones if they don't exist.
PARAMETER | DESCRIPTION |
---|---|
left
|
The left table.
TYPE:
|
right
|
The right table.
TYPE:
|
predicates
|
The join predicates. Anything that ibis.join() accepts.
|
Examples:
>>> import ibis
>>> ibis.options.interactive = True
>>> tl = ibis.memtable({"record_id": [1, 2, 3]})
>>> tr = ibis.memtable({"record_id": [1, 2, 2]})
>>> linkage = Linkage.from_predicates(tl, tr, "record_id")
>>> linkage.links
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ record_id_l ┃ record_id_r ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ int64 │ int64 │
├─────────────┼─────────────┤
│ 1 │ 1 │
│ 2 │ 2 │
│ 2 │ 2 │
└─────────────┴─────────────┘
mismo.types.Linkage.link_counts_chart() -> alt.Chart
A side by side altair Chart of left.link_counts(
)and
right.link_counts()`
Number of Left Table Number of Right Table
Records Records
| █ | █
100,000 | █ █ | █
| █ █ 10,000 | █
| █ █ █ | █
10,000 | █ █ █ | █ █
| █ █ █ | █ █ █
| █ █ █ 1,000 | █ █ █
1,000 | █ █ █ █ | █ █ █
| █ █ █ █ █ █ | █ █ █
| █ █ █ █ █ █ █ | █ █ █ █
100 | █ █ █ █ █ █ █ █ █ 100 | █ █ █ █
| 0 1 2 3 4 10 12 14 23 | 0 1 2 3
Number of Links Number of Links
mismo.types.Linkage.to_parquets(directory: str | Path, /, *, overwrite: bool = False) -> None
Write the needle, haystack, and links to parquet files in the given directory.
mismo.types.LinkedTable
Bases: TableWrapper
A table of records that are linked to another table.
Each record here can be linked to 0-N records in the other_
.
This acts just like an Ibis Table, but it has a few extra attributes and methods that make it more ergonomic to work with, eg to add data from the linked table.
mismo.types.LinkedTable.links_: ibis.Table
instance-attribute
The table of links between this table and other
.
Trailing underscore to avoid name conflicts with column names.
mismo.types.LinkedTable.other_: ibis.Table
instance-attribute
The other table that this table is linked to. Trailing underscore to avoid name conflicts with column names.
mismo.types.LinkedTable.link_counts() -> LinkCountsTable
Describes 'There are n_records
in self that linked to n_links
in `other'.
This is basically a histogram of self.with_n_links()
See Also
with_n_links
Examples:
>>> import ibis
>>> ibis.options.interactive = True
>>> left = ibis.memtable({"record_id": [4, 5, 6]})
>>> right = ibis.memtable({"record_id": [7, 8, 9]})
>>> links = ibis.memtable({"record_id_l": [4, 4, 5], "record_id_r": [7, 8, 9]})
>>> linkage = Linkage(left, right, links)
There is 1 record in left (6) that didn't match any in right. There is 1 record in left (5) that matched 1 in right. There is 1 record in left (4) that matched 2 in right.
>>> linkage.left.link_counts()
┏━━━━━━━━━━━┳━━━━━━━━━┓
┃ n_records ┃ n_links ┃
┡━━━━━━━━━━━╇━━━━━━━━━┩
│ int64 │ int64 │
├───────────┼─────────┤
│ 1 │ 2 │
│ 1 │ 1 │
│ 1 │ 0 │
└───────────┴─────────┘
All 3 records in right matched 1 in left.
>>> linkage.right.link_counts()
┏━━━━━━━━━━━┳━━━━━━━━━┓
┃ n_records ┃ n_links ┃
┡━━━━━━━━━━━╇━━━━━━━━━┩
│ int64 │ int64 │
├───────────┼─────────┤
│ 3 │ 1 │
└───────────┴─────────┘
mismo.types.LinkedTable.make_pair(left: ir.Table, right: ir.Table, links: ir.Table) -> tuple[_typing.Self, _typing.Self]
classmethod
Create a pair of LinkedTables from left, right, and links.
This basically just wraps the logic to make it so that the the _l and _r suffixes in the links table are consistent.
mismo.types.LinkedTable.with_many_linked_values(*values: ibis.Deferred | Callable[[ibis.Table], ir.Value] | None, **named_values: ibis.Deferred | Callable[[ibis.Table], ir.Value] | None) -> _typing.Self
This table, with array<>
columns of values from linked records in other
This is very similar to with_single_linked_values
, except:
- This includes values from all linked records, not just the single match.
- Here, the values from the N linked records are returned in an array. There, since there is only one linked record, we return it directly, not as a length-1 array.
This uses the same semantics as ibis.Table.select(*values, **named_values)
to choose which values from other
to add to self
If no values are provided, we will by default add a column named other
with all the values from other packed into a struct.
PARAMETER | DESCRIPTION |
---|---|
values
|
unnamed values
TYPE:
|
named_values
|
named values
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
A new LinkedTable with the new column.
|
|
Examples:
>>> import ibis
>>> ibis.options.interactive = True
>>> this = ibis.memtable({"record_id": [4, 5, 6]})
>>> other = ibis.memtable({"record_id": [7, 8, 9]})
>>> links = ibis.memtable({"record_id_l": [4, 4, 5], "record_id_r": [7, 8, 9]})
>>> lt = LinkedTable(this, other, links)
>>> lt
┏━━━━━━━━━━━┓
┃ record_id ┃
┡━━━━━━━━━━━┩
│ int64 │
├───────────┤
│ 4 │
│ 5 │
│ 6 │
└───────────┘
Default is to pack everything into array
>>> lt.with_many_linked_values()
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ record_id ┃ other ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ int64 │ array<struct<record_id: int64>> │
├───────────┼──────────────────────────────────────┤
│ 4 │ [{'record_id': 7}, {'record_id': 8}] │
│ 5 │ [{'record_id': 9}] │
│ 6 │ [] │
└───────────┴──────────────────────────────────────┘
Or you can select exactly which values you want. They will be returned in an array, one for each linked record:
>>> lt.with_many_linked_values(_.record_id.name("idrs"), plus_ones=_.record_id + 1)
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ record_id ┃ idrs ┃ plus_ones ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
│ int64 │ array<int64> │ array<int64> │
├───────────┼──────────────────────┼──────────────────────┤
│ 4 │ [7, 8] │ [8, 9] │
│ 5 │ [9] │ [10] │
│ 6 │ [] │ [] │
└───────────┴──────────────────────┴──────────────────────┘
mismo.types.LinkedTable.with_n_links(*, name: str = 'n_links') -> _typing.Self
Add a column to this table with the number of links each record has.
PARAMETER | DESCRIPTION |
---|---|
name
|
The name of the new column.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
A new LinkedTable with the new column.
|
|
Examples:
>>> import ibis
>>> ibis.options.interactive = True
>>> left = ibis.memtable({"record_id": [4, 5, 6]})
>>> right = ibis.memtable({"record_id": [7, 8, 9]})
>>> links = ibis.memtable({"record_id_l": [4, 4, 5], "record_id_r": [7, 8, 9]})
>>> linkage = Linkage(left, right, links)
>>> linkage.left.with_n_links().order_by("record_id")
┏━━━━━━━━━━━┳━━━━━━━━━┓
┃ record_id ┃ n_links ┃
┡━━━━━━━━━━━╇━━━━━━━━━┩
│ int64 │ int64 │
├───────────┼─────────┤
│ 4 │ 2 │
│ 5 │ 1 │
│ 6 │ 0 │
└───────────┴─────────┘
>>> linkage.right.with_n_links(name="link_count").order_by("record_id")
┏━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ record_id ┃ link_count ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ int64 │ int64 │
├───────────┼────────────┤
│ 7 │ 1 │
│ 8 │ 1 │
│ 9 │ 1 │
└───────────┴────────────┘
mismo.types.LinkedTable.with_single_linked_values(*values: ibis.Deferred | Callable[[ibis.Table], ir.Value] | None, **named_values: ibis.Deferred | Callable[[ibis.Table], ir.Value] | None) -> _typing.Self
This table filtered to single matches, with values from the linked record.
This is very similar to with_many_linked_values
, except:
- It filters to only include records that have exactly 1 link.
- In
with_many_linked_values
the values from the N linked records are returned in an array. Here, since there is only one linked record, we return it directly, not as a length-1 array.
This uses the same semantics as ibis.Table.select(*values, **named_values)
to choose which values from other
to add to self
If no values are provided, we will by default add a column named other
with all the values from other packed into a struct.
PARAMETER | DESCRIPTION |
---|---|
values
|
unnamed values
TYPE:
|
named_values
|
named values
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
A new LinkedTable with the new column.
|
|
Examples:
>>> import ibis
>>> ibis.options.interactive = True
>>> left = ibis.memtable({"record_id": [4, 5, 6]})
>>> right = ibis.memtable({"record_id": [7, 8, 9]})
>>> links = ibis.memtable({"record_id_l": [4, 4, 5], "record_id_r": [7, 8, 9]})
>>> lt = LinkedTable(left, right, links)
>>> lt
┏━━━━━━━━━━━┓
┃ record_id ┃
┡━━━━━━━━━━━┩
│ int64 │
├───────────┤
│ 4 │
│ 5 │
│ 6 │
└───────────┘
We only include record with id 5, because it has exactly 1 link. Record 4 is linked to 2 records (7 and 8), and record 6 is linked to 0 records. Default is to pack everything into a struct:
>>> lt.with_single_linked_values()
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ record_id ┃ other ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ int64 │ struct<record_id: int64> │
├───────────┼──────────────────────────┤
│ 5 │ {'record_id': 9} │
└───────────┴──────────────────────────┘
Or you can select exactly which values you want:
>>> lt.with_single_linked_values(_.record_id.name("idr"), plus_one=_.record_id + 1)
┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┓
┃ record_id ┃ idr ┃ plus_one ┃
┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━┩
│ int64 │ int64 │ int64 │
├───────────┼───────┼──────────┤
│ 5 │ 9 │ 10 │
└───────────┴───────┴──────────┘
mismo.types.LinkCountsTable
Bases: TableWrapper
A table representing the number of records binned by number of links.
eg "There were 700 records with 0 links, 300 with 1 link, 20 with 2 links, ..."
mismo.types.LinkCountsTable.n_links: ir.IntegerColumn
instance-attribute
The number of links.
mismo.types.LinkCountsTable.n_records: ir.IntegerColumn
instance-attribute
The number of records.
mismo.types.LinkCountsTable.__init__(t: ibis.Table) -> None
Create from an ibis table with exactly columns 'n_records' and 'n_links'.
PARAMETER | DESCRIPTION |
---|---|
t
|
The table with exactly columns 'n_records' and 'n_links'.
TYPE:
|
mismo.types.LinkCountsTable.chart() -> alt.Chart
A bar chart of the number of records by the number of links.
Number of Records
Number of By Link Count
Records
| █
100,000 | █ █
| █ █
| █ █ █
10,000 | █ █ █
| █ █ █
| █ █ █
1,000 | █ █ █ █
| █ █ █ █ █ █
| █ █ █ █ █ █ █
100 | █ █ █ █ █ █ █ █ █
| 0 1 2 3 4 10 12 14 23
Number of Links
mismo.types.Updates
Bases: TableWrapper
A Table representing how individual rows were updated.
This only represents differences in rows that exist both before and after the changes. To represent a general difference between two tables, eg with insertions and deletions, use Diff.
This represents how each column has changed between two tables. If a column only has a 'before' field, it means this column was removed. If a column only has an 'after' field, it means this column was added. If a column has both 'before' and 'after' fields, it means this column was present in both tables.
mismo.types.Updates.filters = Filters
class-attribute
instance-attribute
A set of filters for convenience.
Examples:
>>> u = Updates.from_tables(before, after, join_on="id")
>>> u.filter(u.filters.all_different(["name", "age"]))
mismo.types.Updates.after() -> ibis.Table
The table after the changes.
mismo.types.Updates.apply_to(t: ibis.Table, /, *, defaults: None | Any = _util.NOT_SET) -> ibis.Table
Return the input table with these updates applied to it.
PARAMETER | DESCRIPTION |
---|---|
t
|
The table to apply the updates to.
TYPE:
|
defaults
|
If the after table has more columns than the before table, you must provide defaults.
This is because there might be some rows in If None, as convenience, we will use
TYPE:
|
mismo.types.Updates.before() -> ibis.Table
The table before the changes.
mismo.types.Updates.from_tables(before: ibis.Table, after: ibis.Table, *, join_on: str, schema: Literal['exactly', 'names', 'lax'] = 'exactly') -> Updates
classmethod
Create from two different tables by joining them on a key.
Note that this results in only the rows that are present in both tables, due to the inner join on the key. Insertions and deletions should be handled separately.
mismo.types.Diff
A set of insertions, updates, and deletions between two tables.
This can only semantically represent 1-1 relationships,
eg a row in the before
table corresponds to only 0 or 1 row in the after
table,
and vice versa.
To represent more general 0-N relationships, use a Linkage. eg many rows in a "dirty" dataset are linked to a single row in a "clean" dataset.
This is able to represent a difference between two tables with different schemas, eg if a column is added or removed.
mismo.types.Diff.stats: DiffStats
property
Statistics about this Diff.
mismo.types.Diff.after() -> ibis.Table
The table after the changes.
mismo.types.Diff.before() -> ibis.Table
The table before the changes.
mismo.types.Diff.cache() -> Diff
Return a new Diff with all tables cached.
mismo.types.Diff.chart() -> alt.Chart
Create a chart that shows the flow of rows through the diff.
This is just a convenience method that calls self.stats.chart()
.
Rows
800,000 | ▓▓ Inserted (50,000)
| ▒▒ Deleted (100,000)
700,000 | ░░ Updated (200,000)
| ██ Unchanged (300,000)
600,000 | ▒▒▒▒
| ▒▒▒▒ ▓▓▓▓
500,000 | ░░░░ ░░░░
| ░░░░ ░░░░
400,000 | ░░░░ ░░░░
| ░░░░ ░░░░
300,000 | ████ ████
| ████ ████
200,000 | ████ ████
| ████ ████
100,000 | ████ ████
| ████ ████
0 | Before (600,000) After (550,000)
mismo.types.Diff.deletions() -> ibis.Table
Rows that were in before
but not in after
.
This has the same schema as before
.
mismo.types.Diff.from_before_after(before: ibis.Table, after: ibis.Table, *, join_on: str) -> _typing.Self
classmethod
Create from a before and after table.
mismo.types.Diff.from_deltas(*, before: ibis.Table, insertions: ibis.Table | None = None, updates: Updates | None = None, deletions: ibis.Table | None = None, updates_defaults: None | Any | Literal[_util.NOT_SET] = _util.NOT_SET) -> _typing.Self
classmethod
Create from a starting point and a set of transformations.
PARAMETER | DESCRIPTION |
---|---|
before
|
The table before the changes.
TYPE:
|
insertions
|
Rows that were in
TYPE:
|
updates
|
Rows that were changed between
TYPE:
|
deletions
|
Rows that were in
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Diff
|
|
mismo.types.Diff.from_parquets(directory: str | Path, /, *, backend: ibis.BaseBackend | None = None) -> _typing.Self
classmethod
Create a Diff by reading parquets from the given directory.
mismo.types.Diff.insertions() -> ibis.Table
Rows that were in after
but not in before
.
This has the same schema as after
.
mismo.types.Diff.to_parquets(directory: str | Path, /, *, overwrite: bool = False) -> None
Write the tables in the changes to parquet files.
mismo.types.Diff.unchanged() -> ibis.Table
Rows that were unchanged between before
and after
.
mismo.types.Diff.updates() -> Updates
Rows that were changed between before
and after
.
self.updates().before()
and self.updates().after()
have the same schemas
as before
and after
, respectively.
mismo.types.DiffStats
Summary statistics about a Diff, such as number of insertions, deletions, etc.
mismo.types.DiffStats.__init__(diff: Diff) -> None
Create from a Diff.
mismo.types.DiffStats.chart() -> alt.Chart
Create a chart that shows the flow of rows through the diff.
Rows
800,000 | ▓▓ Inserted (50,000)
| ▒▒ Deleted (100,000)
700,000 | ░░ Updated (200,000)
| ██ Unchanged (300,000)
600,000 | ▒▒▒▒
| ▒▒▒▒ ▓▓▓▓
500,000 | ░░░░ ░░░░
| ░░░░ ░░░░
400,000 | ░░░░ ░░░░
| ░░░░ ░░░░
300,000 | ████ ████
| ████ ████
200,000 | ████ ████
| ████ ████
100,000 | ████ ████
| ████ ████
0 | Before (600,000) After (550,000)
mismo.types.DiffStats.n_after() -> int
cached
Number of rows in after
.
mismo.types.DiffStats.n_before() -> int
cached
Number of rows in before
.
mismo.types.DiffStats.n_deletions() -> int
cached
Number of rows that were in before
but not in after
.
mismo.types.DiffStats.n_insertions() -> int
cached
Number of rows that were in after
but not in before
.
mismo.types.DiffStats.n_unchanged() -> int
cached
Number of rows that were unchanged between before
and after
.
mismo.types.DiffStats.n_updates() -> int
cached
Number of rows that were changed between before
and after
.