Skip to content

Util Classes

Mismo Types

These are some utility classes. I tried to make them be simple and open to extension.

mismo.types.Linkage

Two tables of records and links between them.

This is semantically similar to a Diff object, except in a Diff object every row in left is linked to either 0 or 1 rows in right. Because there can't be many-to- relationships, Diffs allow for the semantics of insertions, updates, and deletions. eg "this row changed in these ways between these two tables".

On the other hand, a Linkage object is more general. It supports the semantics of a many-to-many relationship between two tables. Say you have a clean database of records. You just got a new batch of dirty data that might contain duplicates. Each record in the clean database might match multiple records in the dirty data. This makes it difficult to use a Diff object, because each clean record can't be paired up nicely with a single dirty record. A Linkage object is more appropriate in this case.

mismo.types.Linkage.left: LinkedTable property

The left Table.

A table of (record_id_l, record_id_r, ...) that link left and right.

mismo.types.Linkage.right: LinkedTable property

The right Table.

mismo.types.Linkage.__init__(left: ibis.Table, right: ibis.Table, links: ibis.Table) -> None

Create from two tables and a table of links between them.

PARAMETER DESCRIPTION
left

A Table of records, with at least a column 'record_id'.

TYPE: Table

right

A Table of records, with at least a column 'record_id'.

TYPE: Table

links

A Table of links between the two tables. Must have columns 'record_id_l' and 'record_id_r', which refer to the 'record_id' columns in left and right, respectively. May have other columns. May not have duplicate (record_id_l, record_id_r) pairs.

TYPE: Table

mismo.types.Linkage.cache() -> _typing.Self

Cache the left, right, and links tables.

RETURNS DESCRIPTION
A new Linkage with the cached tables.

mismo.types.Linkage.from_parquets(directory: str | Path, /, *, backend: ibis.BaseBackend | None = None) -> _typing.Self classmethod

Create a Linkage by reading parquets from the given directory.

mismo.types.Linkage.from_predicates(left: ibis.Table, right: ibis.Table, predicates) -> _typing.Self classmethod

Create a Linkage from join predicates.

This is useful if you don't already have a table of links. This will create a table of links by joining the left and right tables on the given predicates. It will either use the existing record_id columns in the tables, or create new ones if they don't exist.

PARAMETER DESCRIPTION
left

The left table.

TYPE: Table

right

The right table.

TYPE: Table

predicates

The join predicates. Anything that ibis.join() accepts.

Examples:

>>> import ibis
>>> ibis.options.interactive = True
>>> tl = ibis.memtable({"record_id": [1, 2, 3]})
>>> tr = ibis.memtable({"record_id": [1, 2, 2]})
>>> linkage = Linkage.from_predicates(tl, tr, "record_id")
>>> linkage.links
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ record_id_l ┃ record_id_r ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ int64       │ int64       │
├─────────────┼─────────────┤
│           1 │           1 │
│           2 │           2 │
│           2 │           2 │
└─────────────┴─────────────┘

A side by side altair Chart of left.link_counts()andright.link_counts()`

Number of           Left Table               Number of    Right Table
  Records                                      Records
        |    █                                       |    █
100,000 | █  █                                       |    █
        | █  █                                10,000 |    █
        | █  █  █                                    |    █
 10,000 | █  █  █                                    |    █  █
        | █  █  █                                    | █  █  █
        | █  █  █                              1,000 | █  █  █
  1,000 | █  █  █  █                                 | █  █  █
        | █  █  █  █  █  █                           | █  █  █
        | █  █  █  █  █  █  █                        | █  █  █  █
    100 | █  █  █  █  █  █  █  █  █              100 | █  █  █  █
        | 0  1  2  3  4 10 12 14 23                  | 0  1  2  3
        Number of Links                              Number of Links

mismo.types.Linkage.to_parquets(directory: str | Path) -> None

Write the needle, haystack, and links to parquet files in the given directory.

mismo.types.LinkedTable

Bases: TableWrapper

A table of records that are linked to another table.

Each record here can be linked to 0-N records in the other_.

This acts just like an Ibis Table, but it has a few extra attributes and methods that make it more ergonomic to work with, eg to add data from the linked table.

The table of links between this table and other. Trailing underscore to avoid name conflicts with column names.

mismo.types.LinkedTable.other_: ibis.Table instance-attribute

The other table that this table is linked to. Trailing underscore to avoid name conflicts with column names.

Describes 'There are n_records in self that linked to n_links in `other'.

This is basically a histogram of self.with_n_links()

See Also

with_n_links

Examples:

>>> import ibis
>>> ibis.options.interactive = True
>>> left = ibis.memtable({"record_id": [4, 5, 6]})
>>> right = ibis.memtable({"record_id": [7, 8, 9]})
>>> links = ibis.memtable({"record_id_l": [4, 4, 5], "record_id_r": [7, 8, 9]})
>>> linkage = Linkage(left, right, links)

There is 1 record in left (6) that didn't match any in right. There is 1 record in left (5) that matched 1 in right. There is 1 record in left (4) that matched 2 in right.

>>> linkage.left.link_counts()
┏━━━━━━━━━━━┳━━━━━━━━━┓
┃ n_records ┃ n_links ┃
┡━━━━━━━━━━━╇━━━━━━━━━┩
│ int64     │ int64   │
├───────────┼─────────┤
│         1 │       2 │
│         1 │       1 │
│         1 │       0 │
└───────────┴─────────┘

All 3 records in right matched 1 in left.

>>> linkage.right.link_counts()
┏━━━━━━━━━━━┳━━━━━━━━━┓
┃ n_records ┃ n_links ┃
┡━━━━━━━━━━━╇━━━━━━━━━┩
│ int64     │ int64   │
├───────────┼─────────┤
│         3 │       1 │
└───────────┴─────────┘

mismo.types.LinkedTable.with_many_linked_values(*values: ibis.Deferred | Callable[[ibis.Table], ir.Value] | None, **named_values: ibis.Deferred | Callable[[ibis.Table], ir.Value] | None) -> _typing.Self

This table, with array<> columns of values from linked records in other

This is very similar to with_single_linked_values, except:

  • This includes values from all linked records, not just the single match.
  • Here, the values from the N linked records are returned in an array. There, since there is only one linked record, we return it directly, not as a length-1 array.

This uses the same semantics as ibis.Table.select(*values, **named_values) to choose which values from other to add to self

If no values are provided, we will by default add a column named other with all the values from other packed into a struct.

PARAMETER DESCRIPTION
values

unnamed values

TYPE: Deferred | Callable[[Table], Value] | None DEFAULT: ()

named_values

named values

TYPE: Deferred | Callable[[Table], Value] | None DEFAULT: {}

RETURNS DESCRIPTION
A new LinkedTable with the new column.

Examples:

>>> import ibis
>>> ibis.options.interactive = True
>>> this = ibis.memtable({"record_id": [4, 5, 6]})
>>> other = ibis.memtable({"record_id": [7, 8, 9]})
>>> links = ibis.memtable({"record_id_l": [4, 4, 5], "record_id_r": [7, 8, 9]})
>>> lt = LinkedTable(this, other, links)
>>> lt
┏━━━━━━━━━━━┓
┃ record_id ┃
┡━━━━━━━━━━━┩
│ int64     │
├───────────┤
│         4 │
│         5 │
│         6 │
└───────────┘

Default is to pack everything into array>:

>>> lt.with_many_linked_values()
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ record_id ┃ other                                ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ int64     │ array<struct<record_id: int64>>      │
├───────────┼──────────────────────────────────────┤
│         4 │ [{'record_id': 7}, {'record_id': 8}] │
│         5 │ [{'record_id': 9}]                   │
│         6 │ []                                   │
└───────────┴──────────────────────────────────────┘

Or you can select exactly which values you want. They will be returned in an array, one for each linked record:

>>> lt.with_many_linked_values(_.record_id.name("idrs"), plus_ones=_.record_id + 1)
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ record_id ┃ idrs                 ┃ plus_ones            ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
│ int64     │ array<int64>         │ array<int64>         │
├───────────┼──────────────────────┼──────────────────────┤
│         4 │ [7, 8]               │ [8, 9]               │
│         5 │ [9]                  │ [10]                 │
│         6 │ []                   │ []                   │
└───────────┴──────────────────────┴──────────────────────┘

Add a column to this table with the number of links each record has.

PARAMETER DESCRIPTION
name

The name of the new column.

TYPE: str DEFAULT: 'n_links'

RETURNS DESCRIPTION
A new LinkedTable with the new column.

Examples:

>>> import ibis
>>> ibis.options.interactive = True
>>> left = ibis.memtable({"record_id": [4, 5, 6]})
>>> right = ibis.memtable({"record_id": [7, 8, 9]})
>>> links = ibis.memtable({"record_id_l": [4, 4, 5], "record_id_r": [7, 8, 9]})
>>> linkage = Linkage(left, right, links)
>>> linkage.left.with_n_links().order_by("record_id")
┏━━━━━━━━━━━┳━━━━━━━━━┓
┃ record_id ┃ n_links ┃
┡━━━━━━━━━━━╇━━━━━━━━━┩
│ int64     │ int64   │
├───────────┼─────────┤
│         4 │       2 │
│         5 │       1 │
│         6 │       0 │
└───────────┴─────────┘
>>> linkage.right.with_n_links().order_by("record_id")
┏━━━━━━━━━━━┳━━━━━━━━━┓
┃ record_id ┃ n_links ┃
┡━━━━━━━━━━━╇━━━━━━━━━┩
│ int64     │ int64   │
├───────────┼─────────┤
│         7 │       1 │
│         8 │       1 │
│         9 │       1 │
└───────────┴─────────┘

mismo.types.LinkedTable.with_single_linked_values(*values: ibis.Deferred | Callable[[ibis.Table], ir.Value] | None, **named_values: ibis.Deferred | Callable[[ibis.Table], ir.Value] | None) -> _typing.Self

This table filtered to single matches, with values from the linked record.

This is very similar to with_many_linked_values, except:

  • It filters to only include records that have exactly 1 link.
  • In with_many_linked_values the values from the N linked records are returned in an array. Here, since there is only one linked record, we return it directly, not as a length-1 array.

This uses the same semantics as ibis.Table.select(*values, **named_values) to choose which values from other to add to self

If no values are provided, we will by default add a column named other with all the values from other packed into a struct.

PARAMETER DESCRIPTION
values

unnamed values

TYPE: Deferred | Callable[[Table], Value] | None DEFAULT: ()

named_values

named values

TYPE: Deferred | Callable[[Table], Value] | None DEFAULT: {}

RETURNS DESCRIPTION
A new LinkedTable with the new column.

Examples:

>>> import ibis
>>> ibis.options.interactive = True
>>> left = ibis.memtable({"record_id": [4, 5, 6]})
>>> right = ibis.memtable({"record_id": [7, 8, 9]})
>>> links = ibis.memtable({"record_id_l": [4, 4, 5], "record_id_r": [7, 8, 9]})
>>> lt = LinkedTable(left, right, links)
>>> lt
┏━━━━━━━━━━━┓
┃ record_id ┃
┡━━━━━━━━━━━┩
│ int64     │
├───────────┤
│         4 │
│         5 │
│         6 │
└───────────┘

We only include record with id 5, because it has exactly 1 link. Record 4 is linked to 2 records (7 and 8), and record 6 is linked to 0 records. Default is to pack everything into a struct:

>>> lt.with_single_linked_values()
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ record_id ┃ other                    ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ int64     │ struct<record_id: int64> │
├───────────┼──────────────────────────┤
│         5 │ {'record_id': 9}         │
└───────────┴──────────────────────────┘

Or you can select exactly which values you want:

>>> lt.with_single_linked_values(_.record_id.name("idr"), plus_one=_.record_id + 1)
┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┓
┃ record_id ┃ idr   ┃ plus_one ┃
┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━┩
│ int64     │ int64 │ int64    │
├───────────┼───────┼──────────┤
│         5 │     9 │       10 │
└───────────┴───────┴──────────┘

mismo.types.LinkCountsTable

Bases: TableWrapper

A table representing the number of records binned by number of links.

eg "There were 700 records with 0 links, 300 with 1 link, 20 with 2 links, ..."

The number of links.

mismo.types.LinkCountsTable.n_records: ir.IntegerColumn instance-attribute

The number of records.

mismo.types.LinkCountsTable.__init__(t: ibis.Table) -> None

Create from an ibis table with exactly columns 'n_records' and 'n_links'.

PARAMETER DESCRIPTION
t

The table with exactly columns 'n_records' and 'n_links'.

TYPE: Table

mismo.types.LinkCountsTable.chart() -> alt.Chart

A bar chart of the number of records by the number of links.

                 Number of Records
Number of          By Link Count
  Records
        |    █
100,000 | █  █
        | █  █
        | █  █  █
 10,000 | █  █  █
        | █  █  █
        | █  █  █
  1,000 | █  █  █  █
        | █  █  █  █  █  █
        | █  █  █  █  █  █  █
    100 | █  █  █  █  █  █  █  █  █
        | 0  1  2  3  4 10 12 14 23
        Number of Links

mismo.types.Updates

Bases: TableWrapper

A Table representing how individual rows were updated.

This only represents differences in rows that exist both before and after the changes. To represent a general difference between two tables, eg with insertions and deletions, use Diff.

This represents how each column has changed between two tables. If a column only has a 'before' field, it means this column was removed. If a column only has an 'after' field, it means this column was added. If a column has both 'before' and 'after' fields, it means this column was present in both tables.

mismo.types.Updates.filters = Filters class-attribute instance-attribute

A set of filters for convenience.

Examples:

>>> u = Updates.from_tables(before, after, join_on="id")
>>> u.filter(u.filters.all_different(["name", "age"]))

mismo.types.Updates.after() -> ibis.Table

The table after the changes.

mismo.types.Updates.apply_to(t: ibis.Table, *, defaults: None | Literal[_util.NOT_SET] | Any = _util.NOT_SET) -> ibis.Table

Return the input table with these updates applied to it.

PARAMETER DESCRIPTION
t

The table to apply the updates to.

TYPE: Table

defaults

If the after table has more columns than the before table, you must provide defaults. This is because there might be some rows in t that are not touched by this Updates. We need to know what to put in those columns for the untouched rows. This accepts anything that ibis.Table.mutate() accepts.

If None, as convenience, we will use null as the default for all new columns. If _util.NOT_SET, an error is raised if the after table has more columns than the before table.

TYPE: None | Literal[NOT_SET] | Any DEFAULT: NOT_SET

mismo.types.Updates.before() -> ibis.Table

The table before the changes.

mismo.types.Updates.from_tables(before: ibis.Table, after: ibis.Table, *, join_on: str, schema: Literal['exactly', 'names', 'lax'] = 'exactly') -> Updates classmethod

Create from two different tables by joining them on a key.

Note that this results in only the rows that are present in both tables, due to the inner join on the key. Insertions and deletions should be handled separately.

mismo.types.Diff

A set of insertions, updates, and deletions between two tables.

This can only semantically represent 1-1 relationships, eg a row in the before table corresponds to only 0 or 1 row in the after table, and vice versa.

To represent more general 0-N relationships, use a Linkage. eg many rows in a "dirty" dataset are linked to a single row in a "clean" dataset.

This is able to represent a difference between two tables with different schemas, eg if a column is added or removed.

mismo.types.Diff.stats: DiffStats property

Statistics about this Diff.

mismo.types.Diff.after() -> ibis.Table

The table after the changes.

mismo.types.Diff.before() -> ibis.Table

The table before the changes.

mismo.types.Diff.cache() -> Diff

Return a new Diff with all tables cached.

mismo.types.Diff.deletions() -> ibis.Table

Rows that were in before but not in after.

This has the same schema as before.

mismo.types.Diff.from_before_after(before: ibis.Table, after: ibis.Table, *, join_on: str) -> _typing.Self classmethod

Create from a before and after table.

mismo.types.Diff.from_deltas(*, before: ibis.Table, insertions: ibis.Table | None = None, updates: Updates | None = None, deletions: ibis.Table | None = None, updates_defaults: None | Any | Literal[_util.NOT_SET] = _util.NOT_SET) -> _typing.Self classmethod

Create from a starting point and a set of transformations.

PARAMETER DESCRIPTION
before

The table before the changes.

TYPE: Table

insertions

Rows that were in after but not in before. If None, we assume there are no insertions.

TYPE: Table | None DEFAULT: None

updates

Rows that were changed between before and after. If None, we assume there are no updates.

TYPE: Updates | None DEFAULT: None

deletions

Rows that were in before but not in after. If None, we assume there are no deletions.

TYPE: Table | None DEFAULT: None

RETURNS DESCRIPTION
Diff

mismo.types.Diff.from_parquets(directory: str | Path, /, *, backend: ibis.BaseBackend | None = None) -> _typing.Self classmethod

Create a Diff by reading parquets from the given directory.

mismo.types.Diff.insertions() -> ibis.Table

Rows that were in after but not in before.

This has the same schema as after.

mismo.types.Diff.to_parquets(directory: str | Path) -> None

Write the tables in the changes to parquet files.

mismo.types.Diff.unchanged() -> ibis.Table

Rows that were unchanged between before and after.

mismo.types.Diff.updates() -> Updates

Rows that were changed between before and after.

self.updates().before() and self.updates().after() have the same schemas as before and after, respectively.

mismo.types.DiffStats

Summary statistics about a Diff, such as number of insertions, deletions, etc.

mismo.types.DiffStats.__init__(diff: Diff) -> None

Create from a Diff.

mismo.types.DiffStats.chart() -> alt.Chart

Create a chart that shows the flow of rows through the diff.

Rows
800,000 |                                 ▓▓  Inserted (50,000)
        |                                 ▒▒  Deleted (100,000)
700,000 |                                 ░░  Updated (200,000)
        |                                 ██  Unchanged (300,000)
600,000 |      ▒▒▒▒
        |      ▒▒▒▒             ▓▓▓▓
500,000 |      ░░░░             ░░░░
        |      ░░░░             ░░░░
400,000 |      ░░░░             ░░░░
        |      ░░░░             ░░░░
300,000 |      ████             ████
        |      ████             ████
200,000 |      ████             ████
        |      ████             ████
100,000 |      ████             ████
        |      ████             ████
      0 | Before (600,000)  After (550,000)

mismo.types.DiffStats.n_after() -> int cached

Number of rows in after.

mismo.types.DiffStats.n_before() -> int cached

Number of rows in before.

mismo.types.DiffStats.n_deletions() -> int cached

Number of rows that were in before but not in after.

mismo.types.DiffStats.n_insertions() -> int cached

Number of rows that were in after but not in before.

mismo.types.DiffStats.n_unchanged() -> int cached

Number of rows that were unchanged between before and after.

mismo.types.DiffStats.n_updates() -> int cached

Number of rows that were changed between before and after.