Skip to content

Datasets

mismo.Datasets

An ordered, dict-like collection of tables of records.

All tables must have a column named 'record_id' that is globally unique. The dtype of the 'record_id' column must be the same in all tables. Besides that, the schema of the tables can be different.

This is a nice abstraction over the fact that some record linkage problems are deduplication, and thus involve only one table, while others are linkage, and involve two tables.

mismo.Datasets.names property

names: tuple[str, ...]

The names of the underlying tables.

mismo.Datasets.shared_schema property

shared_schema: Schema

The schema that all tables have in common.

Columns with conflicting types are omitted.

This is useful for operations that require the same schema in all tables, for example getting all the record_ids.

mismo.Datasets.tables property

tables: tuple[Table, ...]

The underlying tables.

mismo.Datasets.__contains__

__contains__(key: str | Table) -> bool

Check if a table is in the collection by name or value.

mismo.Datasets.__getitem__

__getitem__(key: str | int) -> Table

Get a table by name or index.

mismo.Datasets.__init__

__init__(
    tables: Table | Iterable[Table] | Mapping[str, Table],
) -> None

Create a new Datasets.

If tables is Mapping, then it is used as-is. If tables is a single table, it is named "dataset_0". If tables is an iterable of tables, then we try to find their names by: - calling get_name() on each table. If that fails, then fall back to... - using "left" and "right" for two tables - using "dataset_i" otherwise

mismo.Datasets.__iter__

__iter__() -> Iterable[Table]

Iterate over the tables in the order they were added.

mismo.Datasets.__len__

__len__() -> int

The number of tables.

mismo.Datasets.all_record_ids

all_record_ids() -> Column

Return all unique record_ids from all tables.

mismo.Datasets.cache

cache() -> Datasets

Return a new Datasets with all tables cached.

mismo.Datasets.filter

filter(
    f: Deferred | Callable[[str, Table], Table],
) -> Table

Return a new Datasets with all tables filtered by f.

mismo.Datasets.items

items() -> Iterable[tuple[str, Table]]

The names and tables of the underlying tables.

mismo.Datasets.keys

keys() -> Iterable[str]

The names of the underlying tables.

mismo.Datasets.map

map(f: Deferred | Callable[[str, Table], Table]) -> Table

Return a new Datasets with all tables transformed by f.

mismo.Datasets.unioned

unioned() -> Table

Select the self.shared_schema columns from all tables and union them.

mismo.Datasets.values

values() -> Iterable[Table]

The underlying tables.