Skip to content

Datasets

mismo.Datasets

An ordered, dict-like collection of tables of records.

All tables must have a column named 'record_id' that is globally unique. The dtype of the 'record_id' column must be the same in all tables. Besides that, the schema of the tables can be different.

This is a nice abstraction over the fact that some record linkage problems are deduplication, and thus involve only one table, while others are linkage, and involve two tables.

mismo.Datasets.names: tuple[str, ...] property

The names of the underlying tables.

mismo.Datasets.shared_schema: ibis.Schema property

The schema that all tables have in common.

Columns with conflicting types are omitted.

This is useful for operations that require the same schema in all tables, for example getting all the record_ids.

mismo.Datasets.tables: tuple[ir.Table, ...] property

The underlying tables.

mismo.Datasets.__contains__(key: str | ir.Table) -> bool

Check if a table is in the collection by name or value.

mismo.Datasets.__getitem__(key: str | int) -> ir.Table

Get a table by name or index.

mismo.Datasets.__init__(tables: ir.Table | Iterable[ir.Table] | Mapping[str, ir.Table]) -> None

Create a new Datasets.

If tables is Mapping, then it is used as-is. If tables is a single table, it is named "dataset_0". If tables is an iterable of tables, then we try to find their names by: - calling get_name() on each table. If that fails, then fall back to... - using "left" and "right" for two tables - using "dataset_i" otherwise

mismo.Datasets.__iter__() -> Iterable[ir.Table]

Iterate over the tables in the order they were added.

mismo.Datasets.__len__() -> int

The number of tables.

mismo.Datasets.all_record_ids() -> ir.Column

Return all unique record_ids from all tables.

mismo.Datasets.cache() -> Datasets

Return a new Datasets with all tables cached.

mismo.Datasets.filter(f: ibis.Deferred | Callable[[str, ir.Table], ir.Table]) -> ir.Table

Return a new Datasets with all tables filtered by f.

mismo.Datasets.items() -> Iterable[tuple[str, ir.Table]]

The names and tables of the underlying tables.

mismo.Datasets.keys() -> Iterable[str]

The names of the underlying tables.

mismo.Datasets.map(f: ibis.Deferred | Callable[[str, ir.Table], ir.Table]) -> ir.Table

Return a new Datasets with all tables transformed by f.

mismo.Datasets.unioned() -> ir.Table

Select the self.shared_schema columns from all tables and union them.

mismo.Datasets.values() -> Iterable[ir.Table]

The underlying tables.