Datasets
mismo.Datasets
An ordered, dict-like collection of tables of records.
All tables must have a column named 'record_id' that is globally unique. The dtype of the 'record_id' column must be the same in all tables. Besides that, the schema of the tables can be different.
This is a nice abstraction over the fact that some record linkage problems are deduplication, and thus involve only one table, while others are linkage, and involve two tables.
mismo.Datasets.names: tuple[str, ...]
property
The names of the underlying tables.
mismo.Datasets.shared_schema: ibis.Schema
property
The schema that all tables have in common.
Columns with conflicting types are omitted.
This is useful for operations that require the same schema in all tables, for example getting all the record_ids.
mismo.Datasets.tables: tuple[ir.Table, ...]
property
The underlying tables.
mismo.Datasets.__contains__(key: str | ir.Table) -> bool
Check if a table is in the collection by name or value.
mismo.Datasets.__getitem__(key: str | int) -> ir.Table
Get a table by name or index.
mismo.Datasets.__init__(tables: ir.Table | Iterable[ir.Table] | Mapping[str, ir.Table]) -> None
Create a new Datasets.
If tables
is Mapping, then it is used as-is.
If tables
is a single table, it is named "dataset_0".
If tables
is an iterable of tables, then we try to find their names by:
- calling get_name()
on each table. If that fails, then fall back to...
- using "left" and "right" for two tables
- using "dataset_i" otherwise
mismo.Datasets.__iter__() -> Iterable[ir.Table]
Iterate over the tables in the order they were added.
mismo.Datasets.__len__() -> int
The number of tables.
mismo.Datasets.all_record_ids() -> ir.Column
Return all unique record_ids from all tables.
mismo.Datasets.cache() -> Datasets
Return a new Datasets with all tables cached.
mismo.Datasets.filter(f: ibis.Deferred | Callable[[str, ir.Table], ir.Table]) -> ir.Table
Return a new Datasets with all tables filtered by f
.
mismo.Datasets.items() -> Iterable[tuple[str, ir.Table]]
The names and tables of the underlying tables.
mismo.Datasets.keys() -> Iterable[str]
The names of the underlying tables.
mismo.Datasets.map(f: ibis.Deferred | Callable[[str, ir.Table], ir.Table]) -> ir.Table
Return a new Datasets with all tables transformed by f
.
mismo.Datasets.unioned() -> ir.Table
Select the self.shared_schema
columns from all tables and union them.
mismo.Datasets.values() -> Iterable[ir.Table]
The underlying tables.