Datasets

`mismo.Datasets`

An ordered, dict-like collection of tables of records.

All tables must have a column named 'record_id' that is globally unique. The dtype of the 'record_id' column must be the same in all tables. Besides that, the schema of the tables can be different.

This is a nice abstraction over the fact that some record linkage problems are deduplication, and thus involve only one table, while others are linkage, and involve two tables.

`mismo.Datasets.names: tuple[str, ...]` `property`

The names of the underlying tables.

`mismo.Datasets.shared_schema: ibis.Schema` `property`

The schema that all tables have in common.

Columns with conflicting types are omitted.

This is useful for operations that require the same schema in all tables, for example getting all the record_ids.

`mismo.Datasets.tables: tuple[ir.Table, ...]` `property`

The underlying tables.

`mismo.Datasets.contains(key: str | ir.Table) -> bool`

Check if a table is in the collection by name or value.

`mismo.Datasets.getitem(key: str | int) -> ir.Table`

Get a table by name or index.

`mismo.Datasets.init(tables: ir.Table | Iterable[ir.Table] | Mapping[str, ir.Table]) -> None`

Create a new Datasets.

If tables is Mapping, then it is used as-is. If tables is a single table, it is named "dataset_0". If tables is an iterable of tables, then we try to find their names by: - calling get_name() on each table. If that fails, then fall back to... - using "left" and "right" for two tables - using "dataset_i" otherwise

`mismo.Datasets.iter() -> Iterable[ir.Table]`

Iterate over the tables in the order they were added.

`mismo.Datasets.len() -> int`

The number of tables.

`mismo.Datasets.all_record_ids() -> ir.Column`

Return all unique record_ids from all tables.

`mismo.Datasets.cache() -> Datasets`

Return a new Datasets with all tables cached.

`mismo.Datasets.filter(f: ibis.Deferred | Callable[[str, ir.Table], ir.Table]) -> ir.Table`

Return a new Datasets with all tables filtered by f.

`mismo.Datasets.items() -> Iterable[tuple[str, ir.Table]]`

The names and tables of the underlying tables.

`mismo.Datasets.keys() -> Iterable[str]`

The names of the underlying tables.

`mismo.Datasets.map(f: ibis.Deferred | Callable[[str, ir.Table], ir.Table]) -> ir.Table`

Return a new Datasets with all tables transformed by f.

`mismo.Datasets.unioned() -> ir.Table`

Select the self.shared_schema columns from all tables and union them.

`mismo.Datasets.values() -> Iterable[ir.Table]`

The underlying tables.

Datasets

mismo.Datasets

mismo.Datasets.names: tuple[str, ...] property

mismo.Datasets.shared_schema: ibis.Schema property

mismo.Datasets.tables: tuple[ir.Table, ...] property

mismo.Datasets.__contains__(key: str | ir.Table) -> bool

mismo.Datasets.__getitem__(key: str | int) -> ir.Table

mismo.Datasets.__init__(tables: ir.Table | Iterable[ir.Table] | Mapping[str, ir.Table]) -> None

mismo.Datasets.__iter__() -> Iterable[ir.Table]

mismo.Datasets.__len__() -> int

mismo.Datasets.all_record_ids() -> ir.Column

mismo.Datasets.cache() -> Datasets

mismo.Datasets.filter(f: ibis.Deferred | Callable[[str, ir.Table], ir.Table]) -> ir.Table

mismo.Datasets.items() -> Iterable[tuple[str, ir.Table]]

mismo.Datasets.keys() -> Iterable[str]

mismo.Datasets.map(f: ibis.Deferred | Callable[[str, ir.Table], ir.Table]) -> ir.Table

mismo.Datasets.unioned() -> ir.Table

mismo.Datasets.values() -> Iterable[ir.Table]