Datasets
mismo.Datasets
An ordered, dict-like collection of tables of records.
All tables must have a column named 'record_id' that is globally unique. The dtype of the 'record_id' column must be the same in all tables. Besides that, the schema of the tables can be different.
This is a nice abstraction over the fact that some record linkage problems are deduplication, and thus involve only one table, while others are linkage, and involve two tables.
mismo.Datasets.shared_schema
property
shared_schema: Schema
The schema that all tables have in common.
Columns with conflicting types are omitted.
This is useful for operations that require the same schema in all tables, for example getting all the record_ids.
mismo.Datasets.__contains__
Check if a table is in the collection by name or value.
mismo.Datasets.__init__
Create a new Datasets.
If tables
is Mapping, then it is used as-is.
If tables
is a single table, it is named "dataset_0".
If tables
is an iterable of tables, then we try to find their names by:
- calling get_name()
on each table. If that fails, then fall back to...
- using "left" and "right" for two tables
- using "dataset_i" otherwise
mismo.Datasets.__iter__
__iter__() -> Iterable[Table]
Iterate over the tables in the order they were added.
mismo.Datasets.all_record_ids
all_record_ids() -> Column
Return all unique record_ids from all tables.
mismo.Datasets.filter
Return a new Datasets with all tables filtered by f
.
mismo.Datasets.items
The names and tables of the underlying tables.
mismo.Datasets.map
Return a new Datasets with all tables transformed by f
.
mismo.Datasets.unioned
unioned() -> Table
Select the self.shared_schema
columns from all tables and union them.