Play Data
Load some toy datasets for testing and examples.
mismo.playdata.load_patents(backend: ibis.BaseBackend | None = None) -> ir.Table
Load the PATSTAT dataset
This represents a dataset of patents, and the task is to determine which patents came from the same inventor.
This comes from the Dedupe Patent Example.
RETURNS | DESCRIPTION |
---|---|
Table
|
An Ibis Table with the following schema:
|
Examples:
>>> import ibis
>>> ibis.options.interactive = True
>>> load_patents().order_by("record_id").head()
┏━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ record_id ┃ label_true ┃ name_true ┃ name ┃ latitude ┃ longitude ┃ coauthors ┃ classes ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ int64 │ int64 │ string │ string │ float64 │ float64 │ string │ string │
├───────────┼────────────┼──────────────────────┼──────────────────────────────────────────────────┼──────────┼───────────┼─────────────────────────────────────────────────┼─────────────────────────────────────────────────┤
│ 2909 │ 402600 │ AGILENT TECHNOLOGIES │ * AGILENT TECHNOLOGIES, INC. │ 0.00 │ 0.000000 │ KONINK PHILIPS ELECTRONICS N V**DAVID E SNYDE… │ A61N**A61B │
│ 3574 │ 569309 │ AKZO NOBEL │ * AKZO NOBEL N.V. │ 0.00 │ 0.000000 │ TSJERK HOEKSTRA**ANDRESS K JOHNSON**TERESA M… │ G01N**B01L**C11D**G02F**F16L │
│ 3575 │ 569309 │ AKZO NOBEL │ * AKZO NOBEL NV │ 0.00 │ 0.000000 │ WILLIAM JOHN ERNEST PARR**HANS OSKARSSON**MA… │ C09K**F17D**B01F**C23F │
│ 3779 │ 656303 │ ALCATEL │ * ALCATEL N.V. │ 52.35 │ 4.916667 │ GUENTER KOCHSMEIER**ZBIGNIEW WIEGOLASKI**EVA… │ G02B**G04G**H02G**G06F │
│ 3780 │ 656303 │ ALCATEL │ * ALCATEL N.V. │ 52.35 │ 4.916667 │ ZILAN MANFRED**JOSIANE RAMOS**DUANE LYNN MO… │ H03G**B05D**H04L**H04B**C03B**C03C**G02B**H01B │
└───────────┴────────────┴──────────────────────┴──────────────────────────────────────────────────┴──────────┴───────────┴─────────────────────────────────────────────────┴─────────────────────────────────────────────────┘
mismo.playdata.load_rldata500(backend: ibis.BaseBackend | None = None) -> ir.Table
Synthetic personal information dataset with 500 rows
This is a synthetic dataset with noisy names and dates of birth, with the task being to determine which rows represent the same person. 10% of the records are duplicates of existing ones, and the level of noise is low. The dataset can be deduplicated with 90%+ precision and recall using simple linkage rules. It is often used as a sanity check for computational efficiency and disambiguation accuracy.
This comes from the RecordLinkage R package and was generated using the data generation component of Febrl (Freely Extensible Biomedical Record Linkage).
RETURNS | DESCRIPTION |
---|---|
Table
|
An Ibis Table with the following schema:
|
Examples:
>>> import ibis
>>> ibis.options.interactive = True
>>> load_rldata500().head()
┏━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ record_id ┃ label_true ┃ fname_c1 ┃ fname_c2 ┃ lname_c1 ┃ lname_c2 ┃ by ┃ bm ┃ bd ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ int64 │ int64 │ string │ string │ string │ string │ int64 │ int64 │ int64 │
├───────────┼────────────┼──────────┼──────────┼──────────┼──────────┼───────┼───────┼───────┤
│ 0 │ 34 │ CARSTEN │ NULL │ MEIER │ NULL │ 1949 │ 7 │ 22 │
│ 1 │ 51 │ GERD │ NULL │ BAUER │ NULL │ 1968 │ 7 │ 27 │
│ 2 │ 115 │ ROBERT │ NULL │ HARTMANN │ NULL │ 1930 │ 4 │ 30 │
│ 3 │ 189 │ STEFAN │ NULL │ WOLFF │ NULL │ 1957 │ 9 │ 2 │
│ 4 │ 72 │ RALF │ NULL │ KRUEGER │ NULL │ 1966 │ 1 │ 13 │
└───────────┴────────────┴──────────┴──────────┴──────────┴──────────┴───────┴───────┴───────┘
mismo.playdata.load_rldata10000(backend: ibis.BaseBackend | None = None) -> ir.Table
Synthetic personal information dataset with 10000 rows
This is a synthetic dataset with noisy names and dates of birth, with the task being to determine which rows represent the same person. 10% of the records are duplicates of existing ones, and the level of noise is low. The dataset can be deduplicated with 90%+ precision and recall using simple linkage rules. It is often used as a sanity check for computational efficiency and disambiguation accuracy.
This comes from the RecordLinkage R package and was generated using the data generation component of Febrl (Freely Extensible Biomedical Record Linkage).
RETURNS | DESCRIPTION |
---|---|
Table
|
An Ibis Table with the following schema:
|
Examples:
>>> import ibis
>>> ibis.options.interactive = True
>>> load_rldata10000().head()
┏━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ record_id ┃ label_true ┃ fname_c1 ┃ fname_c2 ┃ lname_c1 ┃ lname_c2 ┃ by ┃ bm ┃ bd ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ int64 │ int64 │ string │ string │ string │ string │ int64 │ int64 │ int64 │
├───────────┼────────────┼──────────┼──────────┼────────────┼──────────┼───────┼───────┼───────┤
│ 0 │ 3606 │ FRANK │ NULL │ MUELLER │ NULL │ 1967 │ 9 │ 27 │
│ 1 │ 2560 │ MARTIN │ NULL │ SCHWARZ │ NULL │ 1967 │ 2 │ 17 │
│ 2 │ 3892 │ HERBERT │ NULL │ ZIMMERMANN │ NULL │ 1961 │ 11 │ 6 │
│ 3 │ 329 │ HANS │ NULL │ SCHMITT │ NULL │ 1945 │ 8 │ 14 │
│ 4 │ 1994 │ UWE │ NULL │ KELLER │ NULL │ 2000 │ 7 │ 5 │
└───────────┴────────────┴──────────┴──────────┴────────────┴──────────┴───────┴───────┴───────┘
mismo.playdata.load_febrl1() -> tuple[ir.Table, ir.Table]
Load the FEBRL 1 dataset.
The Freely Extensible Biomedical Record Linkage (Febrl) package is distributed with a dataset generator and four datasets generated with the generator.
mismo.playdata.load_febrl2() -> tuple[ir.Table, ir.Table]
Load the FEBRL 2 dataset.
The Freely Extensible Biomedical Record Linkage (Febrl) package is distributed with a dataset generator and four datasets generated with the generator.
mismo.playdata.load_febrl3() -> tuple[ir.Table, ir.Table]
Load the FEBRL 3 dataset.
The Freely Extensible Biomedical Record Linkage (Febrl) package is distributed with a dataset generator and four datasets generated with the generator.