Skip to content

Play Data

Load some toy datasets for testing and examples.

mismo.playdata.load_patents

load_patents(
    *, backend: BaseBackend | None = None
) -> Linkage

Load the PATSTAT dataset.

This represents a dataset of patents, and the task is to determine which patents came from the same inventor.

This comes from the Dedupe Patent Example.

RETURNS DESCRIPTION
Linkage

A Linkage, where both left and right are the tables of records. Each one has the following schema:

  • record_id: uint32 A unique ID for each row in the table.
  • label_true: uint32 The manually labeled, true ID of the inventor.
  • name_true: str The manually labeled, true name of the inventor.
  • name: str The raw name on the patent.
  • latitude: float64 Geocoded from the inventor's address. 0.0 indicates no address was found
  • longitude: float64
  • coauthor: str A list of coauthors on the patent, separated by "**"
  • classes: str A list of 4-character IPC technical codes, separated by "**"

Examples:

>>> import ibis
>>> ibis.options.interactive = True
>>> load_patents().left.head()
┏━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ record_id ┃ label_true ┃ name_true            ┃ name                                             ┃ latitude ┃ longitude ┃ coauthors                                       ┃ classes                                         ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ uint32    │ uint32     │ string               │ string                                           │ float64  │ float64   │ string                                          │ string                                          │
├───────────┼────────────┼──────────────────────┼──────────────────────────────────────────────────┼──────────┼───────────┼─────────────────────────────────────────────────┼─────────────────────────────────────────────────┤
│      2909 │     402600 │ AGILENT TECHNOLOGIES │ * AGILENT TECHNOLOGIES, INC.                     │     0.00 │  0.000000 │ KONINK PHILIPS ELECTRONICS N V**DAVID E  SNYDE… │ A61N**A61B                                      │
│      3574 │     569309 │ AKZO NOBEL           │ * AKZO NOBEL N.V.                                │     0.00 │  0.000000 │ TSJERK  HOEKSTRA**ANDRESS K  JOHNSON**TERESA M… │ G01N**B01L**C11D**G02F**F16L                    │
│      3575 │     569309 │ AKZO NOBEL           │ * AKZO NOBEL NV                                  │     0.00 │  0.000000 │ WILLIAM JOHN ERNEST  PARR**HANS  OSKARSSON**MA… │ C09K**F17D**B01F**C23F                          │
│      3779 │     656303 │ ALCATEL              │ * ALCATEL N.V.                                   │    52.35 │  4.916667 │ GUENTER  KOCHSMEIER**ZBIGNIEW  WIEGOLASKI**EVA… │ G02B**G04G**H02G**G06F                          │
│      3780 │     656303 │ ALCATEL              │ * ALCATEL N.V.                                   │    52.35 │  4.916667 │ ZILAN  MANFRED**JOSIANE  RAMOS**DUANE LYNN  MO… │ H03G**B05D**H04L**H04B**C03B**C03C**G02B**H01B  │
└───────────┴────────────┴──────────────────────┴──────────────────────────────────────────────────┴──────────┴───────────┴─────────────────────────────────────────────────┴─────────────────────────────────────────────────┘

mismo.playdata.load_rldata500

load_rldata500(
    *, backend: BaseBackend | None = None
) -> Linkage

Synthetic personal information dataset with 500 rows

This is a synthetic dataset with noisy names and dates of birth, with the task being to determine which rows represent the same person. 10% of the records are duplicates of existing ones, and the level of noise is low. The dataset can be deduplicated with 90%+ precision and recall using simple linkage rules. It is often used as a sanity check for computational efficiency and disambiguation accuracy.

This comes from the RecordLinkage R package and was generated using the data generation component of Febrl (Freely Extensible Biomedical Record Linkage).

RETURNS DESCRIPTION
Linkage

A Linkage, where both left and right are the tables of records. Each one has the following schema:

  • record_id: int64 A unique ID for each row in the table.
  • label_true: int64 The manually labeled, true ID of the inventor.
  • fname_c1: string First component of the first name.
  • fname_c2: string Second component of the first name (mostly NULL values)
  • lname_c1: string First component of the last name.
  • lname_c2: string Second component of the last name (mostly NULL values).
  • by: int64 Birth year
  • bm: int64 Birth month
  • bd: int64 Birth day

Examples:

>>> import ibis
>>> ibis.options.interactive = True
>>> load_rldata500().left.head()
┏━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ record_id ┃ label_true ┃ fname_c1 ┃ fname_c2 ┃ lname_c1 ┃ lname_c2 ┃ by    ┃ bm    ┃ bd    ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ int64     │ int64      │ string   │ string   │ string   │ string   │ int64 │ int64 │ int64 │
├───────────┼────────────┼──────────┼──────────┼──────────┼──────────┼───────┼───────┼───────┤
│         0 │         34 │ CARSTEN  │ NULL     │ MEIER    │ NULL     │  1949 │     7 │    22 │
│         1 │         51 │ GERD     │ NULL     │ BAUER    │ NULL     │  1968 │     7 │    27 │
│         2 │        115 │ ROBERT   │ NULL     │ HARTMANN │ NULL     │  1930 │     4 │    30 │
│         3 │        189 │ STEFAN   │ NULL     │ WOLFF    │ NULL     │  1957 │     9 │     2 │
│         4 │         72 │ RALF     │ NULL     │ KRUEGER  │ NULL     │  1966 │     1 │    13 │
└───────────┴────────────┴──────────┴──────────┴──────────┴──────────┴───────┴───────┴───────┘

mismo.playdata.load_rldata10000

load_rldata10000(
    *, backend: BaseBackend | None = None
) -> Linkage

Synthetic personal information dataset with 10000 rows

This is a synthetic dataset with noisy names and dates of birth, with the task being to determine which rows represent the same person. 10% of the records are duplicates of existing ones, and the level of noise is low. The dataset can be deduplicated with 90%+ precision and recall using simple linkage rules. It is often used as a sanity check for computational efficiency and disambiguation accuracy.

This comes from the RecordLinkage R package and was generated using the data generation component of Febrl (Freely Extensible Biomedical Record Linkage).

RETURNS DESCRIPTION
Linkage

A Linkage, where both left and right are the tables of records. Each one has the following schema:

  • record_id: int64 A unique ID for each row in the table.
  • label_true: int64 The manually labeled, true ID of the inventor.
  • fname_c1: string First component of the first name.
  • fname_c2: string Second component of the first name (mostly NULL values)
  • lname_c1: string First component of the last name.
  • lname_c2: string Second component of the last name (mostly NULL values).
  • by: int64 Birth year
  • bm: int64 Birth month
  • bd: int64 Birth day

Examples:

>>> import ibis
>>> ibis.options.interactive = True
>>> load_rldata10000().left.head()
┏━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ record_id ┃ label_true ┃ fname_c1 ┃ fname_c2 ┃ lname_c1   ┃ lname_c2 ┃ by    ┃ bm    ┃ bd    ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ int64     │ int64      │ string   │ string   │ string     │ string   │ int64 │ int64 │ int64 │
├───────────┼────────────┼──────────┼──────────┼────────────┼──────────┼───────┼───────┼───────┤
│         0 │       3606 │ FRANK    │ NULL     │ MUELLER    │ NULL     │  1967 │     9 │    27 │
│         1 │       2560 │ MARTIN   │ NULL     │ SCHWARZ    │ NULL     │  1967 │     2 │    17 │
│         2 │       3892 │ HERBERT  │ NULL     │ ZIMMERMANN │ NULL     │  1961 │    11 │     6 │
│         3 │        329 │ HANS     │ NULL     │ SCHMITT    │ NULL     │  1945 │     8 │    14 │
│         4 │       1994 │ UWE      │ NULL     │ KELLER     │ NULL     │  2000 │     7 │     5 │
└───────────┴────────────┴──────────┴──────────┴────────────┴──────────┴───────┴───────┴───────┘

mismo.playdata.load_febrl1

load_febrl1(
    *, backend: BaseBackend | None = None
) -> Linkage

Load the FEBRL 1 dataset.

The Freely Extensible Biomedical Record Linkage (Febrl) package is distributed with a dataset generator and four datasets generated with the generator.

mismo.playdata.load_febrl2

load_febrl2(
    *, backend: BaseBackend | None = None
) -> Linkage

Load the FEBRL 2 dataset.

The Freely Extensible Biomedical Record Linkage (Febrl) package is distributed with a dataset generator and four datasets generated with the generator.

mismo.playdata.load_febrl3

load_febrl3(
    *, backend: BaseBackend | None = None
) -> Linkage

Load the FEBRL 3 dataset.

The Freely Extensible Biomedical Record Linkage (Febrl) package is distributed with a dataset generator and four datasets generated with the generator.