Skip to content

Set Utils

Utilities for set- and bag-of-word-based models.

For example, split text into a collection of tokens (usually words or ngrams), then count the number of times they appear in each document or in the collection of documents.

This also is relevant for other sorts of unordered collections. For example, movies might be tagged with multiple genres such as "fantasy" and "action".

Sets

mismo.sets.jaccard(a: ir.ArrayValue, b: ir.ArrayValue) -> ir.FloatingValue

The Jaccard similarity between two arrays.

PARAMETER DESCRIPTION
a

The first array.

TYPE: ArrayValue

b

The second array.

TYPE: ArrayValue

RETURNS DESCRIPTION
FloatingValue

The Jaccard similarity between the two arrays.

Bag-Of-Words

mismo.sets.add_array_value_counts(t: ir.Table, column: str, *, result_name: str = '{name}_counts') -> ir.Table

value_counts() for ArrayColumns.

PARAMETER DESCRIPTION
t

The input table.

TYPE: Table

column

The name of the array column to analyze.

TYPE: str

result_name

The name of the resulting column. The default is "{name}_counts".

TYPE: str DEFAULT: '{name}_counts'

Examples:

>>> import ibis
>>> from mismo.sets import add_array_value_counts
>>> ibis.options.interactive = True
>>> ibis.options.repr.interactive.max_length = 20
>>> terms = [
...     None,
...     ["st"],
...     ["st"],
...     ["12", "main", "st"],
...     ["99", "main", "ave"],
...     ["56", "st", "joseph", "st"],
...     ["21", "glacier", "st"],
...     ["12", "glacier", "st"],
... ]
>>> t = ibis.memtable({"terms": terms})
>>> add_array_value_counts(t, "terms")
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ terms                        ┃ terms_counts                     ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ array<string>                │ map<string, int64>               │
├──────────────────────────────┼──────────────────────────────────┤
│ ['st']                       │ {'st': 1}                        │
│ ['st']                       │ {'st': 1}                        │
│ ['12', 'main', 'st']         │ {'st': 1, '12': 1, 'main': 1}    │
│ ['99', 'main', 'ave']        │ {'ave': 1, '99': 1, 'main': 1}   │
│ ['56', 'st', 'joseph', 'st'] │ {'56': 1, 'joseph': 1, 'st': 2}  │
│ ['21', 'glacier', 'st']      │ {'glacier': 1, 'st': 1, '21': 1} │
│ ['12', 'glacier', 'st']      │ {'glacier': 1, 'st': 1, '12': 1} │
│ NULL                         │ NULL                             │
└──────────────────────────────┴──────────────────────────────────┘

mismo.sets.add_tfidf(t, column: str, *, result_name: str = '{name}_tfidf', normalize: bool = True)

Vectorize terms using TF-IDF.

Adds a column to the input table that contains the TF-IDF vector for the terms in the input column.

PARAMETER DESCRIPTION
t

The input table.

TYPE: Table

column

The name of the array column to analyze.

TYPE: str

result_name

The name of the resulting column. The default is "{name}_tfidf".

TYPE: str DEFAULT: '{name}_tfidf'

normalize

Whether to normalize the TF-vector before multiplying by the IDF. The default is True. This makes it so that vectors of different lengths can be compared fairly.

TYPE: bool DEFAULT: True

Examples:

>>> import ibis
>>> from mismo.sets import add_tfidf
>>> ibis.options.interactive = True
>>> ibis.options.repr.interactive.max_length = 20
>>> terms = [
...     None,
...     ["st"],
...     ["st"],
...     ["12", "main", "st"],
...     ["99", "main", "ave"],
...     ["56", "st", "joseph", "st"],
...     ["21", "glacier", "st"],
...     ["12", "glacier", "st"],
... ]
>>> t = ibis.memtable({"terms": terms})
>>> add_tfidf(t, "terms")
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ terms                        ┃ terms_tfidf                                                                          ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ array<string>                │ map<string, float64>                                                                 │
├──────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤
│ ['st']                       │ {'st': 0.15415067982725836}                                                          │
│ ['st']                       │ {'st': 0.15415067982725836}                                                          │
│ ['12', 'glacier', 'st']      │ {'12': 0.7232830370915955, 'glacier': 0.7232830370915955, 'st': 0.08899893649403144} │
│ ['12', 'main', 'st']         │ {'12': 0.7232830370915955, 'main': 0.7232830370915955, 'st': 0.08899893649403144}    │
│ ['21', 'glacier', 'st']      │ {'21': 1.12347174837591, 'glacier': 0.7232830370915955, 'st': 0.08899893649403144}   │
│ ['56', 'st', 'joseph', 'st'] │ {'56': 0.7944144917481126, 'joseph': 0.7944144917481126, 'st': 0.12586350302664107}  │
│ ['99', 'main', 'ave']        │ {'main': 0.7232830370915955, 'ave': 1.12347174837591, '99': 1.12347174837591}        │
│ NULL                         │ NULL                                                                                 │
└──────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────┘

mismo.sets.document_counts(terms: ir.ArrayColumn) -> ir.Table

Create a lookup Table from term to number of records containing the term.

PARAMETER DESCRIPTION
terms

One row for each record. Each row is an array of terms in that record. Each term could be a word, ngram, or other token from a string. Or, it could also represent more generic data, such as a list of tags or categories like ["red", "black"]. Each term can be any datatype, not just strings.

TYPE: ArrayColumn

RETURNS DESCRIPTION
A Table with columns `term` and `n_records`. The `term` column contains

each unique term from the input terms array. The n_records column contains the number of records in the input terms array that contain

Examples:

>>> import ibis
>>> from mismo.sets import document_counts
>>> ibis.options.repr.interactive.max_length = 20
>>> addresses = [
...     "12 main st",
...     "99 main ave",
...     "56 st joseph st",
...     "21 glacier st",
...     "12 glacier st",
... ]
>>> t = ibis.memtable({"address": addresses})
>>> # split on whitespace
>>> t = t.mutate(terms=t.address.re_split(r"\s+"))
>>> document_counts(t.terms).order_by("term")
┏━━━━━━━━━┳━━━━━━━━━━━┓
┃ term    ┃ n_records ┃
┡━━━━━━━━━╇━━━━━━━━━━━┩
│ string  │ int64     │
├─────────┼───────────┤
│ 12      │         2 │
│ 21      │         1 │
│ 56      │         1 │
│ 99      │         1 │
│ ave     │         1 │
│ glacier │         2 │
│ joseph  │         1 │
│ main    │         2 │
│ st      │         4 │
└─────────┴───────────┘

mismo.sets.rare_terms(terms: ir.ArrayColumn, *, max_records_n: int | None = None, max_records_frac: float | None = None) -> ir.Column

Get the terms that appear in few records.

The returned Column is flattened. Eg if you supply a column of array<string>, the result will be of type string.

Exactly one of max_records_n or max_records_frac must be set.

PARAMETER DESCRIPTION
terms

A column of Arrays, where each array contains the terms for a record.

TYPE: ArrayColumn

max_records_n

The maximum number of records a term can appear in. The default is None.

TYPE: int DEFAULT: None

max_records_frac

The maximum fraction of records a term can appear in. The default is None.

TYPE: float DEFAULT: None

RETURNS DESCRIPTION
Column

The terms that appear in few records.

mismo.sets.term_idf(terms: ir.ArrayValue) -> ir.Table

Create a lookup Table from term to IDF.

Examples:

>>> import ibis
>>> from mismo.sets import term_idf
>>> ibis.options.interactive = True
>>> addresses = [
...     "12 main st",
...     "99 main ave",
...     "56 st joseph st",
...     "21 glacier st",
...     "12 glacier st",
... ]
>>> t = ibis.memtable({"address": addresses})
>>> # split on whitespace
>>> t = t.mutate(terms=t.address.re_split(r"\s+"))
>>> term_idf(t.terms).order_by("term")
┏━━━━━━━━━┳━━━━━━━━━━┓
┃ term    ┃ idf      ┃
┡━━━━━━━━━╇━━━━━━━━━━┩
│ string  │ float64  │
├─────────┼──────────┤
│ 12      │ 0.916291 │
│ 21      │ 1.609438 │
│ 56      │ 1.609438 │
│ 99      │ 1.609438 │
│ ave     │ 1.609438 │
│ glacier │ 0.916291 │
│ joseph  │ 1.609438 │
│ main    │ 0.916291 │
│ st      │ 0.223144 │
└─────────┴──────────┘