Set Utils

Utilities for set- and bag-of-word-based models.

For example, split text into a collection of tokens (usually words or ngrams), then count the number of times they appear in each document or in the collection of documents.

This also is relevant for other sorts of unordered collections. For example, movies might be tagged with multiple genres such as "fantasy" and "action".

Sets

`mismo.sets.jaccard(a: ir.ArrayValue, b: ir.ArrayValue) -> ir.FloatingValue`

The Jaccard similarity between two arrays.

PARAMETER	DESCRIPTION
`a`	The first array. TYPE: `ArrayValue`
`b`	The second array. TYPE: `ArrayValue`

RETURNS	DESCRIPTION
`FloatingValue`	The Jaccard similarity between the two arrays.

Bag-Of-Words

`mismo.sets.add_array_value_counts(t: ir.Table, column: str, *, result_name: str = '{name}_counts') -> ir.Table`

value_counts() for ArrayColumns.

PARAMETER	DESCRIPTION
`t`	The input table. TYPE: `Table`
`column`	The name of the array column to analyze. TYPE: `str`
`result_name`	The name of the resulting column. The default is "{name}_counts". TYPE: `str` DEFAULT: `'{name}_counts'`

Examples:

>>> import ibis
>>> from mismo.sets import add_array_value_counts
>>> ibis.options.interactive = True
>>> ibis.options.repr.interactive.max_length = 20
>>> terms = [
...     None,
...     ["st"],
...     ["st"],
...     ["12", "main", "st"],
...     ["99", "main", "ave"],
...     ["56", "st", "joseph", "st"],
...     ["21", "glacier", "st"],
...     ["12", "glacier", "st"],
... ]
>>> t = ibis.memtable({"terms": terms})
>>> add_array_value_counts(t, "terms")
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ terms                        ┃ terms_counts                     ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ array<string>                │ map<string, int64>               │
├──────────────────────────────┼──────────────────────────────────┤
│ ['st']                       │ {'st': 1}                        │
│ ['st']                       │ {'st': 1}                        │
│ ['12', 'main', 'st']         │ {'st': 1, '12': 1, 'main': 1}    │
│ ['99', 'main', 'ave']        │ {'ave': 1, '99': 1, 'main': 1}   │
│ ['56', 'st', 'joseph', 'st'] │ {'56': 1, 'joseph': 1, 'st': 2}  │
│ ['21', 'glacier', 'st']      │ {'glacier': 1, 'st': 1, '21': 1} │
│ ['12', 'glacier', 'st']      │ {'glacier': 1, 'st': 1, '12': 1} │
│ NULL                         │ NULL                             │
└──────────────────────────────┴──────────────────────────────────┘

`mismo.sets.add_tfidf(t, column: str, *, result_name: str = '{name}_tfidf', normalize: bool = True)`

Vectorize terms using TF-IDF.

Adds a column to the input table that contains the TF-IDF vector for the terms in the input column.

PARAMETER	DESCRIPTION
`t`	The input table. TYPE: `Table`
`column`	The name of the array column to analyze. TYPE: `str`
`result_name`	The name of the resulting column. The default is "{name}_tfidf". TYPE: `str` DEFAULT: `'{name}_tfidf'`
`normalize`	Whether to normalize the TF-vector before multiplying by the IDF. The default is True. This makes it so that vectors of different lengths can be compared fairly. TYPE: `bool` DEFAULT: `True`

Examples:

>>> import ibis
>>> from mismo.sets import add_tfidf
>>> ibis.options.interactive = True
>>> ibis.options.repr.interactive.max_length = 20
>>> terms = [
...     None,
...     ["st"],
...     ["st"],
...     ["12", "main", "st"],
...     ["99", "main", "ave"],
...     ["56", "st", "joseph", "st"],
...     ["21", "glacier", "st"],
...     ["12", "glacier", "st"],
... ]
>>> t = ibis.memtable({"terms": terms})
>>> add_tfidf(t, "terms")
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ terms                        ┃ terms_tfidf                                                                          ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ array<string>                │ map<string, float64>                                                                 │
├──────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤
│ ['st']                       │ {'st': 0.15415067982725836}                                                          │
│ ['st']                       │ {'st': 0.15415067982725836}                                                          │
│ ['12', 'glacier', 'st']      │ {'12': 0.7232830370915955, 'glacier': 0.7232830370915955, 'st': 0.08899893649403144} │
│ ['12', 'main', 'st']         │ {'12': 0.7232830370915955, 'main': 0.7232830370915955, 'st': 0.08899893649403144}    │
│ ['21', 'glacier', 'st']      │ {'21': 1.12347174837591, 'glacier': 0.7232830370915955, 'st': 0.08899893649403144}   │
│ ['56', 'st', 'joseph', 'st'] │ {'56': 0.7944144917481126, 'joseph': 0.7944144917481126, 'st': 0.12586350302664107}  │
│ ['99', 'main', 'ave']        │ {'main': 0.7232830370915955, 'ave': 1.12347174837591, '99': 1.12347174837591}        │
│ NULL                         │ NULL                                                                                 │
└──────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────┘

`mismo.sets.document_counts(terms: ir.ArrayColumn) -> ir.Table`

Create a lookup Table from term to number of records containing the term.

PARAMETER	DESCRIPTION
`terms`	One row for each record. Each row is an array of terms in that record. Each term could be a word, ngram, or other token from a string. Or, it could also represent more generic data, such as a list of tags or categories like ["red", "black"]. Each term can be any datatype, not just strings. TYPE: `ArrayColumn`

RETURNS	DESCRIPTION
A Table with columns `term` and `n_records`. The `term` column contains	each unique term from the input `terms` array. The `n_records` column contains the number of records in the input `terms` array that contain

Examples:

>>> import ibis
>>> from mismo.sets import document_counts
>>> ibis.options.repr.interactive.max_length = 20
>>> addresses = [
...     "12 main st",
...     "99 main ave",
...     "56 st joseph st",
...     "21 glacier st",
...     "12 glacier st",
... ]
>>> t = ibis.memtable({"address": addresses})
>>> # split on whitespace
>>> t = t.mutate(terms=t.address.re_split(r"\s+"))
>>> document_counts(t.terms).order_by("term")
┏━━━━━━━━━┳━━━━━━━━━━━┓
┃ term    ┃ n_records ┃
┡━━━━━━━━━╇━━━━━━━━━━━┩
│ string  │ int64     │
├─────────┼───────────┤
│ 12      │         2 │
│ 21      │         1 │
│ 56      │         1 │
│ 99      │         1 │
│ ave     │         1 │
│ glacier │         2 │
│ joseph  │         1 │
│ main    │         2 │
│ st      │         4 │
└─────────┴───────────┘

`mismo.sets.rare_terms(terms: ir.ArrayColumn, *, max_records_n: int | None = None, max_records_frac: float | None = None) -> ir.Column`

Get the terms that appear in few records.

The returned Column is flattened. Eg if you supply a column of array<string>, the result will be of type string.

Exactly one of max_records_n or max_records_frac must be set.

PARAMETER	DESCRIPTION
`terms`	A column of Arrays, where each array contains the terms for a record. TYPE: `ArrayColumn`
`max_records_n`	The maximum number of records a term can appear in. The default is None. TYPE: `int` DEFAULT: `None`
`max_records_frac`	The maximum fraction of records a term can appear in. The default is None. TYPE: `float` DEFAULT: `None`

RETURNS	DESCRIPTION
`Column`	The terms that appear in few records.

`mismo.sets.term_idf(terms: ir.ArrayValue) -> ir.Table`

Create a lookup Table from term to IDF.

Examples:

>>> import ibis
>>> from mismo.sets import term_idf
>>> ibis.options.interactive = True
>>> addresses = [
...     "12 main st",
...     "99 main ave",
...     "56 st joseph st",
...     "21 glacier st",
...     "12 glacier st",
... ]
>>> t = ibis.memtable({"address": addresses})
>>> # split on whitespace
>>> t = t.mutate(terms=t.address.re_split(r"\s+"))
>>> term_idf(t.terms).order_by("term")
┏━━━━━━━━━┳━━━━━━━━━━┓
┃ term    ┃ idf      ┃
┡━━━━━━━━━╇━━━━━━━━━━┩
│ string  │ float64  │
├─────────┼──────────┤
│ 12      │ 0.916291 │
│ 21      │ 1.609438 │
│ 56      │ 1.609438 │
│ 99      │ 1.609438 │
│ ave     │ 1.609438 │
│ glacier │ 0.916291 │
│ joseph  │ 1.609438 │
│ main    │ 0.916291 │
│ st      │ 0.223144 │
└─────────┴──────────┘