Set Utils
Utilities for set- and bag-of-word-based models.
For example, split text into a collection of tokens (usually words or ngrams), then count the number of times they appear in each document or in the collection of documents.
This also is relevant for other sorts of unordered collections. For example, movies might be tagged with multiple genres such as "fantasy" and "action".
Sets
mismo.sets.jaccard(a: ir.ArrayValue, b: ir.ArrayValue) -> ir.FloatingValue
The Jaccard similarity between two arrays.
PARAMETER | DESCRIPTION |
---|---|
a |
The first array.
TYPE:
|
b |
The second array.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
FloatingValue
|
The Jaccard similarity between the two arrays. |
Bag-Of-Words
mismo.sets.add_array_value_counts(t: ir.Table, column: str, *, result_name: str = '{name}_counts') -> ir.Table
value_counts() for ArrayColumns.
PARAMETER | DESCRIPTION |
---|---|
t |
The input table.
TYPE:
|
column |
The name of the array column to analyze.
TYPE:
|
result_name |
The name of the resulting column. The default is "{name}_counts".
TYPE:
|
Examples:
>>> import ibis
>>> from mismo.sets import add_array_value_counts
>>> ibis.options.interactive = True
>>> ibis.options.repr.interactive.max_length = 20
>>> terms = [
... None,
... ["st"],
... ["st"],
... ["12", "main", "st"],
... ["99", "main", "ave"],
... ["56", "st", "joseph", "st"],
... ["21", "glacier", "st"],
... ["12", "glacier", "st"],
... ]
>>> t = ibis.memtable({"terms": terms})
>>> add_array_value_counts(t, "terms")
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ terms ┃ terms_counts ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ array<string> │ map<string, int64> │
├──────────────────────────────┼──────────────────────────────────┤
│ ['st'] │ {'st': 1} │
│ ['st'] │ {'st': 1} │
│ ['12', 'main', 'st'] │ {'st': 1, '12': 1, 'main': 1} │
│ ['99', 'main', 'ave'] │ {'ave': 1, '99': 1, 'main': 1} │
│ ['56', 'st', 'joseph', 'st'] │ {'56': 1, 'joseph': 1, 'st': 2} │
│ ['21', 'glacier', 'st'] │ {'glacier': 1, 'st': 1, '21': 1} │
│ ['12', 'glacier', 'st'] │ {'glacier': 1, 'st': 1, '12': 1} │
│ NULL │ NULL │
└──────────────────────────────┴──────────────────────────────────┘
mismo.sets.add_tfidf(t, column: str, *, result_name: str = '{name}_tfidf', normalize: bool = True)
Vectorize terms using TF-IDF.
Adds a column to the input table that contains the TF-IDF vector for the terms in the input column.
PARAMETER | DESCRIPTION |
---|---|
t |
The input table.
TYPE:
|
column |
The name of the array column to analyze.
TYPE:
|
result_name |
The name of the resulting column. The default is "{name}_tfidf".
TYPE:
|
normalize |
Whether to normalize the TF-vector before multiplying by the IDF. The default is True. This makes it so that vectors of different lengths can be compared fairly.
TYPE:
|
Examples:
>>> import ibis
>>> from mismo.sets import add_tfidf
>>> ibis.options.interactive = True
>>> ibis.options.repr.interactive.max_length = 20
>>> terms = [
... None,
... ["st"],
... ["st"],
... ["12", "main", "st"],
... ["99", "main", "ave"],
... ["56", "st", "joseph", "st"],
... ["21", "glacier", "st"],
... ["12", "glacier", "st"],
... ]
>>> t = ibis.memtable({"terms": terms})
>>> add_tfidf(t, "terms")
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ terms ┃ terms_tfidf ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ array<string> │ map<string, float64> │
├──────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤
│ ['st'] │ {'st': 0.15415067982725836} │
│ ['st'] │ {'st': 0.15415067982725836} │
│ ['12', 'glacier', 'st'] │ {'12': 0.7232830370915955, 'glacier': 0.7232830370915955, 'st': 0.08899893649403144} │
│ ['12', 'main', 'st'] │ {'12': 0.7232830370915955, 'main': 0.7232830370915955, 'st': 0.08899893649403144} │
│ ['21', 'glacier', 'st'] │ {'21': 1.12347174837591, 'glacier': 0.7232830370915955, 'st': 0.08899893649403144} │
│ ['56', 'st', 'joseph', 'st'] │ {'56': 0.7944144917481126, 'joseph': 0.7944144917481126, 'st': 0.12586350302664107} │
│ ['99', 'main', 'ave'] │ {'main': 0.7232830370915955, 'ave': 1.12347174837591, '99': 1.12347174837591} │
│ NULL │ NULL │
└──────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────┘
mismo.sets.document_counts(terms: ir.ArrayColumn) -> ir.Table
Create a lookup Table from term to number of records containing the term.
PARAMETER | DESCRIPTION |
---|---|
terms |
One row for each record. Each row is an array of terms in that record. Each term could be a word, ngram, or other token from a string. Or, it could also represent more generic data, such as a list of tags or categories like ["red", "black"]. Each term can be any datatype, not just strings.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
A Table with columns `term` and `n_records`. The `term` column contains
|
each unique term from the input |
Examples:
>>> import ibis
>>> from mismo.sets import document_counts
>>> ibis.options.repr.interactive.max_length = 20
>>> addresses = [
... "12 main st",
... "99 main ave",
... "56 st joseph st",
... "21 glacier st",
... "12 glacier st",
... ]
>>> t = ibis.memtable({"address": addresses})
>>> # split on whitespace
>>> t = t.mutate(terms=t.address.re_split(r"\s+"))
>>> document_counts(t.terms).order_by("term")
┏━━━━━━━━━┳━━━━━━━━━━━┓
┃ term ┃ n_records ┃
┡━━━━━━━━━╇━━━━━━━━━━━┩
│ string │ int64 │
├─────────┼───────────┤
│ 12 │ 2 │
│ 21 │ 1 │
│ 56 │ 1 │
│ 99 │ 1 │
│ ave │ 1 │
│ glacier │ 2 │
│ joseph │ 1 │
│ main │ 2 │
│ st │ 4 │
└─────────┴───────────┘
mismo.sets.rare_terms(terms: ir.ArrayColumn, *, max_records_n: int | None = None, max_records_frac: float | None = None) -> ir.Column
Get the terms that appear in few records.
The returned Column is flattened. Eg if you supply a column of array<string>
,
the result will be of type string
.
Exactly one of max_records_n
or max_records_frac
must be set.
PARAMETER | DESCRIPTION |
---|---|
terms |
A column of Arrays, where each array contains the terms for a record.
TYPE:
|
max_records_n |
The maximum number of records a term can appear in. The default is None.
TYPE:
|
max_records_frac |
The maximum fraction of records a term can appear in. The default is None.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Column
|
The terms that appear in few records. |
mismo.sets.term_idf(terms: ir.ArrayValue) -> ir.Table
Create a lookup Table from term to IDF.
Examples:
>>> import ibis
>>> from mismo.sets import term_idf
>>> ibis.options.interactive = True
>>> addresses = [
... "12 main st",
... "99 main ave",
... "56 st joseph st",
... "21 glacier st",
... "12 glacier st",
... ]
>>> t = ibis.memtable({"address": addresses})
>>> # split on whitespace
>>> t = t.mutate(terms=t.address.re_split(r"\s+"))
>>> term_idf(t.terms).order_by("term")
┏━━━━━━━━━┳━━━━━━━━━━┓
┃ term ┃ idf ┃
┡━━━━━━━━━╇━━━━━━━━━━┩
│ string │ float64 │
├─────────┼──────────┤
│ 12 │ 0.916291 │
│ 21 │ 1.609438 │
│ 56 │ 1.609438 │
│ 99 │ 1.609438 │
│ ave │ 1.609438 │
│ glacier │ 0.916291 │
│ joseph │ 1.609438 │
│ main │ 0.916291 │
│ st │ 0.223144 │
└─────────┴──────────┘