Skip to content

Text API

mismo.text.norm_whitespace(texts: ir.StringValue) -> ir.StringValue

Strip leading/trailing whitespace, replace multiple whitespace with a single space.

mismo.text.ngrams(string: ir.StringValue, n: int) -> ir.ArrayValue

Character n-grams from a string. The order of the n-grams is not guaranteed.

PARAMETER DESCRIPTION
string

The string to generate n-grams from.

TYPE: StringValue

n

The number of characters in each n-gram.

TYPE: int

RETURNS DESCRIPTION
An array of n-grams.

Examples:

>>> import ibis
>>> from mismo.text import ngrams
>>> ngrams("abc", 2).execute()
['ab', 'bc']
>>> ngrams("", 2).execute()
[]
>>> ngrams("a", 2).execute()
[]
>>> ngrams(None, 4).execute() is None
True

Order of n-grams is not guaranteed:

>>> ngrams("abcdef", 3).execute()
['abc', 'def', 'bcd', 'cde']

mismo.text.tokenize(text: ir.StringValue) -> ir.ArrayValue

Split a string into tokens on whitespace.

Examples:

>>> import ibis
>>> from mismo.text import tokenize
>>> tokenize(ibis.literal("  abc    def")).execute()
['abc', 'def']
>>> tokenize(ibis.literal("  abc")).execute()
['abc']
>>> tokenize(ibis.literal(" ")).execute()
[]
>>> tokenize(ibis.null(str)).execute() is None
True

mismo.text.double_metaphone(s: ir.StringValue) -> ir.ArrayValue[ir.StringValue]

Double Metaphone phonetic encoding

This requires the doublemetaphone package to be installed. You can install it with python -m pip install DoubleMetaphone. This uses a python UDF so it is going to be slow.

Examples:

>>> from mismo.text import double_metaphone
>>> double_metaphone("catherine").execute()
['K0RN', 'KTRN']
>>> double_metaphone("").execute()
['', '']
>>> double_metaphone(None).execute() is None
True

mismo.text.levenshtein_ratio(s1: ir.StringValue, s2: ir.StringValue) -> ir.FloatingValue

The levenshtein distance between two strings, normalized to be between 0 and 1.

The ratio is defined as (lenmax - ldist)/lenmax where

  • ldist is the regular levenshtein distance
  • lenmax is the maximum length of the two strings (eg the largest possible edit distance)

This makes it so that the ratio is 1 when the strings are the same and 0 when they are completely different. By doing this normalization, the ratio is always between 0 and 1, regardless of the length of the strings.

PARAMETER DESCRIPTION
s1

The first string

TYPE: StringValue

s2

The second string

TYPE: StringValue

RETURNS DESCRIPTION
lev_ratio

The ratio of the Levenshtein edit cost to the maximum string length

TYPE: FloatingValue

Examples:

>>> from mismo.text import levenshtein_ratio
>>> levenshtein_ratio("mile", "mike").execute()
np.float64(0.75)
>>> levenshtein_ratio("mile", "mile").execute()
np.float64(1.0)
>>> levenshtein_ratio("mile", "").execute()
np.float64(0.0)
>>> levenshtein_ratio("", "").execute()
np.float64(nan)

mismo.text.damerau_levenshtein(a: str, b: str) -> int

The number of adds, deletes, substitutions, and transposes to get from a to b.

This is the levenstein distance with the addition of transpositions as a possible operation.

mismo.text.damerau_levenshtein_ratio(s1: ir.StringValue, s2: ir.StringValue) -> ir.FloatingValue

Like levenshtein_ratio, but with the Damerau-Levenshtein distance.

See Also

mismo.text.jaro_similarity(s1: ir.StringValue, s2: ir.StringValue) -> ir.FloatingValue

The jaro similarity between s1 and s2.

This is a number between 0 and 1, defined as sj = 1/3 * (m/l_1 + m/l_2 + (m-t)/m)

where m is the number of matching characters between s1 and s2 and t is the number of transpositions between s1 and s2.

Examples:

>>> import ibis
>>> from mismo.text import jaro_similarity
>>> jaro_similarity(ibis.literal("foo"), ibis.literal("foo")).execute()
np.float64(1.0)
>>> jaro_similarity(ibis.literal("foo"), ibis.literal("food")).execute()
np.float64(0.9166666666666666)
>>> jaro_similarity(ibis.null(str), ibis.literal("food")).execute()
np.float64(nan)

Be aware: comparing to an empty string always has a similarity of 0:

>>> jaro_similarity(ibis.literal("a"), ibis.literal("")).execute()
np.float64(0.0)
>>> jaro_similarity(ibis.literal(""), ibis.literal("")).execute()
np.float64(0.0)

mismo.text.jaro_winkler_similarity(s1: ir.StringValue, s2: ir.StringValue) -> ir.FloatingValue

The Jaro-Winkler similarity between s1 and s2.

The Jaro-Winkler similarity is a variant of the Jaro similarity that measures the number of edits between two strings and places a higher importance on the prefix.

It is defined as (sjw = sj + l * p * (1-sj) where sj is the Jaro similarity, l is the length of the common prefix (up to a maximum of 4) and p is a constant scaling factor (up to a maximum of 0.25, but typically set to 0.1)

Examples:

>>> import ibis
>>> from mismo.text import jaro_winkler_similarity
>>> jaro_winkler_similarity(ibis.literal("foo"), ibis.literal("foo")).execute()
np.float64(1.0)
>>> jaro_winkler_similarity(ibis.literal("foo"), ibis.literal("food")).execute()
np.float64(0.9416666666666667)
>>> jaro_winkler_similarity(ibis.null(str), ibis.literal("food")).execute()
np.float64(nan)

Be aware: comparing to an empty string always has a similarity of 0:

>>> jaro_winkler_similarity(ibis.literal("a"), ibis.literal("")).execute()
np.float64(0.0)
>>> jaro_winkler_similarity(ibis.literal(""), ibis.literal("")).execute()
np.float64(0.0)