Text API
mismo.text.norm_whitespace(texts: ir.StringValue) -> ir.StringValue
Strip leading/trailing whitespace, replace multiple whitespace with a single space.
mismo.text.ngrams(string: ir.StringValue, n: int) -> ir.ArrayValue
Character n-grams from a string. The order of the n-grams is not guaranteed.
PARAMETER | DESCRIPTION |
---|---|
string |
The string to generate n-grams from.
TYPE:
|
n |
The number of characters in each n-gram.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
An array of n-grams.
|
|
Examples:
>>> import ibis
>>> from mismo.text import ngrams
>>> ngrams("abc", 2).execute()
['ab', 'bc']
>>> ngrams("", 2).execute()
[]
>>> ngrams("a", 2).execute()
[]
>>> ngrams(None, 4).execute() is None
True
Order of n-grams is not guaranteed:
>>> ngrams("abcdef", 3).execute()
['abc', 'def', 'bcd', 'cde']
mismo.text.tokenize(text: ir.StringValue) -> ir.ArrayValue
Split a string into tokens on whitespace.
Examples:
>>> import ibis
>>> from mismo.text import tokenize
>>> tokenize(ibis.literal(" abc def")).execute()
['abc', 'def']
>>> tokenize(ibis.literal(" abc")).execute()
['abc']
>>> tokenize(ibis.literal(" ")).execute()
[]
>>> tokenize(ibis.null(str)).execute() is None
True
mismo.text.double_metaphone(s: ir.StringValue) -> ir.ArrayValue[ir.StringValue]
Double Metaphone phonetic encoding
This requires the doublemetaphone
package to be installed.
You can install it with python -m pip install DoubleMetaphone
.
This uses a python UDF so it is going to be slow.
Examples:
>>> from mismo.text import double_metaphone
>>> double_metaphone("catherine").execute()
['K0RN', 'KTRN']
>>> double_metaphone("").execute()
['', '']
>>> double_metaphone(None).execute() is None
True
mismo.text.levenshtein_ratio(s1: ir.StringValue, s2: ir.StringValue) -> ir.FloatingValue
The levenshtein distance between two strings, normalized to be between 0 and 1.
The ratio is defined as (lenmax - ldist)/lenmax
where
ldist
is the regular levenshtein distancelenmax
is the maximum length of the two strings (eg the largest possible edit distance)
This makes it so that the ratio is 1 when the strings are the same and 0 when they are completely different. By doing this normalization, the ratio is always between 0 and 1, regardless of the length of the strings.
PARAMETER | DESCRIPTION |
---|---|
s1 |
The first string
TYPE:
|
s2 |
The second string
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
lev_ratio
|
The ratio of the Levenshtein edit cost to the maximum string length
TYPE:
|
Examples:
>>> from mismo.text import levenshtein_ratio
>>> levenshtein_ratio("mile", "mike").execute()
np.float64(0.75)
>>> levenshtein_ratio("mile", "mile").execute()
np.float64(1.0)
>>> levenshtein_ratio("mile", "").execute()
np.float64(0.0)
>>> levenshtein_ratio("", "").execute()
np.float64(nan)
mismo.text.damerau_levenshtein(a: str, b: str) -> int
The number of adds, deletes, substitutions, and transposes to get from a
to b
.
This is the levenstein distance with the addition of transpositions as a possible operation.
mismo.text.damerau_levenshtein_ratio(s1: ir.StringValue, s2: ir.StringValue) -> ir.FloatingValue
Like levenshtein_ratio, but with the Damerau-Levenshtein distance.
See Also
mismo.text.jaro_similarity(s1: ir.StringValue, s2: ir.StringValue) -> ir.FloatingValue
The jaro similarity between s1
and s2
.
This is a number between 0 and 1, defined as
sj = 1/3 * (m/l_1 + m/l_2 + (m-t)/m)
where m
is the number of matching characters between s1 and s2 and t
is the
number of transpositions between s1
and s2
.
Examples:
>>> import ibis
>>> from mismo.text import jaro_similarity
>>> jaro_similarity(ibis.literal("foo"), ibis.literal("foo")).execute()
np.float64(1.0)
>>> jaro_similarity(ibis.literal("foo"), ibis.literal("food")).execute()
np.float64(0.9166666666666666)
>>> jaro_similarity(ibis.null(str), ibis.literal("food")).execute()
np.float64(nan)
Be aware: comparing to an empty string always has a similarity of 0:
>>> jaro_similarity(ibis.literal("a"), ibis.literal("")).execute()
np.float64(0.0)
>>> jaro_similarity(ibis.literal(""), ibis.literal("")).execute()
np.float64(0.0)
mismo.text.jaro_winkler_similarity(s1: ir.StringValue, s2: ir.StringValue) -> ir.FloatingValue
The Jaro-Winkler similarity between s1
and s2
.
The Jaro-Winkler similarity is a variant of the Jaro similarity that measures the number of edits between two strings and places a higher importance on the prefix.
It is defined as (sjw = sj + l * p * (1-sj)
where sj
is the Jaro similarity, l
is the length of the common prefix (up to a
maximum of 4) and p
is a constant scaling factor (up to a maximum of 0.25, but
typically set to 0.1)
Examples:
>>> import ibis
>>> from mismo.text import jaro_winkler_similarity
>>> jaro_winkler_similarity(ibis.literal("foo"), ibis.literal("foo")).execute()
np.float64(1.0)
>>> jaro_winkler_similarity(ibis.literal("foo"), ibis.literal("food")).execute()
np.float64(0.9416666666666667)
>>> jaro_winkler_similarity(ibis.null(str), ibis.literal("food")).execute()
np.float64(nan)
Be aware: comparing to an empty string always has a similarity of 0:
>>> jaro_winkler_similarity(ibis.literal("a"), ibis.literal("")).execute()
np.float64(0.0)
>>> jaro_winkler_similarity(ibis.literal(""), ibis.literal("")).execute()
np.float64(0.0)