Text API

mismo.text.norm_whitespace

norm_whitespace(texts: StringValue) -> StringValue

Strip leading/trailing whitespace, replace multiple whitespace with a single space.

mismo.text.strip_accents

strip_accents(s: StringValue) -> StringValue

Remove accents, such as é -> e. Only works with duckdb.

PARAMETER	DESCRIPTION
`s`	The string to strip TYPE: `StringValue`

RETURNS	DESCRIPTION
`StringValue`	The string with non-ascii characters replaced and/or removed.

Examples:

>>> import ibis
>>> strip_accents(ibis.literal("müller")).execute()
'muller'
>>> strip_accents(ibis.literal("François")).execute()
'Francois'
>>> strip_accents(ibis.literal("Øslo")).execute()  # Ø is not an accent
'Øslo'
>>> strip_accents(ibis.literal("æ")).execute()  # neither is this
'æ'
>>> strip_accents(ibis.literal("ɑɽⱤoW")).execute()  # neither is this
'ɑɽⱤoW'

mismo.text.ngrams

ngrams(string: StringValue, n: int) -> ArrayValue

Character n-grams from a string. The order of the n-grams is not guaranteed.

PARAMETER	DESCRIPTION
`string`	The string to generate n-grams from. TYPE: `StringValue`
`n`	The number of characters in each n-gram. TYPE: `int`

RETURNS	DESCRIPTION
`An array of n-grams.`

Examples:

>>> import ibis
>>> from mismo.text import ngrams
>>> ngrams("abc", 2).execute()
['ab', 'bc']
>>> ngrams("", 2).execute()
[]
>>> ngrams("a", 2).execute()
[]
>>> ngrams(None, 4).execute() is None
True

Order of n-grams is not guaranteed:

>>> ngrams("abcdef", 3).execute()
['abc', 'def', 'bcd', 'cde']

mismo.text.tokenize

tokenize(text: StringValue) -> ArrayValue

Split a string into tokens on whitespace.

Examples:

>>> import ibis
>>> from mismo.text import tokenize
>>> tokenize(ibis.literal("  abc    def")).execute()
['abc', 'def']
>>> tokenize(ibis.literal("  abc")).execute()
['abc']
>>> tokenize(ibis.literal(" ")).execute()
[]
>>> tokenize(ibis.null(str)).execute() is None
True

mismo.text.double_metaphone

double_metaphone(s: StringValue) -> ArrayValue[StringValue]

Double Metaphone phonetic encoding

This requires the doublemetaphone package to be installed. You can install it with python -m pip install DoubleMetaphone. This uses a python UDF so it is going to be slow.

Examples:

>>> from mismo.text import double_metaphone
>>> double_metaphone("catherine").execute()
['K0RN', 'KTRN']
>>> double_metaphone("").execute()
['', '']
>>> double_metaphone(None).execute() is None
True

mismo.text.levenshtein_ratio

levenshtein_ratio(
    s1: StringValue, s2: StringValue
) -> FloatingValue

The levenshtein distance between two strings, normalized to be between 0 and 1.

The ratio is defined as (lenmax - ldist)/lenmax where

ldist is the regular levenshtein distance
lenmax is the maximum length of the two strings (eg the largest possible edit distance)

This makes it so that the ratio is 1 when the strings are the same and 0 when they are completely different. By doing this normalization, the ratio is always between 0 and 1, regardless of the length of the strings.

PARAMETER	DESCRIPTION
`s1`	The first string TYPE: `StringValue`
`s2`	The second string TYPE: `StringValue`

RETURNS	DESCRIPTION
`lev_ratio`	The ratio of the Levenshtein edit cost to the maximum string length TYPE: `FloatingValue`

Examples:

>>> from mismo.text import levenshtein_ratio
>>> levenshtein_ratio("mile", "mike").execute()
0.75
>>> levenshtein_ratio("mile", "mile").execute()
1.0
>>> levenshtein_ratio("mile", "").execute()
0.0
>>> levenshtein_ratio("", "").execute()
nan

mismo.text.damerau_levenshtein

damerau_levenshtein(a: str, b: str) -> int

The number of adds, deletes, substitutions, and transposes to get from a to b.

This is the levenstein distance with the addition of transpositions as a possible operation.

mismo.text.damerau_levenshtein_ratio

damerau_levenshtein_ratio(
    s1: StringValue, s2: StringValue
) -> FloatingValue

Like levenshtein_ratio, but with the Damerau-Levenshtein distance.

mismo.text.jaro_similarity

jaro_similarity(
    s1: StringValue, s2: StringValue
) -> FloatingValue

The jaro similarity between s1 and s2.

This is a number between 0 and 1, defined as sj = 1/3 * (m/l_1 + m/l_2 + (m-t)/m)

where m is the number of matching characters between s1 and s2 and t is the number of transpositions between s1 and s2.

Examples:

>>> import ibis
>>> from mismo.text import jaro_similarity
>>> jaro_similarity(ibis.literal("foo"), ibis.literal("foo")).execute()
1.0
>>> jaro_similarity(ibis.literal("foo"), ibis.literal("food")).execute()
0.9166666666666666
>>> jaro_similarity(ibis.null(str), ibis.literal("food")).execute()
nan

Be aware: comparing to an empty string always has a similarity of 0:

>>> jaro_similarity(ibis.literal("a"), ibis.literal("")).execute()
0.0
>>> jaro_similarity(ibis.literal(""), ibis.literal("")).execute()
0.0

mismo.text.jaro_winkler_similarity

jaro_winkler_similarity(
    s1: StringValue, s2: StringValue
) -> FloatingValue

The Jaro-Winkler similarity between s1 and s2.

The Jaro-Winkler similarity is a variant of the Jaro similarity that measures the number of edits between two strings and places a higher importance on the prefix.

It is defined as (sjw = sj + l * p * (1-sj) where sj is the Jaro similarity, l is the length of the common prefix (up to a maximum of 4) and p is a constant scaling factor (up to a maximum of 0.25, but typically set to 0.1)

Examples:

>>> import ibis
>>> from mismo.text import jaro_winkler_similarity
>>> jaro_winkler_similarity(ibis.literal("foo"), ibis.literal("foo")).execute()
1.0
>>> jaro_winkler_similarity(ibis.literal("foo"), ibis.literal("food")).execute()
0.9416666666666667
>>> jaro_winkler_similarity(ibis.null(str), ibis.literal("food")).execute()
nan

Be aware: comparing to an empty string always has a similarity of 0:

>>> jaro_winkler_similarity(ibis.literal("a"), ibis.literal("")).execute()
0.0
>>> jaro_winkler_similarity(ibis.literal(""), ibis.literal("")).execute()
0.0