Emails API
This contains utilities, blockers, and comparers relevant to email addresses
mismo.lib.email.clean_email
clean_email(
email: StringValue, *, normalize: bool = False
) -> StringValue
Clean an email address.
- convert to lowercase
- extract anything that matches r".(\S+@\S+)."
If normalize
is True, an additional step of removing "." and "_" is performed.
This makes it possible to compare two addresses and be more immune to noise.
For example, in many email systems such as gmail, "." are ignored.
mismo.lib.email.ParsedEmail
A simple data class holding an email address that has been split into parts.
mismo.lib.email.ParsedEmail.domain
instance-attribute
domain: StringValue = nullif('')
The domain part of the email address, eg 'gmail.com'.
mismo.lib.email.ParsedEmail.full
instance-attribute
full: StringValue = full
The full email address, eg 'bob.smith@gmail.com'.
mismo.lib.email.ParsedEmail.user
instance-attribute
user: StringValue = nullif('')
The user part of the email address, eg 'bob.smith' of 'bob.smith@gmail.com'
mismo.lib.email.ParsedEmail.__init__
__init__(full: StringValue)
Parse an email address from the full string.
Does no cleaning or normalization. If you want that, use clean_email
first.
PARAMETER | DESCRIPTION |
---|---|
full
|
The full email address.
TYPE:
|
mismo.lib.email.ParsedEmail.as_struct
as_struct() -> StructValue
Convert to an ibis struct.
RETURNS | DESCRIPTION |
---|---|
An ibis struct<full:string, user:string, domain: domain>
|
|
mismo.lib.email.match_level
match_level(
e1: StructValue | StringValue,
e2: StructValue | StringValue,
*,
native_representation: Literal[
"integer", "string"
] = "integer",
) -> EmailMatchLevel
Match level of two email addresses.
PARAMETER | DESCRIPTION |
---|---|
e1
|
The first email address. If a string, it will be parsed and normalized.
TYPE:
|
e2
|
The second email address. If a string, it will be parsed and normalized.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
level
|
The match level.
TYPE:
|
mismo.lib.email.EmailMatchLevel
Bases: MatchLevel
How closely two email addresses of the form <user>@<domain>
match.
Case is ignored, and dots and underscores are removed.
mismo.lib.email.EmailMatchLevel.ELSE
class-attribute
instance-attribute
ELSE = 4
None of the above.
mismo.lib.email.EmailMatchLevel.FULL_EXACT
class-attribute
instance-attribute
FULL_EXACT = 0
The full email addresses are exactly the same.
mismo.lib.email.EmailMatchLevel.FULL_NEAR
class-attribute
instance-attribute
FULL_NEAR = 1
The full email addresses have a small edit distance.
mismo.lib.email.EmailMatchLevel.USER_EXACT
class-attribute
instance-attribute
USER_EXACT = 2
The user part of the email addresses are exactly the same.
mismo.lib.email.EmailMatchLevel.USER_NEAR
class-attribute
instance-attribute
USER_NEAR = 3
The user part of the email addresses have a small edit distance.
mismo.lib.email.EmailsDimension
A dimension of email addresses.
This is useful if each record contains a collection of email addresses. Two records are probably the same if they have a lot of email addresses in common.
mismo.lib.email.EmailsDimension.__init__
__init__(
column: str,
*,
column_parsed: str = "{column}_parsed",
column_compared: str = "{column}_compared",
)
Initialize the dimension.
PARAMETER | DESCRIPTION |
---|---|
column
|
The name of the column that holds a array
TYPE:
|
column_parsed
|
The name of the column that will be filled with the parsed email addresses.
TYPE:
|
column_compared
|
The name of the column that will be filled with the comparison results.
TYPE:
|
mismo.lib.email.EmailsDimension.compare
compare(t: Table) -> Table
Add a column with the best match between all pairs of email addresses.
mismo.lib.email.EmailsDimension.prepare_for_fast_linking
prepare_for_fast_linking(t: Table) -> Table
Add a column with the parsed and normalized email addresses.