Emails API
This contains utilities, blockers, and comparers relevant to email addresses
mismo.lib.email.clean_email(email: ir.StringValue, *, normalize: bool = False) -> ir.StringValue
Clean an email address.
- convert to lowercase
- extract anything that matches r".(\S+@\S+)."
If normalize
is True, an additional step of removing "." and "_" is performed.
This makes it possible to compare two addresses and be more immune to noise.
For example, in many email systems such as gmail, "." are ignored.
mismo.lib.email.ParsedEmail
A simple data class holding an email address that has been split into parts.
mismo.lib.email.ParsedEmail.domain: ir.StringValue = full.split('@')[1].nullif('')
instance-attribute
The domain part of the email address, eg 'gmail.com'.
mismo.lib.email.ParsedEmail.full: ir.StringValue = full
instance-attribute
The full email address, eg 'bob.smith@gmail.com'.
mismo.lib.email.ParsedEmail.user: ir.StringValue = full.split('@')[0].nullif('')
instance-attribute
The user part of the email address, eg 'bob.smith' of 'bob.smith@gmail.com'
mismo.lib.email.ParsedEmail.__init__(full: ir.StringValue)
Parse an email address from the full string.
Does no cleaning or normalization. If you want that, use clean_email
first.
PARAMETER | DESCRIPTION |
---|---|
full |
The full email address.
TYPE:
|
mismo.lib.email.ParsedEmail.as_struct() -> ir.StructValue
Convert to an ibis struct.
RETURNS | DESCRIPTION |
---|---|
An ibis struct<full:string, user:string, domain: domain>
|
|
mismo.lib.email.match_level(e1: ir.StructValue | ir.StringValue, e2: ir.StructValue | ir.StringValue, *, native_representation: Literal['integer', 'string'] = 'integer') -> EmailMatchLevel
Match level of two email addresses.
PARAMETER | DESCRIPTION |
---|---|
e1 |
The first email address. If a string, it will be parsed and normalized.
TYPE:
|
e2 |
The second email address. If a string, it will be parsed and normalized.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
level
|
The match level.
TYPE:
|
mismo.lib.email.EmailMatchLevel
Bases: MatchLevel
How closely two email addresses of the form <user>@<domain>
match.
Case is ignored, and dots and underscores are removed.
mismo.lib.email.EmailMatchLevel.ELSE = 4
class-attribute
instance-attribute
None of the above.
mismo.lib.email.EmailMatchLevel.FULL_EXACT = 0
class-attribute
instance-attribute
The full email addresses are exactly the same.
mismo.lib.email.EmailMatchLevel.FULL_NEAR = 1
class-attribute
instance-attribute
The full email addresses have a small edit distance.
mismo.lib.email.EmailMatchLevel.USER_EXACT = 2
class-attribute
instance-attribute
The user part of the email addresses are exactly the same.
mismo.lib.email.EmailMatchLevel.USER_NEAR = 3
class-attribute
instance-attribute
The user part of the email addresses have a small edit distance.
mismo.lib.email.EmailsDimension
A dimension of email addresses.
This is useful if each record contains a collection of email addresses. Two records are probably the same if they have a lot of email addresses in common.
mismo.lib.email.EmailsDimension.__init__(column: str, *, column_parsed: str = '{column}_parsed', column_compared: str = '{column}_compared')
Initialize the dimension.
PARAMETER | DESCRIPTION |
---|---|
column |
The name of the column that holds a array
TYPE:
|
column_parsed |
The name of the column that will be filled with the parsed email addresses.
TYPE:
|
column_compared |
The name of the column that will be filled with the comparison results.
TYPE:
|
mismo.lib.email.EmailsDimension.compare(t: ir.Table) -> ir.Table
Add a column with the best match between all pairs of email addresses.
mismo.lib.email.EmailsDimension.prepare(t: ir.Table) -> ir.Table
Add a column with the parsed and normalized email addresses.