Skip to content

Emails API

This contains utilities, blockers, and comparers relevant to email addresses

mismo.lib.email.clean_email

clean_email(
    email: StringValue, *, normalize: bool = False
) -> StringValue

Clean an email address.

  • convert to lowercase
  • extract anything that matches r".(\S+@\S+)."

If normalize is True, an additional step of removing "." and "_" is performed. This makes it possible to compare two addresses and be more immune to noise. For example, in many email systems such as gmail, "." are ignored.

mismo.lib.email.ParsedEmail

A simple data class holding an email address that has been split into parts.

mismo.lib.email.ParsedEmail.domain instance-attribute

domain: StringValue = nullif('')

The domain part of the email address, eg 'gmail.com'.

mismo.lib.email.ParsedEmail.full instance-attribute

full: StringValue = full

The full email address, eg 'bob.smith@gmail.com'.

mismo.lib.email.ParsedEmail.user instance-attribute

user: StringValue = nullif('')

The user part of the email address, eg 'bob.smith' of 'bob.smith@gmail.com'

mismo.lib.email.ParsedEmail.__init__

__init__(full: StringValue)

Parse an email address from the full string.

Does no cleaning or normalization. If you want that, use clean_email first.

PARAMETER DESCRIPTION
full

The full email address.

TYPE: StringValue

mismo.lib.email.ParsedEmail.as_struct

as_struct() -> StructValue

Convert to an ibis struct.

RETURNS DESCRIPTION
An ibis struct<full:string, user:string, domain: domain>

mismo.lib.email.match_level

match_level(
    e1: StructValue | StringValue,
    e2: StructValue | StringValue,
    *,
    native_representation: Literal[
        "integer", "string"
    ] = "integer",
) -> EmailMatchLevel

Match level of two email addresses.

PARAMETER DESCRIPTION
e1

The first email address. If a string, it will be parsed and normalized.

TYPE: StructValue | StringValue

e2

The second email address. If a string, it will be parsed and normalized.

TYPE: StructValue | StringValue

RETURNS DESCRIPTION
level

The match level.

TYPE: EmailMatchLevel

mismo.lib.email.EmailMatchLevel

Bases: MatchLevel

How closely two email addresses of the form <user>@<domain> match.

Case is ignored, and dots and underscores are removed.

mismo.lib.email.EmailMatchLevel.ELSE class-attribute instance-attribute

ELSE = 4

None of the above.

mismo.lib.email.EmailMatchLevel.FULL_EXACT class-attribute instance-attribute

FULL_EXACT = 0

The full email addresses are exactly the same.

mismo.lib.email.EmailMatchLevel.FULL_NEAR class-attribute instance-attribute

FULL_NEAR = 1

The full email addresses have a small edit distance.

mismo.lib.email.EmailMatchLevel.USER_EXACT class-attribute instance-attribute

USER_EXACT = 2

The user part of the email addresses are exactly the same.

mismo.lib.email.EmailMatchLevel.USER_NEAR class-attribute instance-attribute

USER_NEAR = 3

The user part of the email addresses have a small edit distance.

mismo.lib.email.EmailsDimension

A dimension of email addresses.

This is useful if each record contains a collection of email addresses. Two records are probably the same if they have a lot of email addresses in common.

mismo.lib.email.EmailsDimension.__init__

__init__(
    column: str,
    *,
    column_parsed: str = "{column}_parsed",
    column_compared: str = "{column}_compared",
)

Initialize the dimension.

PARAMETER DESCRIPTION
column

The name of the column that holds a array of email addresses.

TYPE: str

column_parsed

The name of the column that will be filled with the parsed email addresses.

TYPE: str DEFAULT: '{column}_parsed'

column_compared

The name of the column that will be filled with the comparison results.

TYPE: str DEFAULT: '{column}_compared'

mismo.lib.email.EmailsDimension.compare

compare(t: Table) -> Table

Add a column with the best match between all pairs of email addresses.

mismo.lib.email.EmailsDimension.prepare_for_fast_linking

prepare_for_fast_linking(t: Table) -> Table

Add a column with the parsed and normalized email addresses.