Skip to content

Emails API

This contains utilities, blockers, and comparers relevant to email addresses

mismo.lib.email.clean_email(email: ir.StringValue, *, normalize: bool = False) -> ir.StringValue

Clean an email address.

  • convert to lowercase
  • extract anything that matches r".(\S+@\S+)."

If normalize is True, an additional step of removing "." and "_" is performed. This makes it possible to compare two addresses and be more immune to noise. For example, in many email systems such as gmail, "." are ignored.

mismo.lib.email.ParsedEmail

A simple data class holding an email address that has been split into parts.

mismo.lib.email.ParsedEmail.domain: ir.StringValue = full.split('@')[1].nullif('') instance-attribute

The domain part of the email address, eg 'gmail.com'.

mismo.lib.email.ParsedEmail.full: ir.StringValue = full instance-attribute

The full email address, eg 'bob.smith@gmail.com'.

mismo.lib.email.ParsedEmail.user: ir.StringValue = full.split('@')[0].nullif('') instance-attribute

The user part of the email address, eg 'bob.smith' of 'bob.smith@gmail.com'

mismo.lib.email.ParsedEmail.__init__(full: ir.StringValue)

Parse an email address from the full string.

Does no cleaning or normalization. If you want that, use clean_email first.

PARAMETER DESCRIPTION
full

The full email address.

TYPE: StringValue

mismo.lib.email.ParsedEmail.as_struct() -> ir.StructValue

Convert to an ibis struct.

RETURNS DESCRIPTION
An ibis struct<full:string, user:string, domain: domain>

mismo.lib.email.match_level(e1: ir.StructValue | ir.StringValue, e2: ir.StructValue | ir.StringValue, *, native_representation: Literal['integer', 'string'] = 'integer') -> EmailMatchLevel

Match level of two email addresses.

PARAMETER DESCRIPTION
e1

The first email address. If a string, it will be parsed and normalized.

TYPE: StructValue | StringValue

e2

The second email address. If a string, it will be parsed and normalized.

TYPE: StructValue | StringValue

RETURNS DESCRIPTION
level

The match level.

TYPE: EmailMatchLevel

mismo.lib.email.EmailMatchLevel

Bases: MatchLevel

How closely two email addresses of the form <user>@<domain> match.

Case is ignored, and dots and underscores are removed.

mismo.lib.email.EmailMatchLevel.ELSE = 4 class-attribute instance-attribute

None of the above.

mismo.lib.email.EmailMatchLevel.FULL_EXACT = 0 class-attribute instance-attribute

The full email addresses are exactly the same.

mismo.lib.email.EmailMatchLevel.FULL_NEAR = 1 class-attribute instance-attribute

The full email addresses have a small edit distance.

mismo.lib.email.EmailMatchLevel.USER_EXACT = 2 class-attribute instance-attribute

The user part of the email addresses are exactly the same.

mismo.lib.email.EmailMatchLevel.USER_NEAR = 3 class-attribute instance-attribute

The user part of the email addresses have a small edit distance.

mismo.lib.email.EmailsDimension

A dimension of email addresses.

This is useful if each record contains a collection of email addresses. Two records are probably the same if they have a lot of email addresses in common.

mismo.lib.email.EmailsDimension.__init__(column: str, *, column_parsed: str = '{column}_parsed', column_compared: str = '{column}_compared')

Initialize the dimension.

PARAMETER DESCRIPTION
column

The name of the column that holds a array of email addresses.

TYPE: str

column_parsed

The name of the column that will be filled with the parsed email addresses.

TYPE: str DEFAULT: '{column}_parsed'

column_compared

The name of the column that will be filled with the comparison results.

TYPE: str DEFAULT: '{column}_compared'

mismo.lib.email.EmailsDimension.compare(t: ir.Table) -> ir.Table

Add a column with the best match between all pairs of email addresses.

mismo.lib.email.EmailsDimension.prepare(t: ir.Table) -> ir.Table

Add a column with the parsed and normalized email addresses.