Skip to content

Geospatial API

This contains utilities, blockers, and comparers relevant to geospatial data.

Coordinates

mismo.lib.geo.CoordinateBlocker dataclass

Blocks two locations together if they are within a certain distance.

This isn't precise, and can include pairs that are actually up to about 2x larger than the given threshold. This is because we use a simple grid to bin the coordinates, so 1. This isn't accurate near the poles, and 2. This isn't accurate near the international date line (longitude 180/-180). 3. If two coords fall within opposite corners of the same grid cell, they will be blocked together even if they are further apart than the precision, due to the diagonal distance being longer than the horizontal or vertical distance.

Examples:

>>> import ibis
>>> from mismo.lib.geo import CoordinateBlocker
>>> ibis.options.interactive = True
>>> conn = ibis.duckdb.connect()
>>> left = conn.create_table(
...    "left",
...    {"latlon": [{"lat": 0, "lon": 2}]},
... )
>>> right = conn.create_table(
...     "right",
...     {
...         "latitude": [0, 1, 2],
...         "longitude": [2, 3, 4],
...     },
... )
>>> blocker = CoordinateBlocker(
...     distance_km=10,
...     name="within_10_km",
...     left_coord="latlon",
...     right_lat="latitude",
...     right_lon="longitude",
... )
>>> blocker(left, right)
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ latlon_l                       ┃ latitude_r ┃ longitude_r ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ struct<lat: int64, lon: int64> │ int64      │ int64       │
├────────────────────────────────┼────────────┼─────────────┤
│ {'lat': 0, 'lon': 2}           │          0 │           2 │
└────────────────────────────────┴────────────┴─────────────┘

mismo.lib.geo.CoordinateBlocker.coord: str | Deferred | Callable[[ir.Table], ir.StructColumn] | None = None class-attribute instance-attribute

The column in both tables containing the struct<lat: float, lon: float> coordinates.

mismo.lib.geo.CoordinateBlocker.distance_km: float | int instance-attribute

The (approx) max distance in kilometers that two coords will be blocked together.

mismo.lib.geo.CoordinateBlocker.lat: str | Deferred | Callable[[ir.Table], ir.FloatingColumn] | None = None class-attribute instance-attribute

The column in both tables containing the latitude coordinates.

mismo.lib.geo.CoordinateBlocker.left_coord: str | Deferred | Callable[[ir.Table], ir.StructColumn] | None = None class-attribute instance-attribute

The column in the left tables containing the struct<lat: float, lon: float> coordinates.

mismo.lib.geo.CoordinateBlocker.left_lat: str | Deferred | Callable[[ir.Table], ir.FloatingColumn] | None = None class-attribute instance-attribute

The column in the left tables containing the latitude coordinates.

mismo.lib.geo.CoordinateBlocker.left_lon: str | Deferred | Callable[[ir.Table], ir.FloatingColumn] | None = None class-attribute instance-attribute

The column in the left tables containing the longitude coordinates.

mismo.lib.geo.CoordinateBlocker.lon: str | Deferred | Callable[[ir.Table], ir.FloatingColumn] | None = None class-attribute instance-attribute

The column in both tables containing the longitude coordinates.

mismo.lib.geo.CoordinateBlocker.name: str | None = None class-attribute instance-attribute

The name of the blocker.

mismo.lib.geo.CoordinateBlocker.right_coord: str | Deferred | Callable[[ir.Table], ir.StructColumn] | None = None class-attribute instance-attribute

The column in the right tables containing the struct<lat: float, lon: float> coordinates.

mismo.lib.geo.CoordinateBlocker.right_lat: str | Deferred | Callable[[ir.Table], ir.FloatingColumn] | None = None class-attribute instance-attribute

The column in the right tables containing the latitude coordinates.

mismo.lib.geo.CoordinateBlocker.right_lon: str | Deferred | Callable[[ir.Table], ir.FloatingColumn] | None = None class-attribute instance-attribute

The column in the right tables containing the longitude coordinates.

mismo.lib.geo.CoordinateBlocker.__call__(left: ir.Table, right: ir.Table, **kwargs) -> ir.Table

Return a hash value for the two coordinates.

mismo.lib.geo.distance_km(*, lat1: ir.FloatingValue, lon1: ir.FloatingValue, lat2: ir.FloatingValue, lon2: ir.FloatingValue) -> ir.FloatingValue

The distance between two points on the Earth's surface, in kilometers.

PARAMETER DESCRIPTION
lat1

The latitude of the first point.

TYPE: FloatingValue

lon1

The longitude of the first point.

TYPE: FloatingValue

lat2

The latitude of the second point.

TYPE: FloatingValue

lon2

The longitude of the second point.

TYPE: FloatingValue

RETURNS DESCRIPTION
distance

The distance between the two points, in kilometers.

TYPE: FloatingValue

Addresses

mismo.lib.geo.us_census_geocode(t: ir.Table, format: str = 'census_{name}', *, benchmark: str | None = None, vintage: str | None = None, chunk_size: int | None = None, n_concurrent: int | None = None) -> ir.Table

Geocode US physical addresses using the US Census Bureau's geocoding service.

Uses the batch geocoding API from https://geocoding.geo.census.gov/geocoder. This only works for US physical addresses. PO Boxes are not supported. "APT 123", "UNIT B", etc are not included in the results, so you will need to extract those before geocoding.

Before geocoding, this function normalizes the input addresses and deduplicates them, so if your input table has 1M rows, but only 100k unique addresses, it will only send those 100k addresses to the API.

This took about 7 minutes to geocode 1M unique addresses in my tests.

PARAMETER DESCRIPTION
t

A table of addresses to geocode. Must have the schema: - street: string, the street address. - city: string, the city name. - state: string, the state name. - zipcode: string, the ZIP code.

TYPE: Table

format

The format to use for the output column names. See the Returns section.

TYPE: str DEFAULT: 'census_{name}'

benchmark

The geocoding benchmark to use. Default is "Public_AR_Current".

TYPE: str | None DEFAULT: None

vintage

The geocoding vintage to use. Default is "Current_Current".

TYPE: str | None DEFAULT: None

chunk_size

The number of addresses to geocode in each request. Default is 5000. The maximum allowed by the API is 10_000. This number was tuned experimentally, you probably don't need to change ir.

TYPE: int | None DEFAULT: None

n_concurrent

The number of concurrent requests to make. Default is 16. This number was tuned experimentally, you probably don't need to change ir.

TYPE: int | None DEFAULT: None

RETURNS DESCRIPTION
geocoded

The input table, with the following additional columns: - is_match: bool, whether the address was successfully matched. If False, all the other columns will be NULL. - match_type: string, the type of match. eg "exact", "non_exact" - street: string, the normalized street address - city: string, the normalized city name - state: string, the normalized 2 letter state code - zipcode: string, the 5 digit ZIP code - latitude: float64, the latitude of the matched address - longitude: float64, the longitude of the matched address Each of these columns is named according to the format parameter. For example, if format is "census_{name}", the columns will be named "census_is_match", "census_match_type", "census_street", etc. The order of the results is not guaranteed to match the input order.

TYPE: Table

mismo.lib.geo.AddressesDimension

Preps, blocks, and compares based on array

columns.

An address is a Struct of the type `struct< street1: string, street2: string, # eg "Apt 3" city: string, state: string, postal_code: string, # zipcode in the US country: string,

. This operates on columns of typearray

`. In other words, it is useful for comparing eg people, who might have multiple addresses, and they are the same person if any of their addresses match.

mismo.lib.geo.AddressesDimension.prepare(t: ir.Table) -> ir.Table

Prepares the table for blocking, adding normalized and tokenized columns.

mismo.lib.geo.AddressesMatchLevel

Bases: MatchLevel

How closely two addresses match.

mismo.lib.geo.AddressesMatchLevel.ELSE = 6 class-attribute instance-attribute

None of the above.

mismo.lib.geo.AddressesMatchLevel.POSSIBLE_TYPO = 1 class-attribute instance-attribute

If you consider typos, the addresses match.

eg the levenstein distance is below a certain threshold.

mismo.lib.geo.AddressesMatchLevel.SAME_REGION = 2 class-attribute instance-attribute

The postal code, or city and state, match.

mismo.lib.geo.AddressesMatchLevel.SAME_STATE = 4 class-attribute instance-attribute

The states match.

mismo.lib.geo.AddressesMatchLevel.STREET1_AND_CITY_OR_POSTAL = 0 class-attribute instance-attribute

The street1, city, and state match.

mismo.lib.geo.AddressesMatchLevel.WITHIN_100KM = 3 class-attribute instance-attribute

The addresses are within 100 km of each other.

mismo.lib.geo.match_level(left: ir.StructValue, right: ir.StructValue) -> ir.IntegerValue

Compare two address structs, and return the match level.

mismo.lib.geo.postal_parse_address(address_string: ir.StringValue) -> ir.StructValue

Parse individual fields from an address string.

.. note:: To use this function, you need the optional postal library installed.

This uses the optional postal library to extract individual fields from the string using the following mapping:

  • house_number + road -> street1
  • unit -> street2
  • city -> city
  • state -> state
  • postcode -> postal_code
  • country -> country

Any additional fields parsed by postal will not be included.

PARAMETER DESCRIPTION
address_string

The address as a single string

TYPE: StringValue

RETURNS DESCRIPTION
address

The parsed address as a Struct

TYPE: StructValue

mismo.lib.geo.postal_fingerprint_address(address: ir.StructValue) -> ir.ArrayValue

Generate multiple hashes of an address string to be used for e.g. blocking.

.. note:: To use this function, you need to have the optional postal library installed.

This uses the near-dupe hashing functionality of postal to expand the root of each address component, ignoring tokens such as "road" or "street" in street names.

For street names, whitespace is removed so that for example "Sea Grape Ln" and "Seagrape Ln" will both normalize to "seagrape".

This returns a list of normalized tokens that are the minimum required information to represent the given address.

Near-dupe hashes can be used as keys when blocking, to generate pairs of potential duplicates.

Further details about the hashing function can be found here.

Note that postal.near_dupe.near_dupe_hashes can optionally hash names and use latlon coordinates for geohashing, but this function only hashes addresses. Name and geo-hashing must be implemented elsewhere

Examples:

>>> address = ibis.struct({
...     "street1": "123 Main Street",
...     "street2": "",
...     "city": "Springfield",
...     "state": "IL",
...     "postal_code": "62701",
...     "country": "us",
... })
>>> postal_fingerprint_address(address).execute()
[
    "act|main street|123|springfield",
    "act|main|123|springfield",
    "apc|main street|123|62701",
    "apc|main|123|62701",
]
PARAMETER DESCRIPTION
address

The address

TYPE: StructValue

RETURNS DESCRIPTION
address_hashes

Hashes of the address.

TYPE: ArrayValue

mismo.lib.geo.spacy_tag_address(address_string: ir.StringValue) -> ir.ArrayValue

Tag each token in a US address string with its type, eg StreetName, StreetPreDirectional

.. note:: To use this function, you need the optional spacy-address library installed from https://github.com/NickCrews/spacy-address

This a trained Named Entity Recognition (NER) model in spaCy to tag tokens in an address string with the following labels:

  • AddressNumber
  • AddressNumberPrefix
  • AddressNumberSuffix
  • BuildingName
  • CornerOf
  • CountryName
  • IntersectionSeparator
  • LandmarkName
  • NotAddress
  • OccupancyIdentifier
  • OccupancyType
  • PlaceName
  • Recipient
  • StateName
  • StreetName
  • StreetNamePostDirectional
  • StreetNamePostModifier
  • StreetNamePostType
  • StreetNamePreDirectional
  • StreetNamePreModifier
  • StreetNamePreType
  • SubaddressIdentifier
  • SubaddressType
  • USPSBoxGroupID
  • USPSBoxGroupType
  • USPSBoxID
  • USPSBoxType
  • ZipCode
  • ZipPlus4
PARAMETER DESCRIPTION
address_string

The address as a single string

TYPE: StringValue

RETURNS DESCRIPTION
taggings

An array<struct<token: string, label: string>> with the tagged tokens

Examples:

>>> from mismo.lib.geo import spacy_tag_address
>>> import ibis

Note that - "St" isn't confused as an abbreviation for Street, - "Stre" is correctly tagged as typo for "Street" - "Oklahoma" in "Oklahoma City" is correctly tagged as a PlaceName - "Oklhoma" is correctly tagged as a typo for "Oklahoma"

>>> spacy_tag_address(ibis.literal("456 E St Jude Stre, Oklahoma City, Oklhoma 73102-1234")).execute()
[{'token': '456', 'label': 'AddressNumber'},
{'token': 'E', 'label': 'StreetNamePreDirectional'},
{'token': 'St Jude', 'label': 'StreetName'},
{'token': 'Stre', 'label': 'StreetNamePostType'},
{'token': 'Oklahoma City', 'label': 'PlaceName'},
{'token': 'Oklhoma', 'label': 'StateName'},
{'token': '73102-1234', 'label': 'ZipCode'}]