Skip to content

Geospatial API

This contains utilities, Linkers, and comparers relevant to geospatial data.

Coordinates

mismo.lib.geo.CoordinateLinker dataclass

Links two locations together if they are within a certain distance.

This isn't precise, and can include pairs that are actually up to about 2x larger than the given threshold. This is because we use a simple grid to bin the coordinates, so 1. This isn't accurate near the poles, and 2. This isn't accurate near the international date line (longitude 180/-180). 3. If two coords fall within opposite corners of the same grid cell, they will be blocked together even if they are further apart than the precision, due to the diagonal distance being longer than the horizontal or vertical distance.

Examples:

>>> import ibis
>>> from mismo.lib.geo import CoordinateLinker
>>> ibis.options.interactive = True
>>> conn = ibis.duckdb.connect()
>>> left = conn.create_table(
...     "left",
...     [
...         {
...             "record_id": 0,
...             "latlon": {"lat": 61.1547800, "lon": -150.0677490},
...         }
...     ],
... )
>>> right = conn.create_table(
...     "right",
...     [
...         {
...             "record_id": 4,
...             "latitude": 61.1582056,
...             "longitude": -150.0584552,
...         },
...         {
...             "record_id": 5,
...             "latitude": 61.1582056,
...             "longitude": 0,
...         },
...         {
...             "record_id": 6,
...             "latitude": 61.1547800,
...             "longitude": -150,
...         },
...     ],
... )
>>> linker = CoordinateLinker(
...     distance_km=1,
...     left_coord="latlon",
...     right_lat="latitude",
...     right_lon="longitude",
... )
>>> linker(left, right).links
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ record_id_l ┃ latlon_l                      ┃ record_id_r ┃ latitude_r ┃ longitude_r ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ int64       │ struct<lat: float64, lon:     │ int64       │ float64    │ float64     │
│             │ float64>                      │             │            │             │
├─────────────┼───────────────────────────────┼─────────────┼────────────┼─────────────┤
│             │ {                             │             │            │             │
│           0 │     'lat': 61.15478,          │           4 │  61.158206 │ -150.058455 │
│             │     'lon': -150.067749        │             │            │             │
│             │ }                             │             │            │             │
└─────────────┴───────────────────────────────┴─────────────┴────────────┴─────────────┘

mismo.lib.geo.CoordinateLinker.coord class-attribute instance-attribute

coord: (
    str | Deferred | Callable[[Table], StructColumn] | None
) = None

The column in both tables containing the struct<lat: float, lon: float> coordinates.

mismo.lib.geo.CoordinateLinker.distance_km instance-attribute

distance_km: float | int

The (approx) max distance in kilometers that two coords will be blocked together.

mismo.lib.geo.CoordinateLinker.lat class-attribute instance-attribute

lat: (
    str
    | Deferred
    | Callable[[Table], FloatingColumn]
    | None
) = None

The column in both tables containing the latitude coordinates.

mismo.lib.geo.CoordinateLinker.left_coord class-attribute instance-attribute

left_coord: (
    str | Deferred | Callable[[Table], StructColumn] | None
) = None

The column in the left tables containing the struct<lat: float, lon: float> coordinates.

mismo.lib.geo.CoordinateLinker.left_lat class-attribute instance-attribute

left_lat: (
    str
    | Deferred
    | Callable[[Table], FloatingColumn]
    | None
) = None

The column in the left tables containing the latitude coordinates.

mismo.lib.geo.CoordinateLinker.left_lon class-attribute instance-attribute

left_lon: (
    str
    | Deferred
    | Callable[[Table], FloatingColumn]
    | None
) = None

The column in the left tables containing the longitude coordinates.

mismo.lib.geo.CoordinateLinker.lon class-attribute instance-attribute

lon: (
    str
    | Deferred
    | Callable[[Table], FloatingColumn]
    | None
) = None

The column in both tables containing the longitude coordinates.

mismo.lib.geo.CoordinateLinker.max_pairs class-attribute instance-attribute

max_pairs: int | None = None

The maximum number of pairs that any single block of coordinates can contain.

eg if you have 1000 records all with the same coordinates, this would naively result in ~(1000 * 1000) / 2 = 500_000 pairs. If we set max_pairs to less than this, this group of records will be skipped.

mismo.lib.geo.CoordinateLinker.right_coord class-attribute instance-attribute

right_coord: (
    str | Deferred | Callable[[Table], StructColumn] | None
) = None

The column in the right tables containing the struct<lat: float, lon: float> coordinates.

mismo.lib.geo.CoordinateLinker.right_lat class-attribute instance-attribute

right_lat: (
    str
    | Deferred
    | Callable[[Table], FloatingColumn]
    | None
) = None

The column in the right tables containing the latitude coordinates.

mismo.lib.geo.CoordinateLinker.right_lon class-attribute instance-attribute

right_lon: (
    str
    | Deferred
    | Callable[[Table], FloatingColumn]
    | None
) = None

The column in the right tables containing the longitude coordinates.

mismo.lib.geo.distance_km

distance_km(
    *,
    lat1: FloatingValue,
    lon1: FloatingValue,
    lat2: FloatingValue,
    lon2: FloatingValue,
) -> FloatingValue

The distance between two points on the Earth's surface, in kilometers.

PARAMETER DESCRIPTION
lat1

The latitude of the first point.

TYPE: FloatingValue

lon1

The longitude of the first point.

TYPE: FloatingValue

lat2

The latitude of the second point.

TYPE: FloatingValue

lon2

The longitude of the second point.

TYPE: FloatingValue

RETURNS DESCRIPTION
distance

The distance between the two points, in kilometers.

TYPE: FloatingValue

Addresses

mismo.lib.geo.us_census_geocode

us_census_geocode(
    t: Table,
    format: str = "census_{name}",
    *,
    benchmark: str | None = None,
    vintage: str | None = None,
    chunk_size: int | None = None,
    n_concurrent: int | None = None,
) -> Table

Geocode US physical addresses using the US Census Bureau's geocoding service.

Uses the batch geocoding API from https://geocoding.geo.census.gov/geocoder. This only works for US physical addresses. PO Boxes are not supported. "APT 123", "UNIT B", etc are not included in the results, so you will need to extract those before geocoding.

Before geocoding, this function normalizes the input addresses and deduplicates them, so if your input table has 1M rows, but only 100k unique addresses, it will only send those 100k addresses to the API.

This took about 7 minutes to geocode 1M unique addresses in my tests.

PARAMETER DESCRIPTION
t

A table of addresses to geocode. Must have the schema: - street: string, the street address. - city: string, the city name. - state: string, the state name. - zipcode: string, the ZIP code.

TYPE: Table

format

The format to use for the output column names. See the Returns section.

TYPE: str DEFAULT: 'census_{name}'

benchmark

The geocoding benchmark to use. Default is "Public_AR_Current".

TYPE: str | None DEFAULT: None

vintage

The geocoding vintage to use. Default is "Current_Current".

TYPE: str | None DEFAULT: None

chunk_size

The number of addresses to geocode in each request. Default is 5000. The maximum allowed by the API is 10_000. This number was tuned experimentally, you probably don't need to change ir.

TYPE: int | None DEFAULT: None

n_concurrent

The number of concurrent requests to make. Default is 16. This number was tuned experimentally, you probably don't need to change ir.

TYPE: int | None DEFAULT: None

RETURNS DESCRIPTION
geocoded

The input table, with the following additional columns: - is_match: bool, whether the address was successfully matched. If False, all the other columns will be NULL. - match_type: string, the type of match. eg "exact", "non_exact" - street: string, the normalized street address - city: string, the normalized city name - state: string, the normalized 2 letter state code - zipcode: string, the 5 digit ZIP code - latitude: float64, the latitude of the matched address - longitude: float64, the longitude of the matched address Each of these columns is named according to the format parameter. For example, if format is "census_{name}", the columns will be named "census_is_match", "census_match_type", "census_street", etc. The order of the results is not guaranteed to match the input order.

TYPE: Table

mismo.lib.geo.AddressesDimension

Preps, blocks, and compares based on array

columns.

An address is a Struct of the type `struct< street1: string, street2: string, # eg "Apt 3" city: string, state: string, postal_code: string, # zipcode in the US country: string,

. This operates on columns of typearray

`. In other words, it is useful for comparing eg people, who might have multiple addresses, and they are the same person if any of their addresses match.

mismo.lib.geo.AddressesDimension.prepare_for_blocking

prepare_for_blocking(t: Table) -> Table

Prepares the table for blocking, adding normalized and tokenized columns.

mismo.lib.geo.AddressesDimension.prepare_for_fast_linking

prepare_for_fast_linking(t: Table) -> Table

Prepares the table for fast linking, adding a normalized column.

mismo.lib.geo.AddressesMatchLevel

Bases: MatchLevel

How closely two addresses match.

mismo.lib.geo.AddressesMatchLevel.ELSE class-attribute instance-attribute

ELSE = 6

None of the above.

mismo.lib.geo.AddressesMatchLevel.POSSIBLE_TYPO class-attribute instance-attribute

POSSIBLE_TYPO = 1

If you consider typos, the addresses match.

eg the levenstein distance is below a certain threshold.

mismo.lib.geo.AddressesMatchLevel.SAME_REGION class-attribute instance-attribute

SAME_REGION = 2

The postal code, or city and state, match.

mismo.lib.geo.AddressesMatchLevel.SAME_STATE class-attribute instance-attribute

SAME_STATE = 4

The states match.

mismo.lib.geo.AddressesMatchLevel.STREET1_AND_CITY_OR_POSTAL class-attribute instance-attribute

STREET1_AND_CITY_OR_POSTAL = 0

The street1, city, and state match.

mismo.lib.geo.AddressesMatchLevel.WITHIN_100KM class-attribute instance-attribute

WITHIN_100KM = 3

The addresses are within 100 km of each other.

mismo.lib.geo.match_level

match_level(
    left: StructValue, right: StructValue
) -> IntegerValue

Compare two address structs, and return the match level.

mismo.lib.geo.postal_parse_address

postal_parse_address(
    address_string: StringValue,
) -> StructValue

Parse individual fields from an address string.

.. note:: To use this function, you need the optional postal library installed.

This uses the optional postal library to extract individual fields from the string using the following mapping:

  • house_number + road -> street1
  • unit -> street2
  • city -> city
  • state -> state
  • postcode -> postal_code
  • country -> country

Any additional fields parsed by postal will not be included.

PARAMETER DESCRIPTION
address_string

The address as a single string

TYPE: StringValue

RETURNS DESCRIPTION
address

The parsed address as a Struct

TYPE: StructValue

mismo.lib.geo.postal_fingerprint_address

postal_fingerprint_address(
    address: StructValue,
) -> ArrayValue

Generate multiple hashes of an address string to be used for e.g. blocking.

.. note:: To use this function, you need to have the optional postal library installed.

This uses the near-dupe hashing functionality of postal to expand the root of each address component, ignoring tokens such as "road" or "street" in street names.

For street names, whitespace is removed so that for example "Sea Grape Ln" and "Seagrape Ln" will both normalize to "seagrape".

This returns a list of normalized tokens that are the minimum required information to represent the given address.

Near-dupe hashes can be used as keys when blocking, to generate pairs of potential duplicates.

Further details about the hashing function can be found here.

Note that postal.near_dupe.near_dupe_hashes can optionally hash names and use latlon coordinates for geohashing, but this function only hashes addresses. Name and geo-hashing must be implemented elsewhere

Examples:

>>> address = ibis.struct(
...     {
...         "street1": "123 Main Street",
...         "street2": "",
...         "city": "Springfield",
...         "state": "IL",
...         "postal_code": "62701",
...         "country": "us",
...     }
... )
>>> postal_fingerprint_address(address).execute()
[
    "act|main street|123|springfield",
    "act|main|123|springfield",
    "apc|main street|123|62701",
    "apc|main|123|62701",
]
PARAMETER DESCRIPTION
address

The address

TYPE: StructValue

RETURNS DESCRIPTION
address_hashes

Hashes of the address.

TYPE: ArrayValue

mismo.lib.geo.spacy_tag_address

spacy_tag_address(
    address_string: StringValue,
) -> ArrayValue

Tag each token in a US address string with its type, eg StreetName, StreetPreDirectional

.. note:: To use this function, you need the optional spacy-address library installed from https://github.com/NickCrews/spacy-address

This a trained Named Entity Recognition (NER) model in spaCy to tag tokens in an address string with the following labels:

  • AddressNumber
  • AddressNumberPrefix
  • AddressNumberSuffix
  • BuildingName
  • CornerOf
  • CountryName
  • IntersectionSeparator
  • LandmarkName
  • NotAddress
  • OccupancyIdentifier
  • OccupancyType
  • PlaceName
  • Recipient
  • StateName
  • StreetName
  • StreetNamePostDirectional
  • StreetNamePostModifier
  • StreetNamePostType
  • StreetNamePreDirectional
  • StreetNamePreModifier
  • StreetNamePreType
  • SubaddressIdentifier
  • SubaddressType
  • USPSBoxGroupID
  • USPSBoxGroupType
  • USPSBoxID
  • USPSBoxType
  • ZipCode
  • ZipPlus4
PARAMETER DESCRIPTION
address_string

The address as a single string

TYPE: StringValue

RETURNS DESCRIPTION
taggings

An array<struct<token: string, label: string>> with the tagged tokens

Examples:

>>> from mismo.lib.geo import spacy_tag_address
>>> import ibis

Note that - "St" isn't confused as an abbreviation for Street, - "Stre" is correctly tagged as typo for "Street" - "Oklahoma" in "Oklahoma City" is correctly tagged as a PlaceName - "Oklhoma" is correctly tagged as a typo for "Oklahoma"

>>> spacy_tag_address(
...     ibis.literal("456 E St Jude Stre, Oklahoma City, Oklhoma 73102-1234")
... ).execute()
[{'token': '456', 'label': 'AddressNumber'},
{'token': 'E', 'label': 'StreetNamePreDirectional'},
{'token': 'St Jude', 'label': 'StreetName'},
{'token': 'Stre', 'label': 'StreetNamePostType'},
{'token': 'Oklahoma City', 'label': 'PlaceName'},
{'token': 'Oklhoma', 'label': 'StateName'},
{'token': '73102-1234', 'label': 'ZipCode'}]