Geospatial API
This contains utilities, blockers, and comparers relevant to geospatial data.
Coordinates
mismo.lib.geo.CoordinateBlocker
dataclass
Blocks two locations together if they are within a certain distance.
This isn't precise, and can include pairs that are actually up to about 2x larger than the given threshold. This is because we use a simple grid to bin the coordinates, so 1. This isn't accurate near the poles, and 2. This isn't accurate near the international date line (longitude 180/-180). 3. If two coords fall within opposite corners of the same grid cell, they will be blocked together even if they are further apart than the precision, due to the diagonal distance being longer than the horizontal or vertical distance.
Examples:
>>> import ibis
>>> from mismo.lib.geo import CoordinateBlocker
>>> ibis.options.interactive = True
>>> conn = ibis.duckdb.connect()
>>> left = conn.create_table(
... "left",
... {"latlon": [{"lat": 0, "lon": 2}]},
... )
>>> right = conn.create_table(
... "right",
... {
... "latitude": [0, 1, 2],
... "longitude": [2, 3, 4],
... },
... )
>>> blocker = CoordinateBlocker(
... distance_km=10,
... name="within_10_km",
... left_coord="latlon",
... right_lat="latitude",
... right_lon="longitude",
... )
>>> blocker(left, right)
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ latlon_l ┃ latitude_r ┃ longitude_r ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ struct<lat: int64, lon: int64> │ int64 │ int64 │
├────────────────────────────────┼────────────┼─────────────┤
│ {'lat': 0, 'lon': 2} │ 0 │ 2 │
└────────────────────────────────┴────────────┴─────────────┘
mismo.lib.geo.CoordinateBlocker.coord: str | Deferred | Callable[[ir.Table], ir.StructColumn] | None = None
class-attribute
instance-attribute
The column in both tables containing the struct<lat: float, lon: float>
coordinates.
mismo.lib.geo.CoordinateBlocker.distance_km: float | int
instance-attribute
The (approx) max distance in kilometers that two coords will be blocked together.
mismo.lib.geo.CoordinateBlocker.lat: str | Deferred | Callable[[ir.Table], ir.FloatingColumn] | None = None
class-attribute
instance-attribute
The column in both tables containing the latitude coordinates.
mismo.lib.geo.CoordinateBlocker.left_coord: str | Deferred | Callable[[ir.Table], ir.StructColumn] | None = None
class-attribute
instance-attribute
The column in the left tables containing the struct<lat: float, lon: float>
coordinates.
mismo.lib.geo.CoordinateBlocker.left_lat: str | Deferred | Callable[[ir.Table], ir.FloatingColumn] | None = None
class-attribute
instance-attribute
The column in the left tables containing the latitude coordinates.
mismo.lib.geo.CoordinateBlocker.left_lon: str | Deferred | Callable[[ir.Table], ir.FloatingColumn] | None = None
class-attribute
instance-attribute
The column in the left tables containing the longitude coordinates.
mismo.lib.geo.CoordinateBlocker.lon: str | Deferred | Callable[[ir.Table], ir.FloatingColumn] | None = None
class-attribute
instance-attribute
The column in both tables containing the longitude coordinates.
mismo.lib.geo.CoordinateBlocker.name: str | None = None
class-attribute
instance-attribute
The name of the blocker.
mismo.lib.geo.CoordinateBlocker.right_coord: str | Deferred | Callable[[ir.Table], ir.StructColumn] | None = None
class-attribute
instance-attribute
The column in the right tables containing the struct<lat: float, lon: float>
coordinates.
mismo.lib.geo.CoordinateBlocker.right_lat: str | Deferred | Callable[[ir.Table], ir.FloatingColumn] | None = None
class-attribute
instance-attribute
The column in the right tables containing the latitude coordinates.
mismo.lib.geo.CoordinateBlocker.right_lon: str | Deferred | Callable[[ir.Table], ir.FloatingColumn] | None = None
class-attribute
instance-attribute
The column in the right tables containing the longitude coordinates.
mismo.lib.geo.CoordinateBlocker.__call__(left: ir.Table, right: ir.Table, **kwargs) -> ir.Table
Return a hash value for the two coordinates.
mismo.lib.geo.distance_km(*, lat1: ir.FloatingValue, lon1: ir.FloatingValue, lat2: ir.FloatingValue, lon2: ir.FloatingValue) -> ir.FloatingValue
The distance between two points on the Earth's surface, in kilometers.
PARAMETER | DESCRIPTION |
---|---|
lat1 |
The latitude of the first point.
TYPE:
|
lon1 |
The longitude of the first point.
TYPE:
|
lat2 |
The latitude of the second point.
TYPE:
|
lon2 |
The longitude of the second point.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
distance
|
The distance between the two points, in kilometers.
TYPE:
|
Addresses
mismo.lib.geo.us_census_geocode(t: ir.Table, format: str = 'census_{name}', *, benchmark: str | None = None, vintage: str | None = None, chunk_size: int | None = None, n_concurrent: int | None = None) -> ir.Table
Geocode US physical addresses using the US Census Bureau's geocoding service.
Uses the batch geocoding API from https://geocoding.geo.census.gov/geocoder. This only works for US physical addresses. PO Boxes are not supported. "APT 123", "UNIT B", etc are not included in the results, so you will need to extract those before geocoding.
Before geocoding, this function normalizes the input addresses and deduplicates them, so if your input table has 1M rows, but only 100k unique addresses, it will only send those 100k addresses to the API.
This took about 7 minutes to geocode 1M unique addresses in my tests.
PARAMETER | DESCRIPTION |
---|---|
t |
A table of addresses to geocode. Must have the schema: - street: string, the street address. - city: string, the city name. - state: string, the state name. - zipcode: string, the ZIP code.
TYPE:
|
format |
The format to use for the output column names. See the Returns section.
TYPE:
|
benchmark |
The geocoding benchmark to use. Default is "Public_AR_Current".
TYPE:
|
vintage |
The geocoding vintage to use. Default is "Current_Current".
TYPE:
|
chunk_size |
The number of addresses to geocode in each request. Default is 5000. The maximum allowed by the API is 10_000. This number was tuned experimentally, you probably don't need to change ir.
TYPE:
|
n_concurrent |
The number of concurrent requests to make. Default is 16. This number was tuned experimentally, you probably don't need to change ir.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
geocoded
|
The input table, with the following additional columns:
- is_match: bool, whether the address was successfully matched.
If False, all the other columns will be NULL.
- match_type: string, the type of match. eg "exact", "non_exact"
- street: string, the normalized street address
- city: string, the normalized city name
- state: string, the normalized 2 letter state code
- zipcode: string, the 5 digit ZIP code
- latitude: float64, the latitude of the matched address
- longitude: float64, the longitude of the matched address
Each of these columns is named according to the
TYPE:
|
mismo.lib.geo.AddressesDimension
Preps, blocks, and compares based on array
columns.An address is a Struct of the type `struct< street1: string, street2: string, # eg "Apt 3" city: string, state: string, postal_code: string, # zipcode in the US country: string,
`. In other words, it is useful for comparing eg people, who might have multiple addresses, and they are the same person if any of their addresses match.
. This operates on columns of type
array
mismo.lib.geo.AddressesDimension.prepare(t: ir.Table) -> ir.Table
Prepares the table for blocking, adding normalized and tokenized columns.
mismo.lib.geo.AddressesMatchLevel
Bases: MatchLevel
How closely two addresses match.
mismo.lib.geo.AddressesMatchLevel.ELSE = 6
class-attribute
instance-attribute
None of the above.
mismo.lib.geo.AddressesMatchLevel.POSSIBLE_TYPO = 1
class-attribute
instance-attribute
If you consider typos, the addresses match.
eg the levenstein distance is below a certain threshold.
mismo.lib.geo.AddressesMatchLevel.SAME_REGION = 2
class-attribute
instance-attribute
The postal code, or city and state, match.
mismo.lib.geo.AddressesMatchLevel.SAME_STATE = 4
class-attribute
instance-attribute
The states match.
mismo.lib.geo.AddressesMatchLevel.STREET1_AND_CITY_OR_POSTAL = 0
class-attribute
instance-attribute
The street1, city, and state match.
mismo.lib.geo.AddressesMatchLevel.WITHIN_100KM = 3
class-attribute
instance-attribute
The addresses are within 100 km of each other.
mismo.lib.geo.match_level(left: ir.StructValue, right: ir.StructValue) -> ir.IntegerValue
Compare two address structs, and return the match level.
mismo.lib.geo.postal_parse_address(address_string: ir.StringValue) -> ir.StructValue
Parse individual fields from an address string.
.. note:: To use this function, you need the optional postal
library installed.
This uses the optional postal
library to extract individual fields
from the string using the following mapping:
- house_number + road -> street1
- unit -> street2
- city -> city
- state -> state
- postcode -> postal_code
- country -> country
Any additional fields parsed by postal will not be included.
PARAMETER | DESCRIPTION |
---|---|
address_string |
The address as a single string
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
address
|
The parsed address as a Struct
TYPE:
|
mismo.lib.geo.postal_fingerprint_address(address: ir.StructValue) -> ir.ArrayValue
Generate multiple hashes of an address string to be used for e.g. blocking.
.. note:: To use this function, you need to have the optional postal
library
installed.
This uses the near-dupe hashing functionality of postal
to expand the root
of each address component, ignoring tokens such as "road" or "street" in
street names.
For street names, whitespace is removed so that for example "Sea Grape Ln" and "Seagrape Ln" will both normalize to "seagrape".
This returns a list of normalized tokens that are the minimum required information to represent the given address.
Near-dupe hashes can be used as keys when blocking, to generate pairs of potential duplicates.
Further details about the hashing function can be found here.
Note that postal.near_dupe.near_dupe_hashes
can optionally hash names and
use latlon coordinates for geohashing, but this function only hashes addresses.
Name and geo-hashing must be implemented elsewhere
Examples:
>>> address = ibis.struct({
... "street1": "123 Main Street",
... "street2": "",
... "city": "Springfield",
... "state": "IL",
... "postal_code": "62701",
... "country": "us",
... })
>>> postal_fingerprint_address(address).execute()
[
"act|main street|123|springfield",
"act|main|123|springfield",
"apc|main street|123|62701",
"apc|main|123|62701",
]
PARAMETER | DESCRIPTION |
---|---|
address |
The address
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
address_hashes
|
Hashes of the address.
TYPE:
|
mismo.lib.geo.spacy_tag_address(address_string: ir.StringValue) -> ir.ArrayValue
Tag each token in a US address string with its type, eg StreetName, StreetPreDirectional
.. note::
To use this function, you need the optional spacy-address
library installed
from https://github.com/NickCrews/spacy-address
This a trained Named Entity Recognition (NER) model in spaCy to tag tokens in an address string with the following labels:
- AddressNumber
- AddressNumberPrefix
- AddressNumberSuffix
- BuildingName
- CornerOf
- CountryName
- IntersectionSeparator
- LandmarkName
- NotAddress
- OccupancyIdentifier
- OccupancyType
- PlaceName
- Recipient
- StateName
- StreetName
- StreetNamePostDirectional
- StreetNamePostModifier
- StreetNamePostType
- StreetNamePreDirectional
- StreetNamePreModifier
- StreetNamePreType
- SubaddressIdentifier
- SubaddressType
- USPSBoxGroupID
- USPSBoxGroupType
- USPSBoxID
- USPSBoxType
- ZipCode
- ZipPlus4
PARAMETER | DESCRIPTION |
---|---|
address_string |
The address as a single string
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
taggings
|
An |
Examples:
>>> from mismo.lib.geo import spacy_tag_address
>>> import ibis
Note that - "St" isn't confused as an abbreviation for Street, - "Stre" is correctly tagged as typo for "Street" - "Oklahoma" in "Oklahoma City" is correctly tagged as a PlaceName - "Oklhoma" is correctly tagged as a typo for "Oklahoma"
>>> spacy_tag_address(ibis.literal("456 E St Jude Stre, Oklahoma City, Oklhoma 73102-1234")).execute()
[{'token': '456', 'label': 'AddressNumber'},
{'token': 'E', 'label': 'StreetNamePreDirectional'},
{'token': 'St Jude', 'label': 'StreetName'},
{'token': 'Stre', 'label': 'StreetNamePostType'},
{'token': 'Oklahoma City', 'label': 'PlaceName'},
{'token': 'Oklhoma', 'label': 'StateName'},
{'token': '73102-1234', 'label': 'ZipCode'}]