Blocking API
Utilities and classes for the blocking phase of record linkage, where we choose pairs of records to compare.
Without blocking, we would have to compare N*M records, which becomes intractable for datasets much larger than a few thousand.
Blockers
mismo.block.PBlocker
Bases: Protocol
A Callable that takes two tables and returns candidate pairs.
mismo.block.PBlocker.__call__(left: ir.Table, right: ir.Table, **kwargs) -> ir.Table
Return a table of candidate pairs.
Implementers must expect to be called with two tables, left
and right
.
Each one will have a column record_id
that uniquely identifies each record.
Implementers must accept a variable number of keyword arguments,
even if they don't use them for anything.
The returned table must follow the Mismo convention of blocked tables:
- The left table must have all columns suffixed with "_l".
- The right table must have all columns suffixed with "_r".
- There must be no duplicate pairs. ie there must be no two rows with the same record_id_l and record_id_r.
- There may be additional columns that could get used by later steps, such as a column of rule names used to generate the pair, or some kind of initial score.
mismo.block.CrossBlocker
A Blocker that cross joins two tables, yielding all possible pairs.
mismo.block.EmptyBlocker
A Blocker that yields no pairs.
mismo.block.ConditionBlocker
Blocks based on a join predicate.
mismo.block.ConditionBlocker.__init__(condition: Callable[[ir.Table, ir.Table], ir.BooleanValue], *, name: str | None = None)
Utils
mismo.block.n_naive_comparisons(left: ir.Table | Sized | int, right: ir.Table | Sized | int | None = None) -> int
The number of comparisons if we compared every record to every other record.
PARAMETER | DESCRIPTION |
---|---|
left |
The number of records in the left dataset, or the left dataset itself. |
right |
The number of records in the right dataset, or the right dataset itself. For dedupe tasks leave this as None. |
RETURNS | DESCRIPTION |
---|---|
int
|
The number of comparisons. |
mismo.block.sample_all_pairs(left: ir.Table, right: ir.Table, *, max_pairs: int | None = None) -> ir.Table
Samples up to max_pairs
from all possible pairs of records.
PARAMETER | DESCRIPTION |
---|---|
left |
The left table.
TYPE:
|
right |
The right table.
TYPE:
|
max_pairs |
The maximum number of pairs to sample. If None, all possible pairs are sampled.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Table
|
A blocked table of pairs, in the same schema as from other blocking functions. All pairs will be unique. |
Examples:
>>> import ibis
>>> from mismo import playdata, block
>>> ibis.options.interactive = True
>>> t, _labels = playdata.load_febrl1()
>>> t.head(5)
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━┓
┃ record_id ┃ given_name ┃ surname ┃ street_number ┃ address_1 ┃ … ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━┩
│ string │ string │ string │ string │ string │ … │
├──────────────┼────────────┼────────────┼───────────────┼───────────────────┼───┤
│ rec-0-dup-0 │ thomas │ rokobaro │ 12 │ herschell circuit │ … │
│ rec-0-org │ flynn │ rokobaro │ 12 │ herschell circuit │ … │
│ rec-1-dup-0 │ karli │ alderson │ 144 │ nulsen circuit │ … │
│ rec-1-org │ karli │ alderson │ 144 │ nulsen circuit │ … │
│ rec-10-dup-0 │ kayla │ harrington │ NULL │ maltby circuit │ … │
└──────────────┴────────────┴────────────┴───────────────┴───────────────────┴───┘
>>> block.sample_all_pairs(t, t, max_pairs = 7)
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━┓
┃ record_id_l ┃ record_id_r ┃ address_1_l ┃ address_1_r ┃ … ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━┩
│ string │ string │ string │ string │ … │
├───────────────┼──────────────┼──────────────────┼───────────────────┼───┤
│ rec-138-org │ rec-63-org │ strachan place │ charleston street │ … │
│ rec-2-org │ rec-34-org │ finniss crescent │ messenger street │ … │
│ rec-325-dup-0 │ rec-422-org │ becker uplace │ booth crescent │ … │
│ rec-338-org │ rec-367-org │ crease place │ pickles street │ … │
│ rec-367-org │ rec-431-org │ pickles street │ kaveneys road │ … │
│ rec-448-org │ rec-69-org │ fenwick place │ torrens street │ … │
│ rec-72-org │ rec-58-dup-0 │ madigan street │ clark close │ … │
└───────────────┴──────────────┴──────────────────┴───────────────────┴───┘
mismo.block.join(left: ir.Table, right: ir.Table, *conditions: _Condition, on_slow: Literal['error', 'warn', 'ignore'] = 'error', task: Literal['dedupe', 'link'] | None = None, **kwargs) -> ir.Table
Join two tables together using the given conditions.
This is a wrapper around ibis.join
that
- Allows for slightly more flexible join conditions.
- Renames the columns in the resulting table to have "_l" and "_r" suffixes, to be consistent with Mismo's conventions.
- Sorts the output columns into a nicer order.
- Checks if the join condition is likely to cause a slow O(n*m) join algorithm.
- If
task
is "dedupe", adds an additional restriction thatrecord_id_l < record_id_r
to remove duplicates.
PARAMETER | DESCRIPTION |
---|---|
left |
The left table to block
TYPE:
|
right |
The right table to block
TYPE:
|
conditions |
The conditions that determine if two records should be blocked together. All conditions are ANDed together. Each condition can be any of the following:
Note You can't reference the input tables directly in the conditions.
eg
TYPE:
|
on_slow |
What to do if the join condition causes a slow O(n*m) join algorithm. If "error", raise a SlowJoinError. If "warn", issue a SlowJoinWarning. If "ignore", do nothing. See check_join_algorithm() for more information.
TYPE:
|
task |
If "dedupe", the resulting pairs will have the additional restriction that
TYPE:
|
Key-Based Blockers
Generate pairs where records share a single key, eg "first_name"
mismo.block.KeyBlocker
Blocks records together wherever they share a key, eg "emails match."
This is one of the most basic blocking rules, used very often in record linkage.
This is what is used in splink
.
Examples:
>>> import ibis
>>> from ibis import _
>>> import mismo
>>> ibis.options.interactive = True
>>> t = mismo.playdata.load_patents()["record_id", "name", "latitude"]
>>> t.head()
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ record_id ┃ name ┃ latitude ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ int64 │ string │ float64 │
├───────────┼──────────────────────────────┼──────────┤
│ 2909 │ * AGILENT TECHNOLOGIES, INC. │ 0.00 │
│ 3574 │ * AKZO NOBEL N.V. │ 0.00 │
│ 3575 │ * AKZO NOBEL NV │ 0.00 │
│ 3779 │ * ALCATEL N.V. │ 52.35 │
│ 3780 │ * ALCATEL N.V. │ 52.35 │
└───────────┴──────────────────────────────┴──────────┘
Block the table with itself wherever the names match:
>>> blocker = mismo.block.KeyBlocker("name")
>>> blocker(t, t).order_by("record_id_l", "record_id_r").head()
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ record_id_l ┃ record_id_r ┃ latitude_l ┃ latitude_r ┃ name_l ┃ name_r ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ int64 │ int64 │ float64 │ float64 │ string │ string │
├─────────────┼─────────────┼────────────┼────────────┼────────────────┼────────────────┤
│ 3779 │ 3780 │ 52.35 │ 52.350000 │ * ALCATEL N.V. │ * ALCATEL N.V. │
│ 3779 │ 3782 │ 52.35 │ 0.000000 │ * ALCATEL N.V. │ * ALCATEL N.V. │
│ 3780 │ 3782 │ 52.35 │ 0.000000 │ * ALCATEL N.V. │ * ALCATEL N.V. │
│ 25388 │ 7651559 │ 0.00 │ 50.966667 │ DSM N.V. │ DSM N.V. │
│ 25388 │ 7651560 │ 0.00 │ 52.500000 │ DSM N.V. │ DSM N.V. │
└─────────────┴─────────────┴────────────┴────────────┴────────────────┴────────────────┘
Arbitrary blocking keys are supported. For example, block the table wherever
- the first 5 characters of the name in uppercase, are the same AND
- the latitudes, rounded to 1 decimal place, are the same
>>> blocker = mismo.block.KeyBlocker(_["name"][:5].upper(), _.latitude.round(1))
>>> blocker(t, t).order_by("record_id_l", "record_id_r").head()
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ record_id_l ┃ record_id_r ┃ latitude_l ┃ latitude_r ┃ name_l ┃ name_r ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ int64 │ int64 │ float64 │ float64 │ string │ string │
├─────────────┼─────────────┼────────────┼────────────┼─────────────────────┼─────────────────────┤
│ 3574 │ 3575 │ 0.00 │ 0.00 │ * AKZO NOBEL N.V. │ * AKZO NOBEL NV │
│ 3779 │ 3780 │ 52.35 │ 52.35 │ * ALCATEL N.V. │ * ALCATEL N.V. │
│ 15041 │ 15042 │ 0.00 │ 0.00 │ * CANON EUROPA N.V │ * CANON EUROPA N.V. │
│ 15041 │ 15043 │ 0.00 │ 0.00 │ * CANON EUROPA N.V │ * CANON EUROPA NV │
│ 15042 │ 15043 │ 0.00 │ 0.00 │ * CANON EUROPA N.V. │ * CANON EUROPA NV │
└─────────────┴─────────────┴────────────┴────────────┴─────────────────────┴─────────────────────┘
We can even block on arrays! For example, first let's split each name into significant tokens:
>>> tokens = _.name.upper().split(" ").filter(lambda x: x.length() > 4)
>>> t.select(tokens)
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ ArrayFilter(StringSplit(Uppercase(name), ' '), Greater(StringLength(x), 4)) ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ array<string> │
├─────────────────────────────────────────────────────────────────────────────┤
│ ['AGILENT', 'TECHNOLOGIES,'] │
│ ['NOBEL'] │
│ ['NOBEL'] │
│ ['ALCATEL'] │
│ ['ALCATEL'] │
│ ['ALCATEL'] │
│ ['CANON', 'EUROPA'] │
│ ['CANON', 'EUROPA'] │
│ ['CANON', 'EUROPA'] │
│ [] │
│ … │
└─────────────────────────────────────────────────────────────────────────────┘
Now, block the tables together wherever two records share a token.
Note that this blocked * SCHLUMBERGER LIMITED
with * SCHLUMBERGER TECHNOLOGY BV
.
because they both share the SCHLUMBERGER
token.
>>> blocker = mismo.block.KeyBlocker(tokens.unnest())
>>> blocker(t, t).filter(_.name_l != _.name_r).order_by("record_id_l", "record_id_r").head()
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ record_id_l ┃ record_id_r ┃ latitude_l ┃ latitude_r ┃ name_l ┃ name_r ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ int64 │ int64 │ float64 │ float64 │ string │ string │
├─────────────┼─────────────┼────────────┼────────────┼────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────────┤
│ 2909 │ 13390969 │ 0.0 │ 52.35 │ * AGILENT TECHNOLOGIES, INC. │ Hitachi Global Storage Technologies, Inc. Netherlands B.V │
│ 2909 │ 13390970 │ 0.0 │ 52.35 │ * AGILENT TECHNOLOGIES, INC. │ Hitachi Global Storage Technologies, Inc. Netherlands B.V. │
│ 2909 │ 13391015 │ 0.0 │ 52.35 │ * AGILENT TECHNOLOGIES, INC. │ Hitachi Global Storage Technologies, Netherland B.V. │
│ 2909 │ 13391055 │ 0.0 │ 52.50 │ * AGILENT TECHNOLOGIES, INC. │ Hitachi Global Storage Technologies, Netherlands, B.V. │
│ 2909 │ 13391056 │ 0.0 │ 52.35 │ * AGILENT TECHNOLOGIES, INC. │ Hitachi Global Storage Technologies, Netherlands, B.V. │
└─────────────┴─────────────┴────────────┴────────────┴────────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────────┘
mismo.block.KeyBlocker.name: str
property
The name of the KeyBlocker.
mismo.block.KeyBlocker.__init__(*keys: str | Deferred | Callable[[ibis.Table], ir.Column | str | Deferred] | tuple[str | Deferred | Callable[[ibis.Table], ir.Column | str | Deferred], str | Deferred | Callable[[ibis.Table], ir.Column | str | Deferred]] | Callable[[ibis.Table, ibis.Table], tuple[ir.Column, ir.Column]], name: str | None = None) -> None
Create a KeyBlocker.
PARAMETER | DESCRIPTION |
---|---|
keys |
The keys to block on. The tables will be blocked together wherever they share ALL the keys. Each key can be any of the following:
TYPE:
|
mismo.block.KeyBlocker.key_counts(t: ir.Table) -> CountsTable
cached
Count the join keys in a table.
This is very similar to t.group_by(key).count()
, except that counts NULLs,
whereas this method does not count NULLs since they are not
counted as a match during joins.
This is useful for analyzing the skew of join keys. For example, if you are joining on (surname, city), there might be only 4 values for (hoessle, tinytown), which would lead to a block of 4 * 4 = 16 record pairs.
On the other hand, there could be 10_000 values for (smith, new york city). This would lead to 10_000 * 10_000 = 100_000_000 record pairs, which is likely too many for you to be able to compare.
PARAMETER | DESCRIPTION |
---|---|
t |
The table of records.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
CountsTable
|
Will have column(s) for each |
Examples:
>>> import ibis
>>> ibis.options.interactive = True
>>> import mismo
>>> records = [
... ("a", 1),
... ("b", 1),
... ("b", 1),
... ("c", 3),
... ("b", 2),
... ("c", 3),
... (None, 4),
... ("c", 3),
... ]
>>> letters, nums = zip(*records)
>>> t = ibis.memtable({"letter": letters, "num": nums})
>>> blocker = mismo.block.KeyBlocker("letter", "num")
Note how the (None, 4) record is not counted, since NULLs are not counted as a match during a join.
>>> counts = blocker.key_counts(t)
>>> counts.order_by("letter", "num")
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ letter ┃ num ┃ n ┃
┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ string │ int64 │ int64 │
├────────┼───────┼───────┤
│ a │ 1 │ 1 │
│ b │ 1 │ 2 │
│ b │ 2 │ 1 │
│ c │ 3 │ 3 │
└────────┴───────┴───────┘
The returned CountsTable is a subclass of an Ibis Table
with a special n_total
property for convenience:
>>> counts.n_total
7
>>> isinstance(counts, ibis.Table)
True
mismo.block.KeyBlocker.pair_counts(left: ir.Table, right: ir.Table, *, task: Literal['dedupe', 'link'] | None = None) -> CountsTable
cached
Count the number of pairs that would be generated by each key.
If you were to use this blocker to join left
with right
,
how many pairs would be generated for each key?
This is useful for analyzing the skew of join keys. For example, if you are joining on (surname, city), there might be only 4 values for (hoessle, tinytown), which would lead to a block of 4 * 4 = 16 record pairs.
On the other hand, there could be 10_000 values for (smith, new york city). This would lead to 10_000 * 10_000 = 100_000_000 record pairs, which is likely too many for you to be able to compare.
PARAMETER | DESCRIPTION |
---|---|
left |
The left table.
TYPE:
|
right |
The right table.
TYPE:
|
task |
The task to count pairs for.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
CountsTable
|
Will have column(s) for each |
Examples:
>>> import ibis
>>> ibis.options.interactive = True
>>> import mismo
>>> records = [
... ("a", 1),
... ("b", 1),
... ("b", 1),
... ("c", 3),
... ("b", 2),
... ("c", 3),
... (None, 4),
... ("c", 3),
... ]
>>> letters, nums = zip(*records)
>>> t = ibis.memtable({"letter": letters, "num": nums})
>>> blocker = mismo.block.KeyBlocker("letter", "num")
If we joined t with itself using this blocker in a link task, we would end up with
- 9 pairs in the (c, 3) block due to pairs (3,3), (3, 6), (3, 8), (6, 3), (6, 6), (6, 8), (8, 3), (8, 6), and (8, 8)
- 4 pairs in the (b, 1) block due to pairs (1, 1), (1, 2), (2, 1), and (2, 2)
- 1 pairs in the (a, 1) block due to pair (0, 0)
- 1 pairs in the (b, 2) block due to pair (4, 4)
>>> counts = blocker.pair_counts(t, t, task="link").order_by("letter", "num")
>>> counts
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ letter ┃ num ┃ n ┃
┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ string │ int64 │ int64 │
├────────┼───────┼───────┤
│ a │ 1 │ 1 │
│ b │ 1 │ 4 │
│ b │ 2 │ 1 │
│ c │ 3 │ 9 │
└────────┴───────┴───────┘
If we joined t with itself using this blocker in a dedupe task, we would end up with
- 3 pairs in the (c, 3) block due to pairs (3, 6), (3, 8), and (6, 8)
- 1 pairs in the (b, 1) block due to pairs (1, 2)
- 0 pairs in the (a, 1) block due to record 0 not getting blocked with itself
- 0 pairs in the (b, 2) block due to record 4 not getting blocked with itself
>>> counts = blocker.pair_counts(t, t)
>>> counts.order_by("letter", "num")
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ letter ┃ num ┃ n ┃
┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ string │ int64 │ int64 │
├────────┼───────┼───────┤
│ a │ 1 │ 0 │
│ b │ 1 │ 1 │
│ b │ 2 │ 0 │
│ c │ 3 │ 3 │
└────────┴───────┴───────┘
The returned CountsTable is a subclass of an Ibis Table
with a special n_total
property for convenience:
>>> counts.n_total
4
>>> isinstance(counts, ibis.Table)
True
mismo.block.CountsTable
Bases: TableWrapper
A table with at least an Integer column named n
.
There will also be variable number of other columns that act as identifiers.
You won't create this directly, it will be returned to you from eg KeyBlocker.key_counts() or KeyBlocker.pair_counts().
mismo.block.CountsTable.n: ir.IntegerColumn
instance-attribute
The column containing the count.
mismo.block.CountsTable.n_total: int
cached
property
n.sum(), just here for convenience.
Ensemble Blockers
Blockers that use other Blockers
mismo.block.UnionBlocker
A Blocker that takes the union of the results of other Blockers.
mismo.block.UnionBlocker.__init__(*blockers: PBlocker, labels: bool = False)
Create a UnionBlocker from the given Blockers.
PARAMETER | DESCRIPTION |
---|---|
blockers |
The Blockers to use.
TYPE:
|
labels |
If True, a column of type
TYPE:
|
mismo.block.upset_chart(blocked: ir.Table) -> alt.Chart
Generate an Altair-based UpSet plot for a blocked table.
An UpSet plot is useful to visualize the overlap between blocking rules. For example, how many pairs are blocked by the "given name" rule, how many pairs are blocked by the "surname name" rule, and how many pairs are blocked by both rules together.
For example, if there is one group that generates a lot of pairs, that could be an opportunity to make those rules more restrictive, so that they generate fewer pairs.
PARAMETER | DESCRIPTION |
---|---|
blocked |
The blocked table to plot.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Chart
|
An Altair chart. |
Analyze: Join Algorithm
Analyze the actual algorithm that the SQL engine will use when
pairing up records during the join.
In particular, check for slow O(n*m)
nested loop joins.
See https://duckdb.org/2022/05/27/iejoin.html for a very good explanation of how SQL engines (or at least duckdb) chooses a join algorithm.
mismo.block.get_join_algorithm(left: ir.Table, right: ir.Table, condition) -> str
Return one of the JOIN_ALGORITHMS for the outermost join in the expression.
If there are multiple joins in the query, this will return the top (outermost) one. This only works with expressions bound to a DuckDB backend. Other kinds of expressions will raise NotImplementedError.
mismo.block.check_join_algorithm(left: ir.Table, right: ir.Table, condition, *, on_slow: Literal['error', 'warn', 'ignore'] = 'error') -> None
Error or warn if the join in the expression is likely to be slow.
Issues a SlowJoinWarning or raises a SlowJoinError.
This is only implemented for the duckdb backend. All other backends will issue a warning and skip the check.
By "slow", we mean that the join algorithm is one of:
- "NESTED_LOOP_JOIN" O(n*m)
- "BLOCKWISE_NL_JOIN" O(n*m)
- "CROSS_PRODUCT" O(n*m)
and not one of the fast join algorithms:
- "EMPTY_RESULT" O(1)
- "POSITIONAL_JOIN" O(n)
- "HASH_JOIN" O(n)
- "PIECEWISE_MERGE_JOIN" O(m*log(n))
- "IE_JOIN" O(n*log(n))
- "ASOF_JOIN" O(n*log(n))
This is done by using the EXPLAIN command to generate the query plan, and checking the join algorithm.
See https://duckdb.org/2022/05/27/iejoin.html for a very good explanation of these join algorithms.
mismo.block.JOIN_ALGORITHMS = frozenset({'EMPTY_RESULT', 'LEFT_DELIM_JOIN', 'RIGHT_DELIM_JOIN', 'BLOCKWISE_NL_JOIN', 'NESTED_LOOP_JOIN', 'HASH_JOIN', 'PIECEWISE_MERGE_JOIN', 'IE_JOIN', 'ASOF_JOIN', 'CROSS_PRODUCT', 'POSITIONAL_JOIN'})
module-attribute
Based on all the JOIN operators in https://github.com/duckdb/duckdb/blob/b0b1562e293718ee9279c9621cefe4cb5dc01ef9/src/common/enums/physical_operator_type.cpp#L56 (very good) explanation of these at https://duckdb.org/2022/05/27/iejoin.html