Joining
Join Utilities
mismo.HasJoinCondition
Bases: Protocol
Has __join_condition__(left: ibis.Table, right: ibis.Table)
, which returns something that ibis.join()
understands.
There are concrete implementations of this for various types of join conditions.
For example, BooleanJoinCondition
wraps a boolean
or an ibis.ir.BooleanValue
expression.
mismo.HasJoinCondition.__join_condition__
__join_condition__(
left: Table, right: Table
) -> BooleanValue | bool
Given a left and right table, return something that ibis.join()
understands.
mismo.join_condition
join_condition(
obj: IntoHasJoinCondition,
) -> HasJoinCondition
Create a HasJoinCondition from an object.
PARAMETER | DESCRIPTION |
---|---|
obj
|
The object to create a join condition from.
This can be anything that ibis understands as join condition,
such as a boolean, an ibis.ir.BooleanValue expression, a
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
An object that follows the [HasJoinCondition][mismo.HasJoinCondition] protocol.
|
|
mismo.IntoHasJoinCondition
module-attribute
IntoHasJoinCondition: TypeAlias = Union[
HasJoinCondition,
bool,
BooleanValue,
str,
Deferred,
Callable[
[Table, Table],
HasJoinCondition | bool | BooleanValue,
],
Iterable[
HasJoinCondition,
bool,
BooleanValue,
str,
Deferred,
Callable[
[Table, Table],
HasJoinCondition | bool | BooleanValue,
],
],
]
An object that can be converted into a HasJoinCondition with mismo.join_condition()
mismo.join
join(
left: Table,
right: Table,
predicates: IntoHasJoinCondition = (),
how: str = "inner",
*,
lname: str = "{name}",
rname: str = "{name}_right",
rename_all: bool = False,
)
Ibis.join, but with enhanced condition resolution.
This is a wrapper around ibis.join
that
- Allows for slightly more flexible join conditions.
- Adds an option for renaming all columns with
lname
andrname
, not just those that collide (the default behavior).
PARAMETER | DESCRIPTION |
---|---|
left
|
The left table to block
TYPE:
|
right
|
The right table to block
TYPE:
|
predicates
|
What Anything that Plus some extra sorts of inputs:
- A 2-tuple tuple of one of the above.
The first element is for the left table, the second is for the right table.
This is useful when the column names are different in the two tables,
or require some transformation/normalization.
For example
TYPE:
|
how
|
Behaves the the same as in
TYPE:
|
lname
|
Behaves the the same as in
TYPE:
|
rname
|
Behaves the the same as in
TYPE:
|
rename_all
|
Should we apply
TYPE:
|
mismo.left
module-attribute
left = var('left')
A deferred placeholder for the left table in a join.
Examples:
>>> condition = mismo.left.last_name.upper() == mismo.right.family_name.upper()
>>> import ibis
>>> my_left_table = ibis.memtable([("johnson",), ("smith",)], columns=["last_name"])
>>> my_right_table = ibis.memtable([("JOHNSON",), ("JONES",)], columns=["family_name"])
>>> mismo.join(my_left_table, my_right_table, condition)
┏━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ last_name ┃ family_name ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ string │ string │
├───────────┼─────────────┤
│ johnson │ JOHNSON │
└───────────┴─────────────┘
mismo.right
module-attribute
right = var('right')
A deferred placeholder for the right table in a join.
Examples:
>>> condition = mismo.left.last_name.upper() == mismo.right.family_name.upper()
>>> import ibis
>>> my_left_table = ibis.memtable([("johnson",), ("smith",)], columns=["last_name"])
>>> my_right_table = ibis.memtable([("JOHNSON",), ("JONES",)], columns=["family_name"])
>>> mismo.join(my_left_table, my_right_table, condition)
┏━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ last_name ┃ family_name ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ string │ string │
├───────────┼─────────────┤
│ johnson │ JOHNSON │
└───────────┴─────────────┘
Analyze: Join Algorithm
Analyze the actual algorithm that the SQL engine will use when
pairing up records during the join.
In particular, check for slow O(n*m)
nested loop joins.
See https://duckdb.org/2022/05/27/iejoin.html for a very good explanation of how SQL engines (or at least duckdb) chooses a join algorithm.
mismo.joins.get_join_algorithm
get_join_algorithm(
left: Table, right: Table, condition
) -> str
Return one of the JOIN_ALGORITHMS for the outermost join in the expression.
If there are multiple joins in the query, this will return the top (outermost) one. This only works with expressions bound to a DuckDB backend. Other kinds of expressions will raise NotImplementedError.
mismo.joins.check_join_algorithm
check_join_algorithm(
left: Table,
right: Table,
condition,
*,
on_slow: Literal["error", "warn", "ignore"] = "error",
) -> None
Error or warn if the join in the expression is likely to be slow.
Issues a SlowJoinWarning or raises a SlowJoinError.
This is only implemented for the duckdb backend. All other backends will issue a warning and skip the check.
By "slow", we mean that the join algorithm is one of:
- "NESTED_LOOP_JOIN" O(n*m)
- "BLOCKWISE_NL_JOIN" O(n*m)
- "CROSS_PRODUCT" O(n*m)
and not one of the fast join algorithms:
- "EMPTY_RESULT" O(1)
- "POSITIONAL_JOIN" O(n)
- "HASH_JOIN" O(n)
- "PIECEWISE_MERGE_JOIN" O(m*log(n))
- "IE_JOIN" O(n*log(n))
- "ASOF_JOIN" O(n*log(n))
This is done by using the EXPLAIN command to generate the query plan, and checking the join algorithm.
See https://duckdb.org/2022/05/27/iejoin.html for a very good explanation of these join algorithms.
mismo.joins.JoinAlgorithm
module-attribute
JoinAlgorithm = Literal[
"EMPTY_RESULT",
"LEFT_DELIM_JOIN",
"RIGHT_DELIM_JOIN",
"BLOCKWISE_NL_JOIN",
"NESTED_LOOP_JOIN",
"HASH_JOIN",
"PIECEWISE_MERGE_JOIN",
"IE_JOIN",
"ASOF_JOIN",
"CROSS_PRODUCT",
"POSITIONAL_JOIN",
]
mismo.joins.SlowJoinAlgorithm
module-attribute
SlowJoinAlgorithm = Literal[
"NESTED_LOOP_JOIN", "BLOCKWISE_NL_JOIN", "CROSS_PRODUCT"
]
mismo.joins.JOIN_ALGORITHMS
module-attribute
JOIN_ALGORITHMS = frozenset(__args__)
Based on all the JOIN operators in https://github.com/duckdb/duckdb/blob/b0b1562e293718ee9279c9621cefe4cb5dc01ef9/src/common/enums/physical_operator_type.cpp#L56 (very good) explanation of these at https://duckdb.org/2022/05/27/iejoin.html