Skip to content

Joining

Join Utilities

mismo.HasJoinCondition

Bases: Protocol

Has __join_condition__(left: ibis.Table, right: ibis.Table), which returns something that ibis.join() understands.

There are concrete implementations of this for various types of join conditions. For example, BooleanJoinCondition wraps a boolean or an ibis.ir.BooleanValue expression.

mismo.HasJoinCondition.__join_condition__

__join_condition__(
    left: Table, right: Table
) -> BooleanValue | bool

Given a left and right table, return something that ibis.join() understands.

mismo.join_condition

join_condition(
    obj: IntoHasJoinCondition,
) -> HasJoinCondition

Create a HasJoinCondition from an object.

PARAMETER DESCRIPTION
obj

The object to create a join condition from. This can be anything that ibis understands as join condition, such as a boolean, an ibis.ir.BooleanValue expression, a str, an ibis.Deferred, etc. It also supports other types, such as lambda left, right: <one of the above>,

TYPE: IntoHasJoinCondition

RETURNS DESCRIPTION
An object that follows the [HasJoinCondition][mismo.HasJoinCondition] protocol.

mismo.IntoHasJoinCondition module-attribute

IntoHasJoinCondition: TypeAlias = Union[
    HasJoinCondition,
    bool,
    BooleanValue,
    str,
    Deferred,
    Callable[
        [Table, Table],
        HasJoinCondition | bool | BooleanValue,
    ],
    Iterable[
        HasJoinCondition,
        bool,
        BooleanValue,
        str,
        Deferred,
        Callable[
            [Table, Table],
            HasJoinCondition | bool | BooleanValue,
        ],
    ],
]

An object that can be converted into a HasJoinCondition with mismo.join_condition()

mismo.join

join(
    left: Table,
    right: Table,
    predicates: IntoHasJoinCondition = (),
    how: str = "inner",
    *,
    lname: str = "{name}",
    rname: str = "{name}_right",
    rename_all: bool = False,
)

Ibis.join, but with enhanced condition resolution.

This is a wrapper around ibis.join that

  • Allows for slightly more flexible join conditions.
  • Adds an option for renaming all columns with lname and rname, not just those that collide (the default behavior).
PARAMETER DESCRIPTION
left

The left table to block

TYPE: Table

right

The right table to block

TYPE: Table

predicates

What ibis.join() accepts as a predicate, plus some extras.

Anything that ibis.join() accepts as a predicate: - An ibis.BooleanValue, such as left.last_name == right.surname.upper(). - A string, which is interpreted as the name of a column in both tables. eg "price" is equivalent to left.price == right.price. - A Deferred, which is used to reference a column in a table eg "_.price.fill_null(0)" is equivalent to left.price.fill_null(0) == right.price.fill_null(0)

Plus some extra sorts of inputs: - A 2-tuple tuple of one of the above. The first element is for the left table, the second is for the right table. This is useful when the column names are different in the two tables, or require some transformation/normalization. For example ("last_name", _.surname.upper()") is equivalent to left.last_name == right.surname.upper() - A Callable of signature (left: Table, right: Table, args) -> one of the above

TYPE: IntoHasJoinCondition DEFAULT: ()

how

Behaves the the same as in ibis.join()

TYPE: str DEFAULT: 'inner'

lname

Behaves the the same as in ibis.join()

TYPE: str DEFAULT: '{name}'

rname

Behaves the the same as in ibis.join()

TYPE: str DEFAULT: '{name}_right'

rename_all

Should we apply lname and rname to ALL columns in the output, or just on the ones that collide between the two tables (the default, the same as in ibis.join())

TYPE: bool DEFAULT: False

mismo.left module-attribute

left = var('left')

A deferred placeholder for the left table in a join.

Examples:

>>> condition = mismo.left.last_name.upper() == mismo.right.family_name.upper()
>>> import ibis
>>> my_left_table = ibis.memtable([("johnson",), ("smith",)], columns=["last_name"])
>>> my_right_table = ibis.memtable([("JOHNSON",), ("JONES",)], columns=["family_name"])
>>> mismo.join(my_left_table, my_right_table, condition)
┏━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ last_name ┃ family_name ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ string    │ string      │
├───────────┼─────────────┤
│ johnson   │ JOHNSON     │
└───────────┴─────────────┘

mismo.right module-attribute

right = var('right')

A deferred placeholder for the right table in a join.

Examples:

>>> condition = mismo.left.last_name.upper() == mismo.right.family_name.upper()
>>> import ibis
>>> my_left_table = ibis.memtable([("johnson",), ("smith",)], columns=["last_name"])
>>> my_right_table = ibis.memtable([("JOHNSON",), ("JONES",)], columns=["family_name"])
>>> mismo.join(my_left_table, my_right_table, condition)
┏━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ last_name ┃ family_name ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ string    │ string      │
├───────────┼─────────────┤
│ johnson   │ JOHNSON     │
└───────────┴─────────────┘

Analyze: Join Algorithm

Analyze the actual algorithm that the SQL engine will use when pairing up records during the join. In particular, check for slow O(n*m) nested loop joins.

See https://duckdb.org/2022/05/27/iejoin.html for a very good explanation of how SQL engines (or at least duckdb) chooses a join algorithm.

mismo.joins.get_join_algorithm

get_join_algorithm(
    left: Table, right: Table, condition
) -> str

Return one of the JOIN_ALGORITHMS for the outermost join in the expression.

If there are multiple joins in the query, this will return the top (outermost) one. This only works with expressions bound to a DuckDB backend. Other kinds of expressions will raise NotImplementedError.

mismo.joins.check_join_algorithm

check_join_algorithm(
    left: Table,
    right: Table,
    condition,
    *,
    on_slow: Literal["error", "warn", "ignore"] = "error",
) -> None

Error or warn if the join in the expression is likely to be slow.

Issues a SlowJoinWarning or raises a SlowJoinError.

This is only implemented for the duckdb backend. All other backends will issue a warning and skip the check.

By "slow", we mean that the join algorithm is one of:

  • "NESTED_LOOP_JOIN" O(n*m)
  • "BLOCKWISE_NL_JOIN" O(n*m)
  • "CROSS_PRODUCT" O(n*m)

and not one of the fast join algorithms:

  • "EMPTY_RESULT" O(1)
  • "POSITIONAL_JOIN" O(n)
  • "HASH_JOIN" O(n)
  • "PIECEWISE_MERGE_JOIN" O(m*log(n))
  • "IE_JOIN" O(n*log(n))
  • "ASOF_JOIN" O(n*log(n))

This is done by using the EXPLAIN command to generate the query plan, and checking the join algorithm.

See https://duckdb.org/2022/05/27/iejoin.html for a very good explanation of these join algorithms.

mismo.joins.JoinAlgorithm module-attribute

JoinAlgorithm = Literal[
    "EMPTY_RESULT",
    "LEFT_DELIM_JOIN",
    "RIGHT_DELIM_JOIN",
    "BLOCKWISE_NL_JOIN",
    "NESTED_LOOP_JOIN",
    "HASH_JOIN",
    "PIECEWISE_MERGE_JOIN",
    "IE_JOIN",
    "ASOF_JOIN",
    "CROSS_PRODUCT",
    "POSITIONAL_JOIN",
]

mismo.joins.SlowJoinAlgorithm module-attribute

SlowJoinAlgorithm = Literal[
    "NESTED_LOOP_JOIN", "BLOCKWISE_NL_JOIN", "CROSS_PRODUCT"
]

mismo.joins.JOIN_ALGORITHMS module-attribute

JOIN_ALGORITHMS = frozenset(__args__)

mismo.joins.SLOW_JOIN_ALGORITHMS module-attribute

SLOW_JOIN_ALGORITHMS = frozenset(__args__)

mismo.exceptions.SlowJoinError

Bases: SlowJoinMixin, ValueError, MismoError

Error for slow join algorithms.

mismo.exceptions.SlowJoinWarning

Bases: SlowJoinMixin, UserWarning, MismoWarning

Warning for slow join algorithms.