Fellegi-Sunter Model

See the Fellegi-Sunter Concept guide for background info.

mismo.fs.Weights

Weights for the Fellegi-Sunter model.

An unordered, dict-like collection of ComparerWeights, one for each LevelComparer of the same name.

mismo.fs.Weights.getitem

__getitem__(name: str) -> ComparerWeights

Get a ComparerWeights by name.

mismo.fs.Weights.init

__init__(comparer_weights: Iterable[ComparerWeights])

Create a new Weights object.

mismo.fs.Weights.iter

__iter__() -> Iterator[ComparerWeights]

Iterate over the contained ComparerWeights.

mismo.fs.Weights.len

__len__() -> int

The number of ComparerWeights.

mismo.fs.Weights.compare_and_score

compare_and_score(
    t: Table, level_comparers: Iterable[LevelComparer]
) -> Table

Compare and score record pairs.

Use the given level_comparers to label the record pairs, and then score the results using self.score_compared.

mismo.fs.Weights.from_json `classmethod`

from_json(json: dict | str | Path) -> Self

Create a Weights object from a JSON-serializable representation.

PARAMETER	DESCRIPTION
`json`	If a dict, assumed to be the JSON-serializable representation. Load it directly. If a str or Path, assumed to be a path to a JSON file. Load it from that file. TYPE: `dict \| str \| Path`

RETURNS	DESCRIPTION
`Weights`	The Weights object created from the JSON-serializable representation.

mismo.fs.Weights.plot

plot() -> Chart

Plot the weights for all of the LevelComparers.

mismo.fs.Weights.score_compared

score_compared(compared: Table) -> Table

Score already-compared record pairs.

This assumes that there is already a column one for each LevelComparer that contains the labels for each record pair. For example, if we have a LevelComparer called "address", then we should have a column called "address" that contains labels like "exact", "one-letter-off", "same-city", etc.

For each LevelComparer, we add a column, {comparer.name}_odds. This is a number that describes how this comparer affects the likelihood of a match. For example, an odds of 10 means that this comparer increased the likelihood of a match by 10x as compared to if we hadn't looked at this comparer. For example, the column might be called "name_odds" and have values like 10, 0.1, 1.

In addition to these per-LevelComparer columns, we also add a column called "odds" which is the overall odds for each record pair. We calculate this by starting with the odds of 1 and then multiplying by each LevelComparer's odds to get the overall odds.

mismo.fs.Weights.to_json

to_json(path: str | Path | None = None) -> dict

Return a JSON-serializable representation of the weights.

If path is given, write the dict to the file at that path in addition to returning it.

mismo.fs.ComparerWeights

The weights for a single LevelComparer.

An ordered, dict-like collection of LevelWeights one for each level.

mismo.fs.ComparerWeights.name `property`

name: str

The name of the LevelComparer these weights are for, eg 'name" or "address".

mismo.fs.ComparerWeights.contains

__contains__(name_or_index: str | int) -> bool

Check if a LevelWeights is present by name or index.

mismo.fs.ComparerWeights.getitem

__getitem__(name_or_index: str | int) -> LevelWeights

__getitem__(
    name_or_index: slice,
) -> tuple[LevelWeights, ...]

__getitem__(
    name_or_index: str | int | slice,
) -> LevelWeights | tuple[LevelWeights, ...]

Get a LevelWeights by name or index.

mismo.fs.ComparerWeights.init

__init__(name: str, level_weights: Iterable[LevelWeights])

Create a new ComparerWeights object.

mismo.fs.ComparerWeights.iter

__iter__() -> Iterator[LevelWeights]

Iterate over the LevelWeights, including the implicit ELSE level.

mismo.fs.ComparerWeights.len

__len__() -> int

The number of LevelWeights, including the implicit ELSE level.

mismo.fs.ComparerWeights.log_odds

log_odds(labels: str | int) -> float

log_odds(
    labels: StringValue | IntegerValue,
) -> FloatingValue

log_odds(
    labels: str | int | StringValue | IntegerValue,
) -> float | FloatingValue

Calculate the log odds for each record pair.

mismo.fs.ComparerWeights.match_probability

match_probability(labels: str | int) -> float

match_probability(
    labels: StringValue | IntegerValue,
) -> FloatingValue

match_probability(
    labels: str | int | StringValue | IntegerValue,
) -> float | FloatingValue

Calculate the match probability for each record pair.

mismo.fs.ComparerWeights.odds

odds(labels: str | int) -> float

odds(labels: StringValue | IntegerValue) -> FloatingValue

odds(
    labels: str | int | StringValue | IntegerValue,
) -> float | FloatingValue

Calculate the odds for each record pair.

If labels is a string or integer, then we calculate the odds for that level. For example, if labels is "close", then we calculate the odds for the "close" level. If labels is 0, then we calculate the odds for the first level. If labels is -1, then we calculate the odds for the last level (the ELSE level).

If labels is a StringValue or IntegerValue, then we do the same thing, except that we return an ibis FloatingValue instead of a python float.

mismo.fs.ComparerWeights.plot `staticmethod`

plot() -> Chart

Plot the weights for this Comparer.

mismo.fs.LevelWeights

Weights for a single MatchLevel.

This describes for example "If zipcodes match perfectly, then this increases the probability of a match by 10x as compared to if we hadn't looked at zipcode".

mismo.fs.LevelWeights.log_odds `property`

log_odds: float

The log base 10 of the odds.

mismo.fs.LevelWeights.m `property`

m: float

Among true-matches, what proportion of them have this level?

1 means this level is a good indication of a match, 0 means it's a good indication of a non-match.

mismo.fs.LevelWeights.name `property`

name: str

The name of the level, e.g. "Exact Match".

mismo.fs.LevelWeights.odds `property`

odds: float

How much more likely is a match than a non-match at this level?

This is derived from m and u. This is the same thing as "Bayes Factor" in splink.

values below 1 is evidence against a match
values above 1 is evidence for a match
1 means this level does not provide any evidence for or against a match

mismo.fs.LevelWeights.u `property`

u: float

Among non-matches, what proportion of them have this level?

1 means this level is a good indication of a non-match, 0 means it's a good indication of a match.

mismo.fs.LevelWeights.init

__init__(name: str, *, m: float, u: float) -> None

Create a new LevelWeights object.

mismo.fs.train_using_labels

train_using_labels(
    comparers: Iterable[LevelComparer],
    left: Table,
    right: Table,
    *,
    max_pairs: int = 1000000000,
) -> Weights

Estimate all Weights for a set of LevelComparers using labeled records.

The m parameters represent the proportion of record pairs that fall into each MatchLevel amongst truly matching pairs. This function estimates the m parameters using the label_true columns in the input datasets.

The u parameters represent the proportion of record pairs that fall into each MatchLevel amongst truly non-matching records. This function estimates the u parameters using random sampling.

PARAMETER	DESCRIPTION
`comparers`	The comparers to train. TYPE: `Iterable[LevelComparer]`
`left`	The left dataset. TYPE: `Table`
`right`	The right dataset. TYPE: `Table`
`max_pairs`	The maximum number of pairs to sample. This is used for both the m and u estimates. TYPE: `int` DEFAULT: `1000000000`

RETURNS	DESCRIPTION
`Weights`	The estimated weights for each comparer.

mismo.fs.train_using_em

train_using_em(
    comparers: Iterable[LevelComparer],
    left: Table,
    right: Table,
    *,
    max_pairs: int | None = None,
) -> Weights

Train weights on unlabeled data using an expectation maximization algorithm.

PARAMETER	DESCRIPTION
`comparers`	The comparers to train. TYPE: `Iterable[LevelComparer]`
`left`	The left dataset. TYPE: `Table`
`right`	The right dataset. TYPE: `Table`
`max_pairs`	The maximum number of pairs to sample. If None, all pairs are used. TYPE: `int \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`Weights`	The estimated weights for each comparer.

mismo.fs.plot_weights

plot_weights(
    weights: ComparerWeights | Iterable[ComparerWeights],
) -> Chart

Plot the weights for Comparer(s).

Use this to - See which levels are common and which are rare. If all pairs are getting matched by only one level, you probably want to adjust the conditions so that pairs are more evenly distributed. For example, if you have an "exact match" level that hardly is ever used, that could be an indication that your condition is too strict and you should relax it. - See the odds for each level. If the odds for a "exact match" level are lower than you expect, perhaps near 1, that could be an indication that your condition is too loose and there are many non-matches sneaking into that level. You should inspect those pairs and figure out how to tighted the condition so that only matches are in that level.

PARAMETER	DESCRIPTION
`weights`	The weights to plot. TYPE: `ComparerWeights \| Iterable[ComparerWeights]`

RETURNS	DESCRIPTION
`Chart`	The plot.

Fellegi-Sunter Model

mismo.fs.Weights

mismo.fs.Weights.__getitem__

mismo.fs.Weights.__init__

mismo.fs.Weights.__iter__

mismo.fs.Weights.__len__

mismo.fs.Weights.compare_and_score

mismo.fs.Weights.from_json classmethod

mismo.fs.Weights.plot

mismo.fs.Weights.score_compared

mismo.fs.Weights.to_json

mismo.fs.ComparerWeights

mismo.fs.ComparerWeights.name property

mismo.fs.ComparerWeights.__contains__

mismo.fs.ComparerWeights.__getitem__

mismo.fs.ComparerWeights.__init__

mismo.fs.ComparerWeights.__iter__

mismo.fs.ComparerWeights.__len__