Skip to content

Fellegi-Sunter Model

See the Fellegi-Sunter Concept guide for background info.

mismo.fs.Weights

Weights for the Fellegi-Sunter model.

An unordered, dict-like collection of ComparerWeights, one for each LevelComparer of the same name.

mismo.fs.Weights.__getitem__(name: str) -> ComparerWeights

Get a ComparerWeights by name.

mismo.fs.Weights.__init__(comparer_weights: Iterable[ComparerWeights])

Create a new Weights object.

mismo.fs.Weights.__iter__() -> Iterator[ComparerWeights]

Iterate over the contained ComparerWeights.

mismo.fs.Weights.__len__() -> int

The number of ComparerWeights.

mismo.fs.Weights.compare_and_score(t: ir.Table, level_comparers: Iterable[LevelComparer]) -> ir.Table

Compare and score record pairs.

Use the given level_comparers to label the record pairs, and then score the results using self.score_compared.

mismo.fs.Weights.from_json(json: dict | str | Path) -> Self classmethod

Create a Weights object from a JSON-serializable representation.

PARAMETER DESCRIPTION
json

If a dict, assumed to be the JSON-serializable representation. Load it directly. If a str or Path, assumed to be a path to a JSON file. Load it from that file.

TYPE: dict | str | Path

RETURNS DESCRIPTION
Weights

The Weights object created from the JSON-serializable representation.

mismo.fs.Weights.plot() -> alt.Chart

Plot the weights for all of the LevelComparers.

mismo.fs.Weights.score_compared(compared: ir.Table) -> ir.Table

Score already-compared record pairs.

This assumes that there is already a column one for each LevelComparer that contains the labels for each record pair. For example, if we have a LevelComparer called "address", then we should have a column called "address" that contains labels like "exact", "one-letter-off", "same-city", etc.

For each LevelComparer, we add a column, {comparer.name}_odds. This is a number that describes how this comparer affects the likelihood of a match. For example, an odds of 10 means that this comparer increased the likelihood of a match by 10x as compared to if we hadn't looked at this comparer. For example, the column might be called "name_odds" and have values like 10, 0.1, 1.

In addition to these per-LevelComparer columns, we also add a column called "odds" which is the overall odds for each record pair. We calculate this by starting with the odds of 1 and then multiplying by each LevelComparer's odds to get the overall odds.

mismo.fs.Weights.to_json(path: str | Path | None = None) -> dict

Return a JSON-serializable representation of the weights.

If path is given, write the dict to the file at that path in addition to returning it.

mismo.fs.ComparerWeights

The weights for a single LevelComparer.

An ordered, dict-like collection of LevelWeights one for each level.

mismo.fs.ComparerWeights.name: str property

The name of the LevelComparer these weights are for, eg 'name" or "address".

mismo.fs.ComparerWeights.__contains__(name_or_index: str | int) -> bool

Check if a LevelWeights is present by name or index.

mismo.fs.ComparerWeights.__getitem__(name_or_index: str | int | slice) -> LevelWeights | tuple[LevelWeights, ...]

Get a LevelWeights by name or index.

mismo.fs.ComparerWeights.__init__(name: str, level_weights: Iterable[LevelWeights])

Create a new ComparerWeights object.

mismo.fs.ComparerWeights.__iter__() -> Iterator[LevelWeights]

Iterate over the LevelWeights, including the implicit ELSE level.

mismo.fs.ComparerWeights.__len__() -> int

The number of LevelWeights, including the implicit ELSE level.

mismo.fs.ComparerWeights.log_odds(labels: str | int | ir.StringValue | ir.IntegerValue) -> float | ir.FloatingValue

Calculate the log odds for each record pair.

mismo.fs.ComparerWeights.match_probability(labels: str | int | ir.StringValue | ir.IntegerValue) -> float | ir.FloatingValue

Calculate the match probability for each record pair.

mismo.fs.ComparerWeights.odds(labels: str | int | ir.StringValue | ir.IntegerValue) -> float | ir.FloatingValue

Calculate the odds for each record pair.

If labels is a string or integer, then we calculate the odds for that level. For example, if labels is "close", then we calculate the odds for the "close" level. If labels is 0, then we calculate the odds for the first level. If labels is -1, then we calculate the odds for the last level (the ELSE level).

If labels is a StringValue or IntegerValue, then we do the same thing, except that we return an ibis FloatingValue instead of a python float.

mismo.fs.ComparerWeights.plot() -> alt.Chart staticmethod

Plot the weights for this Comparer.

mismo.fs.LevelWeights

Weights for a single MatchLevel.

This describes for example "If zipcodes match perfectly, then this increases the probability of a match by 10x as compared to if we hadn't looked at zipcode".

mismo.fs.LevelWeights.log_odds: float property

The log base 10 of the odds.

mismo.fs.LevelWeights.m: float property

Among true-matches, what proportion of them have this level?

1 means this level is a good indication of a match, 0 means it's a good indication of a non-match.

mismo.fs.LevelWeights.name: str property

The name of the level, e.g. "Exact Match".

mismo.fs.LevelWeights.odds: float property

How much more likely is a match than a non-match at this level?

This is derived from m and u. This is the same thing as "Bayes Factor" in splink.

  • values below 1 is evidence against a match
  • values above 1 is evidence for a match
  • 1 means this level does not provide any evidence for or against a match

mismo.fs.LevelWeights.u: float property

Among non-matches, what proportion of them have this level?

1 means this level is a good indication of a non-match, 0 means it's a good indication of a match.

mismo.fs.LevelWeights.__init__(name: str, *, m: float, u: float) -> None

Create a new LevelWeights object.

mismo.fs.train_using_labels(comparers: Iterable[LevelComparer], left: ir.Table, right: ir.Table, *, max_pairs: int = 1000000000) -> Weights

Estimate all Weights for a set of LevelComparers using labeled records.

The m parameters represent the proportion of record pairs that fall into each MatchLevel amongst truly matching pairs. This function estimates the m parameters using the label_true columns in the input datasets.

The u parameters represent the proportion of record pairs that fall into each MatchLevel amongst truly non-matching records. This function estimates the u parameters using random sampling.

PARAMETER DESCRIPTION
comparers

The comparers to train.

TYPE: Iterable[LevelComparer]

left

The left dataset.

TYPE: Table

right

The right dataset.

TYPE: Table

max_pairs

The maximum number of pairs to sample. This is used for both the m and u estimates.

TYPE: int DEFAULT: 1000000000

RETURNS DESCRIPTION
Weights

The estimated weights for each comparer.

mismo.fs.train_using_em(comparers: Iterable[LevelComparer], left: ir.Table, right: ir.Table, *, max_pairs: int | None = None) -> Weights

Train weights on unlabeled data using an expectation maximization algorithm.

PARAMETER DESCRIPTION
comparers

The comparers to train.

TYPE: Iterable[LevelComparer]

left

The left dataset.

TYPE: Table

right

The right dataset.

TYPE: Table

max_pairs

The maximum number of pairs to sample. If None, all pairs are used.

TYPE: int | None DEFAULT: None

RETURNS DESCRIPTION
Weights

The estimated weights for each comparer.

mismo.fs.plot_weights(weights: ComparerWeights | Iterable[ComparerWeights]) -> alt.Chart

Plot the weights for Comparer(s).

Use this to - See which levels are common and which are rare. If all pairs are getting matched by only one level, you probably want to adjust the conditions so that pairs are more evenly distributed. For example, if you have an "exact match" level that hardly is ever used, that could be an indication that your condition is too strict and you should relax it. - See the odds for each level. If the odds for a "exact match" level are lower than you expect, perhaps near 1, that could be an indication that your condition is too loose and there are many non-matches sneaking into that level. You should inspect those pairs and figure out how to tighted the condition so that only matches are in that level.

PARAMETER DESCRIPTION
weights

The weights to plot.

TYPE: ComparerWeights | Iterable[ComparerWeights]

RETURNS DESCRIPTION
Chart

The plot.