Fellegi-Sunter Model
See the Fellegi-Sunter Concept guide for background info.
mismo.fs.Weights
Weights for the Fellegi-Sunter model.
An unordered, dict-like collection of ComparerWeights, one for each LevelComparer of the same name.
mismo.fs.Weights.__getitem__(name: str) -> ComparerWeights
Get a ComparerWeights
by name.
mismo.fs.Weights.__init__(comparer_weights: Iterable[ComparerWeights])
Create a new Weights object.
mismo.fs.Weights.__iter__() -> Iterator[ComparerWeights]
Iterate over the contained ComparerWeights
.
mismo.fs.Weights.__len__() -> int
The number of ComparerWeights
.
mismo.fs.Weights.compare_and_score(t: ir.Table, level_comparers: Iterable[LevelComparer]) -> ir.Table
Compare and score record pairs.
Use the given level_comparers
to label the record pairs, and then
score the results using self.score_compared
.
mismo.fs.Weights.from_json(json: dict | str | Path) -> Self
classmethod
Create a Weights object from a JSON-serializable representation.
PARAMETER | DESCRIPTION |
---|---|
json |
If a dict, assumed to be the JSON-serializable representation. Load it directly. If a str or Path, assumed to be a path to a JSON file. Load it from that file. |
RETURNS | DESCRIPTION |
---|---|
Weights
|
The Weights object created from the JSON-serializable representation. |
mismo.fs.Weights.plot() -> alt.Chart
Plot the weights for all of the LevelComparers.
mismo.fs.Weights.score_compared(compared: ir.Table) -> ir.Table
Score already-compared record pairs.
This assumes that there is already a column one for each LevelComparer that contains the labels for each record pair. For example, if we have a LevelComparer called "address", then we should have a column called "address" that contains labels like "exact", "one-letter-off", "same-city", etc.
For each LevelComparer, we add a column, {comparer.name}_odds
.
This is a number that describes how this comparer affects the likelihood
of a match. For example, an odds of 10 means that this comparer
increased the likelihood of a match by 10x as compared to if we hadn't
looked at this comparer.
For example, the column might be called "name_odds" and have values like
10, 0.1, 1.
In addition to these per-LevelComparer columns, we also add a column called "odds" which is the overall odds for each record pair. We calculate this by starting with the odds of 1 and then multiplying by each LevelComparer's odds to get the overall odds.
mismo.fs.Weights.to_json(path: str | Path | None = None) -> dict
Return a JSON-serializable representation of the weights.
If path
is given, write the dict to the file at that path in addition
to returning it.
mismo.fs.ComparerWeights
The weights for a single LevelComparer.
An ordered, dict-like collection of LevelWeights one for each level.
mismo.fs.ComparerWeights.name: str
property
The name of the LevelComparer these weights are for, eg 'name" or "address".
mismo.fs.ComparerWeights.__contains__(name_or_index: str | int) -> bool
Check if a LevelWeights is present by name or index.
mismo.fs.ComparerWeights.__getitem__(name_or_index: str | int | slice) -> LevelWeights | tuple[LevelWeights, ...]
Get a LevelWeights by name or index.
mismo.fs.ComparerWeights.__init__(name: str, level_weights: Iterable[LevelWeights])
Create a new ComparerWeights object.
mismo.fs.ComparerWeights.__iter__() -> Iterator[LevelWeights]
Iterate over the LevelWeights, including the implicit ELSE level.
mismo.fs.ComparerWeights.__len__() -> int
The number of LevelWeights, including the implicit ELSE level.
mismo.fs.ComparerWeights.log_odds(labels: str | int | ir.StringValue | ir.IntegerValue) -> float | ir.FloatingValue
Calculate the log odds for each record pair.
mismo.fs.ComparerWeights.match_probability(labels: str | int | ir.StringValue | ir.IntegerValue) -> float | ir.FloatingValue
Calculate the match probability for each record pair.
mismo.fs.ComparerWeights.odds(labels: str | int | ir.StringValue | ir.IntegerValue) -> float | ir.FloatingValue
Calculate the odds for each record pair.
If labels
is a string or integer, then we calculate the odds for that
level. For example, if labels
is "close", then we calculate the odds
for the "close" level. If labels
is 0, then we calculate the odds for
the first level. If labels
is -1, then we calculate the odds for the
last level (the ELSE level).
If labels
is a StringValue or IntegerValue, then we do the same thing,
except that we return an ibis FloatingValue instead of a python float.
mismo.fs.ComparerWeights.plot() -> alt.Chart
staticmethod
Plot the weights for this Comparer.
mismo.fs.LevelWeights
Weights for a single MatchLevel.
This describes for example "If zipcodes match perfectly, then this increases the probability of a match by 10x as compared to if we hadn't looked at zipcode".
mismo.fs.LevelWeights.log_odds: float
property
The log base 10 of the odds.
mismo.fs.LevelWeights.m: float
property
Among true-matches, what proportion of them have this level?
1 means this level is a good indication of a match, 0 means it's a good indication of a non-match.
mismo.fs.LevelWeights.name: str
property
The name of the level, e.g. "Exact Match".
mismo.fs.LevelWeights.odds: float
property
How much more likely is a match than a non-match at this level?
This is derived from m and u. This is the same thing as "Bayes Factor" in splink.
- values below 1 is evidence against a match
- values above 1 is evidence for a match
- 1 means this level does not provide any evidence for or against a match
mismo.fs.LevelWeights.u: float
property
Among non-matches, what proportion of them have this level?
1 means this level is a good indication of a non-match, 0 means it's a good indication of a match.
mismo.fs.LevelWeights.__init__(name: str, *, m: float, u: float) -> None
Create a new LevelWeights object.
mismo.fs.train_using_labels(comparers: Iterable[LevelComparer], left: ir.Table, right: ir.Table, *, max_pairs: int = 1000000000) -> Weights
Estimate all Weights for a set of LevelComparers using labeled records.
The m parameters represent the proportion of record pairs
that fall into each MatchLevel amongst truly matching pairs.
This function estimates the m parameters using the label_true
columns
in the input datasets.
The u parameters represent the proportion of record pairs that fall into each MatchLevel amongst truly non-matching records. This function estimates the u parameters using random sampling.
PARAMETER | DESCRIPTION |
---|---|
comparers |
The comparers to train.
TYPE:
|
left |
The left dataset.
TYPE:
|
right |
The right dataset.
TYPE:
|
max_pairs |
The maximum number of pairs to sample. This is used for both the m and u estimates.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Weights
|
The estimated weights for each comparer. |
mismo.fs.train_using_em(comparers: Iterable[LevelComparer], left: ir.Table, right: ir.Table, *, max_pairs: int | None = None) -> Weights
Train weights on unlabeled data using an expectation maximization algorithm.
PARAMETER | DESCRIPTION |
---|---|
comparers |
The comparers to train.
TYPE:
|
left |
The left dataset.
TYPE:
|
right |
The right dataset.
TYPE:
|
max_pairs |
The maximum number of pairs to sample. If None, all pairs are used.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Weights
|
The estimated weights for each comparer. |
mismo.fs.plot_weights(weights: ComparerWeights | Iterable[ComparerWeights]) -> alt.Chart
Plot the weights for Comparer(s).
Use this to - See which levels are common and which are rare. If all pairs are getting matched by only one level, you probably want to adjust the conditions so that pairs are more evenly distributed. For example, if you have an "exact match" level that hardly is ever used, that could be an indication that your condition is too strict and you should relax it. - See the odds for each level. If the odds for a "exact match" level are lower than you expect, perhaps near 1, that could be an indication that your condition is too loose and there are many non-matches sneaking into that level. You should inspect those pairs and figure out how to tighted the condition so that only matches are in that level.
PARAMETER | DESCRIPTION |
---|---|
weights |
The weights to plot.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Chart
|
The plot. |