Fellegi-Sunter Model
See the Fellegi-Sunter Concept guide for background info.
mismo.fs.Weights
Weights for the Fellegi-Sunter model.
An unordered, dict-like collection of ComparerWeights, one for each LevelComparer of the same name.
mismo.fs.Weights.__getitem__
__getitem__(name: str) -> ComparerWeights
Get a ComparerWeights
by name.
mismo.fs.Weights.__init__
__init__(comparer_weights: Iterable[ComparerWeights])
Create a new Weights object.
mismo.fs.Weights.__iter__
__iter__() -> Iterator[ComparerWeights]
Iterate over the contained ComparerWeights
.
mismo.fs.Weights.compare_and_score
compare_and_score(
t: Table, level_comparers: Iterable[LevelComparer]
) -> Table
Compare and score record pairs.
Use the given level_comparers
to label the record pairs, and then
score the results using self.score_compared
.
mismo.fs.Weights.from_json
classmethod
Create a Weights object from a JSON-serializable representation.
PARAMETER | DESCRIPTION |
---|---|
json
|
If a dict, assumed to be the JSON-serializable representation. Load it directly. If a str or Path, assumed to be a path to a JSON file. Load it from that file. |
RETURNS | DESCRIPTION |
---|---|
Weights
|
The Weights object created from the JSON-serializable representation. |
mismo.fs.Weights.plot
plot() -> Chart
Plot the weights for all of the LevelComparers.
mismo.fs.Weights.score_compared
score_compared(compared: Table) -> Table
Score already-compared record pairs.
This assumes that there is already a column one for each LevelComparer that contains the labels for each record pair. For example, if we have a LevelComparer called "address", then we should have a column called "address" that contains labels like "exact", "one-letter-off", "same-city", etc.
For each LevelComparer, we add a column, {comparer.name}_odds
.
This is a number that describes how this comparer affects the likelihood
of a match. For example, an odds of 10 means that this comparer
increased the likelihood of a match by 10x as compared to if we hadn't
looked at this comparer.
For example, the column might be called "name_odds" and have values like
10, 0.1, 1.
In addition to these per-LevelComparer columns, we also add a column called "odds" which is the overall odds for each record pair. We calculate this by starting with the odds of 1 and then multiplying by each LevelComparer's odds to get the overall odds.
mismo.fs.ComparerWeights
The weights for a single LevelComparer.
An ordered, dict-like collection of LevelWeights one for each level.
mismo.fs.ComparerWeights.name
property
name: str
The name of the LevelComparer these weights are for, eg 'name" or "address".
mismo.fs.ComparerWeights.__contains__
Check if a LevelWeights is present by name or index.
mismo.fs.ComparerWeights.__getitem__
__getitem__(name_or_index: str | int) -> LevelWeights
__getitem__(
name_or_index: slice,
) -> tuple[LevelWeights, ...]
__getitem__(
name_or_index: str | int | slice,
) -> LevelWeights | tuple[LevelWeights, ...]
Get a LevelWeights by name or index.
mismo.fs.ComparerWeights.__init__
__init__(name: str, level_weights: Iterable[LevelWeights])
Create a new ComparerWeights object.
mismo.fs.ComparerWeights.__iter__
__iter__() -> Iterator[LevelWeights]
Iterate over the LevelWeights, including the implicit ELSE level.
mismo.fs.ComparerWeights.__len__
__len__() -> int
The number of LevelWeights, including the implicit ELSE level.
mismo.fs.ComparerWeights.log_odds
log_odds(
labels: StringValue | IntegerValue,
) -> FloatingValue
Calculate the log odds for each record pair.
mismo.fs.ComparerWeights.match_probability
match_probability(
labels: StringValue | IntegerValue,
) -> FloatingValue
Calculate the match probability for each record pair.
mismo.fs.ComparerWeights.odds
Calculate the odds for each record pair.
If labels
is a string or integer, then we calculate the odds for that
level. For example, if labels
is "close", then we calculate the odds
for the "close" level. If labels
is 0, then we calculate the odds for
the first level. If labels
is -1, then we calculate the odds for the
last level (the ELSE level).
If labels
is a StringValue or IntegerValue, then we do the same thing,
except that we return an ibis FloatingValue instead of a python float.
mismo.fs.ComparerWeights.plot
staticmethod
plot() -> Chart
Plot the weights for this Comparer.
mismo.fs.LevelWeights
Weights for a single MatchLevel.
This describes for example "If zipcodes match perfectly, then this increases the probability of a match by 10x as compared to if we hadn't looked at zipcode".
mismo.fs.LevelWeights.m
property
m: float
Among true-matches, what proportion of them have this level?
1 means this level is a good indication of a match, 0 means it's a good indication of a non-match.
mismo.fs.LevelWeights.odds
property
odds: float
How much more likely is a match than a non-match at this level?
This is derived from m and u. This is the same thing as "Bayes Factor" in splink.
- values below 1 is evidence against a match
- values above 1 is evidence for a match
- 1 means this level does not provide any evidence for or against a match
mismo.fs.LevelWeights.u
property
u: float
Among non-matches, what proportion of them have this level?
1 means this level is a good indication of a non-match, 0 means it's a good indication of a match.
mismo.fs.train_using_labels
train_using_labels(
comparers: Iterable[LevelComparer],
left: Table,
right: Table,
*,
max_pairs: int = 1000000000,
) -> Weights
Estimate all Weights for a set of LevelComparers using labeled records.
The m parameters represent the proportion of record pairs
that fall into each MatchLevel amongst truly matching pairs.
This function estimates the m parameters using the label_true
columns
in the input datasets.
The u parameters represent the proportion of record pairs that fall into each MatchLevel amongst truly non-matching records. This function estimates the u parameters using random sampling.
PARAMETER | DESCRIPTION |
---|---|
comparers
|
The comparers to train.
TYPE:
|
left
|
The left dataset.
TYPE:
|
right
|
The right dataset.
TYPE:
|
max_pairs
|
The maximum number of pairs to sample. This is used for both the m and u estimates.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Weights
|
The estimated weights for each comparer. |
mismo.fs.train_using_em
train_using_em(
comparers: Iterable[LevelComparer],
left: Table,
right: Table,
*,
max_pairs: int | None = None,
) -> Weights
Train weights on unlabeled data using an expectation maximization algorithm.
PARAMETER | DESCRIPTION |
---|---|
comparers
|
The comparers to train.
TYPE:
|
left
|
The left dataset.
TYPE:
|
right
|
The right dataset.
TYPE:
|
max_pairs
|
The maximum number of pairs to sample. If None, all pairs are used.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Weights
|
The estimated weights for each comparer. |
mismo.fs.plot_weights
plot_weights(
weights: ComparerWeights | Iterable[ComparerWeights],
) -> Chart
Plot the weights for Comparer(s).
Use this to - See which levels are common and which are rare. If all pairs are getting matched by only one level, you probably want to adjust the conditions so that pairs are more evenly distributed. For example, if you have an "exact match" level that hardly is ever used, that could be an indication that your condition is too strict and you should relax it. - See the odds for each level. If the odds for a "exact match" level are lower than you expect, perhaps near 1, that could be an indication that your condition is too loose and there are many non-matches sneaking into that level. You should inspect those pairs and figure out how to tighted the condition so that only matches are in that level.
PARAMETER | DESCRIPTION |
---|---|
weights
|
The weights to plot.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Chart
|
The plot. |