Fuzzy matching in Python
Contents
difflib
- Docs here
|
|
Most simple use case
|
|
0.9629629629629629
Create helper function so we don’t need to specify None
each time.
|
|
0.9629629629629629
Compare one sequence to multiple other sequences (SequenceMatcher
caches second sequence)
|
|
abc, abc -> 1.000
ab, abc -> 0.800
abcd, abc -> 0.857
cde, abc -> 0.333
def, abc -> 0.000
fuzzywuzzy
Based on this tutorial.
Finding perfect or imperfect substrings
One limitation of SequenceMatcher
is that two sequences that clearly refer to the same thing might get a lower score than two sequences that refer to something different.
|
|
0.6086956521739131
0.7586206896551724
fuzzywuzzy
has a useful function for this based on what they call the “best-partial” heuristic, which returns the similarity score for the best substring of length min(len(seq1)), len(seq2))
.
|
|
100
69
For one of my projects, I want to filter out financial transactions for which the description is a perfect or near-perfect substring of another transaction. So this is exactly what I need.
|
|