Fuzzy matching in Python
Contents
difflib
- Docs here
| |
Most simple use case
| |
0.9629629629629629
Create helper function so we don’t need to specify None each time.
| |
0.9629629629629629
Compare one sequence to multiple other sequences (SequenceMatcher caches second sequence)
| |
abc, abc -> 1.000
ab, abc -> 0.800
abcd, abc -> 0.857
cde, abc -> 0.333
def, abc -> 0.000
fuzzywuzzy
Based on this tutorial.
Finding perfect or imperfect substrings
One limitation of SequenceMatcher is that two sequences that clearly refer to the same thing might get a lower score than two sequences that refer to something different.
| |
0.6086956521739131
0.7586206896551724
fuzzywuzzy has a useful function for this based on what they call the “best-partial” heuristic, which returns the similarity score for the best substring of length min(len(seq1)), len(seq2)).
| |
100
69
For one of my projects, I want to filter out financial transactions for which the description is a perfect or near-perfect substring of another transaction. So this is exactly what I need.
| |