Fuzzy matching in Python

2021-09-27 253 words 2 minutes

Contents

`difflib`

Docs here

1
import difflib

Most simple use case

1
2
m = difflib.SequenceMatcher(None, 'NEW YORK METS', 'NEW YORK MEATS')
m.ratio()

0.9629629629629629

Create helper function so we don’t need to specify None each time.

1
2
3
4
from functools import partial
matcher = partial(difflib.SequenceMatcher, None)

matcher('NEW YORK METS', 'NEW YORK MEATS').ratio()

0.9629629629629629

Compare one sequence to multiple other sequences (SequenceMatcher caches second sequence)

1
2
3
4
5
6
7
m = difflib.SequenceMatcher()
m.set_seq2('abc')

for s in ['abc', 'ab', 'abcd', 'cde', 'def']:
    m.set_seq1(s)
    length = len(m.a) + len(m.b)
    print('{}, {:{}} -> {:.3f}'.format(m.a, m.b, 10-length, m.ratio()))

abc, abc  -> 1.000
ab, abc   -> 0.800
abcd, abc -> 0.857
cde, abc  -> 0.333
def, abc  -> 0.000

`fuzzywuzzy`

Based on this tutorial.

Finding perfect or imperfect substrings

One limitation of SequenceMatcher is that two sequences that clearly refer to the same thing might get a lower score than two sequences that refer to something different.

1
2
print(matcher("YANKEES", "NEW YORK YANKEES").ratio())
matcher("NEW YORK METS", "NEW YORK YANKEES").ratio()

0.6086956521739131

0.7586206896551724

fuzzywuzzy has a useful function for this based on what they call the “best-partial” heuristic, which returns the similarity score for the best substring of length min(len(seq1)), len(seq2)).

1
2
3
4
from fuzzywuzzy import fuzz

print(fuzz.partial_ratio("YANKEES", "NEW YORK YANKEES"))
print(fuzz.partial_ratio("NEW YORK METS", "NEW YORK YANKEES"))

100
69

For one of my projects, I want to filter out financial transactions for which the description is a perfect or near-perfect substring of another transaction. So this is exactly what I need.

1
a = ''