I discuss the use of a sum-of-longest-common-strings algorithm to measure the degree of difference between strings. This is efficient in VBA, and can be used as a the basis as a closest-match alternative to VLookup and Match.
The Details...
A couple of years ago, I looked at writing a fuzzy-matching version of VLookup, to return the closest match to a search string rather than the #NA() that comes back from the standard Excel function. I posted it in an obscure personal blog, and forgot about it. But I'll be posting it here, shortly, and this post is the introduction to it, with an explanation of the principles and the core VBA function that drives the whole thing.
Originally,I looked at using a Levenshtein 'Edit Distance' algorithm to compare and measure the degree of difference between strings. (Thanks are due to the inestimable Mr. Crowley for pointing me towards some basic C++ and theory-of-algorithms links). But field-testing showed that a simpler and faster approach was required - I needed something that gives consistent results on longer strings, like addresses and sentences, without the need for a separate scoring process that examines the word order.
The simplest approach of all to comparing and scoring for similarity is searching for the longest common string. This has the obvious advantages of speed and simplicity, but it alse has a weak point: simple LCS algorithms are unduly sensitive to single-letter substitutions and deletions near the midpoint of the test word. For example, 'Wednesday' is obviously closer to 'WedXesday' on an edit-distance basis than it is to 'WednesXXX', but the latter has the longest common string despite having more substitutions; this suggests that it would be better to score the 'Wed' as well as the 'eday', adding up all the matching substrings, instead of just measuring the longest one.
It turns out that the recursive algorithm I'm using to do this has an embedded sequence-sensitivity; in theory, this is a complication and a pretty heavy hint that there's some error in my logic that I ought to investigate and remove. In practice, a degree of sequence-sensitivity works well when we compare two sentences or phrases: this 'error' is a pretty good proxy for compiling a secondary score based on similarities in their word order.
Which goes to show that serendipidity comes from simplicity and, if you strip out the comments, this function is a commendably compact piece of code:
Public Function SumOfCommonStrings( _
ByVal s1 As String, _
ByVal s2 As String, _
Optional Compare As VBA.VbCompareMethod = vbTextCompare, _
Optional iScore As Integer = 0 _
) As Integer
Application.Volatile False
' N.Heffernan 06 June 2006 (somewhere over Newfoundland)
' THIS CODE IS IN THE PUBLIC DOMAIN
' Function to measure how much of String 1 is made up of substrings found in String 2
' This function uses a modified Longest Common String algorithm.
' Simple LCS algorithms are unduly sensitive to single-letter
' deletions/changes near the midpoint of the test words, eg:
' Wednesday is obviously closer to WedXesday on an edit-distance
' basis than it is to WednesXXX. So would it be better to score
' the 'Wed' as well as the 'eday' ?
' Watch out for strings of differing lengths:
'
' SumOfCommonStrings("Wednesday", "WednesXXXday")
'
' This scores the same as:
'
' SumOfCommonStrings("Wednesday", "Wednesday")
'
' So make sure the calling function uses the length of the longest
' string when calculating the degree of similarity from this score.
' This is coded for clarity, not for performance.
Dim arr() As Integer ' Scoring matrix
Dim n As Integer ' length of s1
Dim m As Integer ' length of s2
Dim i As Integer ' start position in s1
Dim j As Integer ' start position in s2
Dim subs1 As String ' a substring of s1
Dim len1 As Integer ' length of subs1
Dim sBefore1 ' documented in the code
Dim sBefore2
Dim sAfter1
Dim sAfter2
Dim s3 As String
SumOfCommonStrings = iScore
n = Len(s1)
m = Len(s2)
If s1 = s2 Then
SumOfCommonStrings = n
Exit Function
End If
If n = 0 Or m = 0 Then
Exit Function
End If
's1 should always be the shorter of the two strings:
If n > m Then
s3 = s2
s2 = s1
s1 = s3
n = Len(s1)
m = Len(s2)
End If
n = Len(s1)
m = Len(s2)
' Special case: s1 is n exact substring of s2
If InStr(1, s2, s1, Compare) Then
SumOfCommonStrings = n
Exit Function
End If
For len1 = n To 1 Step -1
For i = 1 To n - len1 + 1
subs1 = Mid(s1, i, len1)
j = 0
j = InStr(1, s2, subs1, Compare)
If j > 0 Then
' We've found a matching substring...
iScore = iScore + len1
' Reinstate this Debug.Print statement to monitor the function:
' Debug.Print s1 & " substring (len=" & len1 & ") " & subs1 & " in " & s2 & " Scores " & len1
' Now clip out this substring from s1 and s2...
' However, we can't just concatenate the fragments before and
' after this deletion and restart: substrings that span this
' artificial join might give spurious matches. So we run the
' function on the separate 'before' and 'after' pieces. Note
' that running before1 vs before2 and after1 vs after2, without
' running before1 vs after2 and before2 vs after1, introduces
' a sequence bias. This may be undesirable, as the effect will
' be to discard match scores for transposed words in a sentence
If i > 1 And j > 1 Then
sBefore1 = left(s1, i - 1)
sBefore2 = left(s2, j - 1)
iScore = SumOfCommonStrings(sBefore1, _
sBefore2, _
Compare, _
iScore)
End If
If i + len1 < n And j + len1 < m Then
sAfter1 = right(s1, n + 1 - i - len1)
sAfter2 = right(s2, m + 1 - j - len1)
iScore = SumOfCommonStrings(sAfter1, _
sAfter2, _
Compare, _
iScore)
End If
SumOfCommonStrings = iScore
' No further recursion: don't double-count substrings of a matched substring!
Exit Function
'Reinstate this 'Else' block to monitor the function:
'Else
' No action required.
' Debug.Print s1 & " substring (len=" & len1 & ") " & subs1 & " in " & s2
End If
Next
Next
End Function
There is room for improvement, and I suspect that the embedded sequence sensitivity is a drawback in some applications. Consider these two addresses:
The House of Cards,
11 High Street,
and
House of Cards, The
11 High Street
They are clearly the same address, and this isn't even a typing error: moving 'The' to the end of the line (or titles like 'Mr' and 'Mrs') is accepted secretarial practice in lists that are sorted alphabetically. But my algorithm treats the transposed word 'The' as an insertion, applying a penalty without making any attempt to identify it as a transposition. In fact, it double-counts the transposition as two points of difference - deletion plus insertion - just like the unmodified Levenshtein edit-distance algorithm. So feel free to rewrite my code where it splits the test words into 'before' and 'after' fragments and resumes the search for matching substrings - but be warned, this is not as simple as you might think, and I don't see any obvious analogy with Damerau's extension of the Levenshtein algorithm. In practice the brutal excision of articles and titles from addresses is the most reliable approach.
A parting shot:
I have a vague suspicion that this sum-of-longest-common-strings algorithm is functionally equivalent to the Levenshtein edit distance, but I lack the logical tools to attempt a formal proof. Would anyone care to offer some pointers? I think its time I moved beyond simple hacks and started putting this stuff on a firmer foundation.