**Summary:**

I discuss the use of a sum-of-longest-common-strings algorithm to measure the degree of difference between strings. This is efficient in VBA, and can be used as a the basis as a closest-match alternative to VLookup and Match.

**The Details...**

A couple of years ago, I looked at writing a fuzzy-matching version of VLookup, to return the closest match to a search string rather than the #NA() that comes back from the standard Excel function. I posted it in an obscure personal blog, and forgot about it. But I'll be posting it here, shortly, and *this* post is the introduction to it, with an explanation of the principles and the core VBA function that drives the whole thing.

Originally,I looked at using a Levenshtein 'Edit Distance' algorithm to compare and measure the degree of difference between strings. (Thanks are due to the inestimable Mr. Crowley for pointing me towards some basic C++ and theory-of-algorithms links). But field-testing showed that a simpler and faster approach was required - I needed something that gives consistent results on longer strings, like addresses and sentences, without the need for a separate scoring process that examines the word order.

The simplest approach of all to comparing and scoring for similarity is searching for the longest common string. This has the obvious advantages of speed and simplicity, but it alse has a weak point: simple LCS algorithms are unduly sensitive to single-letter substitutions and deletions near the midpoint of the test word. For example, 'Wed**n**esday' is obviously closer to 'Wed**X**esday' on an edit-distance basis than it is to 'WednesXXX', but the latter has the longest common string despite having more substitutions; this suggests that it would be better to score the 'Wed' as well as the 'eday', adding up *all* the matching substrings, instead of just measuring the longest one.

It turns out that the recursive algorithm I'm using to do this has an embedded sequence-sensitivity; in theory, this is a complication and a pretty heavy hint that there's some error in my logic that I ought to investigate and remove. In practice, a degree of sequence-sensitivity works well when we compare two sentences or phrases: this 'error' is a pretty good proxy for compiling a secondary score based on similarities in their word order.

Which goes to show that serendipidity comes from simplicity and, if you strip out the comments, this function is a commendably compact piece of code:

Public Function SumOfCommonStrings( _

ByVal s1 As String, _

ByVal s2 As String, _

Optional Compare As VBA.VbCompareMethod = vbTextCompare, _

Optional iScore As Integer = 0 _

) As Integer

Application.Volatile False

' N.Heffernan 06 June 2006 (somewhere over Newfoundland)

' THIS CODE IS IN THE PUBLIC DOMAIN

' Function to measure how much of String 1 is made up of substrings found in String 2

' This function uses a modified Longest Common String algorithm.

' Simple LCS algorithms are unduly sensitive to single-letter

' deletions/changes near the midpoint of the test words, eg:

' Wednesday is obviously closer to WedXesday on an edit-distance

' basis than it is to WednesXXX. So would it be better to score

' the 'Wed' as well as the 'eday' ?

' Watch out for strings of differing lengths:

'

' SumOfCommonStrings("Wednesday", "WednesXXXday")

'

' This scores the same as:

'

' SumOfCommonStrings("Wednesday", "Wednesday")

'

' So make sure the calling function uses the length of the longest

' string when calculating the degree of similarity from this score.

' This is coded for clarity, not for performance.

Dim arr() As Integer ' Scoring matrix

Dim n As Integer ' length of s1

Dim m As Integer ' length of s2

Dim i As Integer ' start position in s1

Dim j As Integer ' start position in s2

Dim subs1 As String ' a substring of s1

Dim len1 As Integer ' length of subs1

Dim sBefore1 ' documented in the code

Dim sBefore2

Dim sAfter1

Dim sAfter2

Dim s3 As String

SumOfCommonStrings = iScore

n = Len(s1)

m = Len(s2)

If s1 = s2 Then

SumOfCommonStrings = n

Exit Function

End If

If n = 0 Or m = 0 Then

Exit Function

End If

's1 should always be the shorter of the two strings:

If n > m Then

s3 = s2

s2 = s1

s1 = s3

n = Len(s1)

m = Len(s2)

End If

n = Len(s1)

m = Len(s2)

' Special case: s1 is n exact substring of s2

If InStr(1, s2, s1, Compare) Then

SumOfCommonStrings = n

Exit Function

End If

For len1 = n To 1 Step -1

For i = 1 To n - len1 + 1

subs1 = Mid(s1, i, len1)

j = 0

j = InStr(1, s2, subs1, Compare)

If j > 0 Then

' We've found a matching substring...

iScore = iScore + len1

' Reinstate this Debug.Print statement to monitor the function:

' Debug.Print s1 & " substring (len=" & len1 & ") " & subs1 & " in " & s2 & " Scores " & len1

' Now clip out this substring from s1 and s2...

' However, we can't just concatenate the fragments before and

' after this deletion and restart: substrings that span this

' artificial join might give spurious matches. So we run the

' function on the separate 'before' and 'after' pieces. Note

' that running before1 vs before2 and after1 vs after2, without

' running before1 vs after2 and before2 vs after1, introduces

' a sequence bias. This may be undesirable, as the effect will

' be to discard match scores for transposed words in a sentence

If i > 1 And j > 1 Then

sBefore1 = left(s1, i - 1)

sBefore2 = left(s2, j - 1)

iScore = SumOfCommonStrings(sBefore1, _

sBefore2, _

Compare, _

iScore)

End If

If i + len1 < n And j + len1 < m Then

sAfter1 = right(s1, n + 1 - i - len1)

sAfter2 = right(s2, m + 1 - j - len1)

iScore = SumOfCommonStrings(sAfter1, _

sAfter2, _

Compare, _

iScore)

End If

SumOfCommonStrings = iScore

' No further recursion: don't double-count substrings of a matched substring!

Exit Function

'Reinstate this 'Else' block to monitor the function:

'Else

' No action required.

' Debug.Print s1 & " substring (len=" & len1 & ") " & subs1 & " in " & s2

End If

Next

Next

End Function

There is room for improvement, and I suspect that the embedded sequence sensitivity is a drawback in some applications. Consider these two addresses:

The House of Cards,

11 High Street,

and

House of Cards, The

11 High Street

They are clearly the same address, and this isn't even a typing error: moving 'The' to the end of the line (or titles like 'Mr' and 'Mrs') is accepted secretarial practice in lists that are sorted alphabetically. But my algorithm treats the transposed word 'The' as an insertion, applying a penalty without making any attempt to identify it as a transposition. In fact, it double-counts the transposition as two points of difference - deletion plus insertion - just like the unmodified Levenshtein edit-distance algorithm. So feel free to rewrite my code where it splits the test words into 'before' and 'after' fragments and resumes the search for matching substrings - but be warned, this is not as simple as you might think, and I don't see any obvious analogy with Damerau's extension of the Levenshtein algorithm. In practice the brutal excision of articles and titles from addresses is the most reliable approach.

**A parting shot:**

I have a vague suspicion that this sum-of-longest-common-strings algorithm is functionally equivalent to the Levenshtein edit distance, but I lack the logical tools to attempt a formal proof. Would anyone care to offer some pointers? I think its time I moved beyond simple hacks and started putting this stuff on a firmer foundation.

I did something like this at my previous job. I believe I just took one string and compared it to another. I would score it by how close was the closest character compared to the original string. If the character was in the same spot then it would receive a score of 1/1, if not it would receive 1/(number of displacement). Then I would take the total length of the other string and divide by the the length into the total added score of the comparison. Hope that makes sense. I would then compare all the scores with all the other strings I was comparing with and if the highest score was over a certain percentage I would do a manual check with a message box to decide if they were the same or not. Made my boring job much more exciting do that code.

ReplyDeleteYou're moving towards Levenshtein edit difference with this approach:

Deletehttp://en.wikipedia.org/wiki/Levenshtein_distance

There's nothing wrong with that - indeed, edit distance algorithms are the 'gold standard' for a rigorously-derived measure of difference between two strings - but it's very slow in VBA; and the whole point of the 'longest common string' approach is to use VBA for the role it performs well: a presentation layer, optimised for the advantages of a rich user interface with a trede-off of low performance, that achieves high performance by calling and controlling fast functions implemented in other languages.

Just wanted to let you know that there's a fairly serious bug in this code. Line 81 in the section that checks for the special case where s1 is a substring of s2 reads

ReplyDeleteSumOfCommonStrings = n

it should read

SumOfCommonStrings = iScore + n

Otherwise when the function is called recursively, it can drop the original iScore. Eg "eoakey" and "edwardoakey" returns 1 instead of 6, because when the recursive call on "e" and "edward" completes, it sets the iScore to 1 rather than adding the 1 to the 5 from the match on "oakey".

I got the code from StackOverflow, so if you could update it there as well, that'd be great.

Bug notwithstanding, this was very helpful. Thanks.

PS. Sorry if this is a repeat, I didn't get confirmation that my post sent the first time.

Thanks Fang, I'll have a look at that.

Delete