Saturday, 27 February 2010

String-Comparison in VBA: a modified Longest-Common-String approach

Summary:

I discuss the use of a sum-of-longest-common-strings algorithm to measure the degree of difference between strings. This is efficient in VBA, and can be used as a the basis as a closest-match alternative to VLookup and Match.


The Details...

A couple of years ago, I looked at writing a fuzzy-matching version of VLookup, to return the closest match to a search string rather than the #NA() that comes back from the standard Excel function. I posted it in an obscure personal blog, and forgot about it. But I'll be posting it here, shortly, and this post is the introduction to it, with an explanation of the principles and the core VBA function that drives the whole thing.

Originally,I looked at using a Levenshtein 'Edit Distance' algorithm to compare and measure the degree of difference between strings. (Thanks are due to the inestimable Mr. Crowley for pointing me towards some basic C++ and theory-of-algorithms links). But field-testing showed that a simpler and faster approach was required - I needed something that gives consistent results on longer strings, like addresses and sentences, without the need for a separate scoring process that examines the word order.

The simplest approach of all to comparing and scoring for similarity is searching for the longest common string. This has the obvious advantages of speed and simplicity, but it alse has a weak point: simple LCS algorithms are unduly sensitive to single-letter substitutions and deletions near the midpoint of the test word. For example, 'Wednesday' is obviously closer to 'WedXesday' on an edit-distance basis than it is to 'WednesXXX', but the latter has the longest common string despite having more substitutions; this suggests that it would be better to score the 'Wed' as well as the 'eday', adding up all the matching substrings, instead of just measuring the longest one.

It turns out that the recursive algorithm I'm using to do this has an embedded sequence-sensitivity; in theory, this is a complication and a pretty heavy hint that there's some error in my logic that I ought to investigate and remove. In practice, a degree of sequence-sensitivity works well when we compare two sentences or phrases: this 'error' is a pretty good proxy for compiling a secondary score based on similarities in their word order.

Which goes to show that serendipidity comes from simplicity and, if you strip out the comments, this function is a commendably compact piece of code:



Public Function SumOfCommonStrings( _
            ByVal s1 As String, _
            ByVal s2 As String, _
            Optional Compare As VBA.VbCompareMethod = vbTextCompare, _
            Optional iScore As Integer = 0 _
                ) As Integer

Application.Volatile False

' N.Heffernan 06 June 2006 (somewhere over Newfoundland)
' THIS CODE IS IN THE PUBLIC DOMAIN


' Function to measure how much of String 1 is made up of substrings found in String 2

' This function uses a modified Longest Common String algorithm.
' Simple LCS algorithms are unduly sensitive to single-letter
' deletions/changes near the midpoint of the test words, eg:
' Wednesday is obviously closer to WedXesday on an edit-distance
' basis than it is to WednesXXX. So would it be better to score
' the 'Wed' as well as the 'eday' ?

' Watch out for strings of differing lengths:
'
'    SumOfCommonStrings("Wednesday", "WednesXXXday")
'
' This scores the same as:
'
'     SumOfCommonStrings("Wednesday", "Wednesday")
'
' So make sure the calling function uses the length of the longest
' string when calculating the degree of similarity from this score.


' This is coded for clarity, not for performance.


Dim arr() As Integer     ' Scoring matrix
Dim n As Integer         ' length of s1
Dim m As Integer         ' length of s2
Dim i As Integer         ' start position in s1
Dim j As Integer         ' start position in s2
Dim subs1 As String     ' a substring of s1
Dim len1 As Integer     ' length of subs1

Dim sBefore1             ' documented in the code
Dim sBefore2
Dim sAfter1
Dim sAfter2

Dim s3 As String


SumOfCommonStrings = iScore

n = Len(s1)
m = Len(s2)

If s1 = s2 Then
    SumOfCommonStrings = n
    Exit Function
End If

If n = 0 Or m = 0 Then
    Exit Function
End If

's1 should always be the shorter of the two strings:

If n > m Then
    s3 = s2
    s2 = s1
    s1 = s3
    n = Len(s1)
    m = Len(s2)
End If

n = Len(s1)
m = Len(s2)

' Special case: s1 is n exact substring of s2
If InStr(1, s2, s1, Compare) Then
    SumOfCommonStrings = n
    Exit Function
End If

For len1 = n To 1 Step -1

    For i = 1 To n - len1 + 1

        subs1 = Mid(s1, i, len1)
        j = 0
        j = InStr(1, s2, subs1, Compare)
        
        If j > 0 Then
        
         ' We've found a matching substring...
            iScore = iScore + len1
        
        ' Reinstate this Debug.Print statement to monitor the function:
        ' Debug.Print s1 & " substring (len=" & len1 & ") " & subs1 & " in " & s2 & " Scores " & len1
            
        ' Now clip out this substring from s1 and s2...
    
        ' However, we can't just concatenate the fragments before and
        ' after this deletion and restart: substrings that span this
        ' artificial join might give spurious matches. So we run the
        ' function on the separate 'before' and 'after' pieces. Note
        ' that running before1 vs before2 and after1 vs after2, without
        ' running before1 vs after2 and before2 vs after1, introduces
        ' a sequence bias. This may be undesirable, as the effect will
        ' be to discard match scores for transposed words in a sentence

    
            If i > 1 And j > 1 Then
                sBefore1 = left(s1, i - 1)
                sBefore2 = left(s2, j - 1)
                iScore = SumOfCommonStrings(sBefore1, _
                                            sBefore2, _
                                            Compare, _
                                            iScore)
            End If
    
    
            If i + len1 < n And j + len1 < m Then
                sAfter1 = right(s1, n + 1 - i - len1)
                sAfter2 = right(s2, m + 1 - j - len1)
                iScore = SumOfCommonStrings(sAfter1, _
                                            sAfter2, _
                                            Compare, _
                                            iScore)
            End If
    
    
            SumOfCommonStrings = iScore
            ' No further recursion: don't double-count substrings of a matched substring!
            Exit Function
        
        'Reinstate this 'Else' block to monitor the function:
        'Else
            ' No action required.
            ' Debug.Print s1 & " substring (len=" & len1 & ") " & subs1 & " in " & s2

            
        End If

    Next


Next


End Function







There is room for improvement, and I suspect that the embedded sequence sensitivity is a drawback in some applications. Consider these two addresses:

    The House of Cards,
    11 High Street,

and

    House of Cards, The
    11 High Street

They are clearly the same address, and this isn't even a typing error: moving 'The' to the end of the line (or titles like 'Mr' and 'Mrs') is accepted secretarial practice in lists that are sorted alphabetically. But my algorithm treats the transposed word 'The' as an insertion, applying a penalty without making any attempt to identify it as a transposition. In fact, it double-counts the transposition as two points of difference - deletion plus insertion - just like the unmodified Levenshtein edit-distance algorithm. So feel free to rewrite my code where it splits the test words into 'before' and 'after' fragments and resumes the search for matching substrings - but be warned, this is not as simple as you might think, and I don't see any obvious analogy with Damerau's extension of the Levenshtein algorithm. In practice the brutal excision of articles and titles from addresses is the most reliable approach.


A parting shot:

I have a vague suspicion that this sum-of-longest-common-strings algorithm is functionally equivalent to the Levenshtein edit distance, but I lack the logical tools to attempt a formal proof. Would anyone care to offer some pointers? I think its time I moved beyond simple hacks and started putting this stuff on a firmer foundation.

4 comments:

  1. I did something like this at my previous job. I believe I just took one string and compared it to another. I would score it by how close was the closest character compared to the original string. If the character was in the same spot then it would receive a score of 1/1, if not it would receive 1/(number of displacement). Then I would take the total length of the other string and divide by the the length into the total added score of the comparison. Hope that makes sense. I would then compare all the scores with all the other strings I was comparing with and if the highest score was over a certain percentage I would do a manual check with a message box to decide if they were the same or not. Made my boring job much more exciting do that code.

    ReplyDelete
    Replies
    1. You're moving towards Levenshtein edit difference with this approach:

      http://en.wikipedia.org/wiki/Levenshtein_distance

      There's nothing wrong with that - indeed, edit distance algorithms are the 'gold standard' for a rigorously-derived measure of difference between two strings - but it's very slow in VBA; and the whole point of the 'longest common string' approach is to use VBA for the role it performs well: a presentation layer, optimised for the advantages of a rich user interface with a trede-off of low performance, that achieves high performance by calling and controlling fast functions implemented in other languages.

      Delete
  2. Just wanted to let you know that there's a fairly serious bug in this code. Line 81 in the section that checks for the special case where s1 is a substring of s2 reads

    SumOfCommonStrings = n

    it should read

    SumOfCommonStrings = iScore + n

    Otherwise when the function is called recursively, it can drop the original iScore. Eg "eoakey" and "edwardoakey" returns 1 instead of 6, because when the recursive call on "e" and "edward" completes, it sets the iScore to 1 rather than adding the 1 to the 5 from the match on "oakey".

    I got the code from StackOverflow, so if you could update it there as well, that'd be great.

    Bug notwithstanding, this was very helpful. Thanks.

    PS. Sorry if this is a repeat, I didn't get confirmation that my post sent the first time.

    ReplyDelete