Monday 1 March 2010

VLookup() with fuzzy-matching to get a 'closest match' result


Ever had to look up a name or an address in a list that doesn't quite match, so the standard Excel 'VLookup()' and 'Match()' functions don't help you?
Here's the solution:


The answer is 'Fuzzy Matching', a process of applying rules and picking out a 'best fit'. In the case of text - words, addresses, alphanumeric codes like ISIN identifiers for shares, bonds and other financial instruments - the best approach is the Levenshtein Edit Distance algorithm, a measure of how many characters were changed or moved to get from one text to another; but I've gone for a simplified approach that adds up how much of text A is made up of recognisable fragments of text B.
It works surprisingly well, but there are some limitations. There's a more detailed discussion of this in the preceding post: feel free to offer some suggestions and comments there: what you're reading here is the application of these abstract principles to spreadsheets. Click on the image: it'll show you how the function is used.
What's missing is a version of Match() (you can code that up for yourselves), and a dedicated 'FuzzyVLookupAddress()' function, tweaked to deal with the peculiar things that we do to postal addresses to make them extraordinarily difficult for computers (and spreadsheets!) to read. Instead, I've provided a crude 'NormaliseAddress' function, which standardises all those troublesome Aves, Avenues, Streets, Saints and St.'s. Apply it to the addresses in your list (and to the search term you're trying to look up) and let me know how you get on.
Enough witter: the code's below. It's been tested for cut-and-paste out of Blogger, but there's always a health warning: Blogger will munge the line breaks, and so will whatever you're using to view this post, especially if it's an RSS feed.




Option Explicit


Public Function FuzzyVLookup(Lookup_Value As String, _
                             Table_Array As Variant, _
                             Optional Col_Index_Num As Integer = 1, _
                             Optional Compare As VbCompareMethod = vbTextCompare _
                            ) As Variant

' Find the best match for a given string in column 1 of an array of data obtained from an Excel range
' This is functionally similar to VLookup, but it returns the best match, not the first exact match
' This function is not case-sensitive, unless you specify 'Compare' as 0 or vbBinaryCompare

' If your data quality is poor, you are advised to display the retrieved index value from column 1
' and use the FuzzyMatchScore() function on this index value to reveal the fuzzy-matching 'score' and
' discard all results below a threshold value. Feel free to code up a 'threshold' parameter!

' If you are looking up names and addresses, use the NormaliseAddress() function on your search term and
' searched population to standardise abbreviations and word-order conventions used in British addresses.

' THIS CODE IS IN THE PUBLIC DOMAIN


Application.Volatile False

Dim dblBestMatch As Double

Dim iRowBest    As Integer
Dim dblMatch    As Double
Dim iRow        As Integer
Dim strTest     As String
Dim strInput    As String

Dim iStartCol   As Integer
Dim iEndCol     As Integer
Dim iOffset     As Integer

If InStr(TypeName(Table_Array), "(") + InStr(1, TypeName(Table_Array), "Range", vbTextCompare) < 1 Then
     'Table_Array is not an array
     FuzzyVLookup = "#VALUE"
    Exit Function
End If

If InStr(1, TypeName(Table_Array), "Range", vbTextCompare) > 0 Then
    Table_Array = Table_Array.Value
End If

' If you get a subscript-out-of-bounds error here, you're using a vector instead
' of the 2-dimensional array that is the default 'Value' property of an Excel range.


iStartCol = LBound(Table_Array, 2)
iEndCol = UBound(Table_Array, 2)
iOffset = 1 - iStartCol


Col_Index_Num = Col_Index_Num - iOffset

If Col_Index_Num > iEndCol Or Col_Index_Num < iStartCol Then
     'Out-of-bounds
     FuzzyVLookup = "#VALUE"
    Exit Function
End If



    strInput = UCase(Lookup_Value)

    iRowBest = -1
    dblBestMatch = 0

    For iRow = LBound(Table_Array, 1) To UBound(Table_Array, 1)

        strTest = ""
        strTest = Table_Array(iRow, iStartCol)

        dblMatch = 0
        dblMatch = FuzzyMatchScore(strInput, strTest, Compare)

        If dblMatch = 1 Then ' Bail out on finding an exact match
            iRowBest = iRow
            Exit For
        End If

        If dblMatch > dblBestMatch Then
            dblBestMatch = dblMatch
            iRowBest = iRow
        End If

    Next iRow


    If iRowBest = -1 Then
        FuzzyVLookup = "#NO MATCH"
        Exit Function
    End If

    FuzzyVLookup = Table_Array(iRowBest, Col_Index_Num)

End Function
   
Public Function FuzzyHLookup(Lookup_Value As String, _
                             Table_Array As Variant, _
                             Optional Row_Index_Num As Integer = 1, _
                             Optional Compare As VbCompareMethod = vbTextCompare)

' Find the best match for a given string in Row 1 of an array of data obtained from an Excel range
' This is functionally similar to HLookup, but it returns the best match, not the first exact match
' This function is not case-sensitive, unless you specify 'Compare' as vbTextBinary.

' If your data quality is poor, you are advised to display the retrieved index value from row 1
' and use the FuzzyMatchScore() function on this index value to reveal the fuzzy-matching 'score' and
' discard all results below a threshold value. Feel free to code up a 'threshold' parameter!

' If you are looking up names and addresses, use the NormaliseAddress() function on your search term and
' searched population to standardise abbreviations and word-order conventions used in British addresses.

' THIS CODE IS IN THE PUBLIC DOMAIN


Application.Volatile False

Dim dblBestMatch As Double

Dim iColBest    As Integer
Dim dblMatch    As Double
Dim iCol        As Integer
Dim strTest     As String
Dim strInput    As String

Dim iStartRow   As Integer
Dim iEndRow     As Integer
Dim iOffset     As Integer

If InStr(TypeName(Table_Array), "(") + InStr(1, TypeName(Table_Array), "Range", vbTextCompare) < 1 Then
     'Table_Array is not an array
     FuzzyHLookup = "#VALUE"
    Exit Function
End If

If InStr(1, TypeName(Table_Array), "Range", vbTextCompare) > 0 Then
    Table_Array = Table_Array.Value
End If

' If you get a subscript-out-of-bounds error here, you're using a vector instead
' of the 2-dimensional array that is the default 'Value' property of an Excel range.


iStartRow = LBound(Table_Array, 1)
iEndRow = UBound(Table_Array, 1)
iOffset = 1 - iStartRow


Row_Index_Num = Row_Index_Num - iOffset

If Row_Index_Num > iEndRow Or Row_Index_Num < iStartRow Then
     'Out-of-bounds
     FuzzyHLookup = "#VALUE"
    Exit Function
End If


    strInput = UCase(Lookup_Value)

    iColBest = -1
    dblBestMatch = 0

    For iCol = LBound(Table_Array, 2) To UBound(Table_Array, 2)

        strTest = ""
        strTest = Table_Array(iStartRow, iCol)

        dblMatch = 0
        dblMatch = FuzzyMatchScore(strInput, strTest, Compare)

        If dblMatch = 1 Then ' Bail out on finding an exact match
            iColBest = iCol
            Exit For
        End If

        If dblMatch > dblBestMatch Then
            dblBestMatch = dblMatch
            iColBest = iCol
        End If

    Next iCol


    If iColBest = -1 Then
        FuzzyHLookup = "#NO MATCH"
        Exit Function
    End If

    FuzzyHLookup = Table_Array(Row_Index_Num, iColBest)

End Function


Public Function FuzzyMatchScore(ByVal str1 As String, _
                                ByVal str2 As String, _
                                Optional Compare As VbCompareMethod = vbTextCompare _
                                ) As Double

' Returns an estimate of how closely word 1 matches word 2: this is best displayed as a percentage
' This is calculated as the fraction of the longer string that is made up of recognisable fragments of the shorter string
' There is no support for wildcards and regular expressions. Case-sensitivity is determined by the 'compare' parameter

' THIS CODE IS IN THE PUBLIC DOMAIN


Application.Volatile False

Dim maxLen As Integer
Dim minLen As Integer

    If str1 = str2 Then
        FuzzyMatchScore = 1#
        Exit Function
    End If

    If Len(str1) > Len(str2) Then
        maxLen = Len(str1)
        minLen = Len(str2)
    Else
        maxLen = Len(str2)
        minLen = Len(str1)
    End If

    If Len(str1) = 0 Or Len(str2) = 0 Then
        FuzzyMatchScore = 0#
    Else

        FuzzyMatchScore = 0#
        FuzzyMatchScore = SumOfCommonStrings(str1, str2, Compare) / maxLen

    End If
   
End Function

Public Function SumOfCommonStrings( _
                            ByVal s1 As String, _
                            ByVal s2 As String, _
                            Optional Compare As VBA.VbCompareMethod = vbTextCompare, _
                            Optional iScore As Integer = 0 _
                                ) As Integer

Application.Volatile False

' N.Heffernan 06 June 2006 (somewhere over Newfoundland)
' THIS CODE IS IN THE PUBLIC DOMAIN


' Function to measure how much of String 1 is made up of substrings found in String 2

' This function uses a modified Longest Common String algorithm.
' Simple LCS algorithms are unduly sensitive to single-letter
' deletions/changes near the midpoint of the test words, eg:
' Wednesday is obviously closer to WedXesday on an edit-distance
' basis than it is to WednesXXX. So it would be better to score
' the 'Wed' as well as the 'esday' and add up the total matched

' Watch out for strings of differing lengths:
'
'    SumOfCommonStrings("Wednesday", "WednesXXXday")
'
' This scores the same as:
'
'     SumOfCommonStrings("Wednesday", "Wednesday")
'
' So make sure the calling function uses the length of the longest
' string when calculating the degree of similarity from this score.


' This is coded for clarity, not for performance.


Dim arr() As Integer    ' Scoring matrix
Dim n As Integer        ' length of s1
Dim m As Integer        ' length of s2
Dim i As Integer        ' start position in s1
Dim j As Integer        ' start position in s2
Dim subs1 As String     ' a substring of s1
Dim len1 As Integer     ' length of subs1

Dim sBefore1            ' documented in the code
Dim sBefore2
Dim sAfter1
Dim sAfter2

Dim s3 As String


SumOfCommonStrings = iScore

n = Len(s1)
m = Len(s2)

If s1 = s2 Then
    SumOfCommonStrings = n
    Exit Function
End If

If n = 0 Or m = 0 Then
    Exit Function
End If

's1 should always be the shorter of the two strings:
If n > m Then
    s3 = s2
    s2 = s1
    s1 = s3
    n = Len(s1)
    m = Len(s2)
End If

n = Len(s1)
m = Len(s2)

' Special case: s1 is n exact substring of s2
If InStr(1, s2, s1, Compare) Then
    SumOfCommonStrings = n
    Exit Function
End If

For len1 = n To 1 Step -1

    For i = 1 To n - len1 + 1

        subs1 = Mid(s1, i, len1)
        j = 0
        j = InStr(1, s2, subs1, Compare)
       
        If j > 0 Then
       
            ' We've found a matching substring...
            iScore = iScore + len1           

          ' Now clip out this substring from s1 and s2...
          ' And search the fragments before and after this excision:

       
            If i > 1 And j > 1 Then
                sBefore1 = left(s1, i - 1)
                sBefore2 = left(s2, j - 1)
                iScore = SumOfCommonStrings(sBefore1, _
                                            sBefore2, _
                                            Compare, _
                                            iScore)
            End If
   
   
            If i + len1 < n And j + len1 < m Then
                sAfter1 = right(s1, n + 1 - i - len1)
                sAfter2 = right(s2, m + 1 - j - len1)
                iScore = SumOfCommonStrings(sAfter1, _
                                            sAfter2, _
                                            Compare, _
                                            iScore)
            End If
   
   
            SumOfCommonStrings = iScore
            Exit Function

        End If

    Next


Next


End Function


Private Function Minimum(ByVal a As Integer, _
                         ByVal b As Integer, _
                         ByVal c As Integer) As Integer
Dim min As Integer

  min = a

  If b < min Then
        min = b
  End If

  If c < min Then
        min = c
  End If

  Minimum = min

End Function



Public Function NormaliseAddress(ByVal strAddress As String) As String
Application.Volatile False
' This function is intended to remove or standardise common phrases
' and abbreviations used in British postal addresses, allowing the use
' of string-comparison algorithms in lists of names and addresses.

' Developers in other countries should review the word list used here,
' as conventions probably differ in your local language or dialect.


strAddress = " " & UCase(strAddress) & " "

strAddress = Substitute(strAddress, ",", " ")
strAddress = Substitute(strAddress, ".", " ")
strAddress = Substitute(strAddress, "-", " ")
strAddress = Substitute(strAddress, vbCrLf, " ")
strAddress = Substitute(strAddress, " BLVD ", " BOULEVARD ")
strAddress = Substitute(strAddress, " BVD ", " BOULEVARD ")
strAddress = Substitute(strAddress, " AV ", " AVENUE ")
strAddress = Substitute(strAddress, " AVE ", " AVENUE ")
strAddress = Substitute(strAddress, " RD ", " ROAD ")
strAddress = Substitute(strAddress, " WY ", " WAY ")
strAddress = Substitute(strAddress, " EST ", " ESTATE ")
strAddress = Substitute(strAddress, " PL ", " PLACE ")
strAddress = Substitute(strAddress, " PK ", " PARK ")
strAddress = Substitute(strAddress, " HSE ", " HOUSE ")
strAddress = Substitute(strAddress, " H0 ", " HOUSE ")
strAddress = Substitute(strAddress, " GDNS ", " GARDENS ")

strAddress = Substitute(strAddress, "&", "AND")
strAddress = Substitute(strAddress, " LIMITED ", " LTD ")
strAddress = Substitute(strAddress, " COMPANY ", " CO ")
strAddress = Substitute(strAddress, " CORPORATION ", " CORP ")
strAddress = Substitute(strAddress, " T/A ", " TA ")
strAddress = Substitute(strAddress, " TRADING AS ", " TA ")

' Common personal titles: these are often applied inconsistently or
' omitted, and must therefore be removed. Specific applications may
' require additional titles and their abbreviations - military rank,
' academic titles and degrees, courtesy titles of the aristocracy,
' knighthoods and honours (particularly for lists of civil servants)


strAddress = Substitute(strAddress, " ESQ ", " ")
strAddress = Substitute(strAddress, " MR ", " ")
strAddress = Substitute(strAddress, " MRS ", " ")
strAddress = Substitute(strAddress, " MISS ", " ")
strAddress = Substitute(strAddress, " MS ", " ")
strAddress = Substitute(strAddress, " MESSRS ", " ")
strAddress = Substitute(strAddress, " SIR ", " ")
strAddress = Substitute(strAddress, " OF ", " ")
strAddress = Substitute(strAddress, " DR ", " ")
strAddress = Substitute(strAddress, " OR ", " ")
strAddress = Substitute(strAddress, " IN ", " ")
strAddress = Substitute(strAddress, " THE ", " ")
strAddress = Substitute(strAddress, " REVEREND ", " REV ")
strAddress = Substitute(strAddress, " REVERENT ", " REV ")
strAddress = Substitute(strAddress, " HONOURABLE ", " HON ")
strAddress = Substitute(strAddress, " BROS ", " BROTHERS ")
strAddress = Substitute(strAddress, " ASSOC ", " ASSOCIATION ")
strAddress = Substitute(strAddress, " ASSN ", " ASSOCIATION ")

' Standardising 'St.', 'St', and 'Street'. Note that there are over 40 English
' towns and place names that contain or consist entirely of the word 'Street'.
' In addition, 'St' is a common abbreviation for 'Saint' in addresses.

' I have never seen a list of addresses where 'Street' and 'St' were used in a
' consistent way, and the only workable solution is to delete them all:


strAddress = Substitute(strAddress, " STREET ", " ")
strAddress = Substitute(strAddress, " ST ", " ")
strAddress = Substitute(strAddress, " STR ", " ")

Do While InStr(strAddress, "  ") > 0
    strAddress = Substitute(strAddress, "  ", " ")
Loop

strAddress = Trim(strAddress)

NormaliseAddress = strAddress

End Function


Public Function StripChars(myString As String, ParamArray Exceptions()) As String

' Strip out all non-alphanumeric characters from a string in a single pass
' Exceptions parameters allow you to retain specific characters (eg: spaces)

' THIS CODE IS IN THE PUBLIC DOMAIN


Application.Volatile False

Dim i As Integer
Dim iLen As Integer
Dim chrA As String * 1
Dim intA As Integer
Dim j As Integer
Dim iStart As Integer
Dim iEnd As Integer

If Not IsEmpty(Exceptions()) Then
    iStart = LBound(Exceptions)
    iEnd = UBound(Exceptions)
End If

iLen = Len(myString)

For i = 1 To iLen
    chrA = Mid(myString, i, 1)
    intA = Asc(chrA)
    Select Case intA
    Case 48 To 57, 65 To 90, 97 To 122
        StripChars = StripChars & chrA
    Case Else
        If Not IsEmpty(Exceptions()) Then
            For j = iStart To iEnd
                If chrA = Exceptions(j) Then
                    StripChars = StripChars & chrA
                    Exit For ' j
                End If
            Next j
        End If
    End Select
Next i



End Function


Private Function Substitute(ByVal Text As String, _
                            ByVal Old_Text As String, _
                            ByVal New_Text As String, _
                            Optional Instance As Long = 0, _
                            Optional Compare As VbCompareMethod = vbTextCompare _
                            ) As String


' Replace all instances (or the nth instance ) of 'Old' text with 'New'
' Unlike VB.Mid$ this method is not sensitive to length and can replace ALL instances
' This is not exposed as a Public function because there is an Excel Worksheet function
' called Substitute(). However, Workheet Functions have length constraints.

' THIS CODE IS IN THE PUBLIC DOMAIN


Dim iStart As Long
Dim iEnd As Long
Dim iLen As Long
Dim iInstance As Long
Dim strOut As String

iLen = Len(Old_Text)

If iLen = 0 Then
    Substitute = Text
    Exit Function
End If

iEnd = 0
iStart = 1

iEnd = InStr(iStart, Text, Old_Text, Compare)

If iEnd = 0 Then
    Substitute = Text
    Exit Function
End If


strOut = ""

Do Until iEnd = 0

    strOut = strOut & Mid$(Text, iStart, iEnd - iStart)
    iInstance = iInstance + 1

    If Instance = 0 Or Instance = iInstance Then
        strOut = strOut & New_Text
    Else
        strOut = strOut & Mid$(Text, iEnd, Len(Old_Text))
    End If

    iStart = iEnd + iLen
    iEnd = InStr(iStart, Text, Old_Text, Compare)

Loop

iLen = Len(Text)
strOut = strOut & Mid$(Text, iStart, iLen - iEnd)

Substitute = strOut

End Function



This is something I coded up on an obscure personal blog a couple of years ago: I get one or two enquiries about it every year (mostly appreciative) so I've tidied it up, tested it more thoroughly, and released it into the wild on Excellerando.

4 comments:

  1. Nigel - I believe I found an error in the SumOfCommonStrings subroutine. When the subroutine exits early as a result of the early test, SumOfCommonStrings is set to the length of s1 if s1=s2, or if instr(1,s2,s1,Compare)>0, while it SHOULD be setting SumOfCommonStrings = iScore PLUS the length of s1 in these two instances. For some reason, any prior iScore is forgotten in the case of a recursive call!

    Attached, is my proposed solution (see comments by dlmille in the code):

    Sub SumOfCommonStrings(...)
    ...

    If s1 = s2 Then
    SumOfCommonStrings = n + iScore 'dlmille modified - needs to carry the score forward
    Exit Function
    End If
    ...

    ' Special case: s1 is n exact substring of s2
    If InStr(1, s2, s1, Compare) Then
    SumOfCommonStrings = n + iScore 'dlmille modified - needs to carry the score forward
    Exit Function
    End If

    ...

    Please advise if you concur.

    Dave

    ReplyDelete
  2. Nigel - I believe the SumOfCommonStrings() routine has an error which is easily corrected.

    When recursive calls are made, the early tests/exits assign SumOfCommonStrings to the length of s1, when it should be iScore PLUS the length of s1.

    Proposed solution (see dlmille comments):

    Sub SumOfCommonStrings(...)
    ...
    f s1 = s2 Then
    SumOfCommonStrings = n + iScore 'dlmille modified - needs to carry the score forward
    Exit Function
    End If

    If n = 0 Or m = 0 Then
    Exit Function
    End If

    's1 should always be the shorter of the two strings:
    If n > m Then
    s3 = s2
    s2 = s1
    s1 = s3
    n = Len(s1)
    m = Len(s2)
    End If

    n = Len(s1)
    m = Len(s2)

    ' Special case: s1 is n exact substring of s2
    If InStr(1, s2, s1, Compare) Then
    SumOfCommonStrings = n + iScore 'dlmille modified - needs to carry the score forward
    Exit Function
    End If
    ...
    End Sub

    Please advise if you concur.

    Thanks!

    Dave

    ReplyDelete
  3. Apologies for the delay in replying - or noticing - these comments. I received no email notification from Bloogger, and only read them when, by accident, I changed the scripting settings in my browser and observed a popup window about comments awaiting 'moderation'... I will test your code as sonn as I get the opportunity.

    ReplyDelete
  4. Thank you very much for posting this - very useful today comparing some lists of wines!

    ReplyDelete