Ever had to look up a name or an address in a list that doesn't quite match, so the standard Excel 'VLookup()' and 'Match()' functions don't help you?
The answer is 'Fuzzy Matching', a process of applying rules and picking out a 'best fit'. In the case of text - words, addresses, alphanumeric codes like ISIN identifiers for shares, bonds and other financial instruments - the best approach is the Levenshtein Edit Distance algorithm, a measure of how many characters were changed or moved to get from one text to another; but I've gone for a simplified approach that adds up how much of text A is made up of recognisable fragments of text B.
It works surprisingly well, but there are some limitations. There's a more detailed discussion of this in the preceding post: feel free to offer some suggestions and comments there: what you're reading here is the application of these abstract principles to spreadsheets. Click on the image: it'll show you how the function is used.
What's missing is a version of Match() (you can code that up for yourselves), and a dedicated 'FuzzyVLookupAddress()' function, tweaked to deal with the peculiar things that we do to postal addresses to make them extraordinarily difficult for computers (and spreadsheets!) to read. Instead, I've provided a crude 'NormaliseAddress' function, which standardises all those troublesome Aves, Avenues, Streets, Saints and St.'s. Apply it to the addresses in your list (and to the search term you're trying to look up) and let me know how you get on.
Enough witter: the code's below. It's been tested for cut-and-paste out of Blogger, but there's always a health warning: Blogger will munge the line breaks, and so will whatever you're using to view this post, especially if it's an RSS feed.
Option Explicit
Public Function FuzzyVLookup(Lookup_Value As String, _
Table_Array As Variant, _
Optional Col_Index_Num As Integer = 1, _
Optional Compare As VbCompareMethod = vbTextCompare _
) As Variant
' Find the best match for a given string in column 1 of an array of data obtained from an Excel range
' This is functionally similar to VLookup, but it returns the best match, not the first exact match
' This function is not case-sensitive, unless you specify 'Compare' as 0 or vbBinaryCompare
' If your data quality is poor, you are advised to display the retrieved index value from column 1
' and use the FuzzyMatchScore() function on this index value to reveal the fuzzy-matching 'score' and
' discard all results below a threshold value. Feel free to code up a 'threshold' parameter!
' If you are looking up names and addresses, use the NormaliseAddress() function on your search term and
' searched population to standardise abbreviations and word-order conventions used in British addresses.
' THIS CODE IS IN THE PUBLIC DOMAIN
Application.Volatile False
Dim dblBestMatch As Double
Dim iRowBest As Integer
Dim dblMatch As Double
Dim iRow As Integer
Dim strTest As String
Dim strInput As String
Dim iStartCol As Integer
Dim iEndCol As Integer
Dim iOffset As Integer
If InStr(TypeName(Table_Array), "(") + InStr(1, TypeName(Table_Array), "Range", vbTextCompare) < 1 Then
'Table_Array is not an array
FuzzyVLookup = "#VALUE"
Exit Function
End If
If InStr(1, TypeName(Table_Array), "Range", vbTextCompare) > 0 Then
Table_Array = Table_Array.Value
End If
' If you get a subscript-out-of-bounds error here, you're using a vector instead
' of the 2-dimensional array that is the default 'Value' property of an Excel range.
iStartCol = LBound(Table_Array, 2)
iEndCol = UBound(Table_Array, 2)
iOffset = 1 - iStartCol
Col_Index_Num = Col_Index_Num - iOffset
If Col_Index_Num > iEndCol Or Col_Index_Num < iStartCol Then
'Out-of-bounds
FuzzyVLookup = "#VALUE"
Exit Function
End If
strInput = UCase(Lookup_Value)
iRowBest = -1
dblBestMatch = 0
For iRow = LBound(Table_Array, 1) To UBound(Table_Array, 1)
strTest = ""
strTest = Table_Array(iRow, iStartCol)
dblMatch = 0
dblMatch = FuzzyMatchScore(strInput, strTest, Compare)
If dblMatch = 1 Then ' Bail out on finding an exact match
iRowBest = iRow
Exit For
End If
If dblMatch > dblBestMatch Then
dblBestMatch = dblMatch
iRowBest = iRow
End If
Next iRow
If iRowBest = -1 Then
FuzzyVLookup = "#NO MATCH"
Exit Function
End If
FuzzyVLookup = Table_Array(iRowBest, Col_Index_Num)
End Function
Public Function FuzzyHLookup(Lookup_Value As String, _
Table_Array As Variant, _
Optional Row_Index_Num As Integer = 1, _
Optional Compare As VbCompareMethod = vbTextCompare)
' Find the best match for a given string in Row 1 of an array of data obtained from an Excel range
' This is functionally similar to HLookup, but it returns the best match, not the first exact match
' This function is not case-sensitive, unless you specify 'Compare' as vbTextBinary.
' If your data quality is poor, you are advised to display the retrieved index value from row 1
' and use the FuzzyMatchScore() function on this index value to reveal the fuzzy-matching 'score' and
' discard all results below a threshold value. Feel free to code up a 'threshold' parameter!
' If you are looking up names and addresses, use the NormaliseAddress() function on your search term and
' searched population to standardise abbreviations and word-order conventions used in British addresses.
' THIS CODE IS IN THE PUBLIC DOMAIN
Application.Volatile False
Dim dblBestMatch As Double
Dim iColBest As Integer
Dim dblMatch As Double
Dim iCol As Integer
Dim strTest As String
Dim strInput As String
Dim iStartRow As Integer
Dim iEndRow As Integer
Dim iOffset As Integer
If InStr(TypeName(Table_Array), "(") + InStr(1, TypeName(Table_Array), "Range", vbTextCompare) < 1 Then
'Table_Array is not an array
FuzzyHLookup = "#VALUE"
Exit Function
End If
If InStr(1, TypeName(Table_Array), "Range", vbTextCompare) > 0 Then
Table_Array = Table_Array.Value
End If
' If you get a subscript-out-of-bounds error here, you're using a vector instead
' of the 2-dimensional array that is the default 'Value' property of an Excel range.
iStartRow = LBound(Table_Array, 1)
iEndRow = UBound(Table_Array, 1)
iOffset = 1 - iStartRow
Row_Index_Num = Row_Index_Num - iOffset
If Row_Index_Num > iEndRow Or Row_Index_Num < iStartRow Then
'Out-of-bounds
FuzzyHLookup = "#VALUE"
Exit Function
End If
strInput = UCase(Lookup_Value)
iColBest = -1
dblBestMatch = 0
For iCol = LBound(Table_Array, 2) To UBound(Table_Array, 2)
strTest = ""
strTest = Table_Array(iStartRow, iCol)
dblMatch = 0
dblMatch = FuzzyMatchScore(strInput, strTest, Compare)
If dblMatch = 1 Then ' Bail out on finding an exact match
iColBest = iCol
Exit For
End If
If dblMatch > dblBestMatch Then
dblBestMatch = dblMatch
iColBest = iCol
End If
Next iCol
If iColBest = -1 Then
FuzzyHLookup = "#NO MATCH"
Exit Function
End If
FuzzyHLookup = Table_Array(Row_Index_Num, iColBest)
End Function
Public Function FuzzyMatchScore(ByVal str1 As String, _
ByVal str2 As String, _
Optional Compare As VbCompareMethod = vbTextCompare _
) As Double
' Returns an estimate of how closely word 1 matches word 2: this is best displayed as a percentage
' This is calculated as the fraction of the longer string that is made up of recognisable fragments of the shorter string
' There is no support for wildcards and regular expressions. Case-sensitivity is determined by the 'compare' parameter
' THIS CODE IS IN THE PUBLIC DOMAIN
Application.Volatile False
Dim maxLen As Integer
Dim minLen As Integer
If str1 = str2 Then
FuzzyMatchScore = 1#
Exit Function
End If
If Len(str1) > Len(str2) Then
maxLen = Len(str1)
minLen = Len(str2)
Else
maxLen = Len(str2)
minLen = Len(str1)
End If
If Len(str1) = 0 Or Len(str2) = 0 Then
FuzzyMatchScore = 0#
Else
FuzzyMatchScore = 0#
FuzzyMatchScore = SumOfCommonStrings(str1, str2, Compare) / maxLen
End If
End Function
Public Function SumOfCommonStrings( _
ByVal s1 As String, _
ByVal s2 As String, _
Optional Compare As VBA.VbCompareMethod = vbTextCompare, _
Optional iScore As Integer = 0 _
) As Integer
Application.Volatile False
' N.Heffernan 06 June 2006 (somewhere over Newfoundland)
' THIS CODE IS IN THE PUBLIC DOMAIN
' Function to measure how much of String 1 is made up of substrings found in String 2
' This function uses a modified Longest Common String algorithm.
' Simple LCS algorithms are unduly sensitive to single-letter
' deletions/changes near the midpoint of the test words, eg:
' Wednesday is obviously closer to WedXesday on an edit-distance
' basis than it is to WednesXXX. So it would be better to score
' the 'Wed' as well as the 'esday' and add up the total matched
' Watch out for strings of differing lengths:
'
' SumOfCommonStrings("Wednesday", "WednesXXXday")
'
' This scores the same as:
'
' SumOfCommonStrings("Wednesday", "Wednesday")
'
' So make sure the calling function uses the length of the longest
' string when calculating the degree of similarity from this score.
' This is coded for clarity, not for performance.
Dim arr() As Integer ' Scoring matrix
Dim n As Integer ' length of s1
Dim m As Integer ' length of s2
Dim i As Integer ' start position in s1
Dim j As Integer ' start position in s2
Dim subs1 As String ' a substring of s1
Dim len1 As Integer ' length of subs1
Dim sBefore1 ' documented in the code
Dim sBefore2
Dim sAfter1
Dim sAfter2
Dim s3 As String
SumOfCommonStrings = iScore
n = Len(s1)
m = Len(s2)
If s1 = s2 Then
SumOfCommonStrings = n
Exit Function
End If
If n = 0 Or m = 0 Then
Exit Function
End If
's1 should always be the shorter of the two strings:
If n > m Then
s3 = s2
s2 = s1
s1 = s3
n = Len(s1)
m = Len(s2)
End If
n = Len(s1)
m = Len(s2)
' Special case: s1 is n exact substring of s2
If InStr(1, s2, s1, Compare) Then
SumOfCommonStrings = n
Exit Function
End If
For len1 = n To 1 Step -1
For i = 1 To n - len1 + 1
subs1 = Mid(s1, i, len1)
j = 0
j = InStr(1, s2, subs1, Compare)
If j > 0 Then
' We've found a matching substring...
iScore = iScore + len1
' Now clip out this substring from s1 and s2...
' And search the fragments before and after this excision:
If i > 1 And j > 1 Then
sBefore1 = left(s1, i - 1)
sBefore2 = left(s2, j - 1)
iScore = SumOfCommonStrings(sBefore1, _
sBefore2, _
Compare, _
iScore)
End If
If i + len1 < n And j + len1 < m Then
sAfter1 = right(s1, n + 1 - i - len1)
sAfter2 = right(s2, m + 1 - j - len1)
iScore = SumOfCommonStrings(sAfter1, _
sAfter2, _
Compare, _
iScore)
End If
SumOfCommonStrings = iScore
Exit Function
End If
Next
Next
End Function
Private Function Minimum(ByVal a As Integer, _
ByVal b As Integer, _
ByVal c As Integer) As Integer
Dim min As Integer
min = a
If b < min Then
min = b
End If
If c < min Then
min = c
End If
Minimum = min
End Function
Public Function NormaliseAddress(ByVal strAddress As String) As String
Application.Volatile False
' This function is intended to remove or standardise common phrases
' and abbreviations used in British postal addresses, allowing the use
' of string-comparison algorithms in lists of names and addresses.
' Developers in other countries should review the word list used here,
' as conventions probably differ in your local language or dialect.
strAddress = " " & UCase(strAddress) & " "
strAddress = Substitute(strAddress, ",", " ")
strAddress = Substitute(strAddress, ".", " ")
strAddress = Substitute(strAddress, "-", " ")
strAddress = Substitute(strAddress, vbCrLf, " ")
strAddress = Substitute(strAddress, " BLVD ", " BOULEVARD ")
strAddress = Substitute(strAddress, " BVD ", " BOULEVARD ")
strAddress = Substitute(strAddress, " AV ", " AVENUE ")
strAddress = Substitute(strAddress, " AVE ", " AVENUE ")
strAddress = Substitute(strAddress, " RD ", " ROAD ")
strAddress = Substitute(strAddress, " WY ", " WAY ")
strAddress = Substitute(strAddress, " EST ", " ESTATE ")
strAddress = Substitute(strAddress, " PL ", " PLACE ")
strAddress = Substitute(strAddress, " PK ", " PARK ")
strAddress = Substitute(strAddress, " HSE ", " HOUSE ")
strAddress = Substitute(strAddress, " H0 ", " HOUSE ")
strAddress = Substitute(strAddress, " GDNS ", " GARDENS ")
strAddress = Substitute(strAddress, "&", "AND")
strAddress = Substitute(strAddress, " LIMITED ", " LTD ")
strAddress = Substitute(strAddress, " COMPANY ", " CO ")
strAddress = Substitute(strAddress, " CORPORATION ", " CORP ")
strAddress = Substitute(strAddress, " T/A ", " TA ")
strAddress = Substitute(strAddress, " TRADING AS ", " TA ")
' Common personal titles: these are often applied inconsistently or
' omitted, and must therefore be removed. Specific applications may
' require additional titles and their abbreviations - military rank,
' academic titles and degrees, courtesy titles of the aristocracy,
' knighthoods and honours (particularly for lists of civil servants)
strAddress = Substitute(strAddress, " ESQ ", " ")
strAddress = Substitute(strAddress, " MR ", " ")
strAddress = Substitute(strAddress, " MRS ", " ")
strAddress = Substitute(strAddress, " MISS ", " ")
strAddress = Substitute(strAddress, " MS ", " ")
strAddress = Substitute(strAddress, " MESSRS ", " ")
strAddress = Substitute(strAddress, " SIR ", " ")
strAddress = Substitute(strAddress, " OF ", " ")
strAddress = Substitute(strAddress, " DR ", " ")
strAddress = Substitute(strAddress, " OR ", " ")
strAddress = Substitute(strAddress, " IN ", " ")
strAddress = Substitute(strAddress, " THE ", " ")
strAddress = Substitute(strAddress, " REVEREND ", " REV ")
strAddress = Substitute(strAddress, " REVERENT ", " REV ")
strAddress = Substitute(strAddress, " HONOURABLE ", " HON ")
strAddress = Substitute(strAddress, " BROS ", " BROTHERS ")
strAddress = Substitute(strAddress, " ASSOC ", " ASSOCIATION ")
strAddress = Substitute(strAddress, " ASSN ", " ASSOCIATION ")
' Standardising 'St.', 'St', and 'Street'. Note that there are over 40 English
' towns and place names that contain or consist entirely of the word 'Street'.
' In addition, 'St' is a common abbreviation for 'Saint' in addresses.
' I have never seen a list of addresses where 'Street' and 'St' were used in a
' consistent way, and the only workable solution is to delete them all:
strAddress = Substitute(strAddress, " STREET ", " ")
strAddress = Substitute(strAddress, " ST ", " ")
strAddress = Substitute(strAddress, " STR ", " ")
Do While InStr(strAddress, " ") > 0
strAddress = Substitute(strAddress, " ", " ")
Loop
strAddress = Trim(strAddress)
NormaliseAddress = strAddress
End Function
Public Function StripChars(myString As String, ParamArray Exceptions()) As String
' Strip out all non-alphanumeric characters from a string in a single pass
' Exceptions parameters allow you to retain specific characters (eg: spaces)
' THIS CODE IS IN THE PUBLIC DOMAIN
Application.Volatile False
Dim i As Integer
Dim iLen As Integer
Dim chrA As String * 1
Dim intA As Integer
Dim j As Integer
Dim iStart As Integer
Dim iEnd As Integer
If Not IsEmpty(Exceptions()) Then
iStart = LBound(Exceptions)
iEnd = UBound(Exceptions)
End If
iLen = Len(myString)
For i = 1 To iLen
chrA = Mid(myString, i, 1)
intA = Asc(chrA)
Select Case intA
Case 48 To 57, 65 To 90, 97 To 122
StripChars = StripChars & chrA
Case Else
If Not IsEmpty(Exceptions()) Then
For j = iStart To iEnd
If chrA = Exceptions(j) Then
StripChars = StripChars & chrA
Exit For ' j
End If
Next j
End If
End Select
Next i
End Function
Private Function Substitute(ByVal Text As String, _
ByVal Old_Text As String, _
ByVal New_Text As String, _
Optional Instance As Long = 0, _
Optional Compare As VbCompareMethod = vbTextCompare _
) As String
' Replace all instances (or the nth instance ) of 'Old' text with 'New'
' Unlike VB.Mid$ this method is not sensitive to length and can replace ALL instances
' This is not exposed as a Public function because there is an Excel Worksheet function
' called Substitute(). However, Workheet Functions have length constraints.
' THIS CODE IS IN THE PUBLIC DOMAIN
Dim iStart As Long
Dim iEnd As Long
Dim iLen As Long
Dim iInstance As Long
Dim strOut As String
iLen = Len(Old_Text)
If iLen = 0 Then
Substitute = Text
Exit Function
End If
iEnd = 0
iStart = 1
iEnd = InStr(iStart, Text, Old_Text, Compare)
If iEnd = 0 Then
Substitute = Text
Exit Function
End If
strOut = ""
Do Until iEnd = 0
strOut = strOut & Mid$(Text, iStart, iEnd - iStart)
iInstance = iInstance + 1
If Instance = 0 Or Instance = iInstance Then
strOut = strOut & New_Text
Else
strOut = strOut & Mid$(Text, iEnd, Len(Old_Text))
End If
iStart = iEnd + iLen
iEnd = InStr(iStart, Text, Old_Text, Compare)
Loop
iLen = Len(Text)
strOut = strOut & Mid$(Text, iStart, iLen - iEnd)
Substitute = strOut
End Function
This is something I coded up on an obscure personal blog a couple of years ago: I get one or two enquiries about it every year (mostly appreciative) so I've tidied it up, tested it more thoroughly, and released it into the wild on Excellerando.
Nigel - I believe I found an error in the SumOfCommonStrings subroutine. When the subroutine exits early as a result of the early test, SumOfCommonStrings is set to the length of s1 if s1=s2, or if instr(1,s2,s1,Compare)>0, while it SHOULD be setting SumOfCommonStrings = iScore PLUS the length of s1 in these two instances. For some reason, any prior iScore is forgotten in the case of a recursive call!
ReplyDeleteAttached, is my proposed solution (see comments by dlmille in the code):
Sub SumOfCommonStrings(...)
...
If s1 = s2 Then
SumOfCommonStrings = n + iScore 'dlmille modified - needs to carry the score forward
Exit Function
End If
...
' Special case: s1 is n exact substring of s2
If InStr(1, s2, s1, Compare) Then
SumOfCommonStrings = n + iScore 'dlmille modified - needs to carry the score forward
Exit Function
End If
...
Please advise if you concur.
Dave
Nigel - I believe the SumOfCommonStrings() routine has an error which is easily corrected.
ReplyDeleteWhen recursive calls are made, the early tests/exits assign SumOfCommonStrings to the length of s1, when it should be iScore PLUS the length of s1.
Proposed solution (see dlmille comments):
Sub SumOfCommonStrings(...)
...
f s1 = s2 Then
SumOfCommonStrings = n + iScore 'dlmille modified - needs to carry the score forward
Exit Function
End If
If n = 0 Or m = 0 Then
Exit Function
End If
's1 should always be the shorter of the two strings:
If n > m Then
s3 = s2
s2 = s1
s1 = s3
n = Len(s1)
m = Len(s2)
End If
n = Len(s1)
m = Len(s2)
' Special case: s1 is n exact substring of s2
If InStr(1, s2, s1, Compare) Then
SumOfCommonStrings = n + iScore 'dlmille modified - needs to carry the score forward
Exit Function
End If
...
End Sub
Please advise if you concur.
Thanks!
Dave
Apologies for the delay in replying - or noticing - these comments. I received no email notification from Bloogger, and only read them when, by accident, I changed the scripting settings in my browser and observed a popup window about comments awaiting 'moderation'... I will test your code as sonn as I get the opportunity.
ReplyDeleteThank you very much for posting this - very useful today comparing some lists of wines!
ReplyDelete