Thursday, 23 September 2010

Hashing algorithms: Adler32 implemented in VBA

I could've sworn I posted this here, years ago, but here's an implementation of Adler-32 in 32-bit VBA.

There's a horrible hack in it: Adler-32 returns a 32-bit integer, and the VBA Long is a signed integer with a range ± (2^31) -1, so I've implemented a 'wrap around' of the overflow at +2^31, restarting at -2^31 +1. And done something I really, really shouldn't have done with a floating-point variable. Eventually everyone, everywhere, will have 64-bit Office and this'll be kind of quaint and unnecessary... Right?

Of course, the real question is: why bother?

It boils down to the common question of checking for changes: if you don't want to use the 'on change' event, or you're dealing with data directly in VBA before it hits the sheet, large data sets need something better than an item-by-item brute force approach. At least, if you're doing it more than once: the cost of rolling each item into your hash is always more than the cost of the one-by-one comparison...

...And that's still true if you're importing a fast hashing algorithm from MySQL or one of the web API libraries (try MDA5, if you can get at an exposed function), unless you can find something that reads VBA variant arrays directly and relieve your VBA thread of the task of enumerating the list values into the imported function.

Meanwhile, here's a hash algorithm that's within reach of VBA: Adler32. The details are in Wikipedia’s article on Adler32: and an hour's testing will teach you some lessons about hashing:
  1. 'Hash collisions' (differing data sets returning the same hash code) are more common than you expected, especially with data containing repeated patterns (like dates);>
  2. Choice of hashing algorithm is important;
  3. ...And that choice is more of an art than a science;
  4. Admitting that you really shouldn't have bothered and resorting to brute force is often the better part of valour.

Adler-32 is actually more useful as a tool to teach those lessons, than as a workaday checksum. It's great for detecting changes in lists of more than 100 distinct items; it's tolerable, on a list of 24 randomly-generated 8-letter words (hash collisions at 1 in 1800 attempts) and it starts giving you single-digit percentage occurrences of the hash collision error in a list of 50 not-so-distinct option maturities, where the differences are mostly in the last 10 chars and *those* ten chars are recurring 3-month maturity dates.

By the time you're comparing pairs of 6-letter strings, more than 10% of your changes will be missed by the checksum in a non-random data set. And then you realise that might as well be using string comparison for that kind of trivial computation anyway.

So the answer is always: test it.

Meanwhile, here's the algorithm, horrible hacks and all:

Option Explicit

Public Function CheckSum(ByRef ColArray As Variant) As Long
Application.Volatile False

' Returns an Adler32 checksum of all the numeric and text values in a column

' Capture data from cells as myRange.Value2 and use a 32-bit checksum to see
' if any value in the range subsequently changes. You can run this on multi-
' column ranges, but it's MUCH faster to run this separately for each column
' Note that the VBA Long Integer data type is *not* a 32-bit integer, it's a
' signed integer with a range of  ± (2^31) -1. So our return value is signed
' and return values exceeding +2^31 -1 'wraparound' and restart at -2^31 +1.

' Coding Notes:

' This is intended for use in VBA, and not for use on the worksheet. Use the
' setting  'Option Private Module' to hide CheckSum from the function wizard

' Author: Nigel Heffernan, May 2006
' Acknowledgements and thanks to Paul Crowley, who recommended Adler-32

' Please note that this code is in the public domain. Mark it clearly, with
' the author's name, and segregate it from any proprietary code if you need
' to assert ownership & commercial confidentiality on your proprietary code

Const LONG_LIMIT As Long = (2 ^ 31) - 1
Const MOD_ADLER As Long = 65521

Dim a As Long
Dim b As Long

Dim i As Long
Dim j As Long
Dim k As Long

Dim arrByte() As Byte

Dim dblOverflow As Double

If TypeName(ColArray) = "Range" Then
    ColArray = ColArray.Value2
End If

If IsEmpty(ColArray) Then
    CheckSum = 0
    Exit Function
End If

If (VarType(ColArray) And vbArray) = 0 Then
    ' single-cell range, or a scalar data type
    ReDim arrData(0 To 0, 0 To 0)
    arrData(0, 0) = CStr(ColArray)
    arrData = ColArray
End If

a = 1
b = 0

For j = LBound(arrData, 2) To UBound(arrData, 2)
    For i = LBound(arrData, 1) To UBound(arrData, 1)
        ' VBA Strings are byte arrays: arrByte(n) is faster than Mid$(s, n)
        arrByte = CStr(arrData(i, j))  ' Is this type conversion efficient?
        For k = LBound(arrByte) To UBound(arrByte)
            a = (a + arrByte(k)) Mod MOD_ADLER
            b = (b + a) Mod MOD_ADLER
        Next k
        ' Terminating each item with a 'vTab' char constructs a better hash
        ' than vbNullString which, being equal to zero, adds no information
        ' to the hash and therefore permits the clash ABCD+EFGH = ABC+DEFGH
        ' However, we wish to avoid inefficient string concatenation, so we
        ' roll the terminating character's bytecode directly into the hash:
        a = (a + 11) Mod MOD_ADLER                ' vbVerticalTab = Chr(11)
        b = (b + a) Mod MOD_ADLER
    Next i
    ' Roll the column into the hash with a terminating horizontal tab char:
    a = (a + 9) Mod MOD_ADLER                     ' Horizontal Tab = Chr(9)
    b = (b + a) Mod MOD_ADLER

Next j

' Using a float in an integer calculation? We can get away with it, because
' the float error for a VBA double is < ±0.5 with numbers smaller than 2^32

dblOverflow = (1# * b * MOD_ADLER) + a

If dblOverflow > LONG_LIMIT Then  ' wraparound 2^31 to 1-(2^31)
    Do Until dblOverflow < LONG_LIMIT
        dblOverflow = dblOverflow - LONG_LIMIT
    CheckSum = 1 + dblOverflow - LONG_LIMIT
    CheckSum = b * MOD_ADLER + a
End If

End Function

No comments:

Post a Comment