I posted some VBA code to Split and Join 2D arrays using optimised string-handling a while ago: here's a function using the same logic to write an array to a csv file.
There's some interesting surprises in this kind of simple operation when you meet Unicode characters - and I found out that the horrible hack that I use for reading Excel ranges into complex SQL queries via csv files has some problems when the file contains (say) Arabic company names.
Without going into too much detail, VBA is internally unicode-compliant (strings using 'wide' chars encoding each character in two bytes have been there since well before the turn of the century) but Excel assumes that the outside world runs on ANSI code, or UTF encodings that require a code page. This makes life difficult when you're writing to a file, and reading it again with something else out of Redmond that can manage Unicode text, but hasn't *quite* got it right with the things that other Microsoft products do with this 'We speak Unicode but the outside world is ANSI' thing.
If you do need more detail (and actually, you probably do; there's a lot of misconceptions around and those of us who work in the Microsoft Office 'stack' have some of the most annoying ones), I recommend a quick re-read of Joel Spolsky's blog post: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.
Other stuff: If you call the function repeatedly with the same file name, it'll check that the file's unchanged since the last 'ArrayToCSVfile' write, using an Adler-32 checksum on the file contents. If it's still the same data, it'll bail out. This is *embarrassing* if you're reusing the file name for different data, so be sure to delete the pre-existing files in your calling function if you do that.
And so, without further ado:
Writing an Excel range to a csv file
Public Function ArrayToCSVfile(ByRef InputArray As Variant, _
ByVal FilePath As String, _
Optional ByVal CoerceText As Boolean = True _
) As Long
' Output an array to a csv file and returns the row count.
' CoerceText=TRUE will encapsulate all items, numeric or not, in quote marks.
' This code handles unicode, and outputs a file that can be read by Microsoft
' ODBC and OLEDB database drivers.
' The first row is assumed to be a list of unique column names. Non-unique or
' blank names are replaced by the F0, F1, F2... sequential names generated by
' widely-used database engines (including MS-Access, JET & OLEDB text drivers)
' Blank rows after the last data row are not written to file.
' The function stores checksums of every file that it writes; we do not over-
' write a pre-existing file if a check on the file name discovers a record in
' the checksum list, and a check on the file contents shows that it still has
' the same checksum. There's an overhead to this preliminary file 'read' of a
' pre-existing file (our VBA implementation of the Adler32 hash can only read
' 25 Mbytes per second) but this is much faster than an uneccessary overwrite
On Error Resume Next
' Coding note: we're not doing any string-handling in VBA.Strings: allocating
' deallocating and (especially!) concatenating are SLOW. We are using the VBA
' Join and Split functions ONLY. Feel free to optimise further by declaring a
' faster set of string functions from the Kernel if you want to.
'
' Other optimisations: type pun. Byte Arrays are interchangeable with strings
' Some of our loops through these arrays have a 'step' of 2. This optimises a
' search-andreplace for ANSI chars in an array of 2-byte unicodes.
' ** THIS CODE IS IN THE PUBLIC DOMAIN **
' Nigel Heffernan Excellerando.Blogspot.com
Dim COMMA As String
Dim BLANK As String
Dim EOROW As String
COMMA = ChrW$(44)
BLANK = ChrW$(13) & ChrW$(10) & ChrW$(13) & ChrW$(10)
EOROW = ChrW$(13) & ChrW$(10)
Dim i As Long
Dim j As Long
Dim k As Long
Dim i_LBound As Long
Dim i_UBound As Long
Dim j_LBound As Long
Dim j_UBound As Long
Dim k_lBound As Long
Dim k_uBound As Long
Dim iCheckSum As Long
Dim iRowCount As Long
Dim hndFile As Long
Dim arrBytes() As Byte
Dim arrTemp1() As String
Dim arrTemp2() As String
Dim arrTemp3(0 To 2) As String
Dim strBlankRow As String
Dim boolSkipRow As Boolean
Dim boolNumeric As Boolean
Dim strHeader As String
Dim arrHeader() As Byte
Static FileCheckSums As Scripting.Dictionary
If FileCheckSums Is Nothing Then
Set FileCheckSums = New Scripting.Dictionary
End If
If Len(VBA.FileSystem.Dir(FilePath)) <> 0 Then
iCheckSum = FileCheckSum(FilePath)
If FileCheckSums(FilePath) = iCheckSum Then
ArrayToCSVfile = -1
Exit Function ' The file's unchanged since we last created it.
Else
VBA.FileSystem.Kill FilePath
End If
End If
i_LBound = LBound(InputArray, 1)
i_UBound = UBound(InputArray, 1)
j_LBound = LBound(InputArray, 2)
j_UBound = UBound(InputArray, 2)
ReDim arrTemp1(i_LBound To i_UBound)
ReDim arrTemp2(j_LBound To j_UBound)
' We start with a 2-byte 'Wide' char. This coerces all subsequent operations to unicode
arrTemp3(0) = ChrW$(34) ' Encapsulating quote
arrTemp3(1) = vbNullString ' The field value will go here
arrTemp3(2) = ChrW$(34) ' Encapsulating quote
' Special handling for the header row. Not optimised, but it's only one row
i = i_LBound
For j = j_LBound To j_UBound
arrTemp3(1) = ChrW(70) & j ' Columns must have a unique header. Default F0, F1...
If IsError(InputArray(i, j)) Then
' no action
ElseIf IsNull(InputArray(i, j)) Then
' no action
ElseIf IsEmpty(InputArray(i, j)) Then
' no action
ElseIf Len(InputArray(i, j)) = 0 Then
' no action
Else
If IsDate(InputArray(i, j)) Then
arrTemp3(1) = Round(CDbl(CVDate(InputArray(i, j))), 8)
Else
arrBytes = CStr(InputArray(i, j))
For k = LBound(arrBytes) To UBound(arrBytes) Step 2
Select Case arrBytes(k)
Case 10, 13, 9, 44, 160 ' replaces CR, LF, Tab, Comma, and non-breaking
arrBytes(k) = 32 ' spaces with the standard ANSI space character
Case 34
arrBytes(k) = 39
End Select
Next k
arrTemp3(1) = arrBytes
End If
End If
arrTemp2(j) = Join(arrTemp3, vbNullString)
' Remove duplicated field names
For k = j_LBound To j - 1 Step 1
If StrComp(arrTemp2(k), arrTemp2(j), vbTextCompare) = 0 Then
arrTemp2(j) = ChrW(34) & "F" & j & ChrW(34) ' Non-unique: revert to default
Exit For
End If
Next k
Next j
arrTemp1(i) = Join(arrTemp2, COMMA)
' Data body. This is heavily optimised to avoid VBA.String functions with allocations
For i = 1 + i_LBound To i_UBound
boolSkipRow = True
For j = j_LBound To j_UBound
If IsEmpty(InputArray(i, j)) Then 'This condition is so common that we separate it out into its
arrTemp2(j) = vbNullString 'own IF...THEN clause & subordinate the rest into nested IFs
Else
If IsError(InputArray(i, j)) Then
arrTemp2(j) = vbNullString '' was #ERROR
ElseIf IsNull(InputArray(i, j)) Then
arrTemp2(j) = vbNullString
ElseIf Len(InputArray(i, j)) = 0 Then
arrTemp2(j) = vbNullString
Else
boolSkipRow = False ' This is definitely a non-blank row
If IsDate(InputArray(i, j)) Then
arrTemp2(j) = InputArray(i, j) ' Safer to Round(CDbl(CVDate(InputArray(i, j))), 8)
' but that's costly for performance. You are better
' off trusting Range.Value2 to create input arrays.
Else
arrBytes = CStr(InputArray(i, j))
For k = LBound(arrBytes) To UBound(arrBytes) Step 2
Select Case arrBytes(k)
Case 10, 13, 9, 44, 160 ' replace CR, LF, Tab, Comma,   with space
If arrBytes(k + 1) = 0 Then arrBytes(k) = 32
Case 34
If arrBytes(k + 1) = 0 Then arrBytes(k) = 39
End Select
Next k
arrTemp2(j) = arrBytes
arrBytes = vbNullString
End If
End If
End If ' isempty
Next j
If boolSkipRow Then
arrTemp1(i) = vbNullString
iRowCount = iRowCount - 1
Else
If CoerceText Then ' encapsulate all fields in quotes
For j = j_LBound To j_UBound
arrTemp3(1) = arrTemp2(j)
arrTemp2(j) = Join$(arrTemp3, vbNullString)
Next j
Else
For j = j_LBound To j_UBound
arrBytes = arrTemp2(j)
boolNumeric = True
For k = LBound(arrBytes) To UBound(arrBytes) Step 2
If arrBytes(k) < 45 Or arrBytes(k) > 57 Then
boolNumeric = False
Exit For
End If
Next k
If boolNumeric Then
For k = 1 + LBound(arrBytes) To UBound(arrBytes) Step 2
If arrBytes(k) <> 0 Then
boolNumeric = False
Exit For
End If
Next k
End If
arrBytes = vbNullString
If Not boolNumeric Then
arrTemp3(1) = arrTemp2(j)
arrTemp2(j) = Join(arrTemp3, vbNullString)
End If
Next j
End If
arrTemp1(i) = Join(arrTemp2, COMMA)
End If
Next i
iRowCount = i + iRowCount - 2
If iRowCount < 1 Then
iRowCount = 0 ' Note: this count excludes the header
End If
' **** WHY THIS IS COMMENTED OUT **** **** **** **** **** **** **** ****
'
' Microsoft ODBC and OLEDB database drivers cannot read the field names from
' the header when a unicode byte order mark (&HFF & &HFE) is inserted at the
' start of the text by Scripting.FileSystemObject 'Write' methods. Trying to
' work around this by writing byte arrays will fail; FSO 'Write' detects the
' string encoding automatically, and won't let you hack around it by writing
' the header as UTF-8 (or 'Narrow' string) and appending the rest as unicode
'
' (Yes, I tried some revolting hacks to get around it: don't *ever* do that)
'
' **** **** **** **** **** **** **** **** **** **** **** **** **** **** ****
'
' With FSO.OpenTextFile(FilePath, ForWriting, True, TristateTrue)
' .Write Join(arrTemp1, EOROW)
' .Close
' End With ' textstream object from objFSO.OpenTextFile
'
' **** **** **** **** **** **** **** **** **** **** **** **** **** **** ****
' **** WHY WE 'PUT' A BYTE ARRAY INSTEAD OF A VBA STRING VARIABLE **** ****
'
' Put #hndFile, , StrConv(Join(arrTemp1, EOROW), vbUnicode)
' Put #hndFile, , Join(arrTemp1, EOROW)
'
' If you pass unicode, Wide or UTF-16 string variables to PUT, it prepends a
' Unicode Byte Order Mark to the data which, when written to your file, will
' render the field names illegible to Microsoft's JET ODBC and ACE-OLEDB SQL
' drivers (which can actually read unicode field names, if the helpful label
' isn't in the way). However, the 'PUT' statements write a Byte array as-is.
'
' **** **** **** **** **** **** **** **** **** **** **** **** **** **** ****
arrBytes = Join(arrTemp1, EOROW)
' Remove empty rows after the data: this is so common in arrays from Excel
' ranges that the performance penalty is acceptable (one big allocation in
' the Redim Preserve statement) but you may prefer to comment out the code
' We *could* do a Replace on BLANK to get internal blank rows as well, but
' I don't trust the unicode handling and the performance penalty is higher
k_lBound = LBound(arrBytes)
k_uBound = UBound(arrBytes)
For k = k_uBound - 1 To k_lBound Step -1
If arrBytes(k) <> 0 Then
If Not (arrBytes(k) = 10 Or arrBytes(k) = 13) Then
Exit For
End If
End If
Next k
ReDim Preserve arrBytes(k_lBound To k + 1)
hndFile = FreeFile
Open FilePath For Binary As #hndFile
Put #hndFile, , arrBytes
Close #hndFile
FileCheckSums(FilePath) = StringCheckSum(arrBytes)
Erase arrBytes
ArrayToCSVfile = iRowCount
ExitSub:
On Error Resume Next
Erase arrTemp1
Erase arrTemp2
Exit Function
ErrSub:
Resume ExitSub
End Function
You'll need this, too: the file and string checksum functions called in the code.
A VBA implementation of the Adler-32 checksum, running on byte arrays instead of using VBA string-handling.
This includes another Heffernan Horrible Hack: the VBA Long Integer data type doesn't go up tp 2³², it's a signed integer for ±2³¹. So there's a wraparound at 2³¹-1, which feels rather quaint in this modern age of 64-bit LongLong integers.
However, there is old-worlde quaintness, and there's mediƦval barbarism: the final operation of an Adler-32 hashing function is a multiplication that can and does blow past 2³², and I'm using a Floating-point double to do it. If my castle is ever threatened by a mob of peasants with pitchforks and torches, I might encapsulate that in a conditional-compilation block on #VBA7, with a proper LongLong integer in the 64-bit block and the barbarism confined to the #Else block.
Public Function StringCheckSum(ByRef ByteArray() As Byte) As Long
Application.Volatile False
' Returns an Adler32 checksum of a string: typically a large file's contents
' Note that the VBA Long Integer data type is *not* a 32-bit integer, it's a
' signed integer with a range of ± (2^31) -1. So our return value is signed
' and return values exceeding +2^31 -1 'wraparound' and restart at -2^31 +1.
' Test your results. Some data sets (especially repeating dates) have double
' digit collision rates, and you'll need to find a different hash algorithm.
' Coding Notes:
' What, didn't you know that a Byte Array and a string are type-compatible?
' This is intended for use in VBA, and not for use on the worksheet. Use the
' setting 'Option Private Module' to hide CheckSum from the function wizard
' Author: Nigel Heffernan, May 2006 http://excellerando.blogspot.com
' Acknowledgements and thanks to Paul Crowley, who recommended Adler-32
' Please note that this code is in the public domain. Mark it clearly, with
' the author's name, and segregate it from any proprietary code if you need
' to assert ownership & commercial confidentiality on your proprietary code
Const LONG_LIMIT As Long = (2 ^ 31) - 1
Const MOD_ADLER As Long = 65521
Dim a As Long
Dim b As Long
Dim i As Long
Dim dblOverflow As Double
Dim i_LBound As Long
Dim i_UBound As Long
a = 1
b = 0
i_LBound = LBound(ByteArray)
i_UBound = UBound(ByteArray)
For i = i_LBound To i_UBound
a = (a + ByteArray(i)) Mod MOD_ADLER
b = (b + a) Mod MOD_ADLER
Next i
' Using a float in an integer calculation? We can get away with it, because
' the float error for a VBA double is < ±0.5 with numbers smaller than 2^32
dblOverflow = (1# * b * MOD_ADLER) + a
If dblOverflow > LONG_LIMIT Then ' wraparound 2^31 to 1-(2^31)
Do Until dblOverflow < LONG_LIMIT
dblOverflow = dblOverflow - LONG_LIMIT
Loop
StringCheckSum = 1 + dblOverflow - LONG_LIMIT
Else
StringCheckSum = b * MOD_ADLER + a
End If
End Function
Public Function FileCheckSum(strFilePath As String) As Long
Application.Volatile False
On Error Resume Next
'Return an ADLER-32 checksum from a file
' Throttle repeated calls using static variables. WARNING
' this assumes the file hasn't changed in the last 500 ms
Static LastFile As String
Static LastCall As Single
Static LastHash As Long
If LastFile = strFilePath Then
If VBA.Timer - LastCall < 0.5 Then
FileCheckSum = LastHash
Exit Function
Else
LastCall = VBA.Timer
End If
Else
LastFile = strFilePath
LastCall = VBA.Timer
End If
Dim hndFile As Long
Dim arrBytes() As Byte
Dim lenData As Long
hndFile = FreeFile
Open strFilePath For Binary As #hndFile
ReDim arrBytes(0 To LOF(hndFile) - 1)
Get #hndFile, , arrBytes
Close #hndFile
FileCheckSum = StringCheckSum(arrBytes)
Erase arrBytes
LastHash = FileCheckSum
End Function
Share and enjoy.
I should thank the estimable Paul Crowley for showing me the Adler-32 algorith, years ago, and much else besides; but a gentleman aficionado of algrorithmic elegance might prefer to dissociate himself from such abuses.
No comments:
Post a Comment