I’ve been working with a lot of raw data files lately — data files that come in some delimited format which I must move into a SQL database. Most of the files are CSV files, and some of them have over 100 columns.
Using BCP to import CSV files into SQL is a pleasure. The only annoyance is writing those XML format files. Actually, the annoyance is not writing them, it’s going through the CSV files (which, of course, are undocumented) to determine what delimiters are used in each row.
Here’s a sample of what I mean:
2934,128321,2782,"2007-04-32","Excluded",2321,,22
Fortunately, most CSV files are consistent in their use of quotes, but going through dozens of columns to determine the terminators is a pain. The terminators in the above example aren’t just commas; they could also be quotes. Column three, for example, ends with a comma followed by a quote.
To generate the <record>
section of my format files, I wrote the following script, which reads the first line of a text file, finds each comma, determines if it is preceded or followed by a quote, and generates the appropriate XML.
//get filename var args = WScript.Arguments; if (args.length == 0) { WScript.Echo('Usage: getdelims.vbs <filename>'); WScript.Quit(); } var filename = args(0); //read file var fso = new ActiveXObject("Scripting.FileSystemObject"); var file = fso.OpenTextFile(filename,1); var contents = file.ReadLine(); file.Close(); file = null; fso = null; //find commas var cnt = 0; for (var i = 1; i < contents.length; i++) { if ( contents.substr(i,1) != ',' ) continue; cnt++; delim = ","; if ( contents.substr(i-1,1) == '"' ) delim = '",'; if ( i+1 < contents.length && contents.substr(i+1,1) == '"' ) delim += '"'; WScript.Echo('\t<FIELD ID="' + cnt + '" xsi:type="CharTerm" TERMINATOR="' + delim + '" />'); }
The output can be copy/pasted right into your format file. The example content above would generate the following.
<field ID="1" xsi:type="CharTerm" TERMINATOR="," /> <field ID="2" xsi:type="CharTerm" TERMINATOR="," /> <field ID="3" xsi:type="CharTerm" TERMINATOR=","" /> <field ID="4" xsi:type="CharTerm" TERMINATOR="","" /> <field ID="5" xsi:type="CharTerm" TERMINATOR=""," /> <field ID="6" xsi:type="CharTerm" TERMINATOR="," /> <field ID="7" xsi:type="CharTerm" TERMINATOR="," />
A few minutes writing a script, and I won’t be looking at CSV files with too many columns ever again (at least, not for the reason of writing XML format files).