Here’s a few tips for those people who provide raw data in text files (think CSV files and the like).
- Surround all text fields in single quotes, even if a comma does not exist in a specific instance. By failing to do this, you lose field delimiter consistency on a row-by-row basis, forcing the contents of the field to be parsed separately (i.e. stripping the “optional” quotes).
- Use consistent line endings. Pick one and stick with it for all your files. Use either (CR/LF), (LF), or (CR) — and use the same in all your files.
- Put column headings in the first row only. This is more a convenience than a necessity. If you make your first row column headings, make sure it is only the first row.
- Every row past the first should have an identical schema. Don’t try to be fancy and have different row types in one file. Each file should have the same number and sequence of columns.
- Provide delimiters for all columns, even when a row does not have data for them. For example, in a comma-delimited file with five columns, if a row has data in only the first two columns, make sure the row ends with three commas. Failure to do so implies missing or incomplete data.
When text files following these guidelines, I can write a script to import them into a SQL table (using BCP and a simple batch file) in a few minutes. Each guideline that is broken requires additional cleanup steps and more complex data import steps, and adds significant development (and debugging) time that shouldn’t be necessary.