RE: SQL-only ETL using a bulk insert into a temporary table (SQL Spackle)

SSC Guru

Points: 1003872

July 23, 2014 at 11:48 am

snh 51890 (7/23/2014)
Nice article. This is a typical operation when the server is in a SOHO type situation or maybe some older XP-era gear that is co-located somewhere. Basically anyplace where the person using SSMS also has access to the C:\ drive of the server.
As my career has progressed and my situations become more corporate, the server administration is heavily managed. So things that were simple like bulk importing a file located on the same server near impossible now.
I've grown fond of using Linked Servers for large CSV files I only need temporarily. Usually this means buying the server admin a beer to enable this feature. The benefits are -
Treats the data file as a database
Does not have to be imported into existing database which means its a little easier on the transaction logs
Can select against it using quad-addressing (server.schema.table.column)
When done, merely unlink the file (server)
File does not have to be on server
Done entirely by script using T-SQL
Sean

Gosh… I wouldn't even unintentionally demean the use of BULK INSERT simply as a SOHO or xp-era method. Bulk Insert is also not relegated to using only those drives that are available on the server itself. It's easy enough to setup a directory on another server that can act as a common area where such temporary use of a file can easily be accomplished without having to bribe the DBAs (although we do very much appreciate the offer of beer :-D).

Shifting gears a bit, with a very little bit of planning, there's also absolutely no need for such imports to take a toll on log files. Since I normally won't let imported data anywhere near a production database until it's been fully validated and cleansed, I see nothing wrong with loading smaller files (the file provided in the article is incredibly small and only weighs in at 5MB) into TempDB, which also easily allows for minimal logging (not covered in the article) for the initial load.

Although I'll admit that you have to train the data providers a bit to not include some of the problems that can appear in improperly formatted CSV, TSV, and Fixed Field files, it's no more difficult to train them in that than it is to convey the idea that things like XML files must also be properly formatted. Of course, such properly formatted CSV and TSV files don't suffer the incredible amount of tag bloat that (for example) XML files provide. We have about a Terabyte of raw data that we necessarily have to keep at work. If they were XML files, they'd occupy somewhere between 8 and 16 Terabytes of data. While disk-space is relatively cheap, I can't justify wasting 7 to 15 Terabytes of disk space nor can I justify the extra transmission times to either transmit or receive large quantities of data that we know the format for.

For transmission of large quantities of data between disconnected remote SQL Servers, BULK INSERT (and BCP) both absolutely rock when it comes to "Native" format files. The format is incredibly tight when it comes to transmitting large amounts of numeric data in particular. It means that the data doesn't need to be specifically checked for data-type (although, inherently, it does) and that even the largest integer values still only take 4 bytes (for example).

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)