SQL-only ETL using a bulk insert into a temporary table (SQL Spackle)

  • Comments posted to this topic are about the item SQL-only ETL using a bulk insert into a temporary table (SQL Spackle)

  • It's a nice article, short and simple.

    However, I would have liked if you specified one of the main problems I've seen with beginners (including me) which is that the file must be on a location available to the server. Most of us have been bitten by this problem using a local address instead of the address the server needs (or putting the file on the server).

    Another improvement would have been to include some references for more information such as how to make it dynamic or how to use format files.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Nice simple article. I may have to try this out sometime.

  • Very nice article! Thank you so much!

  • Extremely academic -- since as was pointed out, the file must be on a disk local to that instance of the SQL Server. Doesn't happen often and most DBA's would argue against it.

    It would be better if you could target local files to a Windows client and bulk those into SQL Server.

    A look at SQLXMLBulkLoad may be worth the while since more and more files move from business to business in a XML format and this can run from any Windows client.

  • Nice article. This is a typical operation when the server is in a SOHO type situation or maybe some older XP-era gear that is co-located somewhere. Basically anyplace where the person using SSMS also has access to the C:\ drive of the server.

    As my career has progressed and my situations become more corporate, the server administration is heavily managed. So things that were simple like bulk importing a file located on the same server near impossible now.

    I've grown fond of using Linked Servers for large CSV files I only need temporarily. Usually this means buying the server admin a beer to enable this feature. The benefits are -

    • Treats the data file as a database
    • Does not have to be imported into existing database which means its a little easier on the transaction logs
    • Can select against it using quad-addressing (server.schema.table.column)
    • When done, merely unlink the file (server)
    • File does not have to be on server
    • Done entirely by script using T-SQL

    Sean

  • I'm actually very torn on this article...

    On the one hand, I agree. As noted by some of the others, the article would certainly have benefitted from the inclusion of some extra information, such as which machine drive-letter reference files must live on or the fact that you can use UNCs instead, etc

    On the other hand, this IS a "Spackle" article and, as it says at the beginning of the article, "These short articles are NOT meant to be complete solutions". Is it a 5 Star article in the classic sense? No. But, if you don’t know about BULK INSERT, it IS, most definitely, a MUST READ article because, although it doesn't fill an entire crack in knowledge of T-SQL, it DOES demonstrate the very important and academic concept that the Extract, Translate, and Load process of data doesn't need to be complicated. While that might not provide a benefit to those of us that already know about the likes of BULK INSERT, it does provide the start of an answer to the recent outbreak of posts on this and other sites that can be generically summarized as “How can I import a file into a table”?

    What's, unfortunately, not stated in the article and requires someone to actually have enough intellectual curiosity to try the code out and do a little research on their own, is the fact that the code provided imports more than 122 THOUSAND rows and then extracts the required (501) rows in just over a second. And, the author provided the substantial amount of test data to prove it! The article clearly demonstrates that it doesn’t require learning a new language (such as PowerShell), doesn’t require the use of another system/program (such as SSIS or any of a thousand other programs), and doesn’t require much additional knowledge of T-SQL.

    In other words, this is a very high altitude and incredibly simple "introduction" to BULK INSERT that says “Look! Importing files can be easy!” and, for those with enough intellectual curiosity to lookup BULK INSERT in Books Online or other web articles, provides a clue to an extremely effective ETL method that they might not have otherwise considered or might not have even been aware of.

    With all of that in mind, I say “Well done, Mr. Kulp”. Add a little more detail and advantageous reasons to use a particular feature to future “Spackle” articles and keep ‘em coming!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Just to be clear, the file doesn't need to be on the same server as SQL Server. Typically, the files will be stored in a separate location and referenced by a network address (\\SomeFileServer\Path\file).

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • How could we extend this code to add the ability to convert datatypes, like parse a datetime.

    Also so do an Upsert per row?

  • sgross 10581 (7/23/2014)


    How could we extend this code to add the ability to convert datatypes, like parse a datetime.

    Also so do an Upsert per row?

    Unless you can guarantee that the person(s) or application will always provide a proper date and time, it's best to import the data into a staging table (as was done in the article) as VARCHAR and validate the data using standard techniques for validating such things. Once validated, simply inserting the data from the staging table to the final data table will do many implicit conversions such as converting proper VARCHAR renditions of date and time to an actual DATETIME (for example) data type of the target column in the final table.

    As for "Upserts", it should be handled as any other data. You have a source table (the staging table) and a target table (the final table). I'll typically pre-mark each row with "I" (for insert) or "U" (for update) (sometimes, "D" for delete but I normally don't delete data due to audit requirements) using simple joins like you would in any classic upsert, and then do a separate insert and update. You could also use MERGE between the staging and target tables. BULK INSERT is not meant to provide upsert capabilities on its own. It's meant to be relatively simple and very fast... and it is.

    If the source of the data is very good, you can, in fact, import into a staging table that has the correct data types. Even if the source is questionable in quality, you can setup BULK INSERT to sequester any bad lines of data in a separate file for future analysis and repair.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • snh 51890 (7/23/2014)


    Nice article. This is a typical operation when the server is in a SOHO type situation or maybe some older XP-era gear that is co-located somewhere. Basically anyplace where the person using SSMS also has access to the C:\ drive of the server.

    As my career has progressed and my situations become more corporate, the server administration is heavily managed. So things that were simple like bulk importing a file located on the same server near impossible now.

    I've grown fond of using Linked Servers for large CSV files I only need temporarily. Usually this means buying the server admin a beer to enable this feature. The benefits are -

    • Treats the data file as a database
    • Does not have to be imported into existing database which means its a little easier on the transaction logs
    • Can select against it using quad-addressing (server.schema.table.column)
    • When done, merely unlink the file (server)
    • File does not have to be on server
    • Done entirely by script using T-SQL

    Sean

    Gosh… I wouldn't even unintentionally demean the use of BULK INSERT simply as a SOHO or xp-era method. Bulk Insert is also not relegated to using only those drives that are available on the server itself. It's easy enough to setup a directory on another server that can act as a common area where such temporary use of a file can easily be accomplished without having to bribe the DBAs (although we do very much appreciate the offer of beer :-D).

    Shifting gears a bit, with a very little bit of planning, there's also absolutely no need for such imports to take a toll on log files. Since I normally won't let imported data anywhere near a production database until it's been fully validated and cleansed, I see nothing wrong with loading smaller files (the file provided in the article is incredibly small and only weighs in at 5MB) into TempDB, which also easily allows for minimal logging (not covered in the article) for the initial load.

    Although I'll admit that you have to train the data providers a bit to not include some of the problems that can appear in improperly formatted CSV, TSV, and Fixed Field files, it's no more difficult to train them in that than it is to convey the idea that things like XML files must also be properly formatted. Of course, such properly formatted CSV and TSV files don't suffer the incredible amount of tag bloat that (for example) XML files provide. We have about a Terabyte of raw data that we necessarily have to keep at work. If they were XML files, they'd occupy somewhere between 8 and 16 Terabytes of data. While disk-space is relatively cheap, I can't justify wasting 7 to 15 Terabytes of disk space nor can I justify the extra transmission times to either transmit or receive large quantities of data that we know the format for.

    For transmission of large quantities of data between disconnected remote SQL Servers, BULK INSERT (and BCP) both absolutely rock when it comes to "Native" format files. The format is incredibly tight when it comes to transmitting large amounts of numeric data in particular. It means that the data doesn't need to be specifically checked for data-type (although, inherently, it does) and that even the largest integer values still only take 4 bytes (for example).

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • To anyone reading this article and subsequent comments, know that BULK INSERT introduces problems and carried with it many more hidden limitations than it offers benefits. BCP (the command-line counterpart to T-SQL's BULK INSERT) and SSIS (whose OLE DB Destination Component with Fast-load enabled is the SSIS counterpart to T-SQL's BULK INSERT), even the SqlBulkCopy class in .NET (the .NET counterpart to T-SQL's BULK INSERT), are components that offer you more flexibility than does BULK INSERT when designing a data-solution. The perspective of the BULK INSERT command with respect to the file, i.e. that the path you provide to the BULK INSERT command must be visible and permissions must be granted from the server hosting SQL Server's perspective, are the haranguing downfall of BULK INSERT when working in a distributed computing environment where your application is likely not running on the same server where your database instance is being hosted.

    Edit: just wanted to make sure to point out SSIS'es counterpart to BULK INSERT. All the counterparts I listed actually use the same API as does BULK INSERT, namely the Bulk Copy API, which is a feature of the SQL Server Database Engine where these components are all just conduits to get to the API

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • opc.three (7/24/2014)


    To anyone reading this article and subsequent comments, know that BULK INSERT introduces problems and carried with it many more hidden limitations than it offers benefits. BCP (the command-line counterpart to T-SQL's BULK INSERT) and SSIS, even the SqlBulkCopy class in .NET (the .NET counterpart to T-SQL's BULK INSERT), are components that offer you more flexibility than does BULK INSERT when designing a data-solution. The perspective of the BULK INSERT command with respect to the file, i.e. that the path you provide to the BULK INSERT command must be visible and permissions must be granted from the server hosting SQL Server's perspective, are the haranguing downfall of BULK INSERT when working in a distributed computing environment where your application is likely not running on the same server where your database instance is being hosted.

    As with SSIS or any other tool, there does have to be a proper connection to wherever you're importing the data from. Ostensibly, ETL doesn't happen by accident and there should be a plan as to both where the data will live and who or what can import it in the form of privs. I have a large distributed environment where different data lives on different machines and have not had the problems that you speak of.

    There's also a huge advantage that BULK INSERT has over BCP. The output of BULK INSERT is easily directed to a Temp Table in a stored procedure and doesn't actually require and excursion to the Command Line.

    I will admit that neither BCP or BULK INSERT are actually supported by MS for things like true CSV imports. Tools like SSIS do allow for the easy identification of a text identifier and will handle some malformed data. There's the rub, though. People easily allow malformed data in text files and get upset when a tool won't handle it yet they just as easily understand that XML has to be properly formed or there could be import errors.

    Every tool has its requirements for use, seemingly annoying nuances, and workarounds. For well formed and consistent data, BULK INSERT is a great tool for me.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • We do exactly this for some major flat file data loads. The files are so wide it is pretty much impossible to work with in the SSIS user interface but simple using this technique.

  • Jeff Moden (7/24/2014)


    opc.three (7/24/2014)


    To anyone reading this article and subsequent comments, know that BULK INSERT introduces problems and carried with it many more hidden limitations than it offers benefits. BCP (the command-line counterpart to T-SQL's BULK INSERT) and SSIS, even the SqlBulkCopy class in .NET (the .NET counterpart to T-SQL's BULK INSERT), are components that offer you more flexibility than does BULK INSERT when designing a data-solution. The perspective of the BULK INSERT command with respect to the file, i.e. that the path you provide to the BULK INSERT command must be visible and permissions must be granted from the server hosting SQL Server's perspective, are the haranguing downfall of BULK INSERT when working in a distributed computing environment where your application is likely not running on the same server where your database instance is being hosted.

    As with SSIS or any other tool, there does have to be a proper connection to wherever you're importing the data from. Ostensibly, ETL doesn't happen by accident and there should be a plan as to both where the data will live and who or what can import it in the form of privs. I have a large distributed environment where different data lives on different machines and have not had the problems that you speak of.

    That is because your SQL Server service account has been granted and extensive set of permissions within your distributed environment. Not everyone is OK with the approach to let SQL Server reach out to the file system for much else than writing backups.

    Every tool has its requirements for use, seemingly annoying nuances, and workarounds. For well formed and consistent data, BULK INSERT is a great tool for me.

    For one-off tasks BULK INSERT can come in handy but I avoid incorporating it into permanent solutions.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

Viewing 15 posts - 1 through 15 (of 20 total)

You must be logged in to reply to this topic. Login to reply