SQL-only ETL using a bulk insert into a temporary table (SQL Spackle)

  • Bulk Insert works in most cases but if your operation is based on sequence, you can't rely on bulk insert as it loads the data from the file regardless of sequence. we had quiet a few issues with this.

  • shah.muhammad (8/7/2014)


    Bulk Insert works in most cases but if your operation is based on sequence, you can't rely on bulk insert as it loads the data from the file regardless of sequence. we had quiet a few issues with this.

    That's good to know. I wonder why that is?

  • Stan Kulp-439977 (8/7/2014)


    shah.muhammad (8/7/2014)


    Bulk Insert works in most cases but if your operation is based on sequence, you can't rely on bulk insert as it loads the data from the file regardless of sequence. we had quiet a few issues with this.

    That's good to know. I wonder why that is?

    That's because relational databases don't care about physical order.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis Cazares (8/7/2014)


    Stan Kulp-439977 (8/7/2014)


    shah.muhammad (8/7/2014)


    Bulk Insert works in most cases but if your operation is based on sequence, you can't rely on bulk insert as it loads the data from the file regardless of sequence. we had quiet a few issues with this.

    That's good to know. I wonder why that is?

    That's because relational databases don't care about physical order.

    Would adding an identity field to the temporary table cause the records to be inserted in order?

  • shah.muhammad (8/7/2014)


    Bulk Insert works in most cases but if your operation is based on sequence, you can't rely on bulk insert as it loads the data from the file regardless of sequence. we had quiet a few issues with this.

    If you set the BATCHSIZE to 1 the importation of data may be slower than with a larger setting but it will be inserted into your table in the same order in which it is read from the data file. You can also use a VIEW that selects all columns except the IDENTITY column from a staging table to act as a façade for BULK INSERT.

    Demo:

    Open the attached zip file and save the 32MB text file inside as C:\@\bi.tbl. The first fours lines of the text file are below and are repeated 980,514 times making a file with ~39 million lines in it. You'll see in the code sample below where I check the identity assigned to each line to validate the file was inserted in order.

    Joe|10

    John|12

    Mary|8

    Jane|14

    USE YourDatabaseName

    GO

    IF EXISTS ( SELECT *

    FROM sys.objects

    WHERE object_id = OBJECT_ID(N'dbo.bi')

    AND type IN (N'U') )

    DROP TABLE dbo.bi;

    GO

    CREATE TABLE dbo.bi (line_number INT NOT NULL IDENTITY(1,1), name VARCHAR(100), age INT);

    GO

    IF EXISTS ( SELECT *

    FROM sys.views

    WHERE object_id = OBJECT_ID(N'dbo.bi_facade') )

    DROP VIEW dbo.bi_facade;

    GO

    CREATE VIEW dbo.bi_facade AS SELECT name, age FROM dbo.bi;

    GO

    BULK INSERT YourDatabaseName.dbo.bi_facade

    FROM 'C:\@\bi.tbl'

    WITH

    (

    FIELDTERMINATOR ='|',

    ROWTERMINATOR ='\n'

    );

    GO

    -- should return 0 rows if lines in the file were inserted into the table in order

    SELECT TOP 10

    *

    FROM dbo.bi

    WHERE (

    name = 'Joe'

    AND line_number % 4 != 1

    )

    OR (

    name = 'John'

    AND line_number % 4 != 2

    )

    OR (

    name = 'Mary'

    AND line_number % 4 != 3

    )

    OR (

    name = 'Jane'

    AND line_number % 4 != 0

    );

    GO

    Note: the script consistently takes about 15 seconds to run on my home machine.

    As I mentioned in an earlier post, T-SQL's BULK INSERT, the SQL Server command line tool bcp.exe, the .NET Base Class Library class System.Data.SqlClient.SqlBulkCopy and SSIS's OLE DB Destination Component all connect to SQL Server's BulkCopy API so it would stand to reason that a batch size option would exposed through all of these tools, and it is. In BULK INSERT it is BATCHSIZE. In bcp.exe it is the -b option. In SqlBulkCopy it is the BatchSize property, and in SSIS's OLE DB Destination Component it is the Fast Load Option FastLoadMaxInsertCommitSize.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Stan Kulp-439977 (8/7/2014)


    shah.muhammad (8/7/2014)


    Bulk Insert works in most cases but if your operation is based on sequence, you can't rely on bulk insert as it loads the data from the file regardless of sequence. we had quiet a few issues with this.

    That's good to know. I wonder why that is?

    It is because once a batch of rows is submitted to SQL Server it is up to the database engine to decide how that data is applied to the table. If you submit a batch of 1000 rows the engine may commit part of the batch of data before another one if it happens to split the batch and send rows to different CPUs for insert into the table. The only way to avoid data in your batches from being committed to your table in an order other than how it was read from the file is to submit a batch containing only a single row.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

Viewing 6 posts - 16 through 20 (of 20 total)

You must be logged in to reply to this topic. Login to reply