Bulk insert changing order of file...

  • I have a flat file that I importing into my database but the rows are not coming into the table in the same order as in the flat file. I know that the data is not coming into the database in order, due records that should be later on in the file are showing up earlier and in the middle of another section of data where it doesn’t belong.

    The flat file is created in a specific order and with only column of data. Therefore I can't order by the column

    How can I get the bulk insert task or command to keep the order of the flat file?

  • You really can't. SQL table rows don't have an "order" unless you can use data to assign one.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • newbie.with.sql (6/16/2011)


    I have a flat file that I importing into my database but the rows are not coming into the table in the same order as in the flat file. I know that the data is not coming into the database in order, due records that should be later on in the file are showing up earlier and in the middle of another section of data where it doesn’t belong.

    The flat file is created in a specific order and with only column of data. Therefore I can't order by the column

    How can I get the bulk insert task or command to keep the order of the flat file?

    Step 1:

    > Make sure you have an IDENTITY column on your load table.

    Step 2:

    > Option 1: In theory you could try setting the threads to 1 and setting the batch commit size to 1. I have not tried it but that may guarantee an ordered load. Having a batch commit size of 1 will be slow...slow like issuing a ton of sequential batches each with on INSERT statement, but it may get the job done. How big is the file?

    > Option 2: The bcp.exe utility could work for you if Option 1 fails. bcp reads/loads files sequentially and TMK is single threaded so the same theory above would apply. If you set the batch commit size to 1 (-b option) you should get an ordered load.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • I'll try with the IDENTITY column, since I am using Option # 2, to see if it helps... Thanks,

  • opc.three (6/16/2011)


    newbie.with.sql (6/16/2011)


    I have a flat file that I importing into my database but the rows are not coming into the table in the same order as in the flat file. I know that the data is not coming into the database in order, due records that should be later on in the file are showing up earlier and in the middle of another section of data where it doesn’t belong.

    The flat file is created in a specific order and with only column of data. Therefore I can't order by the column

    How can I get the bulk insert task or command to keep the order of the flat file?

    Step 1:

    > Make sure you have an IDENTITY column on your load table.

    Step 2:

    > Option 1: In theory you could try setting the threads to 1 and setting the batch commit size to 1. I have not tried it but that may guarantee an ordered load. Having a batch commit size of 1 will be slow...slow like issuing a ton of sequential batches each with on INSERT statement, but it may get the job done. How big is the file?

    > Option 2: The bcp.exe utility could work for you if Option 1 fails. bcp reads/loads files sequentially and TMK is single threaded so the same theory above would apply. If you set the batch commit size to 1 (-b option) you should get an ordered load.

    Unless you add a column that states the order of the rows, in the source file, you really can't count on these working either.

    SQL Server doesn't store "the rows are in this order" data unless you tell it to, and tell it what the order is, and define the order either based on the data, or by manually assigning it to each row.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • That's why the IDENTITY column is added in Step 1.

    You're not going to "physical order" are you? I'm pretty sure all that's needed is logical order. Once the data is loaded this will work just fine:

    SELECT *

    FROM dbo.load_table

    ORDER BY id -- id being the IDENTITY column

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • opc.three (6/17/2011)


    That's why the IDENTITY column is added in Step 1.

    You're not going to "physical order" are you? I'm pretty sure all that's needed is logical order. Once the data is loaded this will work just fine:

    SELECT *

    FROM dbo.load_table

    ORDER BY id -- id being the IDENTITY column

    You would need to verify it through some means. All it takes is one other process running in the CPU, and sequentiality can be ruined. You might get what you want, or you might not, and you won't know unless you manually verify it. Like you said in your post, "should". That's not "will". You can't count on it.

    If the order matters, and apparently it does, it either needs to be ordered in the source file, or manually ordered/verified in the destination table.

    Honestly, ordering in the source file usually isn't that hard. Excel can do that for you in many cases. So can several text editors that can add line numbers to documents.

    Outside of that, import through a VB file reader. Those are line-by-line by definition, using a For Next loop, and you can force ordering into the table that way. But don't use an ID column, use a counter in the loop. That makes it certain (adds sequencing in the source, essentially, by adding it inline during the pull).

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • GSquared (6/17/2011)


    opc.three (6/17/2011)


    That's why the IDENTITY column is added in Step 1.

    You're not going to "physical order" are you? I'm pretty sure all that's needed is logical order. Once the data is loaded this will work just fine:

    SELECT *

    FROM dbo.load_table

    ORDER BY id -- id being the IDENTITY column

    You would need to verify it through some means. All it takes is one other process running in the CPU, and sequentiality can be ruined. You might get what you want, or you might not, and you won't know unless you manually verify it. Like you said in your post, "should". That's not "will". You can't count on it.

    If the order matters, and apparently it does, it either needs to be ordered in the source file, or manually ordered/verified in the destination table.

    Honestly, ordering in the source file usually isn't that hard. Excel can do that for you in many cases. So can several text editors that can add line numbers to documents.

    Outside of that, import through a VB file reader. Those are line-by-line by definition, using a For Next loop, and you can force ordering into the table that way. But don't use an ID column, use a counter in the loop. That makes it certain (adds sequencing in the source, essentially, by adding it inline during the pull).

    I guess I don't see how it could not be ordered. With a single threaded file loader (bcp) reading the file one row at a time and a batch size of 1 (means commit is called every every row is pushed to the table) then ACID being what it is it would (there, I said it) be ordered in the table as it is in the file.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • I tried using this line via a stored procedure ,the records were still mixed up..

    bulk insert Data from 'C:\SQLData\Import\P11.dat' with (fieldterminator = '', rowterminator ='\r',rows_per_batch =1)

    Is this incorrect? It took about the same amount of time to insert the records as it did without the rows_per_batch addition which seems strange.

  • GSquared,

    How do I manually assign it to each row?

    Thanks

  • The bcp.exe utility is what you want. bcp is much different than BULK INSERT.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • I couldn't help myself...it's been a few years since I used BCP regularly so I had to put what I said to the test.


    Step 1) Execute this code against your database server to build a table and a simple view on top of the table to mask the identity column from bcp. The view makes it more convenient for us because it alleviates the need to use a format file.

    USE tempdb

    GO

    IF OBJECT_ID(N'dbo.ordered_load_view') > 0

    DROP VIEW dbo.ordered_load_view ;

    GO

    IF OBJECT_ID(N'dbo.ordered_load_table') > 0

    DROP TABLE dbo.ordered_load_table ;

    GO

    CREATE TABLE dbo.ordered_load_table (id INT IDENTITY(1,1) NOT NULL,

    some_info VARCHAR(100) NOT NULL ) ;

    GO

    CREATE VIEW dbo.ordered_load_view

    AS

    SELECT some_info FROM dbo.ordered_load_table

    GO

    Step 2) Save the text file in the zip file attached to this post to C:\@\ and then run this bcp command from a Cmd prompt. Note: adjust the database server\instance as needed to populate the table via the view. The text file contains 1 million lines of text, each line having the line number on it so it is easy to compare with the IDENTITY column later.

    bcp tempdb.dbo.ordered_load_view in C:\@\1.txt -b 1 -c -T -S server\instance

    Step 3) Execute this T-SQL code to verify that the data was inserted in the order of the file:

    USE tempdb

    GO

    SET NOCOUNT OFF ;

    GO

    RAISERROR('checking for rows loaded out of order',10,1) WITH NOWAIT ;

    GO

    SELECT *

    FROM ordered_load_table

    -- this checks if any data file row numbers did not match the id assigned by the DB

    WHERE id != CAST(some_info AS INT) ;

    GO

    On my two-proc machine running SQL 2008 R2 Standard Edition 0 rows were returned from the verification check.* This tells us that the technique could work for you newbie.with.sql.


    I also decided to look into the options available with BULK INSERT since I have less experience with it than with BCP and was not sure if it used the same API (still not sure about that) so I figured this was a good opportunity to explore it. I found that by setting the BATCHSIZE option to 1 I was able to achieve the same result as with the BCP command above. Here is the code for that test using the same file:

    USE tempdb

    GO

    SET NOCOUNT ON;

    go

    IF OBJECT_ID(N'dbo.ordered_load_view') > 0

    DROP VIEW dbo.ordered_load_view ;

    GO

    IF OBJECT_ID(N'dbo.ordered_load_table') > 0

    DROP TABLE dbo.ordered_load_table ;

    GO

    CREATE TABLE dbo.ordered_load_table (id INT IDENTITY(1,1) NOT NULL,

    some_info VARCHAR(10) NOT NULL ) ;

    GO

    CREATE VIEW dbo.ordered_load_view

    AS

    SELECT some_info FROM dbo.ordered_load_table

    go

    RAISERROR('starting BULK INSERT',10,1) WITH NOWAIT ;

    GO

    BULK INSERT tempdb.dbo.ordered_load_view FROM 'C:\@\1.txt'

    WITH (ROWTERMINATOR = '',

    BATCHSIZE = 1, -- Edit: this is the key to the technique. SET NOCOUNT OFF to see why

    ROWS_PER_BATCH = 1,

    MAXERRORS = 0) ;

    GO

    RAISERROR('BULK INSERT complete',10,1) WITH NOWAIT ;

    GO

    RAISERROR('checking for rows loaded out of order',10,1) WITH NOWAIT ;

    GO

    SET NOCOUNT OFF;

    go

    SELECT *

    FROM ordered_load_table

    -- this checks if any data file row numbers did not match the id assigned by the DB

    WHERE id != CAST(some_info AS INT) ;

    GO

    I wasn't sure because I do not use BULK INSERT much but in fact 0 rows were returned from the verification check.* This means that the BULK INSERT could also work for you newbie.with.sql by using the BATCHSIZE option.


    Loading the 1 million line file took ~35 minutes for BULK INSERT and ~41 minutes for bcp (402.51 records per second). I knew it would take a while loading one row at a time which is why I asked how big the file was. If you have 1MM row files and can't wait 35 minutes for it to load then you may want to look into adding the row numbers to the file itself. That will be more programming work but will certainly allow the file to be loaded fast.


    * I started two additional database windows running simulation code in an infinite loop and checked to ensure I had activity on both CPUs before I started the BCP and BULK INSERT commands.

    EDIT: re-attached zip file with test file...it was somehow lost along the way

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

Viewing 12 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply