Skipping Columns (& other tricks) Using BCP Format Files (SQL Spackle)

  • Andy DBA (1/23/2014)


    Jeff Moden (1/23/2014)


    If the quotes are consistent in the column, then you are, in fact, missing out on a format file option that MS didn't document because they "don't support CSVs". \" is what you use as part of the FIELDTEMINATOR column to represent double quotes. Importing such CSVs using BCP and BULK INSERT will be the subject of another article.

    Of course! Yes, I do remember coming across the approach where double quotes are included as part of the field terminator. There's also a way to "eat" the leading double quote from the first column, but I'll wait for your next article on the topic. I was able to take an approach where I didn't need format files, but I guess what I was really hoping for is a Magically-treat-double quotes-like-Excel-does command line option!:-D

    The old 32 bit "Jet" drivers used to work a treat for exactly that using OPENROWSET. Ii've not tried the 64 bit ACE drivers but rumor has it they they work just as well. The problem with OPENROWSET is that it requires some serious privs whereas BULK INSERT does not.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • This article helped me big time having to load the 18Gb text file here[/url]. All I needed was the MD5 column, though I find it somewhat odd that the destination table has to have all the columns in it, as opposed to just the one you're loading into from.

    I ran into a similar problem as another poster with the text file having double quotes, but deleting dupes from and then replacing the double quotes in a table performed about as well as can be expected, especially vs. PowerShell, which processes large text files embarrassingly slow, even using the System.IO.StreamReader or System.IO.File. LBAL is just as bad as RBAR 😛

    SQL and format file contents are below.

    USE [nist]

    --CREATE TABLE dbo.nist_stage

    -- (

    --[SHA1] VARCHAR(10),

    --[MD5] CHAR(40),

    --[CRC32] VARCHAR(10),

    --[FileName] VARCHAR(10),

    --[FileSize] VARCHAR(10),

    --[ProductCode] VARCHAR(10),

    --[OpSystemCode] VARCHAR(10),

    --[SpecialCode] VARCHAR(10)

    -- )

    --DROP TABLE dbo.nist_stage

    --TRUNCATE TABLE dbo.nist_stage

    --CREATE CLUSTERED INDEX [MD5] ON [dbo].[nist_stage]

    --(

    --[MD5] ASC

    --)

    --DROP INDEX [MD5] ON [dbo].[nist_stage]

    BULK INSERT dbo.nist_stage

    FROM 'M:\bcp\NSRLFile.txt'

    WITH (

    BATCHSIZE = 50000

    ,CODEPAGE = 'RAW'

    ,DATAFILETYPE = 'char'

    ,FIRSTROW = 2

    ,FORMATFILE = 'M:\bcp\Admin\format.fmt'

    ,MAXERRORS = 2000000000

    ,ERRORFILE = 'M:\bcp\Admin\error.log'

    ,TABLOCK

    )

    --SELECT TOP 1000 * FROMdbo.nist_stage

    ;WITH dedupe AS (

    SELECT MD5, ROW_NUMBER() OVER (PARTITION BY MD5 ORDER BY MD5) Rn

    FROM dbo.nist_stage AS ns

    )

    DELETE D

    FROM dedupe D

    WHERE D.Rn > 1

    UPDATE dbo.nist_stage

    SET MD5 = REPLACE(MD5, '"', '')

    DROP INDEX [MD5] ON [dbo].[NSRLFiles]

    INSERT INTO dbo.NSRLFiles (MD5)

    SELECT ns.MD5

    FROM dbo.nist_stage AS ns;

    CREATE CLUSTERED INDEX [MD5] ON [dbo].[NSRLFiles]

    (

    [MD5] ASC

    )

    /*

    Times with clustered index present on staging table

    Load: 31:51

    Delete dupes: 3:53

    Remove double quotes: 8:25

    Time without clustered index present on staging table

    Load: 5:04

    Delete dupes: 9:42

    Remove double quotes: 1:32

    Times with PowerShell processing file:

    Delete dupes: Ew

    Remove double quotes: Nope

    Load: Why?

    */

    11.0

    8

    1 SQLCHAR 0 8000 "," 0 SHA1 ""

    2 SQLCHAR 0 8000 "," 2 MD5 ""

    3 SQLCHAR 0 8000 "," 0 CRC32 ""

    4 SQLCHAR 0 8000 "," 0 FileName ""

    5 SQLCHAR 0 8000 "," 0 FileSize ""

    6 SQLCHAR 0 8000 "," 0 ProductCode ""

    7 SQLCHAR 0 8000 "," 0 OpSystemCode ""

    8 SQLCHAR 0 8000 "\r" 0 SpecialCode ""

  • sqldriver - Thursday, August 21, 2014 9:46 AM

    This article helped me big time having to load the 18Gb text file here[/url]. All I needed was the MD5 column, though I find it somewhat odd that the destination table has to have all the columns in it, as opposed to just the one you're loading into from.I ran into a similar problem as another poster with the text file having double quotes, but deleting dupes from and then replacing the double quotes in a table performed about as well as can be expected, especially vs. PowerShell, which processes large text files embarrassingly slow, even using the System.IO.StreamReader or System.IO.File. LBAL is just as bad as RBAR :-PSQL and format file contents are below.USE [nist]--CREATE TABLE dbo.nist_stage-- (--[SHA1] VARCHAR(10),--[MD5] CHAR(40),--[CRC32] VARCHAR(10),--[FileName] VARCHAR(10),--[FileSize] VARCHAR(10),--[ProductCode] VARCHAR(10),--[OpSystemCode] VARCHAR(10),--[SpecialCode] VARCHAR(10)-- )--DROP TABLE dbo.nist_stage--TRUNCATE TABLE dbo.nist_stage--CREATE CLUSTERED INDEX [MD5] ON [dbo].[nist_stage]--(--[MD5] ASC--)--DROP INDEX [MD5] ON [dbo].[nist_stage] BULK INSERT dbo.nist_stage FROM 'M:\bcp\NSRLFile.txt' WITH ( BATCHSIZE = 50000 ,CODEPAGE = 'RAW' ,DATAFILETYPE = 'char' ,FIRSTROW = 2 ,FORMATFILE = 'M:\bcp\Admin\format.fmt' ,MAXERRORS = 2000000000 ,ERRORFILE = 'M:\bcp\Admin\error.log' ,TABLOCK )--SELECT TOP 1000 * FROMdbo.nist_stage;WITH dedupe AS (SELECT MD5, ROW_NUMBER() OVER (PARTITION BY MD5 ORDER BY MD5) RnFROM dbo.nist_stage AS ns)DELETE DFROM dedupe DWHERE D.Rn > 1UPDATE dbo.nist_stageSET MD5 = REPLACE(MD5, '"', '')DROP INDEX [MD5] ON [dbo].[NSRLFiles]INSERT INTO dbo.NSRLFiles (MD5)SELECT ns.MD5 FROM dbo.nist_stage AS ns;CREATE CLUSTERED INDEX [MD5] ON [dbo].[NSRLFiles]([MD5] ASC)/*Times with clustered index present on staging tableLoad: 31:51Delete dupes: 3:53Remove double quotes: 8:25Time without clustered index present on staging tableLoad: 5:04Delete dupes: 9:42Remove double quotes: 1:32Times with PowerShell processing file:Delete dupes: EwRemove double quotes: NopeLoad: Why?*/

    11.081 SQLCHAR 0 8000 "," 0 SHA1 ""2 SQLCHAR 0 8000 "," 2 MD5 ""3 SQLCHAR 0 8000 "," 0 CRC32 ""4 SQLCHAR 0 8000 "," 0 FileName ""5 SQLCHAR 0 8000 "," 0 FileSize ""6 SQLCHAR 0 8000 "," 0 ProductCode ""7 SQLCHAR 0 8000 "," 0 OpSystemCode ""8 SQLCHAR 0 8000 "\r" 0 SpecialCode ""

    Crud... my most sincere apologies.  Talk about being "behind", I totally missed your good feedback from 4 years ago.  Thank you for the kudo and thank you for the time you spent writing some great feedback on what you did.  It should help others a lot.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Tuesday, July 17, 2018 3:27 PM

    sqldriver - Thursday, August 21, 2014 9:46 AM

    This article helped me big time having to load the 18Gb text file here[/url]. All I needed was the MD5 column, though I find it somewhat odd that the destination table has to have all the columns in it, as opposed to just the one you're loading into from.I ran into a similar problem as another poster with the text file having double quotes, but deleting dupes from and then replacing the double quotes in a table performed about as well as can be expected, especially vs. PowerShell, which processes large text files embarrassingly slow, even using the System.IO.StreamReader or System.IO.File. LBAL is just as bad as RBAR :-PSQL and format file contents are below.USE [nist]--CREATE TABLE dbo.nist_stage-- (--[SHA1] VARCHAR(10),--[MD5] CHAR(40),--[CRC32] VARCHAR(10),--[FileName] VARCHAR(10),--[FileSize] VARCHAR(10),--[ProductCode] VARCHAR(10),--[OpSystemCode] VARCHAR(10),--[SpecialCode] VARCHAR(10)-- )--DROP TABLE dbo.nist_stage--TRUNCATE TABLE dbo.nist_stage--CREATE CLUSTERED INDEX [MD5] ON [dbo].[nist_stage]--(--[MD5] ASC--)--DROP INDEX [MD5] ON [dbo].[nist_stage] BULK INSERT dbo.nist_stage FROM 'M:\bcp\NSRLFile.txt' WITH ( BATCHSIZE = 50000 ,CODEPAGE = 'RAW' ,DATAFILETYPE = 'char' ,FIRSTROW = 2 ,FORMATFILE = 'M:\bcp\Admin\format.fmt' ,MAXERRORS = 2000000000 ,ERRORFILE = 'M:\bcp\Admin\error.log' ,TABLOCK )--SELECT TOP 1000 * FROMdbo.nist_stage;WITH dedupe AS (SELECT MD5, ROW_NUMBER() OVER (PARTITION BY MD5 ORDER BY MD5) RnFROM dbo.nist_stage AS ns)DELETE DFROM dedupe DWHERE D.Rn > 1UPDATE dbo.nist_stageSET MD5 = REPLACE(MD5, '"', '')DROP INDEX [MD5] ON [dbo].[NSRLFiles]INSERT INTO dbo.NSRLFiles (MD5)SELECT ns.MD5 FROM dbo.nist_stage AS ns;CREATE CLUSTERED INDEX [MD5] ON [dbo].[NSRLFiles]([MD5] ASC)/*Times with clustered index present on staging tableLoad: 31:51Delete dupes: 3:53Remove double quotes: 8:25Time without clustered index present on staging tableLoad: 5:04Delete dupes: 9:42Remove double quotes: 1:32Times with PowerShell processing file:Delete dupes: EwRemove double quotes: NopeLoad: Why?*/

    11.081 SQLCHAR 0 8000 "," 0 SHA1 ""2 SQLCHAR 0 8000 "," 2 MD5 ""3 SQLCHAR 0 8000 "," 0 CRC32 ""4 SQLCHAR 0 8000 "," 0 FileName ""5 SQLCHAR 0 8000 "," 0 FileSize ""6 SQLCHAR 0 8000 "," 0 ProductCode ""7 SQLCHAR 0 8000 "," 0 OpSystemCode ""8 SQLCHAR 0 8000 "\r" 0 SpecialCode ""

    Crud... my most sincere apologies.  Talk about being "behind", I totally missed your good feedback from 4 years ago.  Thank you for the kudo and thank you for the time you spent writing some great feedback on what you did.  It should help others a lot.

    Yeah but it bumped this article back up there in the latest posts so hopefully more can come across this one. It is a great article - if only everyone knew how many things and how well they can be done using bulk insert, bcp.
    SSIS seems to be overkill at times and bcp, bulk insert can be excellent (and faster) options in those cases. But I don't hate SSIS like maybe some others 🙂

    Sue

  • Thanks for the great feedback, Sue.  And thank you for all the great posts you've made.  You're definitely a heavy hitter and would love to see you start writing articles.  You've got some great stuff to share.  It's easy to come up with a topic... pick one of the many posts you've answered and write an article about the topic.

    As for SSIS, I seem to come across as an "SSIS Hater".  I'm actually not... I just hate what a lot of people have done with it. 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 5 posts - 16 through 19 (of 19 total)

You must be logged in to reply to this topic. Login to reply