• This article helped me big time having to load the 18Gb text file here[/url]. All I needed was the MD5 column, though I find it somewhat odd that the destination table has to have all the columns in it, as opposed to just the one you're loading into from.

    I ran into a similar problem as another poster with the text file having double quotes, but deleting dupes from and then replacing the double quotes in a table performed about as well as can be expected, especially vs. PowerShell, which processes large text files embarrassingly slow, even using the System.IO.StreamReader or System.IO.File. LBAL is just as bad as RBAR 😛

    SQL and format file contents are below.

    USE [nist]

    --CREATE TABLE dbo.nist_stage

    -- (

    --[SHA1] VARCHAR(10),

    --[MD5] CHAR(40),

    --[CRC32] VARCHAR(10),

    --[FileName] VARCHAR(10),

    --[FileSize] VARCHAR(10),

    --[ProductCode] VARCHAR(10),

    --[OpSystemCode] VARCHAR(10),

    --[SpecialCode] VARCHAR(10)

    -- )

    --DROP TABLE dbo.nist_stage

    --TRUNCATE TABLE dbo.nist_stage

    --CREATE CLUSTERED INDEX [MD5] ON [dbo].[nist_stage]

    --(

    --[MD5] ASC

    --)

    --DROP INDEX [MD5] ON [dbo].[nist_stage]

    BULK INSERT dbo.nist_stage

    FROM 'M:\bcp\NSRLFile.txt'

    WITH (

    BATCHSIZE = 50000

    ,CODEPAGE = 'RAW'

    ,DATAFILETYPE = 'char'

    ,FIRSTROW = 2

    ,FORMATFILE = 'M:\bcp\Admin\format.fmt'

    ,MAXERRORS = 2000000000

    ,ERRORFILE = 'M:\bcp\Admin\error.log'

    ,TABLOCK

    )

    --SELECT TOP 1000 * FROMdbo.nist_stage

    ;WITH dedupe AS (

    SELECT MD5, ROW_NUMBER() OVER (PARTITION BY MD5 ORDER BY MD5) Rn

    FROM dbo.nist_stage AS ns

    )

    DELETE D

    FROM dedupe D

    WHERE D.Rn > 1

    UPDATE dbo.nist_stage

    SET MD5 = REPLACE(MD5, '"', '')

    DROP INDEX [MD5] ON [dbo].[NSRLFiles]

    INSERT INTO dbo.NSRLFiles (MD5)

    SELECT ns.MD5

    FROM dbo.nist_stage AS ns;

    CREATE CLUSTERED INDEX [MD5] ON [dbo].[NSRLFiles]

    (

    [MD5] ASC

    )

    /*

    Times with clustered index present on staging table

    Load: 31:51

    Delete dupes: 3:53

    Remove double quotes: 8:25

    Time without clustered index present on staging table

    Load: 5:04

    Delete dupes: 9:42

    Remove double quotes: 1:32

    Times with PowerShell processing file:

    Delete dupes: Ew

    Remove double quotes: Nope

    Load: Why?

    */

    11.0

    8

    1 SQLCHAR 0 8000 "," 0 SHA1 ""

    2 SQLCHAR 0 8000 "," 2 MD5 ""

    3 SQLCHAR 0 8000 "," 0 CRC32 ""

    4 SQLCHAR 0 8000 "," 0 FileName ""

    5 SQLCHAR 0 8000 "," 0 FileSize ""

    6 SQLCHAR 0 8000 "," 0 ProductCode ""

    7 SQLCHAR 0 8000 "," 0 OpSystemCode ""

    8 SQLCHAR 0 8000 "\r" 0 SpecialCode ""