Using OPENROWSET to import CSV files

  • Please also consider using an SSIS package to import data from files to a database - that is its primary purpose in life, and I don't like to see a good tool get upset. πŸ˜‰

  • Paul White (11/6/2009)


    Please also consider using an SSIS package to import data from files to a database - that is its primary purpose in life, and I don't like to see a good tool get upset. πŸ˜‰

    SSIS requires a fixed number of columns in the import definition.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • GSquared (11/10/2009)


    SSIS requires a fixed number of columns in the import definition.

    Well it certainly prefers to deal with stable metadata, that's for sure. But, unless the requirement is actually to import a file with truly 'any structure' into an unspecified table, there's often a way around it.

    Most frequently I've come across this as different files being subsets of a larger structure, or something like that. If there's some structure or logical to work with, it's often possible to do.

  • Jeff Moden (11/6/2009)


    Simon Parry (11/6/2009)


    Hi

    i have tried using BULK INSERT

    using

    set @string ='BULK INSERT CSVTest FROM ''\\10.1.2.107\rp_uploaded_files\'+@file+''' WITH (FIELDTERMINATOR = '','',ROWTERMINATOR = '''')'

    exec (@string)

    but this requires the destintation table to have the same number of columns as the import file

    The files i want to import have an unknown number of coulmns

    Can you help at all?

    Simon

    Yep... attach a copy of one of the files (unless it has private info in it) to your next post and tell me what you think you'd like to do with it.

    Simon?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi

    Sorry for the delay

    I have attached 3 sample files, its possible each one of these may be used

    If the files were the same size each time i could do it, but because they have a varying number of columns im stuck

    Many thanks for all your help πŸ™‚

    Simon

  • That's the other piece of the puzzle I need, Simon... what do you want to do with the data when it's brought in? Separate tables? Common Table? If a common table, what do you want it to look like?

    ...ooooorrrrrrr....

    ... do you just want to be able to use the data short term as if it were in a table?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Jeff

    thanks for your reply

    i would want the data stored in seperate tables

    thanks

    Simon

  • Using BULK INSERT won't work if you there are variable number of columns in the data file.

    Something like:

    fld1A,fld2A,fld3A

    fld1B,fld2B,fld3B

    fld1C,fld2C,fld3C,fld4C

    Oddly the OLD DTS data bumps/bulk insert tasks were fine with the above.

    They must have used the MSDASQL driver for the insertion process.

    πŸ˜‰

  • Wow... sorry about that, folks. I really lost track of this post.

    Simon, after only 5 months of delay :blush:, do you still need any help on this?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (4/23/2010)


    Wow... sorry about that, folks. I really lost track of this post.

    Simon, after only 5 months of delay :blush:, do you still need any help on this?

    Jeff, i would like to see a solution for this.. i smell that i might run into this problem in a short while.. mine is exactly what Simon Parry was talking about.. i am going to get inconsistent csv files (meaning, the number of columns will differ every day) and i need to import them into a table. So if u have a solution, i am sure u ll have one which will eventually be the best πŸ˜‰ , please share with us!!

    Thanks in advance , Jeff!!

  • I have 2 solutions in mind.

    Solution 1:

    1. Ask the csv team to tel give us if the extra columns will be added in today's file.

    2. Create many format (*.fmt) files and use them according to the csv team's response.

    3. Utilize BCP/BULK INSERT along with FMT files to import them.

    Solution 2:

    1. Use OPENROWSET to capture the column names in the csv file.

    2. Create a table with those column names (using NVARCHAR for all the columns, of course)

    3. Create a BULK INSERT statement using the above table name, dynamically.

    4. EXEC the dynamic statment.

    Hmmm... hope these 2 solutions are feasible. I did a POC of solution 1 and it worked fine.. For solution 2, i am yet to start! I am also quite interested in GSquared's import to temp table solution, i will give a shot at this in some time!

    Now, Jeff, waiting for your inputs on this "tricky" problem . πŸ™‚

    Cheers!!

  • ColdCoffee (4/24/2010)


    I am also quite interested in GSquared's import to temp table solution, i will give a shot at this in some time!

    BWAA-HAA!!! Try it on the Sample2.CSV file the OP provided in his original ZIP file... Surprise, surprise, surprise! πŸ˜›

    There is a way to handle it, though...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I put the OP's files in C:\Temp to have the following code run...

    --DROP TABLE #Staging,#Header,#Data

    --GO

    --===== Supress auto display of rowcounts

    SET NOCOUNT ON;

    --===== Declare local variables

    DECLARE @pFullPath VARCHAR(256), --Full path to the file. May be a UNC. Would be a parameter in a stored proc

    @SQLBulk VARCHAR(MAX), --Dynamic Bulk Insert command

    @SQLSelect VARCHAR(MAX) --Dynamic Select command

    ;

    SELECT @pFullPath = 'C:\Temp\Sample2.csv' --Just for testing purposes.

    ;

    --===== Create the staging table. The others will be created on the fly.

    CREATE TABLE #Staging

    (LineData VARCHAR(8000))

    ;

    --===== Bulk Insert the dynamically named file

    SELECT @SQLBulk =

    ' BULK INSERT #Staging

    FROM ' + QUOTENAME(@pFullPath,'''')+'

    WITH (

    FIELDTERMINATOR = '''',

    ROWTERMINATOR = ''''

    );'

    ;

    EXEC (@SQLBulk)

    ;

    --===== Split and Isolate the header data in a table and cover

    -- for missing header names.

    WITH

    cteNumber AS

    (

    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))-1 AS LineNumber,

    LineData

    FROM #Staging

    )

    SELECT split.ItemNumber,

    QUOTENAME(ISNULL(NULLIF(split.Item,''),split.ItemNumber)) AS Data

    INTO #Header

    FROM cteNumber stage

    CROSS APPLY (SELECT * FROM dbo.DelimitedSplit8K(LineData,',')) split

    WHERE stage.LineNumber = 0

    ;

    --===== Split and isolate the data

    WITH

    cteNumber AS

    (

    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))-1 AS LineNumber,

    LineData

    FROM #Staging

    )

    SELECT stage.LineNumber,

    split.ItemNumber,

    split.Item AS Data

    INTO #Data

    FROM cteNumber stage

    CROSS APPLY (SELECT * FROM dbo.DelimitedSplit8K(LineData,',')) split

    WHERE LineNumber > 0

    ;

    --===== Put everything together as a single result set with the proper column names and all

    -- Create the dynamic part of the SELECT list first

    SELECT @SQLSelect = ISNULL(@SQLSelect+','+CHAR(10),'')

    + 'MAX(CASE WHEN d.ItemNumber =' + STR(h.ItemNumber,3) + ' THEN Data ELSE '''' END) AS ' + h.Data

    FROM #Header h

    ;

    -- Add the static part of the query to the dynamic part

    SELECT @SQLSelect = 'SELECT d.LineNumber,' + CHAR(10)

    + @SQLSelect + CHAR(10)

    + 'FROM #Data d GROUP BY d.LineNumber'

    ;

    --===== We're ready to rock... return the result set

    EXEC (@SQLSelect)

    ;

    ... and here's my splitter function...

    CREATE FUNCTION dbo.DelimitedSplit8K

    /***************************************************************************************************

    Purpose:

    Split a given string at a given delimiter and return a list of the split elements (items).

    Returns:

    iTVF containing the following:

    ItemNumber = Element position of Item as a BIGINT (not converted to INT to eliminate a CAST)

    Item = Element value as a VARCHAR(8000)

    CROSS APPLY Usage Example:

    ---------------------------------------------------------------------------------------------------

    --===== Conditionally drop the test tables to make reruns easier for testing.

    -- (this is NOT a part of the solution)

    IF OBJECT_ID('tempdb..#JBMTest') IS NOT NULL

    DROP TABLE #JBMTest

    ;

    --===== Create and populate a test table on the fly (this is NOT a part of the solution).

    SELECT *

    INTO #JBMTest

    FROM (

    SELECT 1,'1,10,100,1000,10000,100000,1000000' UNION ALL

    SELECT 2,'2000000,200000,20000,2000,200,20,2' UNION ALL

    SELECT 3, 'This,is,a,test' UNION ALL

    SELECT 4, 'and so is this' UNION ALL

    SELECT 5, 'This, too (no pun intended)'

    ) d (SomeID,SomeValue)

    ;

    GO

    --===== Split the CSV column for the whole table using CROSS APPLY (this is the solution)

    SELECT test.SomeID, split.ItemNumber, split.Item

    FROM #JBMTest test

    CROSS APPLY

    (

    SELECT ItemNumber, Item

    FROM dbo.DelimitedSplit8k(test.SomeValue,',')

    ) split

    ;

    ---------------------------------------------------------------------------------------------------

    Notes:

    1. Optimized for VARCHAR(8000) or less.

    2. Optimized for single character delimiter.

    3. Optimized for use with CROSS APPLY.

    4. Does not "trim" elements just in case leading or trailing blanks are intended.

    5. If you don't know how a Tally table can be used to replace loops, please see the following...

    http://www.sqlservercentral.com/articles/T-SQL/62867/

    6. Changing this function to use VARCHAR(MAX) will cause it to run twice as slow. It's just the

    nature of VARCHAR(MAX) whether it fits in-row or not.

    7. Multi-machine testing for the method of using UNPIVOT instead of 10 SELECT/UNION ALLs shows

    that the UNPIVOT method is quite machine dependent and can slow things down quite a bit.

    8. Performance testing shows using "TOP" for the limiting criteria of "N" is actually

    slower and slightly more CPU intensive than the traditional WHERE N < LEN(@pString) + 2.

    9. Performance testing shows using ORDER BY (SELECT x) where "x" is anything is actually

    slower and slightly more CPU intensive than the traditional ORDER BY (SELECT N).

    Credits:

    This code is the product of many people's efforts including but not limited to the following:

    cteTally concept originally by Iztek Ben Gan and "decimalized" by Lynn Pettis (and others) for a

    bit of extra speed and finally redacted by Jeff Moden for a different slant on readability and

    compactness. Hat's off to Paul White for his simple explanations of CROSS APPLY. Finally,

    special thanks to Erland Sommarskog for his tireless efforts to help people understand

    what you can actually do with T-SQL. I also thank whoever wrote the first article I ever saw

    on "numbers tables" which is located at the following URL ...

    http://sqlserver2000.databases.aspfaq.com/why-should-i-consider-using-an-auxiliary-numbers-table.html

    Revision History:

    Rev 00 - 20 Jan 2010 - Concept: Lynn Pettis and others.

    Redaction/Implementation: Jeff Moden

    - Base 10 redaction and reduction for CTE. (Total rewrite)

    Rev 01 - 13 Mar 2010 - Jeff Moden

    - Removed one additional concatenation and one subtraction from the SUBSTRING in the

    SELECT List for that tiny bit of extra speed.

    Rev 02 - 14 Apr 2010 - Jeff Moden

    - No code changes. Added CROSS APPLY usage example to the header, some additional credits,

    and extra documentation.

    Rev 03 - 18 Apr 2010 - Jeff Moden

    - No code changes. Added notes 7, 8, and 9 about certain "optimizations" that dont'

    actually work for this type of function.

    ***************************************************************************************************/

    --===== Define I/O parameters

    (

    @pString VARCHAR(8000),

    @pDelimiter CHAR(1)

    )

    RETURNS TABLE

    AS

    RETURN

    --===== "Inline" CTE Driven "Tally Table” produces values up to

    -- 10,000... enough to cover VARCHAR(8000)

    WITH

    E1(N) AS ( --=== Create Ten 1's

    SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 --10

    ),

    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --100

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10,000

    cteTally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) FROM E4)

    --===== Do the split

    SELECT ROW_NUMBER() OVER (ORDER BY N) AS ItemNumber,

    N AS StartPosition,

    SUBSTRING(@pString, N, CHARINDEX(@pDelimiter, @pString + @pDelimiter, N) - N) AS Item

    FROM cteTally

    WHERE N < LEN(@pString) + 2

    AND SUBSTRING(@pDelimiter + @pString, N, 1) = @pDelimiter

    ;

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks, Jeff, for posting your approach.. As i dont have SSMS at home, i will check out your code tomorrow and let you know..

    But i got a doubt here.. From parsing your code, it looks like you are utilising 2 tables for the csv data. Hmmm, but the Solution 2 i proposed , it seems to use only one table for the csv data.. Anyways, i will try out both the options and give you the result on Monday.

    Thanks again, Jeff! πŸ™‚

    Cheers!!

    C'est Pras!

  • ColdCoffee (4/24/2010)


    Thanks, Jeff, for posting your approach.. As i dont have SSMS at home, i will check out your code tomorrow and let you know..

    But i got a doubt here.. From parsing your code, it looks like you are utilising 2 tables for the csv data. Hmmm, but the Solution 2 i proposed , it seems to use only one table for the csv data.. Anyways, i will try out both the options and give you the result on Monday.

    Thanks again, Jeff! πŸ™‚

    Cheers!!

    C'est Pras!

    The problem with your first suggestion is that you can't anticipate what humans will add nor call something which is what the current problem is... Unknown things being added without warning.

    I like your second solution but the problem with that is... Neither OpenRowSet nor a Text based linked server will open the Sample2.csv file. You could, however, use the method I used to load the header line and parse it to make a Bulk Insert line and you might be able to use a delimited BULK INSERT then but that would also require a table creation using dynamic SQL which adds additional complexity to the problem (although it will probably be faster than doing a Tally split).

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 16 through 30 (of 30 total)

You must be logged in to reply to this topic. Login to reply