Importing CSV files without knowing destination schema

  • I have 4 very large CSV files (large number of records and large number of fields) and need to script the process of importing the data from these CSVs into SQL. I gather that the Bulk Insert command should do the job here, however, it would appear that this command requires that a suitable table is created first. I don't have the exact schema of the database that the CSVs were exported from and would like to avoid having to go through the CSVs manually creating a table with the right number and size fields.

    So I thought I would use the Import Data wizard to manually import the CSVs (thus creating a table with the correct fields). Then I'd be able to script the creation of the table, and include that part in my Bulk Insert script (just before I call the actual Bulk Insert command itself). I hope that part makes sense. Please let me know if there's a better way of doing this.

    Problem is that although I've tried using the "Suggest Types" option, when I actually go to import the file I get errors warning of truncation, even when I go through and manually increase the size of the fields.... Is there an easier way to do this?

  • Do the files have to be CSV? as this might be easier if they were XML.

    Facts are stubborn things, but statistics are more pliable - Mark Twain
    Carolyn
    SQLServerSpecialists[/url]

  • here's my suggestion:

    add a folder which contains all of your .csv files as a linked server.

    a text linked server will show evert .txt file and every.csv file as a table...

    so you can do something really simple like

    SELECT *

    INTO NEWTABLE

    FROM TxtSvr...sample#csv

    which would create the table on sql server with the same structure.

    i think csv files assume the first row in the file is the name of the columns.

    here's the syntax:

    --===== Create a linked server to the drive and path you desire.

    EXEC dbo.sp_AddLinkedServer TxtSvr,

    'Jet 4.0',

    'Microsoft.Jet.OLEDB.4.0',

    'C:\',

    NULL,

    'Text'

    GO

    --===== Set up login mappings.

    EXEC dbo.sp_AddLinkedSrvLogin TxtSvr, FALSE, NULL, Admin, NULL

    GO

    --===== List the tables in the linked server which is really a list of

    -- file names in the directory. Note that the "#" sign in the

    -- Table_Name is where the period in the filename actually goes.

    EXEC dbo.sp_Tables_Ex TxtSvr

    GO

    --===== Query one of the files by using a four-part name.

    SELECT *

    FROM TxtSvr...sample#csv

    --===== Drop the text server

    EXEC dbo.sp_DropServer 'TxtSvr', 'DropLogins'

    GO

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Lowell (8/27/2008)


    here's my suggestion:

    add a folder which contains all of your .csv files as a linked server.

    a text linked server will show evert .txt file and every.csv file as a table...

    so you can do something really simple like

    SELECT *

    INTO NEWTABLE

    FROM TxtSvr...sample#csv

    which would create the table on sql server with the same structure.

    i think csv files assume the first row in the file is the name of the columns.

    here's the syntax:

    --===== Create a linked server to the drive and path you desire.

    EXEC dbo.sp_AddLinkedServer TxtSvr,

    'Jet 4.0',

    'Microsoft.Jet.OLEDB.4.0',

    'C:\',

    NULL,

    'Text'

    GO

    --===== Set up login mappings.

    EXEC dbo.sp_AddLinkedSrvLogin TxtSvr, FALSE, NULL, Admin, NULL

    GO

    --===== List the tables in the linked server which is really a list of

    -- file names in the directory. Note that the "#" sign in the

    -- Table_Name is where the period in the filename actually goes.

    EXEC dbo.sp_Tables_Ex TxtSvr

    GO

    --===== Query one of the files by using a four-part name.

    SELECT *

    FROM TxtSvr...sample#csv

    --===== Drop the text server

    EXEC dbo.sp_DropServer 'TxtSvr', 'DropLogins'

    GO

    Heh... looks real familiar... 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Lowell, Hi Jeff,

    Thanks for your suggestion - sounds just right for my situation. Everything works fine until I get to this bit: SELECT *

    FROM TxtSvr...sample#csv It says the table isn't there. In fact the text server contains just three tables and they are debuglog#txt, services#txt and test#txt. None of my CSV files have been picked up. Any ideas on what I need to change?

  • Ok, apologies for that guys - I needed to put the csv files on the sql server, not my local drive :blush:

  • Hmmm, sorry that this is dragging on everyone, now that I have the files in the right place (;)) the tables are being picked up. However, when I open the tables, there are twice as many rows in the table than in the CSV file. And every value in every row/column is NULL.

    I should say that my CSV files are actually TAB separated. However, I've tried replacing the tabs with commas and although I then get the correct number of columns, I'm still getting all NULLs. Not sure if it makes any difference, but I've got tons of columns (>50)...

  • thomas.lemesurier (8/28/2008)


    Hmmm, sorry that this is dragging on everyone, now that I have the files in the right place (;)) the tables are being picked up. However, when I open the tables, there are twice as many rows in the table than in the CSV file. And every value in every row/column is NULL.

    I should say that my CSV files are actually TAB separated. However, I've tried replacing the tabs with commas and although I then get the correct number of columns, I'm still getting all NULLs. Not sure if it makes any difference, but I've got tons of columns (>50)...

    I'm pretty sure the first row determines the number of columns...if there's anything preceeding the data, that might be an issue.

    I could understand getting some rows that are null for all fields at the end of the file; every CrLf(carriage Return Line Feed...Char(10) + Char(13) is used to determine whether a row exists; if there are a bunch of CrLf at the end of the document, I'd expect that,and would use a WHERE clause to ignore then (WHERE COL1 IS NOT NULL or something)

    sometimes a better text editor can help; I use EditPlus, which has find and replace that allows regular expressions and more; if i were to open a file like this in a linked text server, you can see that rows 9 thru 14 would exist and be null.

    Note how this text editor displays CrLf witha paragraph symbol, Tabs As a Dbl Arrow character,and spaces as a floating period...it helps visualize the data much better.

    If ALL of your data is null, it might be that the file is unix formatted...it has CHAR(10) as the row limiter, and doesn't the additional have Char(13) that is expected;

    you could find and replace, or for example with editplus, you can open the file, and save it as a PC file with document...Fileformat..PC/UNIX/Mac setting and re-save the text file.

    Lots of other text editors have the same ability.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • thomas.lemesurier (8/28/2008)


    Hmmm, sorry that this is dragging on everyone, now that I have the files in the right place (;)) the tables are being picked up. However, when I open the tables, there are twice as many rows in the table than in the CSV file. And every value in every row/column is NULL.

    I should say that my CSV files are actually TAB separated. However, I've tried replacing the tabs with commas and although I then get the correct number of columns, I'm still getting all NULLs. Not sure if it makes any difference, but I've got tons of columns (>50)...

    Time for you to attach a file so we can see what it actually looks like.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hmmm, Lowell hit the nail on the head - it's a unicode file. But having said that, I've tried saving it out as ANSI (just using notepad) just it's still giving me nothing but NULLS.

    OK, I've uploaded a cut down version of the file and stuck a .txt extension on the end so it can be uploaded. Thanks,

    Tom

  • well your example shows up just fine for me...

    renamed the file to "Artists_mini.csv" ,as only one period is allowed to exist for a text server to see it:

    selected from the text server, saw everything basically as undelimited.

    i replaced all tabs with a comma.

    I saved as PC format.

    selected from the text server.

    changed the format to UNIX.

    selected from the text server.

    was able to see the data either way.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • maybe a comma in the exisitng data is affecting the real data you are looking at...

    Can you search the file for a pre-existing commas before you replace the tabs with commas? some of the same fields, like "Humple And Marlin" might b a company name like "Microsoft,Inc." to mess up your assumptions

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • thomas.lemesurier (8/28/2008)


    Hmmm, Lowell hit the nail on the head - it's a unicode file. But having said that, I've tried saving it out as ANSI (just using notepad) just it's still giving me nothing but NULLS.

    OK, I've uploaded a cut down version of the file and stuck a .txt extension on the end so it can be uploaded. Thanks,

    Tom

    Heh... doesn't matter now that we know what the file actually looks like. 😀

    Quick! What's the difference between a CSV file and a TSV file? :hehe: Just the delimiter... if we can read a CSV file into a temp table without knowing what's in the CSV file, there's absolutely no reason why we can't do the same with a TSV file... we just gotta let the system know what the delimiter is. Once we're done importing, we should set everything back to the way it was, as well.

    As Lowell stated, the system doesn't like file names with two periods in it... so I, too, changed the filename to Artists_mini.txt

    Here's the tested code... I've used your "mini" file as is. As usual, the details are in the comments. 😉

    /**********************************************************************************************************************

    Purpose:

    Script to read from a TSV (Tab Separated Values) file having an unknown number of columns of unknown data type.

    This demo is setup to read a given file called Artists_mini.txt located in C:\Temp.

    Author: Jeff Moden - 28 Aug 2008

    **********************************************************************************************************************/

    --===== Declare the local variables we'll need. Names are self-documenting

    DECLARE @RootKey SYSNAME,

    @key SYSNAME,

    @Value_Name SYSNAME,

    @Type SYSNAME,

    @PreviousValue SYSNAME,

    @NewValue SYSNAME

    --===== Preset the "constants". These are self documenting as well

    SELECT @RootKey = N'HKEY_LOCAL_MACHINE',

    @key = N'SOFTWARE\Microsoft\Jet\4.0\Engines\Text',

    @Value_Name = N'Format',

    @Type = N'REG_SZ',

    @NewValue = N'TabDelimited' --May be a character using N'Delimited(,)' where the comma is the character

    --Original setting is usually N'CSVDelimited'

    --===== Remember the previous value so we can set it back

    EXEC Master.dbo.xp_Instance_RegRead

    @RootKey = @RootKey,

    @key = @key,

    @Value_Name = @Value_Name,

    @Value = @PreviousValue OUTPUT

    --===== Show what the original delimeter setting was set to.

    -- This, of course, may be commented out

    SELECT 'Previous delimiter setting = ' + @PreviousValue

    --===== All set... define the new temporary delimiter

    EXEC Master.dbo.xp_Instance_RegWrite

    @RootKey = @RootKey,

    @key = @key,

    @Value_Name = @Value_Name,

    @Type = @Type,

    @Value = @NewValue

    --===== Read the TAB delimited file without knowing any of the columns

    -- Notes: "Database =" identifies the directory the file is in

    -- The FROM clause identifies the file name

    SELECT *

    INTO #MyHead

    FROM OPENROWSET('Microsoft.Jet.OLEDB.4.0',

    'Text;Database=C:\Temp;HDR=YES;FMT=Delimited',

    'SELECT * FROM Artists_mini.txt')

    --===== Show the content of the table we just populated

    SELECT * FROM #MyHEad

    --===== Restore the original delimiter setting

    EXEC Master.dbo.xp_Instance_RegWrite

    @RootKey = @RootKey,

    @key = @key,

    @Value_Name = @Value_Name,

    @Type = @Type,

    @Value = @PreviousValue

    --===== Cleanup after the demo

    DROP TABLE #MyHead

    Send pretzels... I already have enough beer! 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks to both of you for your help with this. The server went down last night and has stayed down, so currently I can't test it, but it looks good. Thanks again,

    Tom

  • Thanks for the feedback... and the pretzel!:P

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply