Import to MSSql from varying tab delimited text files

  • Hello everyone,

    I've been presented with the following task and failed to come up with a working solution. Can I pretty please get someone to help me figure out how to achieve this?:

    I need to import data to a MSSql table from massive (read: a million and a half rows, every single day) logs that come in .txt format separated in tabs with a ";" symbol and then have some stored procedures analyze that data to generate some reports in an excel file with that info. The text files include the column headers in the first row and the data starts on the second one.

    The challenge is that the text files differ in column order and count every single day.

    The analysis that I need to do only needs about 15 columns from the nearly 90-120 that those files include, and those columns sadly happen to be in a different order in those files.

    I'm at a loss right now, may I get a hand with this?

  • To clarify, are the columns the same, just in a different order daily? ie. (Fname, DOB, City, State, ZIP) and then possible (State, Zip, Fname, City, DOB) the next? Or are the column names always different?

    ______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience

  • if you want to go the all TSQL route, then Openrowset of a single file,or linked server for a folder full of files is probably the answer.

    with an linked server/openrowset call to a text file, the first row can be treated as column names, and you can simply select the fifteen or so columns, regardless of their order in the file itself.

    my examples below assume you've successfully installed the 64 bit verison of the ACE drivers, so let us know if you need help with that, it can be tricky if you alreayd installed Office 32 bit.

    on the delimiter issue, i'd like to see a couple rows of sample data to test with.

    if they delimiters are not the standard comma/line feeds, you need to modify the registry, and then put it back to 'normal afterwards for, say pipe delimiter or semicolons.

    http://www.youdidwhatwithtsql.com/tsql-query-pipe-delimited-text-files-with-openrowset/429/

    an openrowset example, so you can see if you know the column names, you don't need SELECT *, but SELECT ColumnList:

    select * from OpenRowset('MSDASQL',

    'Driver={Microsoft Access Text Driver (*.txt, *.csv)};DefaultDir=C:\Data\;Extended Properties="text;HDR=YES;FMT=Delimited"'

    ,'select top 10 * from C:\Data\MailItems.txt')

    and here's a linked server for a folder full of text files:

    --#################################################################################################

    --Linked server Syntax for Folder Full Of Text Files

    --#################################################################################################

    --add a folder as a linked server to access all .txt and .csv files in the folder

    DECLARE @server sysname,

    @srvproduct nvarchar(256),

    @provider nvarchar(256),

    @datasrc nvarchar(100),

    @location nvarchar(100),

    @provstr nvarchar(100),

    @catalog sysname,

    @sql varchar(1000)

    SET @server = N'TxtSvr'

    SET @srvproduct = N'OLE DB Provider for ACE'

    SET @provider = N'Microsoft.ACE.OLEDB.12.0'

    SET @datasrc = N'C:\Data\'

    SET @provstr ='Provider=Microsoft.ACE.OLEDB.12.0;Data Source=C:\Data\;Extended Properties="text;HDR=YES;FMT=Delimited" '

    set @provstr = 'Text'

    EXEC sp_addlinkedserver @server,@srvproduct,@provider,@datasrc,@provstr,@provstr

    --===== Create a linked server to the drive and path you desire.

    --EXEC dbo.sp_AddLinkedServer TxtSvr,

    -- 'MSDASQL',

    -- 'Microsoft.ACE.OLEDB.12.0',

    -- 'C:\',

    -- NULL,

    -- 'Text'

    GO

    --===== Set up login mappings.

    EXEC dbo.sp_AddLinkedSrvLogin TxtSvr, FALSE, NULL, Admin, NULL

    GO

    --===== List the tables in the linked server which is really a list of

    -- file names in the directory. Note that the "#" sign in the

    -- Table_Name is where the period in the filename actually goes.

    EXEC dbo.sp_Tables_Ex TxtSvr

    GO

    --===== Query one of the files by using a four-part name.

    SELECT *

    FROM TxtSvr...[LEgalName_NickName_List#txt]

    --===== Drop the text server

    EXEC dbo.sp_DropServer 'TxtSvr', 'DropLogins'

    GO

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Thanks for the replies!

    @mydoggiejessie

    The columns are consistent for the first 12 rows, then they start to differ greatly for each file, in both order and number. Sadly, I need data from columns that are beyond those 12 rows.

    Bellow I include a sample.

    @lowell

    That's a lot to digest!

    I like the way this solution sounds. Currently I do have office 32 bit installed (on a x64 System) and if I'm reading the obcd data source manager right I do not have the 64 bit ACE drivers installed.

    How should I install those drivers without screwing up anything to be able to try the method you suggest?

    Regarding the sample data, this is what the first rows look like on one of the files (a little painful to look at )

    num;date;time;orig;type;action;alert;i/f_name;i/f_dir;product;log_sys_message;ProductFamily;rule;rule_uid;rule_name;service_id;src;dst;proto;xlatesrc;xlatedst;NAT_rulenum;NAT_addtnl_rulenum;service;s_port;xlatedport;xlatesport;inzone;outzone;message_info;scheme:;methods:;peer gateway;encryption failure:;partner;community;vpn_user;fw_subproduct;vpn_feature_name;ICMP;ICMP Type;ICMP Code;srckeyid;dstkeyid;IKE:;CookieI;CookieR;msgid;IKE notification:;Certificate DN:;IKE IDs:;user;VPN internal source IP;start_time;elapsed;connection_uid;reason:;activity;Update Status;sig_ver;update_src;subs_exp;reason;Session:;L2TP:;PPP:;MAC:;OM:;om_method:;assigned_IP:;machine:;PS;rule_guid;hit;policy;first_hit_time;last_hit_time;special_properties;log_id;reject_category;Log ID;spi;encryption fail reason:;segment_time;packets;bytes;client_inbound_packets;client_outbound_packets;server_inbound_packets;server_outbound_packets;client_inbound_bytes;client_outbound_bytes;server_inbound_bytes;server_outbound_bytes;client_inbound_interface;client_outbound_interface;server_inbound_interface;server_outbound_interface;DCE-RPC Interface UUID;Protection Name;Severity;Confidence Level;protection_id;SmartDefense Profile;Performance Impact;Industry Reference;Protection Type;TCP flags;Attack Info;attack;FollowUp;Total logs;Suppressed logs;message;old IP;old port;new IP;new port;cvpn_category;auth_method;auth_status;snid;group;user_group;Client Type:;description;access_status;url;reject_id;cvpn_resource;session_duration_time;Internal_CA:;serial_num:;dn:;SSL VPN Session ID:

    0;30Jan2014;23:59:01;192.168.1.1;control; ;;daemon;inbound;VPN&FW;Log file has been switched to: 235900.log;Network;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

    1;30Jan2014;23:46:58;192.168.1.1;log;accept;;eth7;inbound;VPN&FW;;Network;54;{AA8B5B73};;http;192.168.1.1;192.168.1.1;tcp;;192.168.1.1;34;5;221;4349;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

    2;30Jan2014;23:46:58;192.168.1.1;log;accept;;eth7;inbound;VPN&FW;;Network;65;{AE48-DF47CCE49};;Napster_directory_2381;192.168.1.1;192.168.1.1;tcp;;192.168.1.1;34;2;535;64543;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

    3;30Jan2014;23:46:58;192.168.1.1;log;accept;;eth0;inbound;VPN&FW;;Network;34;{43D4-843-F36B};;smtp;192.168.1.1;192.168.1.1;tcp;;;;;45;55431;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

    Oh, and for comparison the headers from a second file looks like this:

    num;date;time;orig;type;action;alert;i/f_name;i/f_dir;product;log_sys_message;ProductFamily;inzone;outzone;rule;rule_uid;rule_name;service_id;src;dst;proto;xlatesrc;xlatedst;NAT_rulenum;NAT_addtnl_rulenum;service;s_port;xlatedport;xlatesport;ICMP;ICMP Type;ICMP Code;scheme:;methods:;peer gateway;encryption failure:;partner;community;fw_subproduct;vpn_feature_name;message_info;vpn_user;user;Session:;L2TP:;PPP:;MAC:;OM:;om_method:;assigned_IP:;machine:;srckeyid;dstkeyid;IKE:;CookieI;CookieR;msgid;IKE notification:;Certificate DN:;IKE IDs:;start_time;connection_uid;PS;activity;Update Status;sig_ver;update_src;subs_exp;reason;Protection Name;Severity;Confidence Level;protection_id;SmartDefense Profile;Performance Impact;Industry Reference;Protection Type;Packet info;Attack Info;attack;FollowUp;rule_guid;hit;policy;first_hit_time;last_hit_time;special_properties;log_id;cvpn_category;auth_method;auth_status;snid;reject_id;session_duration_time;reject_category;segment_time;elapsed;packets;bytes;client_inbound_packets;client_outbound_packets;server_inbound_packets;server_outbound_packets;client_inbound_bytes;client_outbound_bytes;server_inbound_bytes;server_outbound_bytes;client_inbound_interface;client_outbound_interface;server_inbound_interface;server_outbound_interface;message;old IP;old port;new IP;new port;Log ID;reason:;spi;encryption fail reason:;VPN internal source IP;group;user_group;Client Type:;description;access_status;url;cvpn_resource;SSL VPN Session ID:;TCP flags;TCP packet out of state;tcp_flags;DCE-RPC Interface UUID;message:;site_name;outgoing_url;dstname;resource;sys_message:;StormAgentName;StormAgentAction;capture_uuid;Total logs;Suppressed logs;Internal_CA:;serial_num:;dn:

    As you can see, the first columns are the same, then start to differ more and more.

    Openrowset sure looks like what I need, although it seems like it's going to take me a while to get it to work.

  • Will the columns you need have the same name every time?

    If so, an alternative could be to use the header row to build a staging table, load the file into the staging table and use the columns you need from there.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Hello Sr. Luis,

    Indeed, the columns I need are always the same, they just change their position in the row.

    I did think of doing something along those lines, but never quite figured how to do so (I'm no SQL genius, barely got to do a simple cli BCP script that worked for 1 txt file).

    I'll be doing some reading on Openrowset to find out how to do that.

  • You could merge Lowell's and Luis' suggestions together: use the OPENROWSET to bring everything into a temporary "staging" table, then select out the column names you want (in the order you want), then INSERT them into a final table where you can run your prcedures against them... that way no matter how many they send or in whichever order, you can always only select out what you need.

    ______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience

  • the key for the install of the AccessDatabaseEngine_x64.exe from microsoft is there's a /passive parameter to install the 64 bit regardless of whether the 32 bit was isntalled or not.:

    http://www.microsoft.com/en-us/download/details.aspx?id=13255

    make sure you open an Administrative command prompt window, and run it with the c:\Downloads\AccessDatabaseEngine_x64.exe /passive

    command line flag;

    this will force the install of the drivers, even if you have 32 bit office installed;

    otherwise you get some error about 32 bit Office preventing the install.

    After that is installed in SSMS you need to do this just once:

    --Required settings for the provider to work correctly as a linked server

    EXEC master.dbo.sp_MSset_oledb_prop N'Microsoft.ACE.OLEDB.12.0', N'AllowInProcess', 1

    GO

    EXEC master.dbo.sp_MSset_oledb_prop N'Microsoft.ACE.OLEDB.12.0', N'DynamicParameters', 1

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • My select don't work properly (problem in the picture). I use SQL Server 2008r2.

    select * from OpenRowset('MSDASQL',

    'Driver={Microsoft Access Text Driver (*.txt, *.csv)};DefaultDir=D:\MALEP_BCP\CSV_VSND\;Extended properties=''HDR=YES; FMT=Delimited(;)''',

    'select * from pokus.csv')

    With schema.ini work OK.

    But I need use various delimited and defined in SQL.

    Help me pls.

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply