OpenRowSet No Field Terminators

  • Hi,
    To start I have minimal experience with OpenRowSet and BCP. I usually do processing with SSIS but this new project is using SQL Server Express so I don't have that luxury.  
    I need to import data from files which have no field delimiters at all.
    Also each ROW is actually a column, so it would something like (NOT ACTUAL DATA):

    01Mary
    02Jones
    03USA
    01Bob
    02Smith
    03UK

    I wanted to use the following syntax to look at the data first and then work with it some more to do some import..
    SELECT a.ColID, a.ColData
    FROM
    OPENROWSET(BULK 'C:\MyDataFile.txt'
             ,FORMATFILE = 'C:\MyDataFileFormat.xml')    AS a;

    The problem I am having is the TERMINATOR entry in the format file. Below is the format file where I tried NULL, didn't think it would work but gave it a shot. I have looked at the docs and there doesn't seem to be an option for no field terminator. 
    <?xml version="1.0"?>
    <BCPFORMAT 
    xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" 
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <RECORD>
      <FIELD ID="1" xsi:type="CharTerm" TERMINATOR="\0" LENGTH="2" COLLATION="SQL_Latin1_General_CP1_CI_AS"/> 
      <FIELD ID="2" xsi:type="CharTerm" TERMINATOR="\n" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
    </RECORD>
    <ROW>
      <COLUMN SOURCE="1" NAME="ColID" xsi:type="SQLVARYCHAR"/>
      <COLUMN SOURCE="2" NAME="ColData" xsi:type="SQLVARYCHAR"/>
    </ROW>
    </BCPFORMAT>

    Am I barking up the wrong tree with this syntax?  
    Thanks for any and all thoughts
    Cheers,
    Steve

  • Is there an option with the format file to have the first column not be varchar but just char ?

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • Thanks for the quick response sgmunson.
    I hadn't noticed the SQLVARYCHAR line. This prompted me to look at the other values. When I change the xsi:type and  removed the collation clauses I was able to look at the data. 

    <?xml version="1.0"?>
    <BCPFORMAT 
    xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" 
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <RECORD>
      <FIELD ID="1" xsi:type="NativeFixed" LENGTH="2" /> 
      <FIELD ID="2" xsi:type="CharTerm" TERMINATOR="\n"/>
    </RECORD>
    <ROW>
      <COLUMN SOURCE="1" NAME="ColID" xsi:type="SQLCHAR"/>
      <COLUMN SOURCE="2" NAME="ColData" xsi:type="SQLVARYCHAR"/>
    </ROW>
    </BCPFORMAT>

    Cheers,
    Steve

  • With length delimited file I define one column with the total length in the openrowset, and use substring for individual columns.

  • Hi Joe Torre,
    Thanks for the input. Do you think substring would be more efficient? The files I will be processing can get quite large, e.g. 500mb+. with millions of rows. 
    Cheers,
    Steve

  • I just following with something I messed up on. In the format file for the TERMINATOR I only used '\n'. This meant a carriage return was added to the end of each result row.
    I needed to change this to '\r\n' as below

    <FIELD ID="2" xsi:type="CharTerm" TERMINATOR="\r\n"/>  

    Cheers
    Steve

  • I've used this method on tables with millions on rows, it runs pretty fast.

  • SteveD SQL - Monday, February 26, 2018 11:00 PM

    Hi Joe Torre,
    Thanks for the input. Do you think substring would be more efficient? The files I will be processing can get quite large, e.g. 500mb+. with millions of rows. 
    Cheers,
    Steve

    Yikes!   This file format for data in that kind of quantity is inherently bad news, and rather seriously inefficient.  I'd be fighting hard to get that file format changed, or to get direct access to the database it is coming from.   I would NOT want to be using SUBSTRING in that kind of quantity if I could avoid it.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • Hi sgmunson,
    Unfortunately the data is coming from a major Pharmaceutical association in Germany. they probably have being doing it this way for decades. Not much chance lil'ole me is going to get anything changed 😉

    Edit: Joe, yup on a local PC I got 65mil + records in in about n7 minutes. However on another thread I found out I really need to get a Row Number in there so I expected that will add a lot of time to this process. Probably won't be a long term solution but gets the data in so I can start playing around

    Cheers,
    Steve

  • Using the row_number() function in a view doesn't add a lot of overhead however adding an identity column to the target table might be faster.

  • To read that much data, adding in multirow requirements like incrementing the Row Number on rows that begin with 00, you might be better off writing your own code in C#, PowerShell, or whatever language you're comfortable with.  It's not that complicated.

    Define a DataTable object that matches the target table.  Open a stream reader for the text file.
    Read each row from the text file, check the first two chars to determine the field.  If "00" increment the row number and start a new output row.
    Every 5000 or so rows write the DataTable to the SQL server with a SqlBulkCopy object, and empty the DataTable Rows collection.

    I would expect this to run much faster than importing millions of rows directly with BCP, then running any kind of query to PIVOT the data and add row numbers while copying it to another table.

  • @joe: Yes I have done it with identity column.
    @scott: yes have already spoken with client about doing a lot of the heavy lifting in C#.

    Cheers to all for input
    Steve

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply