Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

SED and the Big Bad UNIX File

By Peter Ward,

Recently whilst at a client site, I received a distressed holler from the analyst sitting in the cubicle next to me.  “What do you mean there is no row delimiter”?  This holler was closely followed by a cry for help asking “How do I open a 4GB file”?  After hearing this cry I popped my head up ‘Prairie Dog’ style to see if I could assist.

The analyst was attempting to import a delimited file to a SQL Server table using the SQL Server 2000 Import/Export Data Transformation Wizard and was receiving the error below:

Many of you who have a UNIX background may already be aware of the powerful SED utility.  However up until a couple of years ago I didn’t realise the existence of such a utility until having this exact problem, but in a slightly different scenario.  I was trying to work out why a bulk insert was failing on line 365000 of a 1.9 million row file and Goggle came up with the answer to use SED.  My ignorance of SED probably related to sleeping through far too many Unix 101 lectures at university, but that’s another story.

SED is a UNIX Stream Editor which has been ported to Windows and there are numerous flavours available under the GNU GPL licence.  My personal favourite is Super-Sed V3.59 which can be downloaded from http://sed.sourceforge.net/grabbag/ssed/sed-3.59.zip.   SED is a non-interactive editor that works from the command line allowing changes to be made to the contents of a file without having to open it.  Much like the Find and Replace functionality in Microsoft Word, but on steroids.  One of the other features of SED is the ability to read specified rows in a file.  Anyway, enough of the sales pitch for SED and back to the 4GB file without any row delimiters.

The first thing that I though about with regards to the file was to have a look at the first ten records or so to see if there was anything that looked like a row delimiter.  This is where SED comes into play (although HEAD could also be used http://en.wikipedia.org/wiki/Head_(Unix)).  As opening a file that is 4GB with Notepad will give you the error message below and you will have to wait a while to open a 4GB file even if your text editor of choice supports a file this big.

The command below will output the lines between the specified start and end lines using SED.

For example:

sed -n startline,endlinep filename

To output the first 10 lines of the specified file to a file name c:\data.txt the following command can be used:

sed -n 1,10p c:\data.dat > c:\data.txt

Some of you may have guessed by now that the reason that a delimiter was not detected by the Import/Export Data Transformation Wizard was that the row delimiter was a character other than the default newline character.  By default the Import/Export Data Wizards row delimiter is a CRLF (ie. Carriage Return and NL line feed, decimal values 13 and 10).  The big bad file was originally generated by a UNIX application, therefore the newline is indicated by a LF as opposed to a CRLF which is used by default for files on DOS based systems.  So all that needed to be done to import the file was to change the row delimiter.  As previously mentioned, SED can be used as a Find and Replace utility so an alternative would be to use SED to replace a LF with a CRLF.  The command below can be used to convert UNIX newlines (LF) to DOS newlines (CRLF):

sed -n p filename

To convert UNIX newlines (LF) to DOS format for the file data.dat the following command can be used:

sed -n p c:\data.dat > c:\data.txt

However, as with most things it is never just as easy as changing just the newline indicator, as I am sure you can all guess what happed next!  The dreaded Import failure:

After selecting OK it appears that there was an error on line 131892 of the file.

There are no prizes for guessing which utility we are going to use so that we can identify what is wrong with line 131892 of the file being imported…SED.

As per the example to look at lines 1 to 10 of the file SED can be used to look at lines 131880 to 131920 to see what is different between these line and the line where the error has occurred.

For example:

sed -n 131880,131920p c:\data.dat > c:\data.txt

After identifying the rouge record using SED and determining that it could be deleted from the source file we received the message we were look for:

I hope that this article has shown you a new way to assist when data import errors occur in large files.  The following article is a great place to start if you are looking for additional one liners that can be used with SED http://www.student.northpark.edu/pemente/sed/sed1line52.txt.

Total article views: 9194 | Views in the last 30 days: 1
 
Related Articles
FORUM

Bulk Insert Failure: NewLine Character in Data

Bulk Insert Failure: NewLine Character in Data

FORUM

import flat file through SSIS

import a comma delimited file through SSIS got error

FORUM

Flat files without delimiters

Importing flat files without delimiters to SLQserver2005

FORUM

Import error

Import error

FORUM

how do i import text file with different delimited...HELP

how do i import text file with different delimited...HELP

Tags
dts    
sql server 7    
 
Contribute

Join the most active online SQL Server Community

SQL knowledge, delivered daily, free:

Email address:  

You make SSC a better place

As a member of SQLServerCentral, you get free access to loads of fresh content: thousands of articles and SQL scripts, a library of free eBooks, a weekly database news roundup, a great Q & A platform… And it’s our huge, buzzing community of SQL Server Professionals that makes it such a success.

Join us!

Steve Jones
Editor, SQLServerCentral.com

Already a member? Jump in:

Email address:   Password:   Remember me: Forgotten your password?
Steve Jones