Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

Binary Files Import and Export

By Sergey Benner,

Introduction

I've been searching for a more or less comprehensible and fast way to import and export binary files using the SQL Server 2005. I've figured out my own way, which may be not the best or fastest. There may be a bunch of ways to improve the process, but who's perfect after all, it works well for me, and suits my purposes.

Prepare the filepath table and an out.txt file.

We begin with the file which contains the absolute path for the binary files. Assume the extension of the binaries is .dat and let the play begin. I use the GNU find utility to get the names of the files.

find d:\malware -name *.dat -type f > out.txt

This will create the out.txt file with names in it. Next I create a table:

CREATE TABLE files
(filepath varchar(512),
--file_id INT IDENTITY(1,1)
)

You can take any size of VARCHAR you like as it depends on the depth of your directory structure and a file name length. The entire path and name of the file must be stored in this table. We load the filepaths using the following SQL:

BULK INSERT files
FROM 'd:\out.txt'
WITH (
--fieldterminator='',
rowterminator='\n',
codepage=1251);

Trick:

I'm using here the CODEPAGE because I have some files which are named using the Russian CP1251.

If needed, later you can add constraints,indexes or an IDENTITY column to further speed up the SELECT procedures or for any other purpose you like, e.g.

BEWARE:

Be careful with truncation error this error might occur because of the find's output file format. You will probably have to convert the out.txt file to CR-LF windows format later.

The files table:

Loading the files.

IMHO before any bulk loading operation you should alter the database's recovery model to SIMPLE or BULK_LOGGED if you have the luxury to do so and disable the INDEX (file_id_idx) in the bindata table on the file_id column if you have created one - see below:

Creating the table with binary data.

CREATE TABLE bindata
(data varbinary(max),
--file_id INT IDENTITY(1,1)
)

/*
CREATE NONCLUSTERED INDEX file_id_idx
ON bindata (file_id) */

Loading procedure:

DECLARE 
@filename VARCHAR(512),
@sql VARCHAR(512); DECLARE a_cur CURSOR STATIC FOR
SELECT filepath FROM files;
/*WHERE file_id >(SELECT MAX(file_id) FROM bindata); This part is needed when you load the data NOT for the first time from the [files] table. */ OPEN a_cur
FETCH NEXT FROM a_cur INTO @filename; WHILE (@@fetch_status=0)
BEGIN --using dynamic sql in order to use the @filename variable
SET @sql ='
INSERT INTO bindata(data)
SELECT * FROM
OPENROWSET(BULK '''+ @filename+''', SINGLE_BLOB) as f;'
EXEC (@sql);
FETCH NEXT FROM a_cur INTO @filename;
END;
CLOSE a_cur;
DEALLOCATE a_cur;

The process of loading the data depends, of course, on the number and size of the files you want to load.

Once this is completed, you will see this in the Bindata table:

Exporting the files from the database

To dump the files from the database I use the script represented below. In order to execute the script, first you have to enable the xp_cmdshell extended stored procedure as shown below.

EXEC sp_configure 'xp_cmdshell', 1
GO
RECONFIGURE
GO

To read more on xp_cmdshell please refer to http://technet.microsoft.com/en-us/library/ms190693.aspx

Exporting procedure:

DECLARE 
@a INT
DECLARE @i INT,
@filename VARCHAR(300),
@cmd VARCHAR(500)
SET @i = 0;

SELECT @a=max(file_id) from bindata; -- for speeding up this query abit you may want to create and INDEX on file_id WHILE @i != @a
BEGIN

SET @filename = 'd:\tmp\'+CAST (@i AS VARCHAR)+'.dat';
SELECT @filename;
SET @cmd =
'bcp "SELECT data FROM db1.dbo.bindata WHERE file_id='+
CAST(@i+1 AS VARCHAR)+'" queryout '+@filename+' -T -N';
EXEC master..xp_cmdshell @cmd;
SET @i=@i+1;
END;

Problems with File Deltas

At the end I want to admit that there's such a problem like uploading the delta between the previous load into a DB and current files which appeared in the same dirrectories after the previous load. For I have the task to keep the binaries on the file system as well as in the database. I have two ways of dealing with this.

Method #1:

There're multiple ways to do it actually but mine way was to dump the existing filepaths from DB using the BCP utility:

bcp "select filepath from db.dbo.files" queryout out.txt -T -c -C 1251

Again I'm using here the -C 1251 because my files have CP1251 characters in their names. I want to admit that my DB has the collation_name of 'SQL_Latin1_General_CP1251_CI_AS'.

We create a new list using the find utility again:

find d:\malware\ -type f  -name *.dat

To get the delta we're using the diff utility:

 diff -u out.txt new.txt 

would look something like this:

--- out.txt Wed Jun 09 21:42:34 2008
+++ new.txt Wed Jun 09 21:44:05 2008
@@ -36,3 +36,9 @@
d:\malware\02\010.dat
d:\malware\02\011.dat
d:\malware\02\014.dat
+d:\malware\02\016.dat
+d:\malware\02\017.dat
+d:\malware\02\018.dat
+d:\malware\02\019.dat
+d:\malware\02\020.dat
+d:\malware\02\021.dat

I use the GNU grep and sed utilities to cleanse the delta data.

diff -u out.txt new.txt|grep ^+d |sed -e s/+// >delta.txt

The delta.txt will be

d:\malware\02\016.dat
d:\malware\02\017.dat
d:\malware\02\018.dat
d:\malware\02\019.dat
d:\malware\02\020.dat
d:\malware\02\021.dat

Then we load the delta into the files table and loading the files using the procedure described above.

Way #2:

We upload the whole newly created text file with paths once again without any constraint checking (if you have created for example a PRIMARY KEY on the filepath field in the files table) in the BULK INSERT operation but before the BULK INSERT we memorize or write down the current max(file_id) from the files table.

Finding the duplicates:

SELECT * FROM files WHERE 
filepath IN (
SELECT filepath FROM files GROUP BY filepath
HAVING COUNT(*)>1)
AND file_id > YOUR_MEMORIZED_MAX_FILE_ID or SELECT f1.filepath as dups,f2.file_id FROM files f1
INNER JOIN files f2 on f1.filepath=f2.filepath
WHERE f2.file_id > YOUR_MEMORIZED_MAX_FILE_ID
GROUP BY f1.filepath,f2.file_id

This will give you the rows with duplicates and you will be able easily eliminate them later.

Conclusion

So this was basicly my way of using MSSQL 2005 as the DB storage for binaries. MSSQL2005 did its job perfectly well. The techniques used here were flawless in accomplishing this particular task. But I'm pretty sure that anyone can have their own ways to perform this task and which have also an absolute right to exist.

 

 

Total article views: 8860 | Views in the last 30 days: 16
 
Related Articles
FORUM

Scripting help - Need a query to find filenames and filepaths under a certain directory

Scripting help - Need a query to find filenames and filepaths under a certain directory

FORUM

finding filepath, subdirectories, and filenames

in my local drive, i have a directory, d:\data\ with unlimited subdirectories liked the following: ...

FORUM

syntax error /SET "\Package.Variables[user::p1].Properties[Value]";+"FilePath"

/SET "\Package.Variables[user::p1].Properties[Value]";+"FilePath"

FORUM

how to create folder at runtime using part of filename

how to create folder at runtime using part of filename

FORUM

Trigger - Access to filepath?

I need to create a trigger event (on create) that will, based on values in the new row, open a matc...

 
Contribute

Join the most active online SQL Server Community

SQL knowledge, delivered daily, free:

Email address:  

You make SSC a better place

As a member of SQLServerCentral, you get free access to loads of fresh content: thousands of articles and SQL scripts, a library of free eBooks, a weekly database news roundup, a great Q & A platform… And it’s our huge, buzzing community of SQL Server Professionals that makes it such a success.

Join us!

Steve Jones
Editor, SQLServerCentral.com

Already a member? Jump in:

Email address:   Password:   Remember me: Forgotten your password?
Steve Jones