RE: Sending multiple rows to the Database from an Application: Part I

SSCommitted

Points: 1633

October 19, 2010 at 9:57 am

I'm finishing up on a re-write of our import process. We import millions of rows through a Visual FoxPro app from the employee's computer. It can import CSV, fixed length, or DBFs. Our old way was to process the data locally and insert the formatted data. Several approaches were tried using UNION ALL, multiple value statements, and several single inserts statements in batches. All of these methods were slow, mainly because we kept the error handling logic there. Here are the issues we faced.

1) Parsing the data on the client side then sending the data in batches. When a batch failed, try the batch again but with single inserts to find the bad record and continue on. This was slow because we had to loop locally and create single statements from the batch, which in turn, caused more network traffic.

2) Our import process allowed the importation of fields that could not be stored in the destination by storing them in an aux table that stored the textual representation of a column (since most of our data is in CSV or fixed length) so we had to return the inserted IDs to insert this extra data. This slows things down with another round of network traffic and error handling.

The best speed we could achieve was under 500 rows/sec.

Our solution was to build a C# COM object that utilizes SQLBulkCopy then call a stored procedure that knows how to import the data based on the file type.

The process removed as much error handling possible from the client by using a staging table. The staging table can be created by parsing the DataTable that results in the following structures:

1) Fixed Length = Column_0 varchar(max)

2) CSV = Column_0 varchar(max), Column_1 varchar(max).....

3) DBF = Column_0 <mapped field type>, Column_1 <mapped field type>...

All structures include a StagerRowID int IDENTITY(1,1). After the stager is done, a call is made to a stored procedure that imports the data into the destination table in batches of 10K. If an insert fails, switch to single inserts and log the ones that fail into the StagerResults table. And the extra data to be imported gets done in this process.

Results:

DBF with 233,415 records

* Current Code *

10:30 ~ 370.5 rows/sec

* New Code *

2:30 ~ 1556 rows/sec

Fixed length with 2,259,995 records

* Current Code *

85:30 ~ 440 rows/sec

* New Code *

8:00 ~ 4708 rows/sec

I'm in the process of optimizing our CSV process but this is the best option we could find for importing large files from a employee's computer without file access to the server to do bulk copy/bcp, which doesn't meet our needs in tracking errors.

/* Anything is possible but is it worth it? */