SQLServerCentral Article

SQL CLR Multi-Threaded File Copy


I created a CLR C# Stored Procedure to Copy, Move and/or Delete files directly from T-SQL. This Procedure lets you pass in large sets of files, Queues them up, and processes them in parallel with an adjustable parallel thread pool size.

We use RoboCopy, like many of you, to move files such as backups, data exports, reports, and archive scripts. In most cases RoboCopy has worked great but recently we have been having more problems with performance because the size and quantity of these files has increased. Many of these large backup files are used to create dev, test, or stage work environments or to start Disaster Recovery (DR) copies of production with either log shipping or mirroring which require a full restore of the database to get started. Now we are having problems getting these backups copied and restored within the maintenance windows available to us.

We started exploring multi-file backups where a single backup is written to up to 64 files in parallel. Multi-file backups greatly increased the speed for both the backups and the restores but, even with 64 files that were 1/64th the origional size, we were still having a copy bottleneck. The RoboCopy process was still only pushing the files one at a time and not even coming close to utilizing the full bandwidth available. This did help alleviate starting over from the beginning if we had a network “Hiccup” during the copy, but t was not speeding up the total time to get the files copied.

I did notice that there was a multi-threaded version of RoboCopy but it was only available on specific OS versions. I started looking at other multi-threaded file copy processes such as RichCopy but found them cumbersome to use and deploy to every machine that wanted to use them.

At this point we have been using CLR fairly heavily and have a great build and deploy process set up which can push out our CLR code to several hundred servers quickly and reliably so I thought about how much better it would be if file handling was done directly from T-SQL. But in order to do this, I would need to so some things that we had not done before in CLR. Mainly Multi-Threading and Asynchronous processing.

Our primary requirement was that the procedure would not be asynchronous but the work it was doing would be. In other words, each file being copied would spin up a new child thread but the procedure would watch each child thread and not complete until all child threads were complete. This way we would be sure the process was complete without having to build in a polling process. In order to make the calls asynchronous, I used the System.Threading.Thread Class. In order to pass enough information to the child thread so that it could run on its own and behave correctly, I built the call to pass a CurrentStatus Class. 

        public class CurrentStatus 
            public string Action; 
            public string Name; 
            public string Source; 
            public string Destination; 
            public long Size; 
            public DateTime Started; 
            public DateTime Finnished; 
            public long Coppied; 
            public Thread Thrd; 
            public Thread Parent; 
            public string StatusMSG; 
            public int StatusCode; 

Another important requirement was that all child threads would be killed if the main procedure was cancelled. You would not want zombie threads still copying files if the code had been aborted. This was a little harder to get working as I could not get the child threads to observe the cancel or abort being sent to the parent on their own. I did get the desired result by getting each child thread to look for the parent thread (CurrentStatus.Parent) in between each buffered read/write cycle in the copy. If the child thread looks for the parent and finds it is no longer active, the child thread aborts the copy and drops out.

We also needed some sort of status update to be returned so that the query results would show progress. It was also important that this would not overload the output with too much noise. Each time a child thread starts or compleats, a line is sent to the output. The UpdateInterval parameter is used to set how often the parent procedure reports the current status of the running child threads. There is also a Verbose parameter which can compleatly turn off all output if it is too much noise. 

The procedure would need a queuing system so that you could limit the number of parallel threads. We found many cases where too many threads would cause memory issues and when copying across the network, too many connections would cause the server to think it was being DOS'd and kill all the connections. The procedure allows you to specify the QueueMax parameter to limit the number of threads that would be allowed to run at the same time. After the number of threads specified are called and running, the parent procedure continues to check the progress of each running thread. As each child thread completes, the parent procedure start a new child thread if there were still entries to be done. Once all threads have completed, then the parent thread will exit.

RoboCopy uses the "1/1/1980" date on files being copied to identify files that it was working on but had not yet completed. We found this to be a very usefull method to externally check in on a copy process to see if it was done, so I recreated that ability within the procedure. When re-starting a copy process I could also look to see if the destination existed with the "1/1/1980" date and identify files that could restart copying from the point it left off. 

I did have a problem when it came to passing in the data to the procedure. I wanted to pass a table parameter with the files to be coppied but quickly found out the CLR Procedures do not allow table parameters. My solution was to use an XML parameter. XML gave me the ability to pass an unlimited list of files and all of the settings as a single parameter in an XML block that looks like the example below.

  <Settings QueueMax="32" ForceOverwrite="false" Verbose="1" UpdateInterval="300">
    <CopyFile Source="C:\admin_db_20131101230601.SQB" Destination="D:\admin_db_20131101230601.SQB" />
    <CopyFile Source="C:\admin_dfntl_20131104193948.SQD" Destination="D:\admin_dfntl_20131104193948.SQD" />

This gives me the ability to have control parameters and the list of files in the same block of data and I designed this XML to be easily generated from a T-SQL Query by using the FOR XML clause. Here is an example of the query used to generate the XML. This uses another CLR Function that returns a directory listing of files and could be replaced with any process that returns a list of file names.

DECLARE   @Data             XML
          ,@Source          VarChar(max)
          ,@Destination     VarChar(max)
          ,@Mask            VarChar(max)
SELECT    @Source           = ‘C:\’
          ,@Destination     = ‘D:\’
          ,@Mask            = ‘dbaadmin*’
;WITH     Settings
          SELECT    32          AS [QueueMax]            -- Max Number of files coppied at once.
                    ,'false'    AS [ForceOverwrite]      -- true,false
                    ,1          AS [Verbose]             -- -1 = Silent, 0 = Normal, 1 = Percent Updates
                    ,300        AS [UpdateInterval]      -- rate of progress updates in Seconds
          ,CopyFile -- MoveFile, DeleteFile
          SELECT    FullPathName             AS [Source]
                    ,@Destination + Name     AS [Destination]
          FROM      dbaudf_DirectoryList2(@Source,@Mask,0)
SELECT    @Data = (
                  SELECT          *
                                  ,(SELECT * FROM CopyFile FOR XML RAW ('CopyFile'), TYPE)
                  FROM            Settings
                  FOR XML RAW ('Settings'),TYPE, ROOT('FileProcess')

Notice the naming of the CTE CopyFile. This can be changed to MoveFile or DeleteFile, along with the call in the final part of the query to change the XML records into Move or Delete records. I don’t recommend using more than one method in a single XML block but it could be possible to mix them if needed. This single query made it possible to pass all the data needed to the CLR code within a single parameter.



4.33 (6)

You rated this post out of 5. Change rating




4.33 (6)

You rated this post out of 5. Change rating