Jeff Moden (9/6/2012)
UMG Developer (9/5/2012)
Adam Machanic (9/5/2012)
How are you seeing string splits used where such a difference would matter? (Again, I am *only* talking about string split operations.)
Generally when I see string splitters used it's for passing a set of arguments to a stored procedure, and it will end up in some query like:
WHERE ID IN (SELECT CONVERT(INT, output) FROM dbo.StringSplitFunction(@values))
I have cases where I have a table with ~20 million rows and each row contains a string that needs to be split, so that is a case where the splitter would be called 20 million times in a single INSERT query, so any optimization to make it faster can make a big difference.
I have to agree with Adam on this one. Why would you store such a volume of comma separated data in a database? A normalized sister table would be much more efficient.
Actually, I am in exactly this situation right now, Jeff. As I’m sure you know, I would never voluntarily store data like that, not permanently anyway.
The problem is though, that this is how the data is being delivered to me, as a multi-million row pipe-delimited text file, downloaded regularly from the IRS website. I import the data with BULK INSERT, but (to answer the obvious question), I cannot use BULK INSERT to do the splitting because some few rows at the beginning and the end are non-conformant (they are not real data rows). And as we all know, BULK INSERT (and BCP) are not exactly tolerant of non-conforming input rows. And I cannot exclude them reliably with BULK INSERT because we cannot know for sure ahead of time how many or which ones at the beginning and end will be non-conformant. And for legal reasons, we have to be sure that we have not lost any rows.
So the obvious solution is to BULK INSERT them to a staging table with one big VARCHAR() column and then split in inside of SQL where we can more readily detect and handle the exceptions. But that’s not nearly as fast as BULK INSERT …
To import and split a million row file with BULK INSERT takes me about 10 seconds (when I know ahead of time, which rows are bad). Just making a SELECT INTO copy of the imported table takes about 4 seconds. But splitting it with the fastest TSQL splitters I can find take about 2-4 minutes! And the rows aren’t big, just 7 columns with about 100 characters. So right now, I’m using Paul’s latest CLR splitter which does pretty well (about 17 seconds). That’s going to cause some maintenance issues for this customer, but given the speed difference, they’ll probably agree to it.