RE: Set-Based String Splitting table function

Hall of Fame

Points: 3331

March 21, 2018 at 4:49 pm

#1984209

bevanward - Wednesday, March 21, 2018 5:22 AM
Hi Scott
Just an initial thought that might be completely off but thought I should mention first up.
Can you run a bunch of batches in parallel?
If you can monopolize the resources on the server then ...
Run the original SQL2012 lag split for each column concurrently
For example run a bunch of these concurrently:
insert into tmp<field one>
select PrimaryKey, Ordinal, Item from <table> cross apply dbo.DelimitedSplit8K_LEAD(<field one>,'|')
delete tmp<field one> where Item is null
If you monitor your resource consumption on the server as you add more of these and measure overall time that may be the most simple way to get the improvement.
Then once it is complete recomposite the results into the final required shape?
I think with that number of records you should be able to get that processing time down a lot.
Will have more of look tonight when I get off
Thanks
Bevan

Hi Bevan,

Thanks for the idea. I suppose I could embed the separate queries into separate data sources in SSDT vs. using a view, and run those separate queries in parallel on SSDT. For various reasons, I like embedding the logic in views and using those views as a data source in SSDT where possible. But that doesn't mean I'm married to that approach, esp. for performance gains.

I've had limited success using temp tables in SSDT, but I think there are options I can use to make this possible (persistent connections). Of course, I could just create permanent tables in my database and then drop them.

Having said that, I'm quite happy with the performance gains from Eirikur's code.

I've had good success in SSDT using the Balanced Data Distributor in other jobs, but I've never used that with a view, especially with a complex view such as mine. I wonder if I could use the BDD using my current view to achieve parallel processing? One thing to consider is the BDD doesn't guarantee any order in the target table, but that shouldn't be an issue in my processing.

When I ran a single query for performance testing, it ran in 30 mins (see above post). When I ran the data load in my Dev environment overnight, the entire view / data load ran in 32 mins. So I'm wondering if splitting out the queries into parallel processes will gain me much?

Regards,
Scott