DelimitedSplit8K versus DelimitedSplit_Long

  • Referenced sites and threads:

    Processing strings.

    http://www.sqlservercentral.com/Forums/Topic1772477-391-1.aspx

    Tally OH! An Improved SQL 8K “CSV Splitter” Function :

    http://www.sqlservercentral.com/articles/Tally+Table/72993/

    Reaping the benefits of the Window functions in T-SQL :

    http://www.sqlservercentral.com/articles/SQL+Server+2012/106903/

    Concerns the processing of a large CSV string.

    The Thread 'Processing strings', there is a request for help with CSV files.

    Often a CSV file is much longer than 8K, so a version of the DelimitedSplit8K is needed.

    I used the DelimitedSplit8K and made some 'large' *) changes to this.

    Changed the name to DelimitedSplit_Long

    Made the string varchar(Max)

    Made the tally table much larger.

    And extended the max of the Charindex.

    First of all this seems to work. Offcourse I expect a performance hit. But I have not tested this extensively.

    Before starting testing and tuning, I seek some advise here.

    My question is:

    Any other suggested changes ?

    Are there 'optimal' method's to do this ?

    Any other advises here about extending the 8K version ?

    The tally table is now to a Million. (But should be larger).

    The charindex is now up to 10 million. (But should be larger).

    Both are not enough for very large CSV files.

    Remark, I started this new thread because I think this subject of enlarging the capacity of the 8K version deserves it's own thread. (I have not abandonned the Processing Strings thread).

    Thanks for your time and attention,

    Ben

    *)

    Or some small 'size' changes to it. 😛

    Especially thanks to :

    The Tally or Auxilary table promotors.

    Ben-Gan, Itzik (For the Gaps and Islands solutions.)

    Jeff Moden (For his work on the 8K splitter.)

    Others who have contributed.

  • ben.brugman (4/4/2016)


    Referenced sites and threads:

    Any other advises here about extending the 8K version ?

    ...Remark, I started this new thread because I think this subject of enlarging the capacity of the 8K version deserves it's own thread.

    First, this is the splitter I use for strings larger than 8K.

    CREATE FUNCTION dbo.DelimitedSplit2B

    (

    @pString varchar(max),

    @pDelimiter char(1)

    )

    RETURNS TABLE WITH SCHEMABINDING AS RETURN

    WITH L1(N) AS

    (

    SELECT N

    FROM (VALUES

    (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),

    (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),

    (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),

    (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),

    (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),

    (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),

    (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),

    (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),

    (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),

    (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),

    (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) t(N)

    ), --216 values

    cteTally(N) AS

    (

    SELECT 0 UNION ALL

    SELECT TOP (DATALENGTH(ISNULL(@pString,1))) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM L1 a CROSS JOIN L1 b CROSS JOIN L1 c

    --2,176,782,336 rows: enough to handle 2,147,483,647 characters (the varchar(max) limit)

    ),

    cteStart(N1) AS

    (

    SELECT t.N+1

    FROM cteTally t

    WHERE (SUBSTRING(@pString,t.N,1) = @pDelimiter OR t.N = 0)

    )

    SELECT

    ItemNumber = ROW_NUMBER() OVER(ORDER BY s.N1),

    Item = SUBSTRING(@pString,s.N1,ISNULL(NULLIF((LEAD(s.N1,1,1)

    OVER (ORDER BY s.N1) - 1),0)-s.N1,DATALENGTH(ISNULL(@pString,1))))

    FROM cteStart s;

    I started with DelimitedSplit8K_LEAD which uses lead so it requires SQL Server 2012 (just mentioning because you posted your question in a 2008 thread). The changes I made are:

    1. Changed the input data type to varchar(max).

    2. Changed the tally table logic to support up to ~2.2B rows, enough to handle a varchar(max) string

    3. Changed the ISNULL logic that handles the final delimiter to be the length of the string (looking back I should update that to just be 2147483648 but did not think of it at the time).

    If you need a splitter for SQL 2008 or 2005 you could start with DelimitedSplit8K and make the same changes.I do a lot of ETL work and frequently have to split strings longer than 8K. This guy has served me well.

    Any other suggested changes?

    If you work on a 2014+ server you can do an Electric Tally Table (in memory). I you go to the final page in the article for delimitedsplit8K and go back a view pages Paul White did this and saw some memory improvements.

    The tally table is now to a Million. (But should be larger).

    The charindex is now up to 10 million. (But should be larger).

    Both are not enough for very large CSV files.

    When you go bigger than 8K your only option is varchar(max) (or nvarchar(max) is you go that route). In that case you need to be able to support 2,147,483,647 characters as I mentioned early. That's why things slow down when dealing with varchar(max)... It's like you have varchar(1), varchar(2).... varchar(8000) then varchar(2147483647). Plus, check longer strings for a delimiter requires your CTE tally table to generate many more rows.

    This is all good to keep in mind for if/when you develop your own set-based functions for dealing with strings (if you have not done so already).

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • You should also understand that SQL Server deplores joins to the MAX datatypes and doing so usually causes a 2:1 loss in performance to the point where a well written mTVF with a WHILE loop in it will tie and, many times, beat the Tally version of a long string splitter.

    If you really need to split long strings on a regular basis, it's really worth writing a good CLR for the task.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply