Tally OH! An Improved SQL 8K “CSV Splitter” Function

  • peter-757102 (9/11/2012)


    Having to specify collations everywhere would make the function cumbersome to use. It might even throw off the optimizer in not using indexes due to collation differences.

    An example scenario could be this:

    The outcome of the splitter is used to match the substrings against a table of specific words (indexed) and for each word we count the number of occurrences in the original separated input sting. Assuming both the words in that table and the original string are in the same collation, the code should be completely transparent in terms of collations, else if looses much of its appeal.

    Why there would be a need to specify collations everywhere? For the sake of clearing the ambiguity this is the part (in bold) of the code where comparison is done using binary collation (only for the delimiter, so can be used in most cases)

    SUBSTRING(@pString, (N+1),

    ISNULL(NULLIF(CHARINDEX(@pDelimiter ,

    @pString COLLATE Latin1_General_BIN, (N+1)), 0)

    - (N+1), 8000) )

    And this is the part where the original string is split into items

    SUBSTRING(@pString, (N+1),

    ISNULL(NULLIF(CHARINDEX(@pDelimiter ,

    @pString COLLATE Latin1_General_BIN, (N+1)), 0)

    - (N+1), 8000) )

    So I do not think we are losing the default collation.

  • Usman Butt (9/11/2012)


    peter-757102 (9/11/2012)


    Having to specify collations everywhere would make the function cumbersome to use. It might even throw off the optimizer in not using indexes due to collation differences.

    An example scenario could be this:

    The outcome of the splitter is used to match the substrings against a table of specific words (indexed) and for each word we count the number of occurrences in the original separated input sting. Assuming both the words in that table and the original string are in the same collation, the code should be completely transparent in terms of collations, else if looses much of its appeal.

    Why there would be a need to specify collations everywhere? For the sake of remove the ambiguity this is the part (in bold) of the code where comparison is done using binary collation (only for the delimiter, so can be used in most cases)

    SUBSTRING(@pString, (N+1),

    ISNULL(NULLIF(CHARINDEX(@pDelimiter ,

    @pString COLLATE Latin1_General_BIN, (N+1)), 0)

    And this is the part where the original string is split into items

    SUBSTRING(@pString, (N+1),

    ISNULL(NULLIF(CHARINDEX(@pDelimiter ,

    @pString COLLATE Latin1_General_BIN, (N+1)), 0)

    - (N+1), 8000) )

    - (N+1), 8000) )

    So I do not think we are losing the default collation.

    My concern was just a general remark for such routines, not based on any specific implementation. I explicitly agreed that a binary comparison for the delimeter made sense. My concern is only about the final output, which I like to see in the original collation or that of the database default in case of a constant or variable as input.

    Just to make further processing as clean as possible and able to use indexes without collation specifiers. If your implementation takes care of all that, all the better!

    It is easy to test for any function. Just output the result to a table using the select ... into TableX ... from ... construct and then check the collation on the columns using sp_help TableX.

  • Usman Butt (9/11/2012)


    Another thing I just noticed is the return data type for the item is nvarchar(4000). Should not it be nvarchar(max)?

    Not for me, no 🙂

    The source code is there for anyone that does want to split strings where elements may be > 4000 characters.

  • m.t.cleary (9/11/2012)


    Can you avoid the conversion to Unicode by casting to a byte array?

    Yes, but it all gets very inconvenient from there on in - for example working out whether the byte stream was originally Unicode or not.

  • SQL Kiwi (9/11/2012)


    Usman Butt (9/11/2012)


    Another thing I just noticed is the return data type for the item is nvarchar(4000). Should not it be nvarchar(max)?

    Not for me, no 🙂

    The source code is there for anyone that does want to split strings where elements may be > 4000 characters.

    hahahaha..:-D Yes, one can change it since the source code is available. But since it is to be taken as the generic splitter :cool:, it should be handling strings with any length of elements :hehe: Should say we are comparing oranges with oranges when it comes to varchar(max) versions 😀

    But on a serious note, is there any performance degrade if nvarchar(max) is used? I do not remember the link, but I read somewhere that sometimes nvarchar(max) could speed up the execution as compared to nvarchar(N) but may depend upon the logic of the code. BTW, In my initial testing, I did see some improved performance if I change the return type to let say nvarchar(20) etc.

  • Usman Butt (9/12/2012)


    hahahaha..:-D Yes, one can change it since the source code is available. But since it is to be taken as the generic splitter :cool:, it should be handling strings with any length of elements :hehe: Should say we are comparing oranges with oranges when it comes to varchar(max) versions 😀

    I know - and it has that annoying limitation of only accepting a maximum of 2GB of string as input too! 😀

    But on a serious note, is there any performance degrade if nvarchar(max) is used? I do not remember the link, but I read somewhere that sometimes nvarchar(max) could speed up the execution as compared to nvarchar(N) but may depend upon the logic of the code. BTW, In my initial testing, I did see some improved performance if I change the return type to let say nvarchar(20) etc.

    I have never tried it. There quite probably would be since handling LOB types always incurs overhead, a little in the CLR and a great deal more in the SQL Server engine. Improving performance by shortening the maximum length of the output from nvarchar(4000) to nvarchar(20) surprises me. I can't think of a reason that would affect anything. On the general topic of increasing the speed of the CLR spliter - no, I haven't tried much on that at all. It has always been the fastest method for me, and I have never seen it be the performance bottleneck for a real problem I have had. I also value keeping the implementation nice and simple.

  • I was thinking that there would be a separate version for ASCII which would avoid the conversions to and from UNICODE and probably make it even faster. The convenience of having a single function counts for a lot given the speed is already good.

    SQL Kiwi (9/11/2012)


    m.t.cleary (9/11/2012)


    Can you avoid the conversion to Unicode by casting to a byte array?

    Yes, but it all gets very inconvenient from there on in - for example working out whether the byte stream was originally Unicode or not.

  • I found similar failure to proceed on a physical server - a desktop running Vista 32 bit on an aging AMD processor. When I changed the script to select into local variables instead of inserting into temp tables, it completed in good time. There is probably a problem with tempdb allocation or expansion.

    This is OK for micro-bencmarks but if you want a realistic test you should use the splitter output rather than discarding it. If you can't fix tempdb, I'd change the script to delete the temporary tables as you go or to just run one splitter variant at a time.

    -mark

    ErikEckhardt (3/19/2012)


    Second try: failed. 3 days and 20-some-odd hours later, exact same results as before, only with wait time 334,605,998. 🙂

    I think I will not be running this again on my VM SQL instance. That's okay, there are plenty other places I can try it...

    The entire SQL Server could not be used for any other purpose for 2 days. I have something logging to a table every minute and it could not operate.

  • I had an online csv splitter made that can be used on any platform including mac...

    meh won't link...

    http://www.promotional-products.org/automatic-online-csv-file-splitter.html

  • david 97103 (10/5/2012)


    I had an online csv splitter made that can be used on any platform including mac...

    meh won't link...

    http://www.promotional-products.org/automatic-online-csv-file-splitter.html%5B/quote%5D

    That's nice and thanks for the freebee but not exactly what we're talking about here. That splits the files into multiple files. It doesn't actually split the rows into elements and insert them into a database table.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks, Jeff!

    I was inspired by this to make a CSV Splitter that also respects double-quotes, such as 10,"20,30",40 so it doesn't use the comma inside quotes.

    Style note - commas are before the cte names (instead of after) so there is room to insert a testing statement for seeing the results of an intermediate cte

    by uncommenting and running just that section.

    IF object_id('dbo.parseCSV') IS NOT NULL DROP FUNCTION dbo.parseCSV

    GO

    --parseCSV.sql Derived from: http://www.sqlservercentral.com/articles/Tally+Table/72993/

    /*

    We assume input data is comma-separated values with no quotes EXCEPT for fields completely surrounded by quotes.

    That is, the opening quote is either column 1 or just after a comma, and the closing quote is either end-of-string or just before a comma.

    Usage Example:

    SELECT * FROM dbo.parseCSV('Fat,"10,20,30",and,Lazy')

    --*/

    CREATE FUNCTION dbo.parseCSV (@pString VARCHAR(8000))

    --WARNING!!! DO NOT USE MAX DATA-TYPES HERE! IT WILL KILL PERFORMANCE!

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    --===== "Inline" CTE Driven "Tally Table" produces values from 1 up to 10,000...

    -- enough to cover VARCHAR(8000)

    --You can test by executing from this Declare to the end or an uncommented SELECT after a cte definition.

    --declare @pString varchar(1000)='1,"4,6",9'; set @pString='a1,",xb,",c1';--set @pString='Fat,"10,20,30",and,Lazy';--set @pString='a,b,"c,d"';--set @pString='"1,2","3,4"';

    WITH E1(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    ) --10E+1 or 10 rows

    ,E2(N) AS (SELECT 1 FROM E1 a, E1 b) --10E+2 or 100 rows

    ,E4(N) AS (SELECT 1 FROM E2 a, E2 b) --10E+4 or 10,000 rows max

    ,cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

    -- for both a performance gain and prevention of accidental "overruns"

    SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

    )

    ,cteQtCom AS (--find all quotes and commas

    SELECT t.N

    , CASE WHEN SUBSTRING(@pString,t.N,1)=',' THEN 1 ELSE 0 END [isComma]

    , CASE WHEN SUBSTRING(@pString,t.N,1)='"' THEN 1 ELSE 0 END [isQuote]

    FROM cteTally t

    WHERE SUBSTRING(@pString,t.N,1) IN (',','"')

    )--cteQtCom

    --SELECT * FROM cteQtCom

    ,cteNumQuotesBefore AS (

    SELECT t.N, t.isComma, t.isQuote

    , ISNULL((SELECT SUM(t2.isQuote) FROM cteQtCom t2 WHERE t2.N<t.N),0) [qb4]

    FROM cteQtCom t

    )--cteNumQuotesBefore

    --SELECT * FROM cteNumQuotesBefore

    ,cteCommasNotWithinQuotes AS (--if the number of quotes before is EVEN then this comma is not inside quotes.

    SELECT t.N

    FROM cteNumQuotesBefore t

    WHERE t.isComma=1 AND (qb4%2)=0

    )--cteCommasNotWithinQuotes

    --SELECT *, SUBSTRING(@pString,1,N)[substr] FROM cteCommasNotWithinQuotes

    ,cteStart(N1,this_startquote,prev_endquote) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

    SELECT 1[N1], CASE WHEN SUBSTRING(@pString,1,1)='"' THEN 1 ELSE 0 END[this_startquote], 0[prev_endquote]

    UNION ALL

    SELECT t.N+1[N1]

    , CASE WHEN SUBSTRING(@pString,t.N+1,1)='"' THEN 1 ELSE 0 END[this_startquote]--this field starts with quote

    , CASE WHEN SUBSTRING(@pString,t.N-1,1)='"' THEN 1 ELSE 0 END[prev_endquote]--previous field ended with quote

    --both could be true for contiguous quoted fields: "1","2"

    FROM cteCommasNotWithinQuotes t

    )--cteStart

    --SELECT * FROM cteStart

    ,cteLen AS(--==== Return start and length (for use in substring)

    SELECT s.N1, s.this_startquote

    ,CASE WHEN ending_comma.N1 IS NOT NULL THEN ending_comma.N1-s.N1-1

    ELSE LEN(@pString)-s.N1+1 END [L1]

    ,CASE WHEN ending_comma.prev_endquote IS NOT NULL THEN ending_comma.prev_endquote

    ELSE CASE WHEN RIGHT(@pString,1)='"' THEN 1 ELSE 0 END END[this_endquote]

    FROM cteStart s

    OUTER APPLY (

    SELECT TOP 1 e.N1,e.prev_endquote

    FROM cteStart e

    WHERE e.N1>s.N1

    ORDER BY N1

    ) ending_comma

    )--cteLen

    --SELECT *,SUBSTRING(@pString,N1,L1)[substr] FROM cteLen

    SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY LL.N1),

    Item = SUBSTRING(@pString, LL.N1+LL.this_startquote, LL.L1-this_startquote-this_endquote)

    FROM cteLen LL

    ;

    GO

    And here is a little tester code:

    --Create a few hundred test cases by all combinations of A,B,C where each can be bare or "surrounded by quotes"

    DECLARE @a TABLE (x VARCHAR(200))

    INSERT INTO @a VALUES (''),('a1'),('qa1,qa2'),(',xa,')

    DECLARE @q TABLE (iq TINYINT, q VARCHAR(1))

    INSERT INTO @q VALUES (0,''),(1,'"')

    IF object_id('tempdb..#T') IS NOT NULL DROP TABLE #T

    CREATE TABLE #T (id INT IDENTITY(1,1),s VARCHAR(200))

    INSERT INTO #T

    SELECT QA.q+A.x+QA.q

    +','+QB.q+B.x+QB.q

    +','+QC.q+C.x+QC.q

    FROM @a A CROSS JOIN @q QA

    CROSS JOIN @a B CROSS JOIN @q QB

    CROSS JOIN @a C CROSS JOIN @q QC

    --Run the test -- 512 input rows, 2112 output rows -- up to you to eyeball the results

    IF 1=0

    SELECT id,s, r.*

    FROM #T

    OUTER APPLY dbo.parseCSV(s) r

    --show fields in separate columns so more compact:

    SELECT s, [1][one],[2][two],[3][three],[4][four],[5][five],[6][six],[7][seven]

    FROM (

    SELECT t.id, t.s, r.ItemNumber, r.Item

    FROM #T t

    OUTER APPLY dbo.parseCSV(s) r

    ) AS SRC

    PIVOT (

    MAX(Item)

    FOR ItemNumber IN ([1],[2],[3],[4],[5],[6],[7])

    ) AS pvt

    IF 1=0 BEGIN--testa --prints as we go, so we can find where a problem happened:

    DECLARE @n INT=0

    WHILE EXISTS(SELECT * FROM #T WHERE id>@n) BEGIN--n

    SELECT @n=MIN(id) FROM #T WHERE id>@n

    DECLARE @s1 VARCHAR(200)

    SELECT @s1=s FROM #T WHERE id=@n

    PRINT CONVERT(VARCHAR,@n)+': '+@s1

    SELECT @n[n],* FROM dbo.parseCSV(@s1) v

    END--n

    END--testa

  • Hi Jeff,

    Can you please test your new functions' performance against the following function :

    CREATE FUNCTION [dbo].[ParseString]

    (

    @String VarChar(8000),

    @Delimiter VarChar(1)

    ) RETURNS TABLE

    AS

    RETURN

    (

    WITH Tokens(LinePos, StartPos, EndPos) AS

    (

    SELECT

    1,

    1,

    CharIndex(@Delimiter, @String)

    UNION ALL

    SELECT

    LinePos + 1,

    EndPos + 1,

    CharIndex(@Delimiter, @String, EndPos + 1)

    FROM

    Tokens

    WHERE

    EndPos > 0

    )

    SELECT

    CONVERT(Int, LinePos) AS RowNumber,

    SubString

    (

    @String,

    StartPos,

    CASE

    WHEN EndPos > 0 THEN (EndPos - StartPos)

    ELSE 8000

    END

    )

    AS StringValue

    from Tokens

    )

    and let us know which performs best ?

    Regards,

    Dirk van der Watt

  • Dirk vd Watt (12/28/2012)


    Hi Jeff,

    Can you please test your new functions' performance against the following function :

    CREATE FUNCTION [dbo].[ParseString]

    (

    @String VarChar(8000),

    @Delimiter VarChar(1)

    ) RETURNS TABLE

    AS

    RETURN

    (

    WITH Tokens(LinePos, StartPos, EndPos) AS

    (

    SELECT

    1,

    1,

    CharIndex(@Delimiter, @String)

    UNION ALL

    SELECT

    LinePos + 1,

    EndPos + 1,

    CharIndex(@Delimiter, @String, EndPos + 1)

    FROM

    Tokens

    WHERE

    EndPos > 0

    )

    SELECT

    CONVERT(Int, LinePos) AS RowNumber,

    SubString

    (

    @String,

    StartPos,

    CASE

    WHEN EndPos > 0 THEN (EndPos - StartPos)

    ELSE 8000

    END

    )

    AS StringValue

    from Tokens

    )

    and let us know which performs best ?

    Regards,

    Dirk van der Watt

    That's a recursive CTE that counts. Please see the following article.

    http://www.sqlservercentral.com/articles/T-SQL/74118/

    Considering the findings in that article and the fact that I provided a complete test harness so that anyone could do a full test on their own functions, please understand when I say "Can you please test {your own} new functions' performance ... and let us know which performs best?" 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • what about xml?

    declare @table table (id int identity(1,1), textdata varchar(4000))

    insert @table

    exec xp_cmdshell 'wmic volume get capacity, drivetype, freespace, caption /format:csv'

    ; with cte as (

    select Data.value('(/root/r/text())[1]','varchar(1000)') Node

    , convert(bigint,Data.value('(/root/r/text())[2]','varchar(1000)')) / (1024 * 1024 * 1024) Capacity

    , Data.value('(/root/r/text())[3]','varchar(1000)') Caption

    , Data.value('(/root/r/text())[4]','varchar(1000)') DriveType

    , convert(bigint,replace(Data.value('(/root/r/text())[5]','varchar(1000)'),char(10),'')) / (1024 * 1024 * 1024) FreeSpace

    from @table

    cross apply (select convert(xml,'<root><r>' + replace(textdata,',','</r><r>') + '</r></root>') as Data) textdataXML

    where Data.value('(/root/r/text())[4]','varchar(1000)') = '3'

    )

    select Caption DriveName

    , Capacity

    , FreeSpace

    , convert(money,FreeSpace) / convert(money,Capacity) PercentFree

    from cte

    where Capacity > 0

    order by 3 desc

  • telcogod (12/28/2012)


    what about xml?

    Did you read the whole article and look at the performance comparisons? This style of xml splitter was on the list and it was not as fast.

    As Jeff said above, their is a full test harness posted. Try out your xml splitter next to the DelimitedSplit8K. You might be surprised,

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

Viewing 15 posts - 436 through 450 (of 990 total)

You must be logged in to reply to this topic. Login to reply