Creating Delimited Strings

  • Jeff Moden (4/14/2011)


    I guess I'm confused. I see the XLM method beat all but one or two of the CLR's in most groups in your latest test results (which are very well done, by the way).

    Thanks. Sorry if my last set of results were confusing. The multiple CLR options under each use-case are independent, i.e. all CLR methods deliver the same results however they have different internal implementations and some have additional inputs. The intent was to test which CLR method would be best suited for each use-case (e.g. sorted ASC, sorted DESC, unsorted, desired delimiter, etc. plus all permutations thereof) implying you may end up calling a different CLR method per use-case.

    For the optimization I was talking about, we could start by removing an unnecessary and costly concatenation...

    ORDER BY N',' + ErrorDetail ASC

    Should be just ORDER BY ErrorDetail ASC

    Good catch. The concatenation is required when using ORDER BY with DISTINCT and I incorrectly copied and pasted my way into using it for all use cases. It will save a few cycles.

    Here are my results after making those changes for two different size sets of test data:

    500,000 rows in test.dbo.OPC3Test:

    Uniqueness Sort Order Delimiter Fastest Method Least CPU Method

    -------------------------------------------------------------------------------------

    DISTINCT None * Comma * CLR CLR

    DISTINCT None * Not Comma CLR Tie

    -------------------------------------------------------------------------------------

    DISTINCT ASC Comma * Tie XML **

    DISTINCT ASC Not Comma XML ** XML **

    -------------------------------------------------------------------------------------

    DISTINCT DESC Comma * CLR Tie

    DISTINCT DESC Not Comma XML ** XML **

    -------------------------------------------------------------------------------------

    NOT DISTINCT None * Comma * XML ** XML **

    NOT DISTINCT None * Not Comma XML ** XML **

    -------------------------------------------------------------------------------------

    NOT DISTINCT ASC Comma * XML ** CLR

    NOT DISTINCT ASC Not Comma XML ** CLR

    -------------------------------------------------------------------------------------

    NOT DISTINCT DESC Comma * XML ** CLR

    NOT DISTINCT DESC Not Comma XML ** CLR

    -------------------------------------------------------------------------------------

    Fastest: XML 8, CLR 3, Tie 1

    Most Efficient: XML 5, CLR 5, Tie 2

    300,000 rows in test.dbo.OPC3Test:

    Uniqueness Sort Order Delimiter Fastest Method Least CPU Method

    -------------------------------------------------------------------------------------

    DISTINCT None * Comma * CLR CLR

    DISTINCT None * Not Comma CLR Tie

    -------------------------------------------------------------------------------------

    DISTINCT ASC Comma * CLR CLR

    DISTINCT ASC Not Comma XML ** XML **

    -------------------------------------------------------------------------------------

    DISTINCT DESC Comma * CLR CLR

    DISTINCT DESC Not Comma XML ** XML **

    -------------------------------------------------------------------------------------

    NOT DISTINCT None * Comma * XML ** XML **

    NOT DISTINCT None * Not Comma XML ** XML **

    -------------------------------------------------------------------------------------

    NOT DISTINCT ASC Comma * XML ** CLR

    NOT DISTINCT ASC Not Comma XML ** CLR

    -------------------------------------------------------------------------------------

    NOT DISTINCT DESC Comma * XML ** CLR

    NOT DISTINCT DESC Not Comma XML ** CLR

    -------------------------------------------------------------------------------------

    Fastest: XML 8, CLR 4

    Most Efficient: XML 4, CLR 7, Tie 1

    * CLR default, requires one less custom input, implies caller must call correct CLR

    method for use case to achieve best performance

    ** XML does not maintain angle brackets, e.g. it escapes < and > to &lt; and &gt;

    Edit: It's worth mentioning, regarding Ties, I ran each use-case 10 times scoring a winner or a tie for each run. After 10 runs if the # wins was equal I marked it a Tie.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • I have looked around for a solution using TYPE but have come up empty...what is it that you know and why have you been holding out on us? 😀

    I went down the REPLACE-road and performance tanked in a hurry...plus it hit me that it's not just angle brackets we have to account for...all five of these characters are escaped:

    " is escaped as &quot;

    ' is escaped as &apos;

    & is escaped as &amp;

    < is escaped as &lt;

    > is escaped as &gt;

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • opc.three (4/15/2011)


    I have looked around for a solution using TYPE but have come up empty...what is it that you know and why have you been holding out on us? 😀

    I went down the REPLACE-road and performance tanked in a hurry...plus it hit me that it's not just angle brackets we have to account for...all five of these characters are escaped:

    " is escaped as &quot;

    ' is escaped as &apos;

    & is escaped as &amp;

    < is escaped as &lt;

    > is escaped as &gt;

    You are possibly looking for this:

    SELECT DocID,

    STUFF((SELECT DISTINCT

    N',' + ErrorDetail

    FROM dbo.OPC3Test s2

    WHERE s2.DocID = s1.DocID

    FOR

    XML PATH(''),TYPE

    ).value('(./text())[1]','NVARCHAR(400)'), 1, 1, '') AS [Skills]

    FROM dbo.OPC3Test s1

    GROUP BY s1.DocID

    ORDER BY s1.DocID ;

    All it does is ensure the XML is encoded as XML data (,TYPE) and then retrieves the text() of the XML node. This should correctly handle special characters such as <>&'"

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • opc.three (4/15/2011)


    Thanks. Sorry if my last set of results were confusing.

    ...

    ...

    Here are my results after making those changes for two different size sets of test data:

    I guess my comment about confusion was confusing. 😛

    Heh... and I like the first test results you made MUCH better. They weren't confusing at all. I always prefer to know if the horse won by a nose or by a furlong. :hehe: I very much like the way you go through your tests and the way you format their output.

    My confusion was more in line with the following...

    Considering the max that the CLR methods beat the XML methods by was only about 600 milliseconds across 300,000 rows and that the XML methods usually won for duration, I couldn't understand why anyone would anyone consider managing external code for this. And, no... I'm not an anti-CLR zealot or trying to pick a fight. I just know a lot of folks would like their DBA's to enable the use of SQLCLR's and those types of numbers just aren't going to fly as a possible justification.

    As a bit of a side bar, I'm seeing some (just an opinion because I wasn't there for the tests) anomolies for some of the timed outputs. For example, a descencing sort is normally just as fast as an ascending sort unless and index is involved. Are you sure your machine was in a "quiet" state while you were running your tests? I ask because my HP laptop has a world of things that happen automatically and unpredictably unless I conciously turn them off. Even short things like checking for virus definitions cause a flurry of action in the background and, when I see such anomolies in my data, I know I forgot to turn one of those bloody automations off. 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks mister.magoo, those are the goods!

    Jeff Moden (4/15/2011)


    opc.three (4/15/2011)


    Thanks. Sorry if my last set of results were confusing.

    ...

    ...

    Here are my results after making those changes for two different size sets of test data:

    I guess my comment about confusion was confusing. 😛

    Heh... and I like the first test results you made MUCH better. They weren't confusing at all. I always prefer to know if the horse won by a nose or by a furlong. :hehe: I very much like the way you go through your tests and the way you format their output.

    My confusion was more in line with the following...

    Considering the max that the CLR methods beat the XML methods by was only about 600 milliseconds across 300,000 rows and that the XML methods usually won for duration, I couldn't understand why anyone would anyone consider managing external code for this. And, no... I'm not an anti-CLR zealot or trying to pick a fight. I just know a lot of folks would like their DBA's to enable the use of SQLCLR's and those types of numbers just aren't going to fly as a possible justification.

    As a bit of a side bar, I'm seeing some (just an opinion because I wasn't there for the tests) anomolies for some of the timed outputs. For example, a descencing sort is normally just as fast as an ascending sort unless and index is involved. Are you sure your machine was in a "quiet" state while you were running your tests? I ask because my HP laptop has a world of things that happen automatically and unpredictably unless I conciously turn them off. Even short things like checking for virus definitions cause a flurry of action in the background and, when I see such anomolies in my data, I know I forgot to turn one of those bloody automations off. 🙂

    Thank you very much for taking the time to review and for your comments. I really do appreciate it. This will not devolve into a fight between two zealots as I am not a CLR zealot...but you have me on alert now 😉

    The primary reason for starting this was to find a way to simplify the code required to have a string concatenatation aggregate without sacrificing performance. I think the CLR methods have merit on these grounds in some scenarios, but as usual it depends. Your comment regarding 600ms over 300K rows could be turned around too...for cases where the CLR was slightly outperformed by the XML brevity of code could be a factor in choosing to use the CLR.

    At this point I am not certain of the test results displayed so far...they are "in general" numbers at best and are only useful to me to make some decisions about which methods should "move on to the next round" so to speak. This is my first foray into timing results for purposes of presenting. I too was seeing anomalies even between some of the CLR methods in the same category which is why I went for the tabular, compiled results of 10 tests per category. However, after posting those I started noticing that if I run each test individually, as opposed to as a set by category, that some of the individual tests would show drastically better results. At the moment I have suspicions that it has something to do with running multiple statements in a batch while flicking STATISTICS IO OFF and ON in between statements but I have no concrete proof.

    I have plans to re-run all tests individually, multiple times each, and to record the results in a table where I can then play around with averages, comparisons, etc.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • This whole conversation has been incredibly enlightening. I appreciate all the back and forth about testing methods, etc. I'll be coming back to read this thread over and over again for future reference.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • Results attached.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Originally I was not going to include the "XML PATH" method considering the method escapes the five characters shown in an earlier post. I thought think it less useful than the PATH,TYPE method in terms of a generic solution, but for completeness I wanted to show results for all methods.

    The first set of results only showed the "XML PATH" results for 600K rows. The attached results also includes results for "XML PATH" method results for a 300K dataset.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • I am excluding the "XML PATH" method from consideration as a generally accepted way to solve the group string concatenation problem. The "XML PATH,TYPE" method outperformed it in most cases plus the "XML PATH" method suffers from an issue with escaped characters. That said, it did finish first in elapsed time in the "non-distinct, unsorted" category for 600K rows so if you know your data it definitely has its place.

    For elapsed time:

    - If you exclude the "XML PATH" method, CLR outperformed XML methods in 4/6 categories for 300K and 600K rows.

    - If you include the "XML PATH" method, CLR outperformed XML methods in 4/6 categories for 300K and 3/6 categories for 600K rows.

    For cpu time:

    - If you exclude the "XML PATH" method, CLR outperformed XML methods in 4/6 categories for 300K and 600K rows.

    - If you include the "XML PATH", method CLR outperformed XML methods in 3/6 categories for 300K and 600K rows.

    <my_estimation>The CLR methods are worthy of use as a drop-in alternative to both of the XML methods. The CLR showed a performance advantage in the majority of my test cases, plus it offers a more intuitive method to solve the problem and it results in more readable code.</my_estimation>

    As usual, depending (there's that word again) on the situation, if you need to squeeze every last bit of performance out of your system you'll need to choose the appropriate CLR or XML method. It is important to note that there are many other ways to solve the problem and that I have only tested 3 of them. This article on simple-talk.com[/url] mentions many more methods. I chose to focus on CLR and XML because they have the widest acceptance in terms of performance, scalability and the least amount of limitations.

    I have not published it yet, but I have created a project on CodePlex to distribute the CLR methods as drop-ins, implementable as T-SQL scripts, i.e. no .NET compilations or programming experience will be required to implement or use them:

    http://groupconcat.codeplex.com

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • I finally got to a place where I felt comfortable enough to publish the project...but it's still in Beta 😀

    If you have the time and inclination to review the site and provide corrections I would be much obliged...and if you have time use them and provide feedback...even better!

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Viewing 10 posts - 31 through 40 (of 40 total)

    You must be logged in to reply to this topic. Login to reply