Performance issue with tally solution

  • Adam Machanic (9/5/2012)


    Well the only logical way to respond to that is to ask you why you're storing these strings in a table. Dare I ask?

    I'm not storing them that way, that is the way the application stores the data, and is the only way for us to access it and report against it. I can store them split, but there is no flag that indicates which line(s) have been updated each day, so we have to re-split the entire table every time we need updated data. (I supposed we could keep hashes of each row or something, but it takes under an hour to split last time I ran it, so it wasn't that big of an issue.)

    And, going deeper down the rabbit hole... How big are the strings in question?

    I would say up to 25k characters, but probably averaging closer to 3k.

  • UMG Developer (9/5/2012)


    And, going deeper down the rabbit hole... How big are the strings in question?

    I would say up to 25k characters, but probably averaging closer to 3k.

    Ah, and there is the rub: You're over the 4k threshold, which means that you're stuck with the larger definition or you're hoping that the query optimizer gives you proper startup filters (and bothers to honor them).

    (https://connect.microsoft.com/SQLServer/feedback/details/757383/startup-predicates-not-honored-in-plans-with-parallel-nested-loops-subtrees)

    --
    Adam Machanic
    whoisactive

  • Adam Machanic (9/5/2012)


    Ah, and there is the rub: You're over the 4k threshold, which means that you're stuck with the larger definition or you're hoping that the query optimizer gives you proper startup filters (and bothers to honor them).

    But I can split the split operation into two pieces, one that uses the >4k version and one that uses the <=4k version. I don't know how much time that would actually save me, and I don't have time to run trials right now.

    I do have other cases where I have 100s of thousands of rows that have a column that needs to be split, and it is never longer than 255 characters, so I could safely use the <=4k version for that job.

  • UMG Developer (9/5/2012)


    But I can split the split operation into two pieces, one that uses the >4k version and one that uses the <=4k version. I don't know how much time that would actually save me, and I don't have time to run trials right now.

    Let's see, 9 microseconds per call vs. the overhead of scanning all of the data twice? Hm. Best of luck with that 😉

    --
    Adam Machanic
    whoisactive

  • UMG Developer (9/5/2012)


    Adam Machanic (9/5/2012)


    How are you seeing string splits used where such a difference would matter? (Again, I am *only* talking about string split operations.)

    Generally when I see string splitters used it's for passing a set of arguments to a stored procedure, and it will end up in some query like:

    SELECT *

    FROM Table

    WHERE ID IN (SELECT CONVERT(INT, output) FROM dbo.StringSplitFunction(@values))

    Adam,

    I have cases where I have a table with ~20 million rows and each row contains a string that needs to be split, so that is a case where the splitter would be called 20 million times in a single INSERT query, so any optimization to make it faster can make a big difference.

    I have to agree with Adam on this one. Why would you store such a volume of comma separated data in a database? A normalized sister table would be much more efficient.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (9/6/2012)


    UMG Developer (9/5/2012)


    Adam Machanic (9/5/2012)


    How are you seeing string splits used where such a difference would matter? (Again, I am *only* talking about string split operations.)

    Generally when I see string splitters used it's for passing a set of arguments to a stored procedure, and it will end up in some query like:

    SELECT *

    FROM Table

    WHERE ID IN (SELECT CONVERT(INT, output) FROM dbo.StringSplitFunction(@values))

    Adam,

    I have cases where I have a table with ~20 million rows and each row contains a string that needs to be split, so that is a case where the splitter would be called 20 million times in a single INSERT query, so any optimization to make it faster can make a big difference.

    I have to agree with Adam on this one. Why would you store such a volume of comma separated data in a database? A normalized sister table would be much more efficient.

    I guess, it's not really about "why would you store such....", it's more to do with "if you have it (including may be some staging storage) and you need to transform it". The source data for transformation can come from very different sources including non-relational ones, so you may not be able to discuss the format of input feeds...

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Hi all,

    No more comments on my newer CLR version? That's unfortunate. Where's Florian these days?

    Anyway, I just discovered what I consider to be a major issue with CLR TVF based solutions: they don't unload gracefully when an AppDomain recycle occurs. This means that your string split function can be merrily running along, splitting strings, and next thing you know you've hit a strange and misleading exception -- because some OTHER person's code failed.

    Bad state of affairs. And not the case with the other major types of CLR modules (stored procedures and scalar UDFs), so it seems like this should be able to be fixed. If you'd like to see that happen, please vote here:

    https://connect.microsoft.com/SQLServer/feedback/details/765930/clr-tvfs-are-not-gracefully-handled-during-appdomain-recycles

    Thanks!

    Adam

    --
    Adam Machanic
    whoisactive

  • Jeff Moden (9/6/2012)


    UMG Developer (9/5/2012)


    Adam Machanic (9/5/2012)


    How are you seeing string splits used where such a difference would matter? (Again, I am *only* talking about string split operations.)

    Generally when I see string splitters used it's for passing a set of arguments to a stored procedure, and it will end up in some query like:

    SELECT *

    FROM Table

    WHERE ID IN (SELECT CONVERT(INT, output) FROM dbo.StringSplitFunction(@values))

    Adam,

    I have cases where I have a table with ~20 million rows and each row contains a string that needs to be split, so that is a case where the splitter would be called 20 million times in a single INSERT query, so any optimization to make it faster can make a big difference.

    I have to agree with Adam on this one. Why would you store such a volume of comma separated data in a database? A normalized sister table would be much more efficient.

    Actually, I am in exactly this situation right now, Jeff. As I'm sure you know, I would never voluntarily store data like that, not permanently anyway.

    The problem is though, that this is how the data is being delivered to me, as a multi-million row pipe-delimited text file, downloaded regularly from the IRS website. I import the data with BULK INSERT, but (to answer the obvious question), I cannot use BULK INSERT to do the splitting because some few rows at the beginning and the end are non-conformant (they are not real data rows). And as we all know, BULK INSERT (and BCP) are not exactly tolerant of non-conforming input rows. And I cannot exclude them reliably with BULK INSERT because we cannot know for sure ahead of time how many or which ones at the beginning and end will be non-conformant. And for legal reasons, we have to be sure that we have not lost any rows.

    So the obvious solution is to BULK INSERT them to a staging table with one big VARCHAR() column and then split in inside of SQL where we can more readily detect and handle the exceptions. But that's not nearly as fast as BULK INSERT ...

    To import and split a million row file with BULK INSERT takes me about 10 seconds (when I know ahead of time, which rows are bad). Just making a SELECT INTO copy of the imported table takes about 4 seconds. But splitting it with the fastest TSQL splitters I can find take about 2-4 minutes! And the rows aren't big, just 7 columns with about 100 characters. So right now, I'm using Paul's latest CLR splitter which does pretty well (about 17 seconds). That's going to cause some maintenance issues for this customer, but given the speed difference, they'll probably agree to it.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • RBarryYoung (10/3/2012)


    So the obvious solution is to BULK INSERT them to a staging table with one big VARCHAR() column and then split in inside of SQL where we can more readily detect and handle the exceptions. But that's not nearly as fast as BULK INSERT ...

    The "obvious" solution to me would be to use an external ETL package (either SSIS, or a custom thing built in C# or Powershell), do the business logic and string splitting outside of SQL Server, then bulk in the result. But of course "obvious" is very, very context dependent and you probably haven't told the entire story.

    So right now, I'm using Paul's latest CLR splitter...

    You mean mine, that Paul posted the binaries for! 🙂

    --
    Adam Machanic
    whoisactive

  • Adam Machanic (10/3/2012)


    RBarryYoung (10/3/2012)

    So right now, I'm using Paul's latest CLR splitter...

    You mean mine, that Paul posted the binaries for! 🙂

    I am using this one here: http://www.sqlservercentral.com/Forums/FindPost1356773.aspx, I had assumed that it was Paul's because he had not mentioned anyone else there and I had not seen you on that thread recently. If this is fact your work Adam, then I sincerely apologize for not crediting you properly.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • RBarryYoung (10/3/2012)


    I am using this one here: http://www.sqlservercentral.com/Forums/FindPost1356773.aspx, I had assumed that it was Paul's...

    Yes that's mine.

  • RBarryYoung (10/3/2012)


    The problem is though, that this is how the data is being delivered to me, as a multi-million row pipe-delimited text file, downloaded regularly from the IRS website. I import the data with BULK INSERT, but (to answer the obvious question), I cannot use BULK INSERT to do the splitting because some few rows at the beginning and the end are non-conformant (they are not real data rows). And as we all know, BULK INSERT (and BCP) are not exactly tolerant of non-conforming input rows. And I cannot exclude them reliably with BULK INSERT because we cannot know for sure ahead of time how many or which ones at the beginning and end will be non-conformant. And for legal reasons, we have to be sure that we have not lost any rows.

    Have you thought about combining the CLR splitter with OPENROWSET...BULK to avoid the staging table? That gives us a way to achieve minimally-logged inserts directly from the source file while transforming the data (= rejecting bad rows and splitting good ones) in the bulk flow. I wrote a demo of the basic idea here: http://www.sqlservercentral.com/Forums/FindPost916633.aspx

  • Adam Machanic (10/3/2012)


    Where's Florian these days?

    Haven't seen him for ages on SSC. He is @Florian_Reischl on twitter, if that helps.

    Voted.

Viewing 13 posts - 511 through 522 (of 522 total)

You must be logged in to reply to this topic. Login to reply