Running totals problem

  • Hi all -

    Here's my current problem. I have a table, whos records are segregated into several subsets. For each subset, I want to build a file.

    This file is going to contain a row for each record in the table. As it stands right now, this is how I'm solving the problem.

    As an example of my data set to deal with,

    CREATE TABLE #DataTable

    (

    ID INT IDENTITY PRIMARY KEY,

    Criteria VARCHAR(MAX)

    )

    INSERT INTO #DataTable (Criteria)

    SELECT TOP 10000 REPLICATE(RAND(CHECKSUM(NEWID())) * 100, 20)

    FROM Master.dbo.SysColumns sc1

    CROSS JOIN Master.dbo.SysColumns sc2

    CREATE TABLE #Temp

    (

    RowNum INT IDENTITY (1,1) NOT NULL,

    Criteria VARCHAR(MAX),

    EndResult VARCHAR(MAX)

    CONSTRAINT PK_RowNum

    PRIMARY KEY NONCLUSTERED (RowNum)

    WITH FILLFACTOR = 100

    )

    INSERT INTO #Temp (Criteria, EndResult)

    SELECT Criteria, ''

    FROM #DataTable

    WHERE Clause

    For each subset, I'm running this query:

    UPDATE #Temp

    SET@RunningTotalVariable = EndResult = @RunningTotalVariable + Criteria

    FROM #Temp

    The problem is that, this query is taking a very long time to handle the records. In some cases there are as many as 10000 records to go through. The root of the problem seems to be the fact that the running total is performing a total against string values - which is taking the database engine a very long time to process.

    I tested this same script by replacing the strings with just simple integers, and it does this no problem. Also tried with strings, but very short and simple strings, and it also works, although it takes longer.

    My strings, though, can be up to 50-100 characters long, and the database just doesn't seem to be able to deal with that.

    I also tried this code instead:

    SELECT @RunningTotalVariable = @RunningTotalVariable + Criteria

    FROM #Temp

    Same problem though, takes way too long for it to run.

    So, the question is simple - what can I do to solve this problem? The end result that I need is a string that is of type VARCHAR(MAX), which contains a bunch of text that represents a concactenation of all of the records in the table that match the given criteria passed.

    I'm considering perhaps that the best solution may be to employ a .NET assembly, since perhaps .NET can handle this task better than SQL Server can. Hopefully I don't have to resort to that though, since I'd really rather not have to use a CLR for this.

    One other thing that I'm considering is, rather than creating my final VARCHAR(MAX) variable with the contents of the entire file, then writing the contents to the file, maybe a better solution would be to write line-by-line to the file. However, this would require me to loop through the temp table line by line, which doesn't exactly follow best practices.

  • The major issue I can see so far is using varchar(max).

    If possible for your solution, change it to varchar(8000) or less.

    It seems like the dowside of varchar(max) (being too long to fit into a page) has a huge effect on your query...

    Edit: make sure to change it for all columns and variables used during the quirky update!



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Maybe I 'm missing something, but from the code posted, it looks as though since all the data is defined as varchar, the "+" operator would invoke string concatenation, and not addition. If the data is truly characters and not numeric, how would you want the "running total" to be determined?

    BTW, I tried to run this and after defining @RunningTotalVariable as varchar(max), the result column in #temp stayed as NULL. Is there something else you did to test your update?

  • @john-2:

    I expect string concatenation as well, based on OPs statement

    My strings, though, can be up to 50-100 characters long, and the database just doesn't seem to be able to deal with that.

    After defining @RunningTotalVariable you need to set it to an empty string (''), since concatenation of NULL + something will result in NULL.

    Regarding the sample code itself:

    I strongly recommend against running it. The reason is very simple:

    It will concatenate all 10000 rows, leading to a string of 1.400.000 character (10000 * 140 character per row) in the final row. In total, there will be over 7 Billion character in that column.... To build that will take a while...

    @kramaswamy

    I guess the sample you provided doesn't reflect the real scenario...

    How long is an average (and max) resulting row after you're done with concatenation?



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Actually, the size of the string is about that, yes. Some of the larger subsets have over 1 million characters in them.

    So, yeah, they are storing a ton of data into that temp table. Could that be the problem, that the temp table is just too large to handle all the data?

    Even still though, I do need to solve this problem somehow 😛 Any advice?

  • kramaswamy (1/5/2010)


    Actually, the size of the string is about that, yes. Some of the larger subsets have over 1 million characters in them.

    So, yeah, they are storing a ton of data into that temp table. Could that be the problem, that the temp table is just too large to handle all the data?

    Even still though, I do need to solve this problem somehow 😛 Any advice?

    I can't think of any business reason to build a string with a million character, but anyway... Neither do I know if you need all intermediate results or just the very last row. If so, then you could use a string concatenation using FOR XML PATH

    Something like

    SELECT TOP 1

    (select '' + Criteria from #Temp t2 for xml path(''))

    from #Temp t1

    If that doesn't help you should describe the business case a little more detailed including some more descriptive sample data.



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Wow - that works amazingly well! Took a 45 minute operation down to 5 seconds!

    Can you explain to me how this is working? I'm a bit fuzzy on the whole XML thing, so I'm not really sure what is going on with the script.

  • Has to do with how the string values are actually concatenated. IIRC, straight concatenation, @Var1 = @Var1 + @Var2 for instance, actually copies the data from @Var1 and @Var2 to a new location in memory then changes the pointer for @Var1 to the new location. The XML version actually appends the data from @Var2 to the end of the data in @Var1. Not sure how that works, but I do recall this being explained somewhere on SSC much better than I just did.

  • Sorry for the double post, but there is one more problem left now.

    The string, as it is written to the file, is converting many of the characters to their HTML codes. Here's what's actually happening:

    SELECT TOP 1 @PrevURL = @PrevURL +

    (

    SELECT '' + CHAR(13) + CHAR(10) + '<url><loc>' + URL + '</loc><lastmod>' + REPLACE(CONVERT(VARCHAR, DateCreated, 111), '/', '-') + 'T' + CONVERT(VARCHAR, DateCreated, 8) + '-05:00</lastmod><changefreq>daily</changefreq><priority>' + CAST(Priority AS VARCHAR) + '</priority></url>'

    FROM #Temp t2 FOR XML PATH('')

    )

    FROM #Temp t1

    This should result in, for example,

    <url><loc>http://whitepapers.technologyevaluation.com/view_Document/1286/Demand-Driven-Strategies-for-Complex-Manufacturing.html</loc><lastmod>2009-09-26T05:00:00-05:00</lastmod><changefreq>daily</changefreq><priority>1.00</priority></url&gt;

    Instead, it gives me, for example,

    |lt;url|gt;|lt;loc|gt;http://whitepapers.technologyevaluation.com/view_Document/1286/Demand-Driven-Strategies-for-Complex-Manufacturing.html|lt;/loc|gt;|lt;lastmod|gt;2009-09-26T05:00:00-05:00|lt;/lastmod|gt;|lt;changefreq|gt;daily|lt;/changefreq|gt;|lt;priority|gt;1.00|lt;/priority|gt;|lt;/url|gt;|#x0D;

    The "|" characters are actually "&" characters, but if I put those, it converts them automatically to their HTML equivalent. Any ideas on how I can fix this? I know I could just add some REPLACE rules, but I'd rather just tell it to do it the "right" way from the start.

  • Could be me, but I didn't see any differences between the two strings.

    Edit: And yes, there is a way to fix it, I just can't remember it off the top of my head, but I know it is somewhere on SSC. Will need to do some looking.

  • Lol - yeah, see the correction I posted 😛 Had to replace the "&" characters, since otherwise it just showed the correct string 😛

  • Never mind - figured it out! Actually starting to learn how this XML stuff works finally, lol.

    Incase you're curious, here's the solution I employed:

    SELECT TOP 1 @PrevURL = @PrevURL + REPLACE(

    (

    --SELECT '' + CHAR(13) + CHAR(10) + '<url><loc>' + URL + '</loc><lastmod>' + REPLACE(CONVERT(VARCHAR, DateCreated, 111), '/', '-') + 'T' + CONVERT(VARCHAR, DateCreated, 8) + '-05:00</lastmod><changefreq>daily</changefreq><priority>' + CAST(Priority AS VARCHAR) + '</priority></url>'

    --SELECT '' + CHAR(60) + 'url' + CHAR(62) + CHAR(60) + 'loc' + CHAR(62) + URL + CHAR(60) + '/loc' + CHAR(62) + CHAR(60) + 'lastmod' + CHAR(62) + REPLACE(CONVERT(VARCHAR, DateCreated, 111), '/', '-') + 'T' + CONVERT(VARCHAR, DateCreated, 8) + '-05:00' + CHAR(60) + '/lastmod' + CHAR(62) + CHAR(60) + 'changefreq' + CHAR(62) + 'daily' + CHAR(60) + '/changefreq' + CHAR(62) + CHAR(60) + 'priority' + CHAR(62) + '1.00' + CHAR(60) + '/priority' + CHAR(62) + CHAR(60) + '/url' + CHAR(62)

    SELECT

    URL "url/loc",

    REPLACE(CONVERT(VARCHAR, DateCreated, 111), '/', '-') + 'T' + CONVERT(VARCHAR, DateCreated, 8) + '-05:00' "url/lastmod",

    'daily' "url/changefreq",

    CAST(Priority AS VARCHAR) "url/priority"

    FROM #Temp t2 FOR XML PATH('')

    ), '<url>', CHAR(13) + CHAR(10) + '<url>')

    FROM #Temp t1

    Basically, rather than trying to artificially build a string into an XML structure, I just built the XML structure the normal way, and then turned it into a string. Works like a charm!

  • kramaswamy (1/5/2010)


    Wow - that works amazingly well! Took a 45 minute operation down to 5 seconds!

    Can you explain to me how this is working? I'm a bit fuzzy on the whole XML thing, so I'm not really sure what is going on with the script.

    Ok, I'll give it a try. (a little late since quite a few replies happened in between...)

    But before we talk about how the xml stuff works I'd like to explain why this concept in general is a lot faster than the quirky update:

    Usually, the quirky update method is used if you need to assign a running value (either character or numbers) to each row. If we just need the final result, we'd usually use the SUM() function for numbers instead of storing each intermediate value.

    Therefore, the performance problem wasn't caused by the quirky update. The quirky update was simply the wrong method to begin with.

    Now to the XML stuff:

    Let's start with the inner SELECT:

    select '' + Criteria from #Temp t2 for xml path('')

    When using FOR XML PATH we convert the data into an xml structure.

    Step 1:

    SELECT TOP 2 Criteria from #Temp t2 for xml path ('test')

    /* result set

    <test>

    <Criteria>4.945634.945634.945634.945634.945634.945634.945634.945634.945634.945634.945634.945634.945634.945634.945634.945634.945634.945634.945634.94563</Criteria>

    </test>

    <test>

    <Criteria>33.579633.579633.579633.579633.579633.579633.579633.579633.579633.579633.579633.579633.579633.579633.579633.579633.579633.579633.579633.5796</Criteria>

    </test>

    */

    As you can see, each row is wrapped with two xml tags (column name and "test")

    Since we don't want to have those tags we replace 'test' with an empty string and also define an empty string to replace the column name.

    SELECT TOP 2 Criteria from #Temp t2 for xml path ('test')

    /* result set

    4.945634.945634.945634.945634.945634.945634.945634.945634.945634.945634.945634.945634.945634.945634.945634.945634.945634.945634.945634.9456333.579633.579633.579633.579633.579633.579633.579633.579633.579633.579633.579633.579633.579633.579633.579633.579633.579633.579633.579633.5796

    */

    If we remove the TOP 2 clause, we'd get the data concatenated just as we want it. Unfortunately, it's an XML format.

    And while trying to explain it I just figured that we don't need the outer reference to #Temp table. A simple SELECT is enough to get the output as varchar(max) again.

    So it finally looks like:

    SELECT (select '' + Criteria from #Temp t2 for xml path(''))

    This concept can also be used to concatenate rows in a comma separated list grouped by a separate column (the usual case).

    An example can be found here.

    Finally, the change we just "discovered" should make the code even faster...



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Oh, neat. So I can do this instead:

    SELECT

    URL "loc",

    REPLACE(CONVERT(VARCHAR, DateCreated, 111), '/', '-') + 'T' + CONVERT(VARCHAR, DateCreated, 8) + '-05:00' "lastmod",

    'daily' "changefreq",

    CAST(Priority AS VARCHAR) "priority"

    FROM #Temp t2 FOR XML PATH('url')

    To have it use "url" as the root, instead of having no root. Nice!

  • Cool Beans!

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply