TSQL help needed

  • Hi I need some help in programing the below scenario, can Some one help me.

    Source Table Structure

    Resource_Id Skill_Id

    101 sqlserver

    101 vb.net

    101 oracle

    102 sqlserver

    102 java

    102 excel

    103 vb.net

    103 java

    103 oracle

    I need the out put in the below format, I tried some methods , but they are taking more than expected time , Performance is the key factor .

    Final Output

    101 sqlserver vb.net oracle

    102 sqlserver ---- ---- java excel

    103 ---- vb.net oracle java

  • See if these are quick enough for you. The use of FOR XML is the fastest concatenation technique I've come across, but this sort of string manipulation is often better accomplished by the calling application outside SQL.

    The second, CrossTab, technique can be modified to produce the output you wanted; I just thought it was easier to read this way.

    declare @skills table (Resource_Id int, Skill_Id varchar(20))

    insert into @skills

    select 101, 'sqlserver' union all

    select 101, 'vb.net' union all

    select 101, 'oracle' union all

    select 102, 'sqlserver' union all

    select 102, 'java' union all

    select 102, 'excel' union all

    select 103, 'vb.net' union all

    select 103, 'java' union all

    select 103, 'oracle'

    --

    select * from @skills s1

    -- Concatenated Format

    set statistics time on;

    SELECT Resource_Id,stuff((SELECT ',' + Skill_Id

    FROM @skills s2

    WHERE s2.Resource_Id= s1.resource_ID -- must match GROUP BY below

    ORDER BY Skill_Id

    FOR XML PATH('')

    ),1,1,'') as [Skills]

    FROM @skills s1

    GROUP BY s1.Resource_Id -- without GROUP BY multiple rows are returned

    ORDER BY s1.Resource_Id

    set statistics time off;

    -- CrossTab Format

    SELECT Resource_Id

    ,MAX(case when skill_id = 'Excel' then 'Yes' else '' end) as Excel

    ,MAX(case when skill_id = 'Java' then 'Yes' else '' end) as Java

    ,MAX(case when skill_id = 'Oracle' then 'Yes' else '' end) as Oracle

    ,MAX(case when skill_id = 'SQLServer' then 'Yes' else '' end) as SQLServer

    ,MAX(case when skill_id = 'VB.Net' then 'Yes' else '' end) as [VB.Net]

    FROM @skills

    Group by Resource_Id

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Bob Hovious (4/11/2009)


    See if these are quick enough for you. The use of FOR XML is the fastest concatenation technique I've come across, but this sort of string manipulation is often better accomplished by the calling application outside SQL.

    Like everything else, it depends.

    Test setup... (code window doubled up the word "ADD" and will need repair when you copy it)

    USE TempDB

    GO

    --===== Create and populate a 1,000,000 row test table.

    -- Column "RowNum" has a range of 1 to 100,000 unique numbers

    -- Column "SomeInt" has a range of 1 to 50,000 non-unique numbers

    -- Column "SomeLetters2" has a range of "AA" to "ZZ" non-unique 2 character strings

    -- Column "SomeMoney has a range of 0.0000 to 99.9999 non-unique numbers

    -- Column "SomeDate" has a range of >=01/01/2000 and <01/01/2010 non-unique date/times

    -- Column "SomeCSV" contains 'Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10'

    -- for all rows.

    -- Column "SomeHex12" contains 12 random hex characters (ie, 0-9,A-F)

    SELECT TOP 1000000

    SomeID = IDENTITY(INT,1,1),

    SomeInt = ABS(CHECKSUM(NEWID()))%50000+1,

    SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)

    + CHAR(ABS(CHECKSUM(NEWID()))%26+65),

    SomeCSV = CAST('Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10' AS VARCHAR(80)),

    SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),

    SomeDate = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME),

    SomeHex12 = RIGHT(NEWID(),12)

    INTO dbo.JBMTest

    FROM Master.dbo.SysColumns t1

    CROSS JOIN

    Master.dbo.SysColumns t2

    GO

    --===== Add a Clustered PK

    ALTER TABLE dbo.JBMTest

    ADD PRIMARY KEY CLUSTERED (SomeID)

    GO

    --===== Add an index to support the concatenation

    CREATE INDEX IX_JBMTest_SomeInt ON dbo.JBMTest (SomeInt,SomeLetters2)

    GO

    --===== Create a function that uses VARCHAR(8000)

    CREATE FUNCTION dbo.Concat8KTest

    (@SomeInt INT)

    RETURNS VARCHAR(8000)

    AS

    BEGIN

    DECLARE @ReturnString VARCHAR(8000)

    SELECT @ReturnString = ISNULL(@ReturnString+',' ,'') + SomeLetters2

    FROM dbo.JBMTest

    WHERE SomeInt = @SomeInt

    ORDER BY SomeLetters2

    RETURN @ReturnString

    END

    GO

    --===== Create a function that uses VARCHAR(MAX)

    CREATE FUNCTION dbo.ConcatMAXTest

    (@SomeInt INT)

    RETURNS VARCHAR(MAX)

    AS

    BEGIN

    DECLARE @ReturnString VARCHAR(MAX)

    SELECT @ReturnString = ISNULL(@ReturnString+',' ,'') + SomeLetters2

    FROM dbo.JBMTest

    WHERE SomeInt = @SomeInt

    ORDER BY SomeLetters2

    RETURN @ReturnString

    END

    GO

    Test 8k, MAX, and XML methods...

    --===== Test the 8k solution

    SELECT SomeInt,dbo.ConcatTest(SomeInt)

    FROM dbo.JBMTest

    GROUP BY SomeInt

    ORDER BY SomeInt

    GO

    --===== Test the MAX solution

    SELECT SomeInt,dbo.ConcatTest(SomeInt)

    FROM dbo.JBMTest

    GROUP BY SomeInt

    ORDER BY SomeInt

    GO

    --===== Test the XML solution

    SELECT t1.SomeInt,STUFF((SELECT ',' + t2.SomeLetters2

    FROM dbo.JBMTest t2

    WHERE t2.SomeInt = t1.SomeInt --- must match GROUP BY below

    ORDER BY t2.SomeLetters2

    FOR XML PATH(''))

    ,1,1,'') AS SomeLetters2

    FROM dbo.JBMTest t1

    GROUP BY t1.SomeInt --- without GROUP BY multiple rows are returned

    ORDER BY t1.SomeInt

    GO

    Peformance and resource results...

    Passing only 50,000 rows to the app is a heck of a lot cheaper than clogging the pipe with a million rows for the app to do it. And, it looks to me like the XML method is not only slower for total duration, but it's almost twice as hard on the CPU as a good ol' fashioned concatenation function.

    Of course, for any of these methods, you need the correct index if you don't want to wait a week for it to finish.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff you shoulda been a lawyer :-D. Wasn't it you asking why people needed to do this on another thread? I'm in total agreement that passing summary data takes a huge load off the system. But I thought I was on the other side of this dispute the last time you schooled me.

    Yes it depends and, given your index, the simple concatenation function works faster. But, what if the index doesn't exist in the OPs environment? The XML concatenation still runs in under a minute for your million row table, and I gave up on the concatenation function after it ran for over 5. Also, I couldn't duplicate your cpu numbers. On my machine, the function with Max took over 12,000 ms while the XML ran in the 3300-3900 range, and the 8k function ran around 3100-3300. (The XML was the clear loser on the elapsed time though due to I/O, taking almost twice as long as the concatenation functions, The functions could get the job done with a single scan of the nonclustered index and a fraction of the logical reads.)

    I have a question about the nonclustered index. You are depending on it for the sequencing of the someletters combinations, so you avoid having to do an ORDER BY anywhere but on the someint column. But I thought you or Gail said that you couldn't count on anything but a clustered index or an ORDER BY clause to maintain sequence. Did I misunderstand?

    Finally, since it depends and given that the OP's question concerned a very specific set of values, couldn't we have also used this approach? It results in CPU and elapsed times under 1850 ms, with the nonclustered index in place.

    set statistics time on;

    ;with CrossTab as (

    SELECT someInt

    ,MAX(case when someLetters2 = 'AA' then ',AA' else '' end) as AA

    ,MAX(case when someletters2 = 'BB' then ',BB' else '' end) as BB

    ,MAX(case when someletters2 = 'CC' then ',CC' else '' end) as CC

    ,MAX(case when someletters2 = 'DD' then ',DD' else '' end) as DD

    ,MAX(case when someletters2 = 'EE' then ',EE' else '' end) as EE

    FROM JBMTest

    Group by someInt

    )

    select isnull(STUFF(AA+BB+CC+DD+EE,1,1,''),'')

    from CrossTab

    set statistics time off;

    Thanks again for furthering my education! 🙂

    Bob

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Bob Hovious (4/12/2009)


    Jeff you shoulda been a lawyer :-D. Wasn't it you asking why people needed to do this on another thread? I'm in total agreement that passing summary data takes a huge load off the system. But I thought I was on the other side of this dispute the last time you schooled me.

    Heh... I guess you're right about that. The problem with concatenation is that it's not really summary data no matter how you handle it. Bloody concatenation can be justified either way depending on whether you're IO bound or CPU bound. My normal outrage about the subject is that it's very costly either way and I really want to know why any given OP wants to do it to begin with. Either you're gonna send all of the data with a bunch of CrLf's or all of the data with a bunch of commas. Both the server and the client are going to have to work at processing it and the pipe will suffer one way or the other. I pretty much hate the idea of concatenation no matter where it's done.

    Yes it depends and, given your index, the simple concatenation function works faster. But, what if the index doesn't exist in the OPs environment? The XML concatenation still runs in under a minute for your million row table, and I gave up on the concatenation function after it ran for over 5. Also, I couldn't duplicate your cpu numbers. On my machine, the function with Max took over 12,000 ms while the XML ran in the 3300-3900 range, and the 8k function ran around 3100-3300. (The XML was the clear loser on the elapsed time though due to I/O, taking almost twice as long as the concatenation functions, The functions could get the job done with a single scan of the nonclustered index and a fraction of the logical reads.)

    That's the other reason I don't care for concatenation. I don't understand all of the reasons for it, but the process varies greatly from machine to machine.

    So far as the possibility of the missing index goes, if I had a million row table and considering that the fastest method returns in a whopping minute without it, you can be darned sure I'd bust a hump or a DBA's head to ensure the index was placed. 🙂 It's all part of performance tuning, etc.

    I have a question about the nonclustered index. You are depending on it for the sequencing of the someletters combinations, so you avoid having to do an ORDER BY anywhere but on the someint column. But I thought you or Gail said that you couldn't count on anything but a clustered index or an ORDER BY clause to maintain sequence. Did I misunderstand?

    Nope... I don't see where I left out any ORDER BY's for SomeLetters2... both functions contain it and so does the XML. And, yes, contrary to popular belief, it's not safe to depend on a clustered index as an ORDER BY in Selects... just the "quirky update". That's part of the reason why I'm rewriting the "Running Total" article.

    Finally, since it depends and given that the OP's question concerned a very specific set of values, couldn't we have also used this approach? It results in CPU and elapsed times under 1850 ms, with the nonclustered index in place.

    Absolutely. But what happens if someone adds another "category" to the table? The code has no warning that it's not picking everything up. And people who say another category will never be added always live long enough to find out they're eventually wrong. 😉

    Because of my rants on performance and scalability, many think that speed is my only goal. They don't realize that accuracy and data integrity is always my first goal. I'd gladly have something take 3-5 seconds on a million row table and never have to worry about it producing the wrong answer if a new category is added than to run in half that time with the eventual potential for a wrong answer.

    To wit, my mantra is "Make it work, make it fast, make it pretty... and it ain't done 'til it's pretty."

    Thanks again for furthering my education! 🙂

    The feeling is more than mutual. Whenever someone like yourself asks such good questions, I always have to go back and re-examine the reasons for any convictions I may have. Normally, if a conviction cannot be backed up with code, it's not worth being a conviction.

    I sure do hope I answered your questions and concerns on this one. Concatenation is always difficult to defend no matter how you do it.

    Thanks for the feedback, Bob.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Sorry about raising the ORDER BY issue. I threw that in while looking at the tests and just had a brain-short not thinking about the functions themselves.

    You're absolutely right about the problem with the crossTab being what to do when the set of possible values increases. The only play there is to do it in dynamic SQL. Hopefully, there would be a table somewhere that limits the valid codes that could be entered, or at least an index for a quick SELECT DISTINCT. Possibly a trigger could be written to rebuild a stored procedure or table-valued function whenever the content of that table changes. (I am sometimes amazed that the dba's and security people let me write code that writes code. Haven't they ever heard of Skynet?)

    The only justification I can give you for concatenation is that when people are looking at data, it allows a lot more to fit on a page where they can visually scan it. Every time I've had a request for it, it's been in the context of a report of some kind. I am still keeping the FOR XML technique in the toolbox, but it is immediately followed by your code from above, because with the proper index, it's hard to beat.

    Jeff, thanks for taking the time to listen to my questions and follow up. I invariably learn something worthwhile from the exchange. It's a pleasure to be able to ask questions and discuss competing considerations without it being all about egos. Now I have to go look at Flo's CLR parsing.

    (P.S. Florian is a guy's name.) 😛

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Bob Hovious (4/12/2009)


    But I thought you or Gail said that you couldn't count on anything but a clustered index or an ORDER BY clause to maintain sequence.

    Wouldn't have been me. I tell people that you can't depend on anything other than an ORDER BY to maintain sequence. The update is a side-effect of how the engine currently works, but there's no guarantee that someone won't find another edge case next week where it breaks.

    The only time the optimiser will care about the order rows are returned in is when there's an ORDER BY on the query.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Thanks, Gail. I was sure about the ORDER BY, but not the clustered index. I couldn't find any of the threads where either was discussed.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Jeff,

    wow..its simply great to see the way you are thinking on the questions and answers...

  • I think Nirmal might need something like this

    declare @skills table (Resource_Id int, Skill_Id varchar(20))

    insert into @skills

    select 101, 'sqlserver' union all

    select 101, 'vb.net' union all

    select 101, 'oracle' union all

    select 102, 'sqlserver' union all

    select 102, 'java' union all

    select 102, 'excel' union all

    select 103, 'vb.net' union all

    select 103, 'java' union all

    select 103, 'oracle'

    SELECT Resource_Id, ISNULL(sqlserver, '----'), ISNULL([vb.net], '----'), ISNULL(oracle,'----'), ISNULL(java, '----'), ISNULL(excel, '----')

    FROM @skills

    PIVOT(MAX(Skill_Id) FOR Skill_Id

    IN(sqlserver,[vb.net],oracle,java,excel)) AS P;

    If you look carefully at his sample output the data is not just concatenated but sorted as well and when a skill is missing you have four dashes. Of course this solution is going to work with a finite number of skills only. The output generated by the code above is not exactly matching the sample output because there are trailing dashes for skills further down the list.

    PS. Bob , thanks for the insert code....

    ---------------------------------------------
    [font="Verdana"]Nothing is impossible.
    It is just a matter of time and money.[/font]

  • VK... my observation is that there is a world of valuable information to be learned from the people who hang around SSC. I shamelessly take every opportunity to pick their brains.

    Jacek0... you're welcome, I think.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Jeff, in your posted test harness, you need to rename the functions in the final tests. It has "SELECT SomeInt, dbo.ConcatTest...", but it needs to be Concat8kTest and ConcatMaxTest, per the create scripts. No big deal, but you might want to edit the post accordingly.

    Also, I got very different CPU and total time readings than you on those. Average for the 8k UDF was 5300 milliseconds CPU, 9200 total; Max UDF was 8500 CPU, 13000 total; XML inline was 2000 CPU, 2000 total.

    The XML version's IO stats are out the roof, with 154k logical reads, compared to 1985 for each of the other two, but that didn't slow it down apparently. On my system, anyway.

    Of course, since the test data is pretty much random, the actual number of reads and scans will vary, as it depends on the number of unique SomeInt values and how many rows each of those has in the table.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Just to test it a bit further:

    CREATE FUNCTION dbo.ConcatXMLTest (@SomeInt_in INT)

    RETURNS VARCHAR(8000)

    AS BEGIN

    RETURN (SELECT

    STUFF((SELECT

    ',' + t2.SomeLetters2

    FROM

    dbo.JBMTest t2

    WHERE

    t2.SomeInt = @SomeInt_in

    ORDER BY

    t2.SomeLetters2

    FOR

    XML PATH(''),type).value('.[1]', 'varchar(MAX)'), 1, 1, ''))

    END ;

    SELECT

    SomeInt,

    dbo.ConcatXMLTest(SomeInt)

    FROM

    dbo.JBMTest

    GROUP BY

    SomeInt

    ORDER BY

    SomeInt ;

    That got the scan count down to 1 and the reads to 1985, but took the CPU up to over 10 seconds and total to nearly 14 seconds (up from 2 for each for the inline version).

    Adding "type" and the value function to the inline XML query increased its run-time to 3800 milliseconds, but allows the string values to contain characters that are XML-specific, like "&", which will otherwise cause it to error out.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Just for one more test, I took the type and value functions out of the XML version of the UDF, and that took it down to 6.5 seconds CPU and 9.5 seconds total. Same scan and read stats either way, of course.

    That's slightly worse than the 8k string function, and better than the Max string function, so I modified the XML UDF one more time to make it return varchar(max), and that took it up to 7 seconds CPU, 10 seconds total. No effect on IO. That's better than the Max string function still.

    With both the type and value functions, and returning varchar(max), the XML UDF went up to nearly 11 seconds CPU and nearly 14 seconds total, which makes it the worst version of them all!

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Gus, I rewrote your function as an inline table valued function and it gets the job done in half the time of the ConcatMaxTest function, but still three times longer than the Concat8KTest function.

    There is just no beating the IO advantage of scanning the nonclustered index a single time.

    ALTER FUNCTION [dbo].[itvfConcatXML]

    (

    @someInt int

    )

    RETURNS TABLE

    AS

    RETURN

    (

    SELECT STUFF((SELECT ',' + t2.SomeLetters2

    FROM dbo.JBMTest t2

    WHERE t2.SomeInt = @SomeInt

    ORDER BY t2.SomeLetters2

    FOR XML PATH(''),type).value('.[1]', 'varchar(MAX)'), 1, 1, '') as conCatString

    )

    I used a CTE to feed it a distinct list.

    ;with someints as (select distinct someint from jbmtest)

    SELECT t1.SomeInt,f.ConCatString

    FROM someints t1

    cross apply dbo.itvfConcatXML(t1.someint) f

    ORDER BY t1.SomeInt

    Table 'JBMTest'. Scan count 50001, logical reads 154059, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 6864 ms, elapsed time = 6901 ms.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Viewing 15 posts - 1 through 15 (of 20 total)

You must be logged in to reply to this topic. Login to reply