Group by column A but compare column B

  • This one has had me stumped for the last few hours and at this stage I think I need some help...

    I need to compare multiple groups from a single table and to identify where items listed in col B match. For example: -

    Col A...............Col B

    John................Apple

    John................Orange

    John................Banana

    Mary................Orange

    Mary................Strawberry

    David...............Apple

    David...............Orange

    David...............Banana

    I want 'John' and 'David' returned because their items in col B match. Hope this makes sense! Thanks in advance! G

  • Couple of questions:

    Is this oversimplified or is it a homework question? If homework, what's the chapter's subject so we can help you find the right place with this?

    If it's real data, please see the first link in my signature for proper setup for code assistance. Schema, sample data, expected results, current work to be modified. We should be able to fix you right up from there with minimal back and forth.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • I found a way to do this. Might not be optimal but it's a way.

    However, I want to know what is this for, as Craig asked.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis Cazares (9/27/2012)


    I found a way to do this. Might not be optimal but it's a way.

    However, I want to know what is this for, as Craig asked.

    There's a few approaches, from row-collapse to after compare rowcount checks. Depends on volume of data, # of items per partition, etc. I just want to know what the OP is actually dealing with.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • I would like to know more about those methods. Could you tell me where to find information? I couldn't find much.

    This is what I came with. It's doing many reads.

    DECLARE @TestTABLE(

    colAvarchar(15),

    colBvarchar(15))

    INSERT @Test VALUES

    ('John', 'Apple'),

    ('John','Orange'),

    ('John','Banana'),

    ('Mary','Orange'),

    ('Mary','Strawberry'),

    ('Luis','Orange'),

    ('Luis','Strawberry'),

    ('Luis','Apple'),

    ('Mike','Orange'),

    ('Mike','Strawberry'),

    ('Mike','Apple'),

    ('David','Apple'),

    ('David','Orange'),

    ('David','Banana');

    WITH CTE AS(

    SELECT colA, SUM( value) total_value

    FROM @Test a

    JOIN (

    SELECT colB,

    POWER( 2, ROW_NUMBER() OVER( order by colB) - 1 ) AS value

    FROM (SELECT DISTINCT ColB FROM @Test)x) b ON a.colB = B.colB

    GROUP BY colA

    )

    SELECT *

    FROM @Test

    WHERE colA IN(

    SELECT a.colA

    FROM CTE a

    JOIN CTE b ON a.total_value = b.total_value AND a.colA <> b.colA)

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis Cazares (9/27/2012)


    I would like to know more about those methods. Could you tell me where to find information? I couldn't find much.

    This is what I came with. It's doing many reads.

    PM sent so it doesn't end up with the OP going off on a few methods that may not work well depending on his schema and indexing.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • My apologies - my original post was using over simplified data. My real data has a list of computer names in column A and names of installed applications in column B. I want to find computers that have the same software stack installed. The actual data is very long so here is a cut down version

    IF OBJECT_ID('TempDB..#mytable','U') IS NOT NULL

    DROP TABLE #mytable

    CREATE TABLE #MyTable(

    [ComputerName] [nvarchar](255) NULL,

    [ApplicationName] [nvarchar](255) NULL

    )

    INSERT INTO #MyTable

    (ComputerName, ApplicationName)

    SELECT 'DCZC445645S','1E NomadBranch 3' UNION ALL

    SELECT 'DCZC445645S','1E WakeUp Agent 6' UNION ALL

    SELECT 'DCZC445645S','Windows Media Player 9' UNION ALL

    SELECT 'DCZC445645S','Microsoft Office 2003 Excel' UNION ALL

    SELECT 'DCZC843531X','1E NomadBranch 3' UNION ALL

    SELECT 'DCZC843531X','1E WakeUp Agent 6' UNION ALL

    SELECT 'DCZC843531X','Windows Media Player 9' UNION ALL

    SELECT 'DCZC843531X','Microsoft Office 2003 Excel' UNION ALL

    SELECT 'DCZC2231124','1E NomadBranch 3' UNION ALL

    SELECT 'DCZC2231124','1E WakeUp Agent 6' UNION ALL

    SELECT 'DCZC2231124','Windows Media Player 9' UNION ALL

    SELECT 'DCZC2231124','Microsoft Office 2003 Excel' UNION ALL

    SELECT 'DCZC84352Z9','!NovaDigm Deployment Repository' UNION ALL

    SELECT 'DCZC84352Z9','1E NomadBranch 3' UNION ALL

    SELECT 'DCZC84352Z9','1E WakeUp Agent 6' UNION ALL

    SELECT 'DCZC84352Z9','Adobe Reader 9' UNION ALL

    SELECT 'DCZC84352Z9','Citrix Program Neighborhood Agent 10' UNION ALL

    SELECT 'DCZC60548AS','!NovaDigm Deployment Repository' UNION ALL

    SELECT 'DCZC60548AS','1E NomadBranch 3' UNION ALL

    SELECT 'DCZC60548AS','1E WakeUp Agent 6' UNION ALL

    SELECT 'DCZC60548AS','Adobe Reader 9' UNION ALL

    SELECT 'DCZC60548AS','Citrix Program Neighborhood Agent 10' UNION ALL

    SELECT 'DCZC789784C','Microsoft Internet Explorer 8' UNION ALL

    SELECT 'DCZC789784C','Windows Media Player 9' UNION ALL

    SELECT 'DCZC789784C','Microsoft Office 2003 Excel'

    The query I have so far is...

    SELECT A.ComputerName Comp1, B.ComputerName Comp2

    FROM #MyTable A

    JOIN (SELECT ComputerName, count(ApplicationName) CountBs

    FROM #MyTable

    GROUP BY ComputerName) G ON G.ComputerName = A.ComputerName

    JOIN #MyTable B on A.ApplicationName = B.ApplicationName and A.ComputerName < B.ComputerName

    GROUP BY A.ComputerName, B.ComputerName, G.CountBs HAVING COUNT(DISTINCT A.ApplicationName) = G.CountBs

    This gives me the following results...

    Comp1Comp2

    DCZC2231124DCZC445645S

    DCZC2231124DCZC843531X

    DCZC445645SDCZC843531X

    DCZC60548ASDCZC84352Z9

    But because there are hundreds of computer names this is going to make interpreting the above results a nightmare so I'm looking for something along the lines of....

    DCZC2231124DCZC445645S DCZC843531X

    DCZC60548ASDCZC84352Z9

    Craig - thanks for the tips on forum etiquette, I hope this is more in line with what you were looking for!

    Thanks in advance again!

  • gary.harrington (9/27/2012)


    My apologies - my original post was using over simplified data. My real data has a list of computer names in column A and names of installed applications in column B. I want to find computers that have the same software stack installed. The actual data is very long so here is a cut down version

    Excellent, this is perfect. A few additional items. I realize we're not looking at the origination schema here, however, does the source have an identity column that can be used against the compname/appname, and what indexes apply to these two (three) columns in the base table? It'll affect your approach, especially with very large data sets.

    Craig - thanks for the tips on forum etiquette, I hope this is more in line with what you were looking for!

    Thanks in advance again!

    Yep, nearly on target but because you're going to need something optimal I need just a little more, otherwise the code may run like a three legged dog. You're new though, and are willing to do the work once it's pointed out, so you're good to go there. ๐Ÿ™‚

    Now that I have a general idea of what you're dealing with however, here's the general answer to Luis' question earlier (don't use these out of the box, though...):

    DECLARE @TestTABLE(

    colAvarchar(15),

    colBvarchar(15))

    INSERT @Test VALUES

    ('John', 'Apple'),

    ('John','Orange'),

    ('John','Banana'),

    ('Mary','Orange'),

    ('Mary','Strawberry'),

    ('Luis','Orange'),

    ('Luis','Strawberry'),

    ('Luis','Apple'),

    ('Mike','Orange'),

    ('Mike','Strawberry'),

    ('Mike','Apple'),

    ('David','Apple'),

    ('David','Orange'),

    ('David','Banana');

    -- Row Collapse using FOR XML

    ;WITH cte AS

    (SELECT DISTINCT

    colA,

    (SELECT ColB + ',' AS 'data()'

    FROM @Test AS t2

    WHERE t2.ColA = t.ColA FOR XML PATH( '')) AS OneRowVal

    FROm

    @Test AS t

    )

    SELECT

    t.ColA AS MatchedName1,

    t2.ColA AS MatchedName2

    FROM

    cte AS t

    JOIN

    cte AS t2

    ONt2.OneRowVal = t.OneRowVal

    AND t2.ColA <> t.ColA

    -- PreCount and Join Check

    IF OBJECT_ID('tempdb..#Master') IS NOT NULL

    DROP TABLE #Master

    CREATE TABLE #Master

    ( ColA VARCHAR(15),

    cnt INT)

    CREATE CLUSTERED INDEX idx_master ON #Master

    (cnt)

    INSERT INTO #Master

    SELECT

    ColA,

    COUNT(*) AS cnt

    FROM

    @Test

    GROUP BY

    ColA

    --SELECT * FROM #Master

    -- This makes sure we don't deal with items with 'extras'.

    ;WITH CoreCompareList AS

    (SELECT

    m.ColA AS Name1,

    m2.ColA AS Name2,

    m.cnt AS cnt

    FROM

    #Master AS m

    JOIN

    #Master AS m2

    ONm.cnt = m2.cnt

    AND m.ColA <> m2.ColA

    )

    , CountList AS

    (SELECT

    ccl.Name1,

    ccl.Name2,

    COUNT(*) AS cnt

    FROM

    CoreCompareList AS ccl

    JOIN

    @Test AS t

    ONccl.Name1 = t.ColA

    JOIN

    @Test AS t2

    ONccl.Name2 = t2.ColA

    AND t.ColB = t2.ColB

    GROUP BY

    ccl.Name1,

    ccl.Name2

    )

    SELECT

    ccl.Name1,

    ccl.Name2

    FROM

    CoreCompareList AS ccl

    JOIN

    CountList AS cl

    ONccl.Name1 = cl.Name1

    AND ccl.Name2 = cl.Name2

    AND ccl.cnt = cl.cnt


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • I take it back. Because of the length and datatype of your application information, there's a completely different approach I'd personally take here. Don't worry about the extra information, but can you confirm there's an index in the base table (non-clustered by preference) that uses ComputerName on the first column and includes (somewhere) ApplicationName?

    I'm building out the rest of it now. The chained combination pivot is where it gets messy.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Some notes in the code, others below.

    IF OBJECT_ID('TempDB..#mytable','U') IS NOT NULL

    DROP TABLE #mytable

    CREATE TABLE #MyTable(

    [ComputerName] [nvarchar](255) NULL,

    [ApplicationName] [nvarchar](255) NULL

    )

    INSERT INTO #MyTable

    (ComputerName, ApplicationName)

    SELECT 'DCZC445645S','1E NomadBranch 3' UNION ALL

    SELECT 'DCZC445645S','1E WakeUp Agent 6' UNION ALL

    SELECT 'DCZC445645S','Windows Media Player 9' UNION ALL

    SELECT 'DCZC445645S','Microsoft Office 2003 Excel' UNION ALL

    SELECT 'DCZC843531X','1E NomadBranch 3' UNION ALL

    SELECT 'DCZC843531X','1E WakeUp Agent 6' UNION ALL

    SELECT 'DCZC843531X','Windows Media Player 9' UNION ALL

    SELECT 'DCZC843531X','Microsoft Office 2003 Excel' UNION ALL

    SELECT 'DCZC2231124','1E NomadBranch 3' UNION ALL

    SELECT 'DCZC2231124','1E WakeUp Agent 6' UNION ALL

    SELECT 'DCZC2231124','Windows Media Player 9' UNION ALL

    SELECT 'DCZC2231124','Microsoft Office 2003 Excel' UNION ALL

    SELECT 'DCZC84352Z9','!NovaDigm Deployment Repository' UNION ALL

    SELECT 'DCZC84352Z9','1E NomadBranch 3' UNION ALL

    SELECT 'DCZC84352Z9','1E WakeUp Agent 6' UNION ALL

    SELECT 'DCZC84352Z9','Adobe Reader 9' UNION ALL

    SELECT 'DCZC84352Z9','Citrix Program Neighborhood Agent 10' UNION ALL

    SELECT 'DCZC60548AS','!NovaDigm Deployment Repository' UNION ALL

    SELECT 'DCZC60548AS','1E NomadBranch 3' UNION ALL

    SELECT 'DCZC60548AS','1E WakeUp Agent 6' UNION ALL

    SELECT 'DCZC60548AS','Adobe Reader 9' UNION ALL

    SELECT 'DCZC60548AS','Citrix Program Neighborhood Agent 10' UNION ALL

    SELECT 'DCZC789784C','Microsoft Internet Explorer 8' UNION ALL

    SELECT 'DCZC789784C','Windows Media Player 9' UNION ALL

    SELECT 'DCZC789784C','Microsoft Office 2003 Excel'

    IF OBJECT_ID('tempdb..#Checksums') IS NOT NULL

    DROP TABLE #Checksums

    CREATE TABLE #Checksums

    ( csID INT IDENTITY( 1,1) NOT NULL,

    ComputerName nVARCHAR(255) NOT NULL,

    ChecksumValue INT)

    CREATE CLUSTERED INDEX idx_Checksums ON #Checksums

    ( ChecksumValue)

    ;WITH cte AS

    (SELECT DISTINCT

    ComputerName,

    (SELECT ApplicationName + ',' AS 'data()'

    FROM #MyTable AS t2

    WHERE t2.ComputerName = t.ComputerName FOR XML PATH( '')) AS OneRowVal

    FROm

    #MyTable AS t

    )

    INSERT INTO #Checksums

    SELECT

    ComputerName,

    CHECKSUM( OneRowVal)

    FROM

    cte

    SELECT

    c.ChecksumValue,

    c.ComputerName AS Name1,

    c2.ComputerName AS Name2

    FROM

    #Checksums AS c

    JOIN

    #Checksums AS c2

    ONc.ChecksumValue = c2.ChecksumValue

    AND c.csID < c2.csID -- Controls 'back and forth' duplication

    Gary, this should get you the existing results you have as quickly as possible as long as you have the above mentioned index available.

    The problem is de-duping the existing values in the result set. The only way I have to do it offhand is a bit messy but it DOES get you there.

    ;WITH CompareComplete AS

    (SELECT

    c.ChecksumValue,

    c.ComputerName AS Name1,

    c2.ComputerName AS Name2

    FROM

    #Checksums AS c

    JOIN

    #Checksums AS c2

    ONc.ChecksumValue = c2.ChecksumValue

    AND c.csID < c2.csID -- Controls 'back and forth' duplication

    )

    , UnionedResults AS

    (SELECT

    ChecksumValue,

    Name1 AS CompName

    FROM

    CompareComplete

    UNION -- This will force a list distinction, do NOT use ALL

    SELECT

    ChecksumValue,

    Name2

    FROM

    CompareComplete

    )

    SELECT DISTINCT

    ChecksumValue,

    (SELECT CompName + ',' AS 'data()'

    FROM UnionedResults AS t2

    WHERE t2.ChecksumValue = t.ChecksumValue FOR XML PATH( '')) AS OneRowVal

    FROm

    UnionedResults AS t

    I'm hoping someone can clean up the formatting of the results a bit. This process is one I've ran against larger comparison sets (VARCHAR(4000)s) which absolutely killed me in the compare performance, thus the use of CHECKSUM to remove the possible 1000+ character comparisons. The use of the Temp table also gets you off the primary data sources.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Here's another, relatively straightforward approach:

    ;WITH ApplicationsGrouped AS (

    SELECT ComputerName

    ,Applications=STUFF((

    SELECT ';' + ApplicationName

    FROM #mytable b

    WHERE a.ComputerName = b.ComputerName

    ORDER BY ApplicationName

    FOR XML PATH(''), TYPE).value('.', 'VARCHAR(MAX)')

    ,1, 1, '')

    FROM #mytable a

    GROUP BY ComputerName)

    SELECT Applications

    ,Computers=STUFF((

    SELECT ';' + ComputerName

    FROM ApplicationsGrouped b

    WHERE a.Applications = b.Applications

    ORDER BY ComputerName

    FOR XML PATH(''), TYPE).value('.', 'VARCHAR(MAX)')

    ,1, 1, '')

    FROM ApplicationsGrouped a

    GROUP BY Applications

    HAVING COUNT(ComputerName) > 1


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Craig - This is perfect! Much appreciated!.... and yes, I have a non-clustered index on the table using ComputerName and ApplicationName

    Thanks a million!

    G

  • Nice Dwain! Thanks for taking the time to look at this for me!

  • Here's a fourth method:

    -- solution

    ;WITH ParentsOnly AS (

    SELECT ComputerName, ChildRows = COUNT(*)

    FROM #MyTable

    GROUP BY ComputerName

    )

    SELECT a.*, b.*, x.*

    FROM ParentsOnly a

    INNER JOIN ParentsOnly b ON b.ChildRows = a.ChildRows AND b.ComputerName > a.ComputerName

    CROSS APPLY ( -- match up the child rows

    SELECT MatchingRows = COUNT(*)

    FROM #MyTable ai

    INNER JOIN #MyTable bi ON bi.ComputerName = b.ComputerName

    AND bi.ApplicationName = ai.ApplicationName

    WHERE ai.ComputerName = a.ComputerName

    ) x

    WHERE a.ChildRows = x.MatchingRows

    ORDER BY a.ComputerName, b.ComputerName

    โ€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.โ€ - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Another one

    WITH GrpCounts(ComputerName, ApplicationName,GrpCnt) AS (

    SELECT ComputerName, ApplicationName,

    COUNT(*) OVER(PARTITION BY ComputerName)

    FROM #MyTable),

    Results(ComputerNameA,ComputerNameB) AS (

    SELECT x.ComputerName , y.ComputerName

    FROM GrpCounts x

    INNER JOIN GrpCounts y ON y.ApplicationName = x.ApplicationName AND y.ComputerName < x.ComputerName AND x.GrpCnt=y.GrpCnt

    GROUP BY x.ComputerName, y.ComputerName, x.GrpCnt

    HAVING COUNT(*) = x.GrpCnt)

    SELECT r1.ComputerNameA +

    (SELECT ' ' + r3.ComputerNameB AS "text()"

    FROM Results r3

    WHERE r3.ComputerNameA = r1.ComputerNameA

    ORDER BY r3.ComputerNameB

    FOR XML PATH('')) Computers

    FROM Results r1

    WHERE NOT EXISTS(SELECT * FROM Results r2 WHERE r1.ComputerNameA = r2.ComputerNameB)

    GROUP BY r1.ComputerNameA

    ORDER BY r1.ComputerNameA;

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply