Group by column A but compare column B

Question

Group by column A but compare column B

gary.harrington

SSC Rookie

Points: 42
More actions
September 27, 2012 at 1:19 pm

#268314

This one has had me stumped for the last few hours and at this stage I think I need some help...
I need to compare multiple groups from a single table and to identify where items listed in col B match. For example: -
Col A...............Col B
John................Apple
John................Orange
John................Banana
Mary................Orange
Mary................Strawberry
David...............Apple
David...............Orange
David...............Banana
I want 'John' and 'David' returned because their items in col B match. Hope this makes sense! Thanks in advance! G

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 1

Couple of questions:

Is this oversimplified or is it a homework question? If homework, what's the chapter's subject so we can help you find the right place with this?

If it's real data, please see the first link in my signature for proper setup for code assistance. Schema, sample data, expected results, current work to be modified. We should be able to fix you right up from there with minimal back and forth.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

Luis Cazares SSC Guru Points: 183706 More actions · Answer 2

I found a way to do this. Might not be optimal but it's a way.

However, I want to know what is this for, as Craig asked.

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 3

Luis Cazares (9/27/2012)
I found a way to do this. Might not be optimal but it's a way.
However, I want to know what is this for, as Craig asked.

There's a few approaches, from row-collapse to after compare rowcount checks. Depends on volume of data, # of items per partition, etc. I just want to know what the OP is actually dealing with.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

Luis Cazares SSC Guru Points: 183706 More actions · Answer 4

I would like to know more about those methods. Could you tell me where to find information? I couldn't find much.

This is what I came with. It's doing many reads.

DECLARE @TestTABLE(

colAvarchar(15),

colBvarchar(15))

INSERT @Test VALUES

('John', 'Apple'),

('John','Orange'),

('John','Banana'),

('Mary','Orange'),

('Mary','Strawberry'),

('Luis','Orange'),

('Luis','Strawberry'),

('Luis','Apple'),

('Mike','Orange'),

('Mike','Strawberry'),

('Mike','Apple'),

('David','Apple'),

('David','Orange'),

('David','Banana');

WITH CTE AS(

SELECT colA, SUM( value) total_value

FROM @Test a

JOIN (

SELECT colB,

POWER( 2, ROW_NUMBER() OVER( order by colB) - 1 ) AS value

FROM (SELECT DISTINCT ColB FROM @Test)x) b ON a.colB = B.colB

GROUP BY colA

)

SELECT *

FROM @Test

WHERE colA IN(

SELECT a.colA

FROM CTE a

JOIN CTE b ON a.total_value = b.total_value AND a.colA <> b.colA)

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 5

Luis Cazares (9/27/2012)
I would like to know more about those methods. Could you tell me where to find information? I couldn't find much.
This is what I came with. It's doing many reads.

PM sent so it doesn't end up with the OP going off on a few methods that may not work well depending on his schema and indexing.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

gary.harrington SSC Rookie Points: 42 More actions · Answer 6

My apologies - my original post was using over simplified data. My real data has a list of computer names in column A and names of installed applications in column B. I want to find computers that have the same software stack installed. The actual data is very long so here is a cut down version

IF OBJECT_ID('TempDB..#mytable','U') IS NOT NULL

DROP TABLE #mytable

CREATE TABLE #MyTable(

[ComputerName] [nvarchar](255) NULL,

[ApplicationName] [nvarchar](255) NULL

)

INSERT INTO #MyTable

(ComputerName, ApplicationName)

SELECT 'DCZC445645S','1E NomadBranch 3' UNION ALL

SELECT 'DCZC445645S','1E WakeUp Agent 6' UNION ALL

SELECT 'DCZC445645S','Windows Media Player 9' UNION ALL

SELECT 'DCZC445645S','Microsoft Office 2003 Excel' UNION ALL

SELECT 'DCZC843531X','1E NomadBranch 3' UNION ALL

SELECT 'DCZC843531X','1E WakeUp Agent 6' UNION ALL

SELECT 'DCZC843531X','Windows Media Player 9' UNION ALL

SELECT 'DCZC843531X','Microsoft Office 2003 Excel' UNION ALL

SELECT 'DCZC2231124','1E NomadBranch 3' UNION ALL

SELECT 'DCZC2231124','1E WakeUp Agent 6' UNION ALL

SELECT 'DCZC2231124','Windows Media Player 9' UNION ALL

SELECT 'DCZC2231124','Microsoft Office 2003 Excel' UNION ALL

SELECT 'DCZC84352Z9','!NovaDigm Deployment Repository' UNION ALL

SELECT 'DCZC84352Z9','1E NomadBranch 3' UNION ALL

SELECT 'DCZC84352Z9','1E WakeUp Agent 6' UNION ALL

SELECT 'DCZC84352Z9','Adobe Reader 9' UNION ALL

SELECT 'DCZC84352Z9','Citrix Program Neighborhood Agent 10' UNION ALL

SELECT 'DCZC60548AS','!NovaDigm Deployment Repository' UNION ALL

SELECT 'DCZC60548AS','1E NomadBranch 3' UNION ALL

SELECT 'DCZC60548AS','1E WakeUp Agent 6' UNION ALL

SELECT 'DCZC60548AS','Adobe Reader 9' UNION ALL

SELECT 'DCZC60548AS','Citrix Program Neighborhood Agent 10' UNION ALL

SELECT 'DCZC789784C','Microsoft Internet Explorer 8' UNION ALL

SELECT 'DCZC789784C','Windows Media Player 9' UNION ALL

SELECT 'DCZC789784C','Microsoft Office 2003 Excel'

The query I have so far is...

SELECT A.ComputerName Comp1, B.ComputerName Comp2

FROM #MyTable A

JOIN (SELECT ComputerName, count(ApplicationName) CountBs

FROM #MyTable

GROUP BY ComputerName) G ON G.ComputerName = A.ComputerName

JOIN #MyTable B on A.ApplicationName = B.ApplicationName and A.ComputerName < B.ComputerName

GROUP BY A.ComputerName, B.ComputerName, G.CountBs HAVING COUNT(DISTINCT A.ApplicationName) = G.CountBs

This gives me the following results...

Comp1Comp2

DCZC2231124DCZC445645S

DCZC2231124DCZC843531X

DCZC445645SDCZC843531X

DCZC60548ASDCZC84352Z9

But because there are hundreds of computer names this is going to make interpreting the above results a nightmare so I'm looking for something along the lines of....

DCZC2231124DCZC445645S DCZC843531X

DCZC60548ASDCZC84352Z9

Craig - thanks for the tips on forum etiquette, I hope this is more in line with what you were looking for!

Thanks in advance again!

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 7

gary.harrington (9/27/2012)
My apologies - my original post was using over simplified data. My real data has a list of computer names in column A and names of installed applications in column B. I want to find computers that have the same software stack installed. The actual data is very long so here is a cut down version

Excellent, this is perfect. A few additional items. I realize we're not looking at the origination schema here, however, does the source have an identity column that can be used against the compname/appname, and what indexes apply to these two (three) columns in the base table? It'll affect your approach, especially with very large data sets.

Craig - thanks for the tips on forum etiquette, I hope this is more in line with what you were looking for!
Thanks in advance again!

Yep, nearly on target but because you're going to need something optimal I need just a little more, otherwise the code may run like a three legged dog. You're new though, and are willing to do the work once it's pointed out, so you're good to go there. 🙂

Now that I have a general idea of what you're dealing with however, here's the general answer to Luis' question earlier (don't use these out of the box, though...):

DECLARE @TestTABLE(

colAvarchar(15),

colBvarchar(15))

INSERT @Test VALUES

('John', 'Apple'),

('John','Orange'),

('John','Banana'),

('Mary','Orange'),

('Mary','Strawberry'),

('Luis','Orange'),

('Luis','Strawberry'),

('Luis','Apple'),

('Mike','Orange'),

('Mike','Strawberry'),

('Mike','Apple'),

('David','Apple'),

('David','Orange'),

('David','Banana');

-- Row Collapse using FOR XML

;WITH cte AS

(SELECT DISTINCT

colA,

(SELECT ColB + ',' AS 'data()'

FROM @Test AS t2

WHERE t2.ColA = t.ColA FOR XML PATH( '')) AS OneRowVal

FROm

@Test AS t

)

SELECT

t.ColA AS MatchedName1,

t2.ColA AS MatchedName2

FROM

cte AS t

JOIN

cte AS t2

ONt2.OneRowVal = t.OneRowVal

AND t2.ColA <> t.ColA

-- PreCount and Join Check

IF OBJECT_ID('tempdb..#Master') IS NOT NULL

DROP TABLE #Master

CREATE TABLE #Master

( ColA VARCHAR(15),

cnt INT)

CREATE CLUSTERED INDEX idx_master ON #Master

(cnt)

INSERT INTO #Master

SELECT

ColA,

COUNT(*) AS cnt

FROM

@Test

GROUP BY

ColA

--SELECT * FROM #Master

-- This makes sure we don't deal with items with 'extras'.

;WITH CoreCompareList AS

(SELECT

m.ColA AS Name1,

m2.ColA AS Name2,

m.cnt AS cnt

FROM

#Master AS m

JOIN

#Master AS m2

ONm.cnt = m2.cnt

AND m.ColA <> m2.ColA

)

, CountList AS

(SELECT

ccl.Name1,

ccl.Name2,

COUNT(*) AS cnt

FROM

CoreCompareList AS ccl

JOIN

@Test AS t

ONccl.Name1 = t.ColA

JOIN

@Test AS t2

ONccl.Name2 = t2.ColA

AND t.ColB = t2.ColB

GROUP BY

ccl.Name1,

ccl.Name2

)

SELECT

ccl.Name1,

ccl.Name2

FROM

CoreCompareList AS ccl

JOIN

CountList AS cl

ONccl.Name1 = cl.Name1

AND ccl.Name2 = cl.Name2

AND ccl.cnt = cl.cnt

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 8

I take it back. Because of the length and datatype of your application information, there's a completely different approach I'd personally take here. Don't worry about the extra information, but can you confirm there's an index in the base table (non-clustered by preference) that uses ComputerName on the first column and includes (somewhere) ApplicationName?

I'm building out the rest of it now. The chained combination pivot is where it gets messy.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 9

Some notes in the code, others below.

IF OBJECT_ID('TempDB..#mytable','U') IS NOT NULL

DROP TABLE #mytable

CREATE TABLE #MyTable(

[ComputerName] [nvarchar](255) NULL,

[ApplicationName] [nvarchar](255) NULL

)

INSERT INTO #MyTable

(ComputerName, ApplicationName)

SELECT 'DCZC445645S','1E NomadBranch 3' UNION ALL

SELECT 'DCZC445645S','1E WakeUp Agent 6' UNION ALL

SELECT 'DCZC445645S','Windows Media Player 9' UNION ALL

SELECT 'DCZC445645S','Microsoft Office 2003 Excel' UNION ALL

SELECT 'DCZC843531X','1E NomadBranch 3' UNION ALL

SELECT 'DCZC843531X','1E WakeUp Agent 6' UNION ALL

SELECT 'DCZC843531X','Windows Media Player 9' UNION ALL

SELECT 'DCZC843531X','Microsoft Office 2003 Excel' UNION ALL

SELECT 'DCZC2231124','1E NomadBranch 3' UNION ALL

SELECT 'DCZC2231124','1E WakeUp Agent 6' UNION ALL

SELECT 'DCZC2231124','Windows Media Player 9' UNION ALL

SELECT 'DCZC2231124','Microsoft Office 2003 Excel' UNION ALL

SELECT 'DCZC84352Z9','!NovaDigm Deployment Repository' UNION ALL

SELECT 'DCZC84352Z9','1E NomadBranch 3' UNION ALL

SELECT 'DCZC84352Z9','1E WakeUp Agent 6' UNION ALL

SELECT 'DCZC84352Z9','Adobe Reader 9' UNION ALL

SELECT 'DCZC84352Z9','Citrix Program Neighborhood Agent 10' UNION ALL

SELECT 'DCZC60548AS','!NovaDigm Deployment Repository' UNION ALL

SELECT 'DCZC60548AS','1E NomadBranch 3' UNION ALL

SELECT 'DCZC60548AS','1E WakeUp Agent 6' UNION ALL

SELECT 'DCZC60548AS','Adobe Reader 9' UNION ALL

SELECT 'DCZC60548AS','Citrix Program Neighborhood Agent 10' UNION ALL

SELECT 'DCZC789784C','Microsoft Internet Explorer 8' UNION ALL

SELECT 'DCZC789784C','Windows Media Player 9' UNION ALL

SELECT 'DCZC789784C','Microsoft Office 2003 Excel'

IF OBJECT_ID('tempdb..#Checksums') IS NOT NULL

DROP TABLE #Checksums

CREATE TABLE #Checksums

( csID INT IDENTITY( 1,1) NOT NULL,

ComputerName nVARCHAR(255) NOT NULL,

ChecksumValue INT)

CREATE CLUSTERED INDEX idx_Checksums ON #Checksums

( ChecksumValue)

;WITH cte AS

(SELECT DISTINCT

ComputerName,

(SELECT ApplicationName + ',' AS 'data()'

FROM #MyTable AS t2

WHERE t2.ComputerName = t.ComputerName FOR XML PATH( '')) AS OneRowVal

FROm

#MyTable AS t

)

INSERT INTO #Checksums

SELECT

ComputerName,

CHECKSUM( OneRowVal)

FROM

cte

SELECT

c.ChecksumValue,

c.ComputerName AS Name1,

c2.ComputerName AS Name2

FROM

#Checksums AS c

JOIN

#Checksums AS c2

ONc.ChecksumValue = c2.ChecksumValue

AND c.csID < c2.csID -- Controls 'back and forth' duplication

Gary, this should get you the existing results you have as quickly as possible as long as you have the above mentioned index available.

The problem is de-duping the existing values in the result set. The only way I have to do it offhand is a bit messy but it DOES get you there.

;WITH CompareComplete AS

(SELECT

c.ChecksumValue,

c.ComputerName AS Name1,

c2.ComputerName AS Name2

FROM

#Checksums AS c

JOIN

#Checksums AS c2

ONc.ChecksumValue = c2.ChecksumValue

AND c.csID < c2.csID -- Controls 'back and forth' duplication

)

, UnionedResults AS

(SELECT

ChecksumValue,

Name1 AS CompName

FROM

CompareComplete

UNION -- This will force a list distinction, do NOT use ALL

SELECT

ChecksumValue,

Name2

FROM

CompareComplete

)

SELECT DISTINCT

ChecksumValue,

(SELECT CompName + ',' AS 'data()'

FROM UnionedResults AS t2

WHERE t2.ChecksumValue = t.ChecksumValue FOR XML PATH( '')) AS OneRowVal

FROm

UnionedResults AS t

I'm hoping someone can clean up the formatting of the results a bit. This process is one I've ran against larger comparison sets (VARCHAR(4000)s) which absolutely killed me in the compare performance, thus the use of CHECKSUM to remove the possible 1000+ character comparisons. The use of the Temp table also gets you off the primary data sources.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

Dwain Camps SSC Guru Points: 86978 More actions · Answer 10

Here's another, relatively straightforward approach:

;WITH ApplicationsGrouped AS (

SELECT ComputerName

,Applications=STUFF((

SELECT ';' + ApplicationName

FROM #mytable b

WHERE a.ComputerName = b.ComputerName

ORDER BY ApplicationName

FOR XML PATH(''), TYPE).value('.', 'VARCHAR(MAX)')

,1, 1, '')

FROM #mytable a

GROUP BY ComputerName)

SELECT Applications

,Computers=STUFF((

SELECT ';' + ComputerName

FROM ApplicationsGrouped b

WHERE a.Applications = b.Applications

ORDER BY ComputerName

FOR XML PATH(''), TYPE).value('.', 'VARCHAR(MAX)')

,1, 1, '')

FROM ApplicationsGrouped a

GROUP BY Applications

HAVING COUNT(ComputerName) > 1

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

gary.harrington SSC Rookie Points: 42 More actions · Answer 11

Craig - This is perfect! Much appreciated!.... and yes, I have a non-clustered index on the table using ComputerName and ApplicationName

Thanks a million!

G

gary.harrington SSC Rookie Points: 42 More actions · Answer 12

Nice Dwain! Thanks for taking the time to look at this for me!

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 13

Here's a fourth method:

-- solution

;WITH ParentsOnly AS (

SELECT ComputerName, ChildRows = COUNT(*)

FROM #MyTable

GROUP BY ComputerName

)

SELECT a.*, b.*, x.*

FROM ParentsOnly a

INNER JOIN ParentsOnly b ON b.ChildRows = a.ChildRows AND b.ComputerName > a.ComputerName

CROSS APPLY ( -- match up the child rows

SELECT MatchingRows = COUNT(*)

FROM #MyTable ai

INNER JOIN #MyTable bi ON bi.ComputerName = b.ComputerName

AND bi.ApplicationName = ai.ApplicationName

WHERE ai.ComputerName = a.ComputerName

) x

WHERE a.ChildRows = x.MatchingRows

ORDER BY a.ComputerName, b.ComputerName

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Mark Cowne One Orange Chip Points: 26986 More actions · Answer 14

Another one

WITH GrpCounts(ComputerName, ApplicationName,GrpCnt) AS (

SELECT ComputerName, ApplicationName,

COUNT(*) OVER(PARTITION BY ComputerName)

FROM #MyTable),

Results(ComputerNameA,ComputerNameB) AS (

SELECT x.ComputerName , y.ComputerName

FROM GrpCounts x

INNER JOIN GrpCounts y ON y.ApplicationName = x.ApplicationName AND y.ComputerName < x.ComputerName AND x.GrpCnt=y.GrpCnt

GROUP BY x.ComputerName, y.ComputerName, x.GrpCnt

HAVING COUNT(*) = x.GrpCnt)

SELECT r1.ComputerNameA +

(SELECT ' ' + r3.ComputerNameB AS "text()"

FROM Results r3

WHERE r3.ComputerNameA = r1.ComputerNameA

ORDER BY r3.ComputerNameB

FOR XML PATH('')) Computers

FROM Results r1

WHERE NOT EXISTS(SELECT * FROM Results r2 WHERE r1.ComputerNameA = r2.ComputerNameB)

GROUP BY r1.ComputerNameA

ORDER BY r1.ComputerNameA;

____________________________________________________

Deja View - The strange feeling that somewhere, sometime you've optimised this query before

How to get the best help on a forum

http://www.sqlservercentral.com/articles/Best+Practices/61537