avoiding distinct operator ?

Question

avoiding distinct operator ?

vsamantha35

SSChampion

Points: 11529
More actions
December 26, 2017 at 9:23 pm

#341416

Hi All,
Is there a better way to re-write below type of queries? Most developers tend to keep DISTINCT Operator to avoid duplicates. Is there a nice alternate way to avoid DISTINCT and still getting results with duplicates and has better performance?
test query
===========
SELECT DISTINCT
t1.c1,t1.c2,c6,
t2.c1,t2.c3,t2.c4,
t3.c1,t3.c2,t3.c3,
FROM t1 INNER JOIN t2 ON t1.c2 = t2.c4
INNER JOIN t3 ON t3.c1 = t2.c3 and t3.c3 = t1.c6
WHERE t1.iscurrent = 0
and t3.c2 in (10,20,30,40)
Thanks,
Sam

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

ZZartin SSC-Dedicated Points: 30894 More actions · Answer 1

Yes, change the join criteria so it's not generating duplicates, but it kind of depends on what's generating the duplicates. In this case one option might be to change the t3.c2 in (10,20,30,40) clause to a series of union statements.

drew.allen SSC Guru Points: 76944 More actions · Answer 2

vsamantha35 - Tuesday, December 26, 2017 9:23 PM
Hi All,
Is there a better way to re-write below type of queries? Most developers tend to keep DISTINCT Operator to avoid duplicates. Is there a nice alternate way to avoid DISTINCT and still getting results with duplicates and has better performance?
test query
===========
SELECT DISTINCT
t1.c1,t1.c2,c6,
t2.c1,t2.c3,t2.c4,
t3.c1,t3.c2,t3.c3,
FROM t1 INNER JOIN t2 ON t1.c2 = t2.c4
INNER JOIN t3 ON t3.c1 = t2.c3 and t3.c3 = t1.c6
WHERE t1.iscurrent = 0
and t3.c2 in (10,20,30,40)
Thanks,
Sam

If you are getting duplicates, your data is bad or your query is wrong. An ounce of prevention (cleaning up your data) is worth a pound of cure (trying to find workarounds to DISTINCT).

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

Jeff Moden SSC Guru Points: 1003853 More actions · Answer 3

drew.allen - Wednesday, December 27, 2017 7:13 AM
vsamantha35 - Tuesday, December 26, 2017 9:23 PM
Hi All,
Is there a better way to re-write below type of queries? Most developers tend to keep DISTINCT Operator to avoid duplicates. Is there a nice alternate way to avoid DISTINCT and still getting results with duplicates and has better performance?
test query
===========
SELECT DISTINCT
t1.c1,t1.c2,c6,
t2.c1,t2.c3,t2.c4,
t3.c1,t3.c2,t3.c3,
FROM t1 INNER JOIN t2 ON t1.c2 = t2.c4
INNER JOIN t3 ON t3.c1 = t2.c3 and t3.c3 = t1.c6
WHERE t1.iscurrent = 0
and t3.c2 in (10,20,30,40)
Thanks,
Sam
If you are getting duplicates, your data is bad or your query is wrong. An ounce of prevention (cleaning up your data) is worth a pound of cure (trying to find workarounds to DISTINCT).
Drew

I'll second that. In most cases, this happens because someone thought it would be a whole lot easier if they denormalized data or they don't know what normalization is. In some cases, it's simply because people don't understand the data and have left some important criteria out of the JOINs. It can also happen for the same reason when people write huge "all in one" queries where a little "Divide'n'Conquer" to capture the main controlling data early in a Temp Table will make a server crushing 45 minute run take only 3 seconds to execute because all of the "distinct" functionality is no longer necessary and millions (sometimes billions) of internal rows no longer need to be generated.

The other thing to watch for is people that use GROUP BY to solve the same issues that DISTINCT does (usually, incorrectly).

It's not possible for us to make a recommendation without knowing the nature of the data in the tables in the query but the "Divide'n'Conquer" method may be just what the doctor order even for a query with so few joins especially if some of the tables have a lot of data.

As a bit of a sidebar (and I know it's only example code), whenever there's more than one table involved in a query, ALWAYS use the 2 part naming convention (tablealias.columnname) for all columns in the SELECT List. It's makes troubleshooting a whole lot easier and will prevent a meltdown if someone happens to add an identically named column to one of the other tables.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

vsamantha35 SSChampion Points: 11529 More actions · Answer 4

Jeff Moden - Wednesday, December 27, 2017 8:49 AM
drew.allen - Wednesday, December 27, 2017 7:13 AM
vsamantha35 - Tuesday, December 26, 2017 9:23 PM
Hi All,
Is there a better way to re-write below type of queries? Most developers tend to keep DISTINCT Operator to avoid duplicates. Is there a nice alternate way to avoid DISTINCT and still getting results with duplicates and has better performance?
test query
===========
SELECT DISTINCT
t1.c1,t1.c2,c6,
t2.c1,t2.c3,t2.c4,
t3.c1,t3.c2,t3.c3,
FROM t1 INNER JOIN t2 ON t1.c2 = t2.c4
INNER JOIN t3 ON t3.c1 = t2.c3 and t3.c3 = t1.c6
WHERE t1.iscurrent = 0
and t3.c2 in (10,20,30,40)
Thanks,
Sam
If you are getting duplicates, your data is bad or your query is wrong. An ounce of prevention (cleaning up your data) is worth a pound of cure (trying to find workarounds to DISTINCT).
Drew
I'll second that. In most cases, this happens because someone thought it would be a whole lot easier if they denormalized data or they don't know what normalization is. In some cases, it's simply because people don't understand the data and have left some important criteria out of the JOINs. It can also happen for the same reason when people write huge "all in one" queries where a little "Divide'n'Conquer" to capture the main controlling data early in a Temp Table will make a server crushing 45 minute run take only 3 seconds to execute because all of the "distinct" functionality is no longer necessary and millions (sometimes billions) of internal rows no longer need to be generated.
The other thing to watch for is people that use GROUP BY to solve the same issues that DISTINCT does (usually, incorrectly).
It's not possible for us to make a recommendation without knowing the nature of the data in the tables in the query but the "Divide'n'Conquer" method may be just what the doctor order even for a query with so few joins especially if some of the tables have a lot of data.
As a bit of a sidebar (and I know it's only example code), whenever there's more than one table involved in a query, ALWAYS use the 2 part naming convention (tablealias.columnname) for all columns in the SELECT List. It's makes troubleshooting a whole lot easier and will prevent a meltdown if someone happens to add an identically named column to one of the other tables.

Okay, makes sense. Thanks for those wise suggestions.

Jeff Moden SSC Guru Points: 1003853 More actions · Answer 5

vsamantha35 - Wednesday, December 27, 2017 11:44 AM
Jeff Moden - Wednesday, December 27, 2017 8:49 AM
drew.allen - Wednesday, December 27, 2017 7:13 AM
vsamantha35 - Tuesday, December 26, 2017 9:23 PM
Hi All,
Is there a better way to re-write below type of queries? Most developers tend to keep DISTINCT Operator to avoid duplicates. Is there a nice alternate way to avoid DISTINCT and still getting results with duplicates and has better performance?
test query
===========
SELECT DISTINCT
t1.c1,t1.c2,c6,
t2.c1,t2.c3,t2.c4,
t3.c1,t3.c2,t3.c3,
FROM t1 INNER JOIN t2 ON t1.c2 = t2.c4
INNER JOIN t3 ON t3.c1 = t2.c3 and t3.c3 = t1.c6
WHERE t1.iscurrent = 0
and t3.c2 in (10,20,30,40)
Thanks,
Sam
If you are getting duplicates, your data is bad or your query is wrong. An ounce of prevention (cleaning up your data) is worth a pound of cure (trying to find workarounds to DISTINCT).
Drew
I'll second that. In most cases, this happens because someone thought it would be a whole lot easier if they denormalized data or they don't know what normalization is. In some cases, it's simply because people don't understand the data and have left some important criteria out of the JOINs. It can also happen for the same reason when people write huge "all in one" queries where a little "Divide'n'Conquer" to capture the main controlling data early in a Temp Table will make a server crushing 45 minute run take only 3 seconds to execute because all of the "distinct" functionality is no longer necessary and millions (sometimes billions) of internal rows no longer need to be generated.
The other thing to watch for is people that use GROUP BY to solve the same issues that DISTINCT does (usually, incorrectly).
It's not possible for us to make a recommendation without knowing the nature of the data in the tables in the query but the "Divide'n'Conquer" method may be just what the doctor order even for a query with so few joins especially if some of the tables have a lot of data.
As a bit of a sidebar (and I know it's only example code), whenever there's more than one table involved in a query, ALWAYS use the 2 part naming convention (tablealias.columnname) for all columns in the SELECT List. It's makes troubleshooting a whole lot easier and will prevent a meltdown if someone happens to add an identically named column to one of the other tables.
Okay, makes sense. Thanks for those wise suggestions.

Thanks for the feedback. Also, keep in mind that, sometimes, DISTINCT is exactly what IS needed. As always, "It Depends". It's like saying that "all cursors are bad". They're actually not IF they're used for the right things. "It Depends". 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)