SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


avoiding distinct operator ?


avoiding distinct operator ?

Author
Message
vsamantha35
vsamantha35
SSCertifiable
SSCertifiable (5.7K reputation)SSCertifiable (5.7K reputation)SSCertifiable (5.7K reputation)SSCertifiable (5.7K reputation)SSCertifiable (5.7K reputation)SSCertifiable (5.7K reputation)SSCertifiable (5.7K reputation)SSCertifiable (5.7K reputation)

Group: General Forum Members
Points: 5683 Visits: 1080
Hi All,

Is there a better way to re-write below type of queries? Most developers tend to keep DISTINCT Operator to avoid duplicates. Is there a nice alternate way to avoid DISTINCT and still getting results with duplicates and has better performance?

test query
===========

SELECT DISTINCT
t1.c1,t1.c2,c6,
t2.c1,t2.c3,t2.c4,
t3.c1,t3.c2,t3.c3,
FROM t1 INNER JOIN t2 ON t1.c2 = t2.c4
INNER JOIN t3 ON t3.c1 = t2.c3 and t3.c3 = t1.c6
WHERE t1.iscurrent = 0
and t3.c2 in (10,20,30,40)

Thanks,

Sam
ZZartin
ZZartin
SSCoach
SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)

Group: General Forum Members
Points: 16257 Visits: 15391
Yes, change the join criteria so it's not generating duplicates, but it kind of depends on what's generating the duplicates. In this case one option might be to change the t3.c2 in (10,20,30,40) clause to a series of union statements.
drew.allen
drew.allen
SSC-Forever
SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)

Group: General Forum Members
Points: 45187 Visits: 14686
vsamantha35 - Tuesday, December 26, 2017 9:23 PM
Hi All,

Is there a better way to re-write below type of queries? Most developers tend to keep DISTINCT Operator to avoid duplicates. Is there a nice alternate way to avoid DISTINCT and still getting results with duplicates and has better performance?

test query
===========

SELECT DISTINCT
t1.c1,t1.c2,c6,
t2.c1,t2.c3,t2.c4,
t3.c1,t3.c2,t3.c3,
FROM t1 INNER JOIN t2 ON t1.c2 = t2.c4
INNER JOIN t3 ON t3.c1 = t2.c3 and t3.c3 = t1.c6
WHERE t1.iscurrent = 0
and t3.c2 in (10,20,30,40)

Thanks,

Sam

If you are getting duplicates, your data is bad or your query is wrong. An ounce of prevention (cleaning up your data) is worth a pound of cure (trying to find workarounds to DISTINCT).

Drew


J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
How to post data/code on a forum to get the best help.
How to Post Performance Problems
Make sure that you include code in the appropriate IFCode tags, e.g. [code=sql]<your code here>[/code]. You can find the IFCode tags under the INSERT options when you are writing a post.
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (618K reputation)SSC Guru (618K reputation)SSC Guru (618K reputation)SSC Guru (618K reputation)SSC Guru (618K reputation)SSC Guru (618K reputation)SSC Guru (618K reputation)SSC Guru (618K reputation)

Group: General Forum Members
Points: 618337 Visits: 45200
drew.allen - Wednesday, December 27, 2017 7:13 AM
vsamantha35 - Tuesday, December 26, 2017 9:23 PM
Hi All,

Is there a better way to re-write below type of queries? Most developers tend to keep DISTINCT Operator to avoid duplicates. Is there a nice alternate way to avoid DISTINCT and still getting results with duplicates and has better performance?

test query
===========

SELECT DISTINCT
t1.c1,t1.c2,c6,
t2.c1,t2.c3,t2.c4,
t3.c1,t3.c2,t3.c3,
FROM t1 INNER JOIN t2 ON t1.c2 = t2.c4
INNER JOIN t3 ON t3.c1 = t2.c3 and t3.c3 = t1.c6
WHERE t1.iscurrent = 0
and t3.c2 in (10,20,30,40)

Thanks,

Sam

If you are getting duplicates, your data is bad or your query is wrong. An ounce of prevention (cleaning up your data) is worth a pound of cure (trying to find workarounds to DISTINCT).

Drew


I'll second that. In most cases, this happens because someone thought it would be a whole lot easier if they denormalized data or they don't know what normalization is. In some cases, it's simply because people don't understand the data and have left some important criteria out of the JOINs. It can also happen for the same reason when people write huge "all in one" queries where a little "Divide'n'Conquer" to capture the main controlling data early in a Temp Table will make a server crushing 45 minute run take only 3 seconds to execute because all of the "distinct" functionality is no longer necessary and millions (sometimes billions) of internal rows no longer need to be generated.

The other thing to watch for is people that use GROUP BY to solve the same issues that DISTINCT does (usually, incorrectly).

It's not possible for us to make a recommendation without knowing the nature of the data in the tables in the query but the "Divide'n'Conquer" method may be just what the doctor order even for a query with so few joins especially if some of the tables have a lot of data.

As a bit of a sidebar (and I know it's only example code), whenever there's more than one table involved in a query, ALWAYS use the 2 part naming convention (tablealias.columnname) for all columns in the SELECT List. It's makes troubleshooting a whole lot easier and will prevent a meltdown if someone happens to add an identically named column to one of the other tables.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
vsamantha35
vsamantha35
SSCertifiable
SSCertifiable (5.7K reputation)SSCertifiable (5.7K reputation)SSCertifiable (5.7K reputation)SSCertifiable (5.7K reputation)SSCertifiable (5.7K reputation)SSCertifiable (5.7K reputation)SSCertifiable (5.7K reputation)SSCertifiable (5.7K reputation)

Group: General Forum Members
Points: 5683 Visits: 1080
Jeff Moden - Wednesday, December 27, 2017 8:49 AM
drew.allen - Wednesday, December 27, 2017 7:13 AM
vsamantha35 - Tuesday, December 26, 2017 9:23 PM
Hi All,

Is there a better way to re-write below type of queries? Most developers tend to keep DISTINCT Operator to avoid duplicates. Is there a nice alternate way to avoid DISTINCT and still getting results with duplicates and has better performance?

test query
===========

SELECT DISTINCT
t1.c1,t1.c2,c6,
t2.c1,t2.c3,t2.c4,
t3.c1,t3.c2,t3.c3,
FROM t1 INNER JOIN t2 ON t1.c2 = t2.c4
INNER JOIN t3 ON t3.c1 = t2.c3 and t3.c3 = t1.c6
WHERE t1.iscurrent = 0
and t3.c2 in (10,20,30,40)

Thanks,

Sam

If you are getting duplicates, your data is bad or your query is wrong. An ounce of prevention (cleaning up your data) is worth a pound of cure (trying to find workarounds to DISTINCT).

Drew


I'll second that. In most cases, this happens because someone thought it would be a whole lot easier if they denormalized data or they don't know what normalization is. In some cases, it's simply because people don't understand the data and have left some important criteria out of the JOINs. It can also happen for the same reason when people write huge "all in one" queries where a little "Divide'n'Conquer" to capture the main controlling data early in a Temp Table will make a server crushing 45 minute run take only 3 seconds to execute because all of the "distinct" functionality is no longer necessary and millions (sometimes billions) of internal rows no longer need to be generated.

The other thing to watch for is people that use GROUP BY to solve the same issues that DISTINCT does (usually, incorrectly).

It's not possible for us to make a recommendation without knowing the nature of the data in the tables in the query but the "Divide'n'Conquer" method may be just what the doctor order even for a query with so few joins especially if some of the tables have a lot of data.

As a bit of a sidebar (and I know it's only example code), whenever there's more than one table involved in a query, ALWAYS use the 2 part naming convention (tablealias.columnname) for all columns in the SELECT List. It's makes troubleshooting a whole lot easier and will prevent a meltdown if someone happens to add an identically named column to one of the other tables.


Okay, makes sense. Thanks for those wise suggestions.
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (618K reputation)SSC Guru (618K reputation)SSC Guru (618K reputation)SSC Guru (618K reputation)SSC Guru (618K reputation)SSC Guru (618K reputation)SSC Guru (618K reputation)SSC Guru (618K reputation)

Group: General Forum Members
Points: 618337 Visits: 45200
vsamantha35 - Wednesday, December 27, 2017 11:44 AM
Jeff Moden - Wednesday, December 27, 2017 8:49 AM
drew.allen - Wednesday, December 27, 2017 7:13 AM
vsamantha35 - Tuesday, December 26, 2017 9:23 PM
Hi All,

Is there a better way to re-write below type of queries? Most developers tend to keep DISTINCT Operator to avoid duplicates. Is there a nice alternate way to avoid DISTINCT and still getting results with duplicates and has better performance?

test query
===========

SELECT DISTINCT
t1.c1,t1.c2,c6,
t2.c1,t2.c3,t2.c4,
t3.c1,t3.c2,t3.c3,
FROM t1 INNER JOIN t2 ON t1.c2 = t2.c4
INNER JOIN t3 ON t3.c1 = t2.c3 and t3.c3 = t1.c6
WHERE t1.iscurrent = 0
and t3.c2 in (10,20,30,40)

Thanks,

Sam

If you are getting duplicates, your data is bad or your query is wrong. An ounce of prevention (cleaning up your data) is worth a pound of cure (trying to find workarounds to DISTINCT).

Drew


I'll second that. In most cases, this happens because someone thought it would be a whole lot easier if they denormalized data or they don't know what normalization is. In some cases, it's simply because people don't understand the data and have left some important criteria out of the JOINs. It can also happen for the same reason when people write huge "all in one" queries where a little "Divide'n'Conquer" to capture the main controlling data early in a Temp Table will make a server crushing 45 minute run take only 3 seconds to execute because all of the "distinct" functionality is no longer necessary and millions (sometimes billions) of internal rows no longer need to be generated.

The other thing to watch for is people that use GROUP BY to solve the same issues that DISTINCT does (usually, incorrectly).

It's not possible for us to make a recommendation without knowing the nature of the data in the tables in the query but the "Divide'n'Conquer" method may be just what the doctor order even for a query with so few joins especially if some of the tables have a lot of data.

As a bit of a sidebar (and I know it's only example code), whenever there's more than one table involved in a query, ALWAYS use the 2 part naming convention (tablealias.columnname) for all columns in the SELECT List. It's makes troubleshooting a whole lot easier and will prevent a meltdown if someone happens to add an identically named column to one of the other tables.


Okay, makes sense. Thanks for those wise suggestions.


Thanks for the feedback. Also, keep in mind that, sometimes, DISTINCT is exactly what IS needed. As always, "It Depends". It's like saying that "all cursors are bad". They're actually not IF they're used for the right things. "It Depends". Wink

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum








































































































































































SQLServerCentral


Search