avoiding distinct operator ?

  • Hi All,

    Is there a better way to re-write below type of queries? Most developers tend to keep DISTINCT Operator to avoid duplicates. Is there a nice alternate way to avoid DISTINCT and still getting results with duplicates and has better performance?

    test query
    ===========

    SELECT DISTINCT
    t1.c1,t1.c2,c6,
    t2.c1,t2.c3,t2.c4,
    t3.c1,t3.c2,t3.c3,
    FROM t1 INNER JOIN t2 ON t1.c2 = t2.c4
    INNER JOIN t3 ON t3.c1 = t2.c3 and t3.c3 = t1.c6
    WHERE t1.iscurrent = 0
    and t3.c2 in (10,20,30,40)

    Thanks,

    Sam

  • Yes, change the join criteria so it's not generating duplicates, but it kind of depends on what's generating the duplicates.  In this case one option might be to change the  t3.c2 in (10,20,30,40) clause to a series of union statements.

  • vsamantha35 - Tuesday, December 26, 2017 9:23 PM

    Hi All,

    Is there a better way to re-write below type of queries? Most developers tend to keep DISTINCT Operator to avoid duplicates. Is there a nice alternate way to avoid DISTINCT and still getting results with duplicates and has better performance?

    test query
    ===========

    SELECT DISTINCT
    t1.c1,t1.c2,c6,
    t2.c1,t2.c3,t2.c4,
    t3.c1,t3.c2,t3.c3,
    FROM t1 INNER JOIN t2 ON t1.c2 = t2.c4
    INNER JOIN t3 ON t3.c1 = t2.c3 and t3.c3 = t1.c6
    WHERE t1.iscurrent = 0
    and t3.c2 in (10,20,30,40)

    Thanks,

    Sam

    If you are getting duplicates, your data is bad or your query is wrong.  An ounce of prevention (cleaning up your data) is worth a pound of cure (trying to find workarounds to DISTINCT).

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • drew.allen - Wednesday, December 27, 2017 7:13 AM

    vsamantha35 - Tuesday, December 26, 2017 9:23 PM

    Hi All,

    Is there a better way to re-write below type of queries? Most developers tend to keep DISTINCT Operator to avoid duplicates. Is there a nice alternate way to avoid DISTINCT and still getting results with duplicates and has better performance?

    test query
    ===========

    SELECT DISTINCT
    t1.c1,t1.c2,c6,
    t2.c1,t2.c3,t2.c4,
    t3.c1,t3.c2,t3.c3,
    FROM t1 INNER JOIN t2 ON t1.c2 = t2.c4
    INNER JOIN t3 ON t3.c1 = t2.c3 and t3.c3 = t1.c6
    WHERE t1.iscurrent = 0
    and t3.c2 in (10,20,30,40)

    Thanks,

    Sam

    If you are getting duplicates, your data is bad or your query is wrong.  An ounce of prevention (cleaning up your data) is worth a pound of cure (trying to find workarounds to DISTINCT).

    Drew

    I'll second that.  In most cases, this happens because someone thought it would be a whole lot easier if they denormalized data or they don't know what normalization is.  In some cases, it's simply because people don't understand the data and have left some important criteria out of the JOINs.  It can also happen for the same reason when people write huge "all in one" queries where a little "Divide'n'Conquer" to capture the main controlling data early in a Temp Table will make a server crushing 45 minute run take only 3 seconds to execute because all of the "distinct" functionality is no longer necessary and millions (sometimes billions) of internal rows no longer need to be generated.

    The other thing to watch for is people that use GROUP BY to solve the same issues that DISTINCT does (usually, incorrectly).

    It's not possible for us to make a recommendation without knowing the nature of the data in the tables in the query but the "Divide'n'Conquer" method may be just what the doctor order even for a query with so few joins especially if some of the tables have a lot of data.

    As a bit of a sidebar (and I know it's only example code), whenever there's more than one table involved in a query, ALWAYS use the 2 part naming convention (tablealias.columnname) for all columns in the SELECT List.  It's makes troubleshooting a whole lot easier and will prevent a meltdown if someone happens to add an identically named column to one of the other tables.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Wednesday, December 27, 2017 8:49 AM

    drew.allen - Wednesday, December 27, 2017 7:13 AM

    vsamantha35 - Tuesday, December 26, 2017 9:23 PM

    Hi All,

    Is there a better way to re-write below type of queries? Most developers tend to keep DISTINCT Operator to avoid duplicates. Is there a nice alternate way to avoid DISTINCT and still getting results with duplicates and has better performance?

    test query
    ===========

    SELECT DISTINCT
    t1.c1,t1.c2,c6,
    t2.c1,t2.c3,t2.c4,
    t3.c1,t3.c2,t3.c3,
    FROM t1 INNER JOIN t2 ON t1.c2 = t2.c4
    INNER JOIN t3 ON t3.c1 = t2.c3 and t3.c3 = t1.c6
    WHERE t1.iscurrent = 0
    and t3.c2 in (10,20,30,40)

    Thanks,

    Sam

    If you are getting duplicates, your data is bad or your query is wrong.  An ounce of prevention (cleaning up your data) is worth a pound of cure (trying to find workarounds to DISTINCT).

    Drew

    I'll second that.  In most cases, this happens because someone thought it would be a whole lot easier if they denormalized data or they don't know what normalization is.  In some cases, it's simply because people don't understand the data and have left some important criteria out of the JOINs.  It can also happen for the same reason when people write huge "all in one" queries where a little "Divide'n'Conquer" to capture the main controlling data early in a Temp Table will make a server crushing 45 minute run take only 3 seconds to execute because all of the "distinct" functionality is no longer necessary and millions (sometimes billions) of internal rows no longer need to be generated.

    The other thing to watch for is people that use GROUP BY to solve the same issues that DISTINCT does (usually, incorrectly).

    It's not possible for us to make a recommendation without knowing the nature of the data in the tables in the query but the "Divide'n'Conquer" method may be just what the doctor order even for a query with so few joins especially if some of the tables have a lot of data.

    As a bit of a sidebar (and I know it's only example code), whenever there's more than one table involved in a query, ALWAYS use the 2 part naming convention (tablealias.columnname) for all columns in the SELECT List.  It's makes troubleshooting a whole lot easier and will prevent a meltdown if someone happens to add an identically named column to one of the other tables.

    Okay, makes sense. Thanks for those wise suggestions.

  • vsamantha35 - Wednesday, December 27, 2017 11:44 AM

    Jeff Moden - Wednesday, December 27, 2017 8:49 AM

    drew.allen - Wednesday, December 27, 2017 7:13 AM

    vsamantha35 - Tuesday, December 26, 2017 9:23 PM

    Hi All,

    Is there a better way to re-write below type of queries? Most developers tend to keep DISTINCT Operator to avoid duplicates. Is there a nice alternate way to avoid DISTINCT and still getting results with duplicates and has better performance?

    test query
    ===========

    SELECT DISTINCT
    t1.c1,t1.c2,c6,
    t2.c1,t2.c3,t2.c4,
    t3.c1,t3.c2,t3.c3,
    FROM t1 INNER JOIN t2 ON t1.c2 = t2.c4
    INNER JOIN t3 ON t3.c1 = t2.c3 and t3.c3 = t1.c6
    WHERE t1.iscurrent = 0
    and t3.c2 in (10,20,30,40)

    Thanks,

    Sam

    If you are getting duplicates, your data is bad or your query is wrong.  An ounce of prevention (cleaning up your data) is worth a pound of cure (trying to find workarounds to DISTINCT).

    Drew

    I'll second that.  In most cases, this happens because someone thought it would be a whole lot easier if they denormalized data or they don't know what normalization is.  In some cases, it's simply because people don't understand the data and have left some important criteria out of the JOINs.  It can also happen for the same reason when people write huge "all in one" queries where a little "Divide'n'Conquer" to capture the main controlling data early in a Temp Table will make a server crushing 45 minute run take only 3 seconds to execute because all of the "distinct" functionality is no longer necessary and millions (sometimes billions) of internal rows no longer need to be generated.

    The other thing to watch for is people that use GROUP BY to solve the same issues that DISTINCT does (usually, incorrectly).

    It's not possible for us to make a recommendation without knowing the nature of the data in the tables in the query but the "Divide'n'Conquer" method may be just what the doctor order even for a query with so few joins especially if some of the tables have a lot of data.

    As a bit of a sidebar (and I know it's only example code), whenever there's more than one table involved in a query, ALWAYS use the 2 part naming convention (tablealias.columnname) for all columns in the SELECT List.  It's makes troubleshooting a whole lot easier and will prevent a meltdown if someone happens to add an identically named column to one of the other tables.

    Okay, makes sense. Thanks for those wise suggestions.

    Thanks for the feedback.  Also, keep in mind that, sometimes, DISTINCT is exactly what IS needed.  As always, "It Depends".  It's like saying that "all cursors are bad".  They're actually not IF they're used for the right things.  "It Depends". 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply