• Jeff Moden - Wednesday, December 27, 2017 8:49 AM

    drew.allen - Wednesday, December 27, 2017 7:13 AM

    vsamantha35 - Tuesday, December 26, 2017 9:23 PM

    Hi All,

    Is there a better way to re-write below type of queries? Most developers tend to keep DISTINCT Operator to avoid duplicates. Is there a nice alternate way to avoid DISTINCT and still getting results with duplicates and has better performance?

    test query
    ===========

    SELECT DISTINCT
    t1.c1,t1.c2,c6,
    t2.c1,t2.c3,t2.c4,
    t3.c1,t3.c2,t3.c3,
    FROM t1 INNER JOIN t2 ON t1.c2 = t2.c4
    INNER JOIN t3 ON t3.c1 = t2.c3 and t3.c3 = t1.c6
    WHERE t1.iscurrent = 0
    and t3.c2 in (10,20,30,40)

    Thanks,

    Sam

    If you are getting duplicates, your data is bad or your query is wrong.  An ounce of prevention (cleaning up your data) is worth a pound of cure (trying to find workarounds to DISTINCT).

    Drew

    I'll second that.  In most cases, this happens because someone thought it would be a whole lot easier if they denormalized data or they don't know what normalization is.  In some cases, it's simply because people don't understand the data and have left some important criteria out of the JOINs.  It can also happen for the same reason when people write huge "all in one" queries where a little "Divide'n'Conquer" to capture the main controlling data early in a Temp Table will make a server crushing 45 minute run take only 3 seconds to execute because all of the "distinct" functionality is no longer necessary and millions (sometimes billions) of internal rows no longer need to be generated.

    The other thing to watch for is people that use GROUP BY to solve the same issues that DISTINCT does (usually, incorrectly).

    It's not possible for us to make a recommendation without knowing the nature of the data in the tables in the query but the "Divide'n'Conquer" method may be just what the doctor order even for a query with so few joins especially if some of the tables have a lot of data.

    As a bit of a sidebar (and I know it's only example code), whenever there's more than one table involved in a query, ALWAYS use the 2 part naming convention (tablealias.columnname) for all columns in the SELECT List.  It's makes troubleshooting a whole lot easier and will prevent a meltdown if someone happens to add an identically named column to one of the other tables.

    Okay, makes sense. Thanks for those wise suggestions.