Bypassing the keyword DISTINCT

  • Hello,

    I have been told that the keyword DISTINCT is rather costly for sqlserver and was wondering with witch statement we could bypass this keyword and still become the resultset

    consider this:

    CREATE TABLE [dbo].[tbl_1](

    [tbl1_id] [int] NOT NULL,

    [field1] [varchar](50) COLLATE Latin1_General_BIN NOT NULL

    ) ON [PRIMARY]

    GO

    CREATE TABLE [dbo].[tbl_2](

    [tbl2_id] [int] NOT NULL,

    [tbl1_id] [int] NOT NULL,

    [tbl2_actief] [bit] NOT NULL CONSTRAINT [DF_tbl_2_tbl2_actief] DEFAULT ((1))

    ) ON [PRIMARY]

    INSERT INTO [tbl_1] (

    [tbl1_id],

    [field1]

    )

    SELECT 1,'A'

    UNION ALL

    SELECT 2,'B'

    UNION ALL

    SELECT 3,'C'

    INSERT INTO [tbl_2] (

    [tbl2_id],

    [tbl1_id],

    [tbl2_actief]

    )

    SELECT 1,1,1

    UNION ALL

    SELECT 2,1,1

    UNION ALL

    SELECT 3,2,0

    SELECT DISTINCT [t].[tbl1_id], [field1] FROM [tbl_1] AS t INNER JOIN [tbl_2] AS t2 ON t.[tbl1_id] = t2.[tbl1_id] WHERE [tbl2_actief] = 1

    shows me only:

    tbl1_id field1

    1 A

    This works for me because it shows me al fields from table1 for witch there is a relation in table 2 and this relation is active in table2

    Because i need only all fields from table 1 and none from table 2 the distinct does what it should for me

    But can we make another select statement that does not need the distinct en still gives me the same results back .?

    PS: is it really really true that the distinct keyword is very expensive.? or am i asking a stupid question here?

    wkr,

    Eddy

  • so you want only UNIQUE Field1 from Table1? is that the requirement?

    and it's okay to have duplicate entries in Table2?

    GROUP BY or ROW_NUMBER can mimic DISTINCT as well

    SQLServerNewbieMCITP: Database Administrator SQL Server 2005
  • That is idd whats the demand,

    DISTINCT works fine, don't understand me wrong, but cause i have been told its more expensive i'm looking for a more faster query method if possible

    because my resultsets in production will contain several thousands of rows and will be called multiple times a minute during peak hours.

    So i have just created my full statement for a view, and for now with the test data is runs lightning fast

    Queryplan seems fine to

    but still i want to be proactive for when we will reach production numbers.

    Total View select statement looks like:

    SELECT b.BRL_ID, b.PAR_ID_FIRMA, b.OFF_ID, b.PAR_ID_TYPE, b.PAR_ID_DISC,

    b.ADR_ID_KLANT,b.ARG_ID, b.ART_ID, b.ADR_ID_LAADNR, b.ADR_ID_LOSNR,

    b.PLS_ID_LADEN, b.PLS_ID_LOSSEN, b.PAR_ID_EENHEID, b.PAR_ID_MUNT,

    b.BRL_BASISNR, b.BRL_BEGDATUM, b.BRL_ENDDATUM, b.BRL_REMOVED,

    b.BRL_REGDATUM, dbo.ADRES.ADR_NAAM AS KLANTNAAM, ADRES_1.ADR_NAAM AS LAADPLAATS,

    ADRES_2.ADR_NAAM AS LOSPLAATS, dbo.PLAATS.PLS_CODE AS PLAATSCODELADEN, dbo.PLAATS.PLS_OMSCHR AS PLAATSLADEN,

    PLAATS_1.PLS_CODE AS PLAATSCODELOSSEN, PLAATS_1.PLS_OMSCHR AS PLAATSLOSSEN, dbo.ADRSRT.ADS_NR AS KLANTNR,

    dbo.MASTERREL.PAR_ID_TYPE AS PAR_ID_TYPE_MASTER, dbo.ARTIKEL.ART_OMS_NED AS ARTIKELNAAM

    FROM

    (SELECT DISTINCT

    b.BRL_ID, b.OFF_ID, b.MRL_ID, b.PAR_ID_FIRMA, b.PAR_ID_TYPE, b.PAR_ID_DISC, b.BRL_REGDATUM, b.BRL_BEGDATUM, b.BRL_ENDDATUM,

    b.BRL_REMOVED, b.BRL_REMOVED_LISA, b.BRL_BASISNR, b.ADR_ID_KLANT, b.ARG_ID, b.ART_ID, b.ADR_ID_LAADNR, b.ADR_ID_LOSNR,

    b.PAR_ID_EENHEID, b.PAR_ID_MUNT, b.PLS_ID_LADEN, b.PLS_ID_LOSSEN, b.BRL_FACTLIJST, b.BRL_OPM_FACTLIJST, b.BRL_OPM_FACTUUR,

    b.BRL_OPM_NED_LAAPLAATS, b.BRL_OPM_FRA_LAADPLAATS, b.BRL_OPM_NED_LOSPLAATS, b.BRL_OPM_FRA_LOSPLAATS,

    b.BRL_OPL_EIGENDOM

    FROM dbo.BASISREL AS b INNER JOIN

    dbo.DEELREL AS d ON b.BRL_ID = d.BRL_ID

    WHERE (d.DLR_REMOVED = 0)) AS b

    INNER JOIN

    dbo.MASTERREL ON b.MRL_ID = dbo.MASTERREL.MRL_ID INNER JOIN

    dbo.ARTIKEL ON b.ART_ID = dbo.ARTIKEL.ART_ID LEFT OUTER JOIN

    dbo.ADRES AS ADRES_2 ON b.ADR_ID_LOSNR = ADRES_2.ADR_ID LEFT OUTER JOIN

    dbo.ADRES AS ADRES_1 ON b.ADR_ID_LAADNR = ADRES_1.ADR_ID LEFT OUTER JOIN

    dbo.PLAATS AS PLAATS_1 ON b.PLS_ID_LOSSEN = PLAATS_1.PLS_ID LEFT OUTER JOIN

    dbo.PLAATS ON b.PLS_ID_LADEN = dbo.PLAATS.PLS_ID LEFT OUTER JOIN

    dbo.ADRES ON b.ADR_ID_KLANT = dbo.ADRES.ADR_ID LEFT OUTER JOIN

    dbo.ADRSRT ON dbo.ADRES.ADR_ID = dbo.ADRSRT.ADR_ID AND dbo.ADRSRT.ADS_TYPE = 1

    If any expert see's something that could make this "always" run fast no mather how many thousand records there are in Basisrel then please tell me how to improve this one.

    Wkr,

    Eddy

  • How many rows are returned by this...

    [font="Courier New"]SELECT DISTINCT

                          b.BRL_ID, b.OFF_ID, b.MRL_ID, b.PAR_ID_FIRMA, b.PAR_ID_TYPE, b.PAR_ID_DISC, b.BRL_REGDATUM, b.BRL_BEGDATUM, b.BRL_ENDDATUM,

                          b.BRL_REMOVED, b.BRL_REMOVED_LISA, b.BRL_BASISNR, b.ADR_ID_KLANT, b.ARG_ID, b.ART_ID, b.ADR_ID_LAADNR, b.ADR_ID_LOSNR,

                          b.PAR_ID_EENHEID, b.PAR_ID_MUNT, b.PLS_ID_LADEN, b.PLS_ID_LOSSEN, b.BRL_FACTLIJST, b.BRL_OPM_FACTLIJST, b.BRL_OPM_FACTUUR,

                          b.BRL_OPM_NED_LAAPLAATS, b.BRL_OPM_FRA_LAADPLAATS, b.BRL_OPM_NED_LOSPLAATS, b.BRL_OPM_FRA_LOSPLAATS,

                          b.BRL_OPL_EIGENDOM

    FROM dbo.BASISREL AS b

    INNER JOIN dbo.DEELREL AS d ON b.BRL_ID = d.BRL_ID

    WHERE d.DLR_REMOVED = 0 [/font]

    ...with and without the DISTINCT?

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • in my testtable

    i have 4 rows when run without the distinct and 3 rows if i use the distinct keyword

    what is exactly what it should return cause i have put 2x DEELREL With the same BASISREL.BRL_ID in it.

    But we need to create a selection view for our planning team and they do not need the DEELREL Data at that point, thats why i want only the "Distinct" BASISREL returned here

    In our dataschema there is ALWAYS AT LEAST 1 DEELREL for each BASISREL,

    Wkr,

    Eddy

  • How many rows in DEELREL and BASISREL?

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Can you make this join 1 to 1 ? b.BRL_ID = d.BRL_ID

    Where did you read that DISTINCT had a high cost associated with it?

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. πŸ˜‰

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • Sorts are expensive operations and distinct requires a sort, as do most other methods of removing duplicates from a resuuklt set.

    The best 'workaround' is to find out why there are duplicated (bad data, bad database design, bad query) and fix that. If you don't need distinct, don't use it.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Makes sense. That said, I've read that in certain circumstances, such as IN clauses, it's usually beneficial to add DISTINCT, even if you shouldn't really need it. Not sure if it's accurate as I've never done any testing of it. It should help if it eliminates duplicates in an IN clause though.

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. πŸ˜‰

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • Garadin (11/20/2008)


    Makes sense. That said, I've read that in certain circumstances, such as IN clauses, it's usually beneficial to add DISTINCT, even if you shouldn't really need it. Not sure if it's accurate as I've never done any testing of it. It should help if it eliminates duplicates in an IN clause though.

    It's not only not beneficial, it can be harmful.

    Duplicates in an IN are ignored. All that's important to the IN is what values exist, not how many times the value repeats. The optimiser typically uses a semi-join on an in with a subquery where only the presence of the row is important, the join isn't actually done.

    Most of the time the optimiser ignores distinct in IN, because it's irrelevant. Sometimes it does run distinct and it's a waste of resources.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (11/20/2008)


    It's not only not beneficial, it can be harmful.

    Duplicates in an IN are ignored. All that's important to the IN is what values exist, not how many times the value repeats. The optimiser typically uses a semi-join on an in with a subquery where only the presence of the row is important, the join isn't actually done.

    Most of the time the optimiser ignores distinct in IN, because it's irrelevant. Sometimes it does run distinct and it's a waste of resources.

    Good to know. Guess I'll stop doing that :Wow: Stupid multi-database query optimization book.

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. πŸ˜‰

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • Test, test, test....

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Indeed. The problem at the moment is that I don't really understand how to read execution plans / see the specifics of which might be better for what reason. I mean, I can of course do the "Well... this one runs in 1/3 the time... it's better!", but the specifics are still much a mystery. :ermm: Definitely need to do some reading on that.

    Thanks for the tips Gail.

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. πŸ˜‰

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • Distinct in my eyes is only good for analysing why/if there are duplicate records in your record set.

    just like suquerys are only good for refactoring!

    Carlton..

  • Garadin (11/20/2008)


    Indeed. The problem at the moment is that I don't really understand how to read execution plans / see the specifics of which might be better for what reason.

    Take a look at Grant Fritchey's book. I think Redgate's still giving it away. If mot, mail him. Also take a look at my blog

    http://sqlinthewild.co.za/index.php/category/sql-server/execution-plans/

    If there's anything you want to see that's not there, let me know.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply