Bypassing the keyword DISTINCT

Question

Post reply

Bypassing the keyword DISTINCT

Van Heghe Eddy

SSCarpal Tunnel

Points: 4449
More actions
November 20, 2008 at 6:49 am

#194205

Hello,
I have been told that the keyword DISTINCT is rather costly for sqlserver and was wondering with witch statement we could bypass this keyword and still become the resultset
consider this:
CREATE TABLE [dbo].[tbl_1](
[tbl1_id] [int] NOT NULL,
[field1] [varchar](50) COLLATE Latin1_General_BIN NOT NULL
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[tbl_2](
[tbl2_id] [int] NOT NULL,
[tbl1_id] [int] NOT NULL,
[tbl2_actief] [bit] NOT NULL CONSTRAINT [DF_tbl_2_tbl2_actief] DEFAULT ((1))
) ON [PRIMARY]
INSERT INTO [tbl_1] (
[tbl1_id],
[field1]
)
SELECT 1,'A'
UNION ALL
SELECT 2,'B'
UNION ALL
SELECT 3,'C'
INSERT INTO [tbl_2] (
[tbl2_id],
[tbl1_id],
[tbl2_actief]
)
SELECT 1,1,1
UNION ALL
SELECT 2,1,1
UNION ALL
SELECT 3,2,0
SELECT DISTINCT [t].[tbl1_id], [field1] FROM [tbl_1] AS t INNER JOIN [tbl_2] AS t2 ON t.[tbl1_id] = t2.[tbl1_id] WHERE [tbl2_actief] = 1
shows me only:
tbl1_id field1
1 A
This works for me because it shows me al fields from table1 for witch there is a relation in table 2 and this relation is active in table2
Because i need only all fields from table 1 and none from table 2 the distinct does what it should for me
But can we make another select statement that does not need the distinct en still gives me the same results back .?
PS: is it really really true that the distinct keyword is very expensive.? or am i asking a stupid question here?
wkr,
Eddy

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply

Jerry Hung SSChampion Points: 12968 More actions · Answer 1

so you want only UNIQUE Field1 from Table1? is that the requirement?

and it's okay to have duplicate entries in Table2?

GROUP BY or ROW_NUMBER can mimic DISTINCT as well

SQLServerNewbieMCITP: Database Administrator SQL Server 2005

Van Heghe Eddy SSCarpal Tunnel Points: 4449 More actions · Answer 2

That is idd whats the demand,

DISTINCT works fine, don't understand me wrong, but cause i have been told its more expensive i'm looking for a more faster query method if possible

because my resultsets in production will contain several thousands of rows and will be called multiple times a minute during peak hours.

So i have just created my full statement for a view, and for now with the test data is runs lightning fast

Queryplan seems fine to

but still i want to be proactive for when we will reach production numbers.

Total View select statement looks like:

SELECT b.BRL_ID, b.PAR_ID_FIRMA, b.OFF_ID, b.PAR_ID_TYPE, b.PAR_ID_DISC,

b.ADR_ID_KLANT,b.ARG_ID, b.ART_ID, b.ADR_ID_LAADNR, b.ADR_ID_LOSNR,

b.PLS_ID_LADEN, b.PLS_ID_LOSSEN, b.PAR_ID_EENHEID, b.PAR_ID_MUNT,

b.BRL_BASISNR, b.BRL_BEGDATUM, b.BRL_ENDDATUM, b.BRL_REMOVED,

b.BRL_REGDATUM, dbo.ADRES.ADR_NAAM AS KLANTNAAM, ADRES_1.ADR_NAAM AS LAADPLAATS,

ADRES_2.ADR_NAAM AS LOSPLAATS, dbo.PLAATS.PLS_CODE AS PLAATSCODELADEN, dbo.PLAATS.PLS_OMSCHR AS PLAATSLADEN,

PLAATS_1.PLS_CODE AS PLAATSCODELOSSEN, PLAATS_1.PLS_OMSCHR AS PLAATSLOSSEN, dbo.ADRSRT.ADS_NR AS KLANTNR,

dbo.MASTERREL.PAR_ID_TYPE AS PAR_ID_TYPE_MASTER, dbo.ARTIKEL.ART_OMS_NED AS ARTIKELNAAM

FROM

(SELECT DISTINCT

b.BRL_ID, b.OFF_ID, b.MRL_ID, b.PAR_ID_FIRMA, b.PAR_ID_TYPE, b.PAR_ID_DISC, b.BRL_REGDATUM, b.BRL_BEGDATUM, b.BRL_ENDDATUM,

b.BRL_REMOVED, b.BRL_REMOVED_LISA, b.BRL_BASISNR, b.ADR_ID_KLANT, b.ARG_ID, b.ART_ID, b.ADR_ID_LAADNR, b.ADR_ID_LOSNR,

b.PAR_ID_EENHEID, b.PAR_ID_MUNT, b.PLS_ID_LADEN, b.PLS_ID_LOSSEN, b.BRL_FACTLIJST, b.BRL_OPM_FACTLIJST, b.BRL_OPM_FACTUUR,

b.BRL_OPM_NED_LAAPLAATS, b.BRL_OPM_FRA_LAADPLAATS, b.BRL_OPM_NED_LOSPLAATS, b.BRL_OPM_FRA_LOSPLAATS,

b.BRL_OPL_EIGENDOM

FROM dbo.BASISREL AS b INNER JOIN

dbo.DEELREL AS d ON b.BRL_ID = d.BRL_ID

WHERE (d.DLR_REMOVED = 0)) AS b

INNER JOIN

dbo.MASTERREL ON b.MRL_ID = dbo.MASTERREL.MRL_ID INNER JOIN

dbo.ARTIKEL ON b.ART_ID = dbo.ARTIKEL.ART_ID LEFT OUTER JOIN

dbo.ADRES AS ADRES_2 ON b.ADR_ID_LOSNR = ADRES_2.ADR_ID LEFT OUTER JOIN

dbo.ADRES AS ADRES_1 ON b.ADR_ID_LAADNR = ADRES_1.ADR_ID LEFT OUTER JOIN

dbo.PLAATS AS PLAATS_1 ON b.PLS_ID_LOSSEN = PLAATS_1.PLS_ID LEFT OUTER JOIN

dbo.PLAATS ON b.PLS_ID_LADEN = dbo.PLAATS.PLS_ID LEFT OUTER JOIN

dbo.ADRES ON b.ADR_ID_KLANT = dbo.ADRES.ADR_ID LEFT OUTER JOIN

dbo.ADRSRT ON dbo.ADRES.ADR_ID = dbo.ADRSRT.ADR_ID AND dbo.ADRSRT.ADS_TYPE = 1

If any expert see's something that could make this "always" run fast no mather how many thousand records there are in Basisrel then please tell me how to improve this one.

Wkr,

Eddy

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 3

How many rows are returned by this...

[font="Courier New"]SELECT DISTINCT

b.BRL_ID, b.OFF_ID, b.MRL_ID, b.PAR_ID_FIRMA, b.PAR_ID_TYPE, b.PAR_ID_DISC, b.BRL_REGDATUM, b.BRL_BEGDATUM, b.BRL_ENDDATUM,

b.BRL_REMOVED, b.BRL_REMOVED_LISA, b.BRL_BASISNR, b.ADR_ID_KLANT, b.ARG_ID, b.ART_ID, b.ADR_ID_LAADNR, b.ADR_ID_LOSNR,

b.PAR_ID_EENHEID, b.PAR_ID_MUNT, b.PLS_ID_LADEN, b.PLS_ID_LOSSEN, b.BRL_FACTLIJST, b.BRL_OPM_FACTLIJST, b.BRL_OPM_FACTUUR,

b.BRL_OPM_NED_LAAPLAATS, b.BRL_OPM_FRA_LAADPLAATS, b.BRL_OPM_NED_LOSPLAATS, b.BRL_OPM_FRA_LOSPLAATS,

b.BRL_OPL_EIGENDOM

FROM dbo.BASISREL AS b

INNER JOIN dbo.DEELREL AS d ON b.BRL_ID = d.BRL_ID

WHERE d.DLR_REMOVED = 0 [/font]

...with and without the DISTINCT?

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Van Heghe Eddy SSCarpal Tunnel Points: 4449 More actions · Answer 4

in my testtable

i have 4 rows when run without the distinct and 3 rows if i use the distinct keyword

what is exactly what it should return cause i have put 2x DEELREL With the same BASISREL.BRL_ID in it.

But we need to create a selection view for our planning team and they do not need the DEELREL Data at that point, thats why i want only the "Distinct" BASISREL returned here

In our dataschema there is ALWAYS AT LEAST 1 DEELREL for each BASISREL,

Wkr,

Eddy

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 5

How many rows in DEELREL and BASISREL?

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Garadin One Orange Chip Points: 29613 More actions · Answer 6

Can you make this join 1 to 1 ? b.BRL_ID = d.BRL_ID

Where did you read that DISTINCT had a high cost associated with it?

Seth Phelabaum

Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 7

Sorts are expensive operations and distinct requires a sort, as do most other methods of removing duplicates from a resuuklt set.

The best 'workaround' is to find out why there are duplicated (bad data, bad database design, bad query) and fix that. If you don't need distinct, don't use it.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Garadin One Orange Chip Points: 29613 More actions · Answer 8

Makes sense. That said, I've read that in certain circumstances, such as IN clauses, it's usually beneficial to add DISTINCT, even if you shouldn't really need it. Not sure if it's accurate as I've never done any testing of it. It should help if it eliminates duplicates in an IN clause though.

Seth Phelabaum

Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 9

Garadin (11/20/2008)
Makes sense. That said, I've read that in certain circumstances, such as IN clauses, it's usually beneficial to add DISTINCT, even if you shouldn't really need it. Not sure if it's accurate as I've never done any testing of it. It should help if it eliminates duplicates in an IN clause though.

It's not only not beneficial, it can be harmful.

Duplicates in an IN are ignored. All that's important to the IN is what values exist, not how many times the value repeats. The optimiser typically uses a semi-join on an in with a subquery where only the presence of the row is important, the join isn't actually done.

Most of the time the optimiser ignores distinct in IN, because it's irrelevant. Sometimes it does run distinct and it's a waste of resources.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Garadin One Orange Chip Points: 29613 More actions · Answer 10

GilaMonster (11/20/2008)
It's not only not beneficial, it can be harmful.
Duplicates in an IN are ignored. All that's important to the IN is what values exist, not how many times the value repeats. The optimiser typically uses a semi-join on an in with a subquery where only the presence of the row is important, the join isn't actually done.
Most of the time the optimiser ignores distinct in IN, because it's irrelevant. Sometimes it does run distinct and it's a waste of resources.

Good to know. Guess I'll stop doing that :Wow: Stupid multi-database query optimization book.

Seth Phelabaum

Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 11

Test, test, test....

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Garadin One Orange Chip Points: 29613 More actions · Answer 12

Indeed. The problem at the moment is that I don't really understand how to read execution plans / see the specifics of which might be better for what reason. I mean, I can of course do the "Well... this one runs in 1/3 the time... it's better!", but the specifics are still much a mystery. :ermm: Definitely need to do some reading on that.

Thanks for the tips Gail.

Seth Phelabaum

Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

Carlton Leach SSCarpal Tunnel Points: 4202 More actions · Answer 13

Distinct in my eyes is only good for analysing why/if there are duplicate records in your record set.

just like suquerys are only good for refactoring!

Carlton..

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 14

Garadin (11/20/2008)
Indeed. The problem at the moment is that I don't really understand how to read execution plans / see the specifics of which might be better for what reason.

Take a look at Grant Fritchey's book. I think Redgate's still giving it away. If mot, mail him. Also take a look at my blog

http://sqlinthewild.co.za/index.php/category/sql-server/execution-plans/

If there's anything you want to see that's not there, let me know.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass