Group by performance FK / PK

Question

Group by performance FK / PK

Peter Kruis

Mr or Mrs. 500

Points: 577
More actions
July 10, 2014 at 9:03 am

#296431

Hello all,
I have something I can't explain, hope some of you can?
We have a medium sized database with the next tables:
- PA: 525000 records
- PR: 780000 records
- R: 1000 records
- B: 45 records
PA: PK = PAARDCODE
PR: PK = PAARDREGISTERCODE
PR: FK = PAARDCODE
PR: FK = REGISTERCODE
R: PK = REGISTERCODE
R: FK = BOEKCODE
B: PK = BOEKCODE
When I group by B.BOEKCODE the query lasts: 10 (or more when 'where' option is added) seconds
When I group by R.BOEKCODE the query lasts less than 2 seconds.
SELECT B.BOEKOMSCHRIJVING, B.BOEKCODE -- or R.BOEKCODE
FROM PA
INNER JOIN PR ON PA.PAARDCODE = PR.PAARDCODE
INNER JOIN R ON R.REGISTERCODE = PR.REGISTERCODE
INNER JOIN B ON R.BOEKCODE = B.BOEKCODE
GROUP BY BOEKOMSCHRIJVING, B.BOEKCODE -- or R.BOEKCODE
ORDER BY BOEKOMSCHRIJVING
Why is the option of B.BOEKCODE a lot slower compared to R.BOEKCODE?
Thanks in advance,
Peter

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply

sgmunson SSC Guru Points: 110639 More actions · Answer 1

Have you looked at the query's execution plan? This would likely reveal the reason. Also, is there more than one record in the table with alias B for a given BOEKCODE value ? In that case you may be traversing a lot of records that aren't needed, as including the B table in the result set requires retreiving all the records from that table for that BOEKCODE value. As you are using GROUP BY, you don't need the B table data if you aren't aggregating anything from it, so using the R table's version is fine.

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

Peter Kruis Mr or Mrs. 500 Points: 577 More actions · Answer 2

sgmunson (7/10/2014)
Have you looked at the query's execution plan? This would likely reveal the reason. Also, is there more than one record in the table with alias B for a given BOEKCODE value ? In that case you may be traversing a lot of records that aren't needed, as including the B table in the result set requires retreiving all the records from that table for that BOEKCODE value. As you are using GROUP BY, you don't need the B table data if you aren't aggregating anything from it, so using the R table's version is fine.

Hi Sgmunson,

Thanks for looking into this, I will look at the execution plan to see what is happening.

From the 'B' table, I need the 'BOEKOMSCHRIJVING', that's why I've added the 'B' table.

For the extra records, Every PA has (at least) one PR, each PR contains 1 R, and each R has one B

sgmunson SSC Guru Points: 110639 More actions · Answer 3

Given your additional info, this suggests that once you have the query plan, take a close look at the indexes on the B table. See if there are any that have both the BOEKCODE as well as the other selected field. I'm guessing not, and by selecting that field from the B table, you have to traverse all the records rather than hit an index that just contains the other selected field, but I wouldn't want to offer any guarantee on that. The query plan will likely make things clear as to what's happening when you select it from the B table instead of the R table. You may be able to anticipate the plan's answer just by running the following query:

SELECT BOEKCODE, BOEKOMSCHRIJVING, COUNT(*)

FROM B

GROUP BY BOEKCODE, BOEKOMSCHRIJVING

HAVING COUNT(*) > 1

If that query returns any rows, you might want to substitute the following into your query in place of the B table:

SELECT DISTINCT BOEKCODE, BOEKOMSCHRIJVING

FROM B

peter 67432 (7/10/2014)
sgmunson (7/10/2014)
Have you looked at the query's execution plan? This would likely reveal the reason. Also, is there more than one record in the table with alias B for a given BOEKCODE value ? In that case you may be traversing a lot of records that aren't needed, as including the B table in the result set requires retreiving all the records from that table for that BOEKCODE value. As you are using GROUP BY, you don't need the B table data if you aren't aggregating anything from it, so using the R table's version is fine.
Hi Sgmunson,
Thanks for looking into this, I will look at the execution plan to see what is happening.
From the 'B' table, I need the 'BOEKOMSCHRIJVING', that's why I've added the 'B' table.
For the extra records, Every PA has (at least) one PR, each PR contains 1 R, and each R has one B

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

Peter Kruis Mr or Mrs. 500 Points: 577 More actions · Answer 4

Hi Steve,

I checked the query execution plan, however it is the first time I am looking in such thing, and I have no idea how to see what is wrong. I tried to follow this website: link[/url], however I still don't understand it, can you help me with reading this plan?

Thanks in advance!

Peter

The slow one:

The fast one, also divided in 2 screens for readability.

sgmunson SSC Guru Points: 110639 More actions · Answer 5

One thing I noticed right away was that the "slow" one has two "Nested Loop" icons, while the "fast" one does not, and includes parallelism. Taking advantage of parallel operations usually speeds things up. Having no nested loops is also an advantage. I may not be an expert at reading these plans, but the conclusion I can draw from seeing the nested loop plan going a lot slower than one that takes good advantage of the indexes and operates in parallel is that my original thinking was quite likely correct. Having to retrieve the BOEKCODE from the B table is costing a lot, as it probably comes from the index in the other table, so not having to traverse the individual records from the B table gets you out of a costly pair of nested loops. Next time you want to post a SQL execution plan, be sure to save the plan with the .sqlplan extension, then zip the file and attach it to your post. It is much easier for others to look at the plan that way, and because I can't see the properties of the individual icons, I can't be certain of my conclusion, but I don't think I'm likely to learn differently by examining those properties in this specific case. Someone with more experience reading them might be in a better position to do so, and to better explain it than I can.

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

ScottPletcher SSC Guru Points: 100942 More actions · Answer 6

If possible, please attach the actual query plan xml as an xml file, rather than just a picture of the plan. There are row counts and other stats available in the query plan that are critical to analyzing it but that can't be seen from a static picture alone.

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Peter Kruis Mr or Mrs. 500 Points: 577 More actions · Answer 7

Hi,

Sorry for the late response, was a weekend out of town. The zip file should be attached.

Peter

ScottPletcher SSC Guru Points: 100942 More actions · Answer 8

Interesting, but not definitive.

Would you please run these commands on that database and post the results? That will show what indexes SQL "thinks" are missing, and how existing indexes are being used:

USE [DH_KWPN]

SET DEADLOCK_PRIORITY LOW --probably irrelevant, but just in case

DECLARE @list_missing_indexes bit

DECLARE @table_name_pattern sysname

--NOTE: showing missing indexes can take some time; set to 0 if you don't want to wait.

SET @list_missing_indexes = 1 --1=list missing index(es); 0=don't.

PRINT 'Started @ ' + CONVERT(varchar(30), GETDATE(), 120)

IF @list_missing_indexes = 1

BEGIN

SELECT

GETDATE() AS capture_date,

DB_NAME(mid.database_id) AS Db_Name,

OBJECT_NAME(mid.object_id /*, mid.database_id*/) AS Table_Name,

mid.equality_columns, mid.inequality_columns, mid.included_columns,

ca1.max_days_active,

migs.*,

mid.statement, mid.object_id, mid.index_handle

FROM sys.dm_db_missing_index_details mid WITH (NOLOCK)

CROSS APPLY (

SELECT DATEDIFF(DAY, create_date, GETDATE()) AS max_days_active FROM sys.databases WHERE name = 'tempdb'

) AS ca1

LEFT OUTER JOIN sys.dm_db_missing_index_groups mig WITH (NOLOCK) ON

mig.index_handle = mid.index_handle

LEFT OUTER JOIN sys.dm_db_missing_index_group_stats migs WITH (NOLOCK) ON

migs.group_handle = mig.index_group_handle

--order by

--DB_NAME, Table_Name, equality_columns

WHERE

1 = 1

AND mid.database_id = DB_ID()

AND OBJECT_NAME(mid.object_id) IN (

'SHA_PAARDACT',

'SHA_PAARDREGISTER',

'SHA_REGISTER',

'SHA_BOEK'

)

ORDER BY

--avg_total_user_cost * (user_seeks + user_scans) DESC,

Db_Name, Table_Name, equality_columns, inequality_columns

END --IF

PRINT 'Midpoint @ ' + CONVERT(varchar(30), GETDATE(), 120)

-- list index usage stats (seeks, scans, etc.)

SELECT

ius2.row_num, DB_NAME() AS db_name,

i.name AS index_name,

OBJECT_NAME(i.object_id/*, DB_ID()*/) AS table_name,

i.index_id, --ius.user_seeks + ius.user_scans AS total_reads,

dps.row_count,

SUBSTRING(key_cols, 3, 8000) AS key_cols, SUBSTRING(nonkey_cols, 3, 8000) AS nonkey_cols,

ius.user_seeks, ius.user_scans, ius.user_lookups, ius.user_updates,

ius.last_user_seek, ius.last_user_scan, ius.last_user_lookup, ius.last_user_update,

fk.Reference_Count AS fk_ref_count,

FILEGROUP_NAME(i.data_space_id) AS filegroup_name,

ca1.max_days_active,

ius.system_seeks, ius.system_scans, ius.system_lookups, ius.system_updates,

ius.last_system_seek, ius.last_system_scan, ius.last_system_lookup, ius.last_system_update

FROM sys.indexes i WITH (NOLOCK)

INNER JOIN sys.objects o WITH (NOLOCK) ON

o.object_id = i.object_id

CROSS APPLY (

SELECT DATEDIFF(DAY, create_date, GETDATE()) AS max_days_active FROM sys.databases WHERE name = 'tempdb'

) AS ca1

OUTER APPLY (

SELECT

', ' + COL_NAME(object_id, ic.column_id)

FROM sys.index_columns ic

WHERE

ic.key_ordinal > 0 AND

ic.object_id = i.object_id AND

ic.index_id = i.index_id

ORDER BY

ic.key_ordinal

FOR XML PATH('')

) AS key_cols (key_cols)

OUTER APPLY (

SELECT

', ' + COL_NAME(object_id, ic.column_id)

FROM sys.index_columns ic

WHERE

ic.key_ordinal = 0 AND

ic.object_id = i.object_id AND

ic.index_id = i.index_id

ORDER BY

COL_NAME(object_id, ic.column_id)

FOR XML PATH('')

) AS nonkey_cols (nonkey_cols)

LEFT OUTER JOIN sys.dm_db_partition_stats dps WITH (NOLOCK) ON

dps.object_id = i.object_id AND

dps.index_id = i.index_id

LEFT OUTER JOIN sys.dm_db_index_usage_stats ius WITH (NOLOCK) ON

ius.database_id = DB_ID() AND

ius.object_id = i.object_id AND

ius.index_id = i.index_id

LEFT OUTER JOIN (

SELECT

database_id, object_id, MAX(user_scans) AS user_scans,

ROW_NUMBER() OVER (ORDER BY MAX(user_scans) DESC) AS row_num --user_scans|user_seeks+user_scans

FROM sys.dm_db_index_usage_stats WITH (NOLOCK)

WHERE

database_id = DB_ID()

--AND index_id > 0

GROUP BY

database_id, object_id

) AS ius2 ON

ius2.database_id = DB_ID() AND

ius2.object_id = i.object_id

LEFT OUTER JOIN (

SELECT

referenced_object_id, COUNT(*) AS Reference_Count

FROM sys.foreign_keys WITH (NOLOCK)

WHERE

is_disabled = 0

GROUP BY

referenced_object_id

) AS fk ON

fk.referenced_object_id = i.object_id

WHERE

i.object_id > 100 AND

i.is_hypothetical = 0 AND

i.type IN (0, 1, 2) AND

o.type NOT IN ( 'IF', 'IT', 'TF', 'TT' ) AND

(

o.name IN (

'SHA_PAARDACT',

'SHA_PAARDREGISTER',

'SHA_REGISTER',

'SHA_BOEK'

)

ORDER BY

--row_count DESC,

--ius.user_scans DESC,

--ius2.row_num, --user_scans+user_seeks

-- list clustered index first, if any, then other index(es)

db_name, table_name, CASE WHEN i.index_id IN (0, 1) THEN 1 ELSE 2 END, index_name

PRINT 'Ended @ ' + CONVERT(varchar(30), GETDATE(), 120)

Edit: Corrected CROSS APPLY column alias.

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Peter Kruis Mr or Mrs. 500 Points: 577 More actions · Answer 9

Hi Scott,

In attachment the results (hopefully good format)

Peter

ScottPletcher SSC Guru Points: 100942 More actions · Answer 10

I think that Query1 data got scrambled or something.

Can you use a spreadsheet instead? After you run the queries, in the Results/Output area, left-click in the empty box to the left of the first column name, which should highlight the entire result set. Then right-click, and do "Copy with Headers" and paste it into the spreadsheet. Both results can go into the same spreadsheet.

Based on analysis so far of the query2, you'd definitely gain performance by clustering SHA_PAARDREGISTER on PAARDCODE instead of on PAARDREGISTERCODE, although you'd definitely want to create a nonclustered index on PAARDREGISTERCODE.

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Peter Kruis Mr or Mrs. 500 Points: 577 More actions · Answer 11

Peter Kruis

Mr or Mrs. 500

Points: 577

July 17, 2014 at 1:33 am

#1730139

As asked also the results in grid format.

ScottPletcher SSC Guru Points: 100942 More actions · Answer 12

Yep, looks great, thanks!

Here's my scripted recommendations for index changes/rebuilds. I don't have time right now, but can explain more later if/when you have qs. Hope this helps!

------------------------------------------------------------------------------------------------------------------------

--Table: SHA_PAARDACT

-- DROPs

DROP INDEX [IDX_FPS_PAARDACT_ACTUEEL] ON SHA_PAARDACT --will be covered by recreated IDX_FPS_PAARDACT_GESLACHT

--!!(see Comment1 below in CREATEs)

DROP INDEX [IDX_FPS_PAARDACT_GEBOORTEDATUM] ON SHA_PAARDACT --will be covered by recreated IDX_FPS_PAARDACT_GESLACHT

DROP INDEX [IDX_FPS_PAARDACT_GESLACHT] ON SHA_PAARDACT --to be recreated with included columns

DROP INDEX [IDX_FPS_PAARDACT_KEURSEIZ] ON SHA_PAARDACT --will be covered by recreated IDX_FPS_PAARDACT_GESLACHT

DROP INDEX [IDX_FPS_PAARDACT_REGISTER] ON SHA_PAARDACT --will be covered by recreated IDX_FPS_PAARDACT_GESLACHT

DROP INDEX [IX_SHA_PAARDACT] ON SHA_PAARDACT --will be covered by recreated IDX_FPS_PAARDACT_GESLACHT

-- CREATEs

CREATE NONCLUSTERED INDEX [IDX_FPS_PAARDACT_EXPORTDATUM_IMPORTDATUM]

ON SHA_PAARDACT ( EXPORTDATUM, IMPORTDATUM )

WITH ( FILLFACTOR = 99, ONLINE = ON, SORT_IN_TEMPDB = ON )

ON [PRIMARY]

--Comment1: i think [GEBOORTEDATUM] is birth date??(name looks Dutch(?), but based on my very limited knowledge of German);

-- if instead it's a long (~20+ bytes) column, you can leave it in its own index and remove it from this one

CREATE UNIQUE NONCLUSTERED INDEX [IDX_FPS_PAARDACT_GESLACHT]

ON SHA_PAARDACT ( GESLACHT, PAARDCODE )

INCLUDE ( ACTUEEL, DATUMOVERLIJDEN, GEBOORTEDATUM, KEURINGSSEIZOEN, REGISTERCODE )

WITH ( FILLFACTOR = 99, ONLINE = ON, SORT_IN_TEMPDB = ON )

ON [PRIMARY]

------------------------------------------------------------------------------------------------------------------------

--Table: SHA_PAARDREGISTER !*changing clustering key*!

-- DROPs

DROP INDEX [IDX_FPS_PAARDREGISTER_DATUMTOT] ON SHA_PAARDREGISTER --will be covered by recreated IDX_FPS_PAARDREGISTER_REG

DROP INDEX [IDX_FPS_PAARDREGISTER_DATUMVAN] ON SHA_PAARDREGISTER --will be covered by recreated IDX_FPS_PAARDREGISTER_REG

DROP INDEX [IDX_FPS_PAARDREGISTER_PAARD] ON SHA_PAARDREGISTER --will become new clustering key, so nonclus index no longer needed

DROP INDEX [IDX_FPS_PAARDREGISTER_REG] ON SHA_PAARDREGISTER --to be recreated with included columns

ALTER TABLE SHA_PAARDREGISTER DROP CONSTRAINT [PK_FPS_PAARDREGISTER] --changing clustered index, must drop existing one first

-- CREATEs

CREATE UNIQUE CLUSTERED INDEX [CLX_FPS_PAARDREGISTER]

ON SHA_PAARDREGISTER ( PAARDCODE, PAARDREGISTERCODE )

WITH ( FILLFACTOR = 99, ONLINE = ON, SORT_IN_TEMPDB = ON )

ON [PRIMARY]

ALTER TABLE SHA_PAARDREGISTER

ADD CONSTRAINT [PK_FPS_PAARDREGISTER] PRIMARY KEY ( PAARDREGISTERCODE )

WITH ( FILLFACTOR = 100, ONLINE = ON, SORT_IN_TEMPDB = ON )

ON [PRIMARY]

CREATE NONCLUSTERED INDEX [IDX_FPS_PAARDREGISTER_REG]

ON SHA_PAARDREGISTER ( REGISTERCODE )

INCLUDE ( DATUMVAN, DATUMTOT, PAARDCODE )

WITH ( FILLFACTOR = 99, ONLINE = ON, SORT_IN_TEMPDB = ON )

ON [PRIMARY]

------------------------------------------------------------------------------------------------------------------------

--Table: SHA_REGISTER

-- DROPs

DROP INDEX [IDX_FPS_REGISTER_BOEK] ON SHA_REGISTER --will be covered by UIX_FPS_REGISTER

DROP INDEX [IDX_FPS_REGISTER_GESLACHT] ON SHA_REGISTER --will be covered by UIX_FPS_REGISTER

DROP INDEX [IDX_FPS_REGISTER_PREDIKAAT] ON SHA_REGISTER --will be covered by UIX_FPS_REGISTER

DROP INDEX [IDX_FPS_REGISTER_REGISTERTYPE] ON SHA_REGISTER --will be covered by UIX_FPS_REGISTER

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Peter Kruis Mr or Mrs. 500 Points: 577 More actions · Answer 13

Hi Scott,

When I try to run the scripts I get:

'Online index operations can only be performed in Enterprise edition of SQL Server.'

Is there a way to do this in the not enterprise version?

Peter

ScottPletcher SSC Guru Points: 100942 More actions · Answer 14

peter 67432 (7/25/2014)
Hi Scott,
When I try to run the scripts I get:
'Online index operations can only be performed in Enterprise edition of SQL Server.'
Is there a way to do this in the not enterprise version?
Peter

Sorry, sure, just remove "ONLINE = ON," from all the index commands.

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.