Group by performance FK / PK

  • Hello all,

    I have something I can't explain, hope some of you can?

    We have a medium sized database with the next tables:

    - PA: 525000 records

    - PR: 780000 records

    - R: 1000 records

    - B: 45 records

    PA: PK = PAARDCODE

    PR: PK = PAARDREGISTERCODE

    PR: FK = PAARDCODE

    PR: FK = REGISTERCODE

    R: PK = REGISTERCODE

    R: FK = BOEKCODE

    B: PK = BOEKCODE

    When I group by B.BOEKCODE the query lasts: 10 (or more when 'where' option is added) seconds

    When I group by R.BOEKCODE the query lasts less than 2 seconds.

    SELECT B.BOEKOMSCHRIJVING, B.BOEKCODE -- or R.BOEKCODE

    FROM PA

    INNER JOIN PR ON PA.PAARDCODE = PR.PAARDCODE

    INNER JOIN R ON R.REGISTERCODE = PR.REGISTERCODE

    INNER JOIN B ON R.BOEKCODE = B.BOEKCODE

    GROUP BY BOEKOMSCHRIJVING, B.BOEKCODE -- or R.BOEKCODE

    ORDER BY BOEKOMSCHRIJVING

    Why is the option of B.BOEKCODE a lot slower compared to R.BOEKCODE?

    Thanks in advance,

    Peter

  • Have you looked at the query's execution plan? This would likely reveal the reason. Also, is there more than one record in the table with alias B for a given BOEKCODE value ? In that case you may be traversing a lot of records that aren't needed, as including the B table in the result set requires retreiving all the records from that table for that BOEKCODE value. As you are using GROUP BY, you don't need the B table data if you aren't aggregating anything from it, so using the R table's version is fine.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • sgmunson (7/10/2014)


    Have you looked at the query's execution plan? This would likely reveal the reason. Also, is there more than one record in the table with alias B for a given BOEKCODE value ? In that case you may be traversing a lot of records that aren't needed, as including the B table in the result set requires retreiving all the records from that table for that BOEKCODE value. As you are using GROUP BY, you don't need the B table data if you aren't aggregating anything from it, so using the R table's version is fine.

    Hi Sgmunson,

    Thanks for looking into this, I will look at the execution plan to see what is happening.

    From the 'B' table, I need the 'BOEKOMSCHRIJVING', that's why I've added the 'B' table.

    For the extra records, Every PA has (at least) one PR, each PR contains 1 R, and each R has one B

  • Given your additional info, this suggests that once you have the query plan, take a close look at the indexes on the B table. See if there are any that have both the BOEKCODE as well as the other selected field. I'm guessing not, and by selecting that field from the B table, you have to traverse all the records rather than hit an index that just contains the other selected field, but I wouldn't want to offer any guarantee on that. The query plan will likely make things clear as to what's happening when you select it from the B table instead of the R table. You may be able to anticipate the plan's answer just by running the following query:

    SELECT BOEKCODE, BOEKOMSCHRIJVING, COUNT(*)

    FROM B

    GROUP BY BOEKCODE, BOEKOMSCHRIJVING

    HAVING COUNT(*) > 1

    If that query returns any rows, you might want to substitute the following into your query in place of the B table:

    SELECT DISTINCT BOEKCODE, BOEKOMSCHRIJVING

    FROM B

    peter 67432 (7/10/2014)


    sgmunson (7/10/2014)


    Have you looked at the query's execution plan? This would likely reveal the reason. Also, is there more than one record in the table with alias B for a given BOEKCODE value ? In that case you may be traversing a lot of records that aren't needed, as including the B table in the result set requires retreiving all the records from that table for that BOEKCODE value. As you are using GROUP BY, you don't need the B table data if you aren't aggregating anything from it, so using the R table's version is fine.

    Hi Sgmunson,

    Thanks for looking into this, I will look at the execution plan to see what is happening.

    From the 'B' table, I need the 'BOEKOMSCHRIJVING', that's why I've added the 'B' table.

    For the extra records, Every PA has (at least) one PR, each PR contains 1 R, and each R has one B

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • Hi Steve,

    I checked the query execution plan, however it is the first time I am looking in such thing, and I have no idea how to see what is wrong. I tried to follow this website: link[/url], however I still don't understand it, can you help me with reading this plan?

    Thanks in advance!

    Peter

    The slow one:

    The fast one, also divided in 2 screens for readability.

  • One thing I noticed right away was that the "slow" one has two "Nested Loop" icons, while the "fast" one does not, and includes parallelism. Taking advantage of parallel operations usually speeds things up. Having no nested loops is also an advantage. I may not be an expert at reading these plans, but the conclusion I can draw from seeing the nested loop plan going a lot slower than one that takes good advantage of the indexes and operates in parallel is that my original thinking was quite likely correct. Having to retrieve the BOEKCODE from the B table is costing a lot, as it probably comes from the index in the other table, so not having to traverse the individual records from the B table gets you out of a costly pair of nested loops. Next time you want to post a SQL execution plan, be sure to save the plan with the .sqlplan extension, then zip the file and attach it to your post. It is much easier for others to look at the plan that way, and because I can't see the properties of the individual icons, I can't be certain of my conclusion, but I don't think I'm likely to learn differently by examining those properties in this specific case. Someone with more experience reading them might be in a better position to do so, and to better explain it than I can.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • If possible, please attach the actual query plan xml as an xml file, rather than just a picture of the plan. There are row counts and other stats available in the query plan that are critical to analyzing it but that can't be seen from a static picture alone.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Hi,

    Sorry for the late response, was a weekend out of town. The zip file should be attached.

    Peter

  • Interesting, but not definitive.

    Would you please run these commands on that database and post the results? That will show what indexes SQL "thinks" are missing, and how existing indexes are being used:

    USE [DH_KWPN]

    SET DEADLOCK_PRIORITY LOW --probably irrelevant, but just in case

    DECLARE @list_missing_indexes bit

    DECLARE @table_name_pattern sysname

    --NOTE: showing missing indexes can take some time; set to 0 if you don't want to wait.

    SET @list_missing_indexes = 1 --1=list missing index(es); 0=don't.

    PRINT 'Started @ ' + CONVERT(varchar(30), GETDATE(), 120)

    IF @list_missing_indexes = 1

    BEGIN

    SELECT

    GETDATE() AS capture_date,

    DB_NAME(mid.database_id) AS Db_Name,

    OBJECT_NAME(mid.object_id /*, mid.database_id*/) AS Table_Name,

    mid.equality_columns, mid.inequality_columns, mid.included_columns,

    ca1.max_days_active,

    migs.*,

    mid.statement, mid.object_id, mid.index_handle

    FROM sys.dm_db_missing_index_details mid WITH (NOLOCK)

    CROSS APPLY (

    SELECT DATEDIFF(DAY, create_date, GETDATE()) AS max_days_active FROM sys.databases WHERE name = 'tempdb'

    ) AS ca1

    LEFT OUTER JOIN sys.dm_db_missing_index_groups mig WITH (NOLOCK) ON

    mig.index_handle = mid.index_handle

    LEFT OUTER JOIN sys.dm_db_missing_index_group_stats migs WITH (NOLOCK) ON

    migs.group_handle = mig.index_group_handle

    --order by

    --DB_NAME, Table_Name, equality_columns

    WHERE

    1 = 1

    AND mid.database_id = DB_ID()

    AND OBJECT_NAME(mid.object_id) IN (

    'SHA_PAARDACT',

    'SHA_PAARDREGISTER',

    'SHA_REGISTER',

    'SHA_BOEK'

    )

    ORDER BY

    --avg_total_user_cost * (user_seeks + user_scans) DESC,

    Db_Name, Table_Name, equality_columns, inequality_columns

    END --IF

    PRINT 'Midpoint @ ' + CONVERT(varchar(30), GETDATE(), 120)

    -- list index usage stats (seeks, scans, etc.)

    SELECT

    ius2.row_num, DB_NAME() AS db_name,

    i.name AS index_name,

    OBJECT_NAME(i.object_id/*, DB_ID()*/) AS table_name,

    i.index_id, --ius.user_seeks + ius.user_scans AS total_reads,

    dps.row_count,

    SUBSTRING(key_cols, 3, 8000) AS key_cols, SUBSTRING(nonkey_cols, 3, 8000) AS nonkey_cols,

    ius.user_seeks, ius.user_scans, ius.user_lookups, ius.user_updates,

    ius.last_user_seek, ius.last_user_scan, ius.last_user_lookup, ius.last_user_update,

    fk.Reference_Count AS fk_ref_count,

    FILEGROUP_NAME(i.data_space_id) AS filegroup_name,

    ca1.max_days_active,

    ius.system_seeks, ius.system_scans, ius.system_lookups, ius.system_updates,

    ius.last_system_seek, ius.last_system_scan, ius.last_system_lookup, ius.last_system_update

    FROM sys.indexes i WITH (NOLOCK)

    INNER JOIN sys.objects o WITH (NOLOCK) ON

    o.object_id = i.object_id

    CROSS APPLY (

    SELECT DATEDIFF(DAY, create_date, GETDATE()) AS max_days_active FROM sys.databases WHERE name = 'tempdb'

    ) AS ca1

    OUTER APPLY (

    SELECT

    ', ' + COL_NAME(object_id, ic.column_id)

    FROM sys.index_columns ic

    WHERE

    ic.key_ordinal > 0 AND

    ic.object_id = i.object_id AND

    ic.index_id = i.index_id

    ORDER BY

    ic.key_ordinal

    FOR XML PATH('')

    ) AS key_cols (key_cols)

    OUTER APPLY (

    SELECT

    ', ' + COL_NAME(object_id, ic.column_id)

    FROM sys.index_columns ic

    WHERE

    ic.key_ordinal = 0 AND

    ic.object_id = i.object_id AND

    ic.index_id = i.index_id

    ORDER BY

    COL_NAME(object_id, ic.column_id)

    FOR XML PATH('')

    ) AS nonkey_cols (nonkey_cols)

    LEFT OUTER JOIN sys.dm_db_partition_stats dps WITH (NOLOCK) ON

    dps.object_id = i.object_id AND

    dps.index_id = i.index_id

    LEFT OUTER JOIN sys.dm_db_index_usage_stats ius WITH (NOLOCK) ON

    ius.database_id = DB_ID() AND

    ius.object_id = i.object_id AND

    ius.index_id = i.index_id

    LEFT OUTER JOIN (

    SELECT

    database_id, object_id, MAX(user_scans) AS user_scans,

    ROW_NUMBER() OVER (ORDER BY MAX(user_scans) DESC) AS row_num --user_scans|user_seeks+user_scans

    FROM sys.dm_db_index_usage_stats WITH (NOLOCK)

    WHERE

    database_id = DB_ID()

    --AND index_id > 0

    GROUP BY

    database_id, object_id

    ) AS ius2 ON

    ius2.database_id = DB_ID() AND

    ius2.object_id = i.object_id

    LEFT OUTER JOIN (

    SELECT

    referenced_object_id, COUNT(*) AS Reference_Count

    FROM sys.foreign_keys WITH (NOLOCK)

    WHERE

    is_disabled = 0

    GROUP BY

    referenced_object_id

    ) AS fk ON

    fk.referenced_object_id = i.object_id

    WHERE

    i.object_id > 100 AND

    i.is_hypothetical = 0 AND

    i.type IN (0, 1, 2) AND

    o.type NOT IN ( 'IF', 'IT', 'TF', 'TT' ) AND

    (

    o.name IN (

    'SHA_PAARDACT',

    'SHA_PAARDREGISTER',

    'SHA_REGISTER',

    'SHA_BOEK'

    )

    )

    ORDER BY

    --row_count DESC,

    --ius.user_scans DESC,

    --ius2.row_num, --user_scans+user_seeks

    -- list clustered index first, if any, then other index(es)

    db_name, table_name, CASE WHEN i.index_id IN (0, 1) THEN 1 ELSE 2 END, index_name

    PRINT 'Ended @ ' + CONVERT(varchar(30), GETDATE(), 120)

    Edit: Corrected CROSS APPLY column alias.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Hi Scott,

    In attachment the results (hopefully good format)

    Peter

  • I think that Query1 data got scrambled or something.

    Can you use a spreadsheet instead? After you run the queries, in the Results/Output area, left-click in the empty box to the left of the first column name, which should highlight the entire result set. Then right-click, and do "Copy with Headers" and paste it into the spreadsheet. Both results can go into the same spreadsheet.

    Based on analysis so far of the query2, you'd definitely gain performance by clustering SHA_PAARDREGISTER on PAARDCODE instead of on PAARDREGISTERCODE, although you'd definitely want to create a nonclustered index on PAARDREGISTERCODE.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • As asked also the results in grid format.

  • Yep, looks great, thanks!

    Here's my scripted recommendations for index changes/rebuilds. I don't have time right now, but can explain more later if/when you have qs. Hope this helps!

    ------------------------------------------------------------------------------------------------------------------------

    --Table: SHA_PAARDACT

    -- DROPs

    DROP INDEX [IDX_FPS_PAARDACT_ACTUEEL] ON SHA_PAARDACT --will be covered by recreated IDX_FPS_PAARDACT_GESLACHT

    --!!(see Comment1 below in CREATEs)

    DROP INDEX [IDX_FPS_PAARDACT_GEBOORTEDATUM] ON SHA_PAARDACT --will be covered by recreated IDX_FPS_PAARDACT_GESLACHT

    DROP INDEX [IDX_FPS_PAARDACT_GESLACHT] ON SHA_PAARDACT --to be recreated with included columns

    DROP INDEX [IDX_FPS_PAARDACT_KEURSEIZ] ON SHA_PAARDACT --will be covered by recreated IDX_FPS_PAARDACT_GESLACHT

    DROP INDEX [IDX_FPS_PAARDACT_REGISTER] ON SHA_PAARDACT --will be covered by recreated IDX_FPS_PAARDACT_GESLACHT

    DROP INDEX [IX_SHA_PAARDACT] ON SHA_PAARDACT --will be covered by recreated IDX_FPS_PAARDACT_GESLACHT

    -- CREATEs

    CREATE NONCLUSTERED INDEX [IDX_FPS_PAARDACT_EXPORTDATUM_IMPORTDATUM]

    ON SHA_PAARDACT ( EXPORTDATUM, IMPORTDATUM )

    WITH ( FILLFACTOR = 99, ONLINE = ON, SORT_IN_TEMPDB = ON )

    ON [PRIMARY]

    --Comment1: i think [GEBOORTEDATUM] is birth date??(name looks Dutch(?), but based on my very limited knowledge of German);

    -- if instead it's a long (~20+ bytes) column, you can leave it in its own index and remove it from this one

    CREATE UNIQUE NONCLUSTERED INDEX [IDX_FPS_PAARDACT_GESLACHT]

    ON SHA_PAARDACT ( GESLACHT, PAARDCODE )

    INCLUDE ( ACTUEEL, DATUMOVERLIJDEN, GEBOORTEDATUM, KEURINGSSEIZOEN, REGISTERCODE )

    WITH ( FILLFACTOR = 99, ONLINE = ON, SORT_IN_TEMPDB = ON )

    ON [PRIMARY]

    ------------------------------------------------------------------------------------------------------------------------

    --Table: SHA_PAARDREGISTER !*changing clustering key*!

    -- DROPs

    DROP INDEX [IDX_FPS_PAARDREGISTER_DATUMTOT] ON SHA_PAARDREGISTER --will be covered by recreated IDX_FPS_PAARDREGISTER_REG

    DROP INDEX [IDX_FPS_PAARDREGISTER_DATUMVAN] ON SHA_PAARDREGISTER --will be covered by recreated IDX_FPS_PAARDREGISTER_REG

    DROP INDEX [IDX_FPS_PAARDREGISTER_PAARD] ON SHA_PAARDREGISTER --will become new clustering key, so nonclus index no longer needed

    DROP INDEX [IDX_FPS_PAARDREGISTER_REG] ON SHA_PAARDREGISTER --to be recreated with included columns

    ALTER TABLE SHA_PAARDREGISTER DROP CONSTRAINT [PK_FPS_PAARDREGISTER] --changing clustered index, must drop existing one first

    -- CREATEs

    CREATE UNIQUE CLUSTERED INDEX [CLX_FPS_PAARDREGISTER]

    ON SHA_PAARDREGISTER ( PAARDCODE, PAARDREGISTERCODE )

    WITH ( FILLFACTOR = 99, ONLINE = ON, SORT_IN_TEMPDB = ON )

    ON [PRIMARY]

    ALTER TABLE SHA_PAARDREGISTER

    ADD CONSTRAINT [PK_FPS_PAARDREGISTER] PRIMARY KEY ( PAARDREGISTERCODE )

    WITH ( FILLFACTOR = 100, ONLINE = ON, SORT_IN_TEMPDB = ON )

    ON [PRIMARY]

    CREATE NONCLUSTERED INDEX [IDX_FPS_PAARDREGISTER_REG]

    ON SHA_PAARDREGISTER ( REGISTERCODE )

    INCLUDE ( DATUMVAN, DATUMTOT, PAARDCODE )

    WITH ( FILLFACTOR = 99, ONLINE = ON, SORT_IN_TEMPDB = ON )

    ON [PRIMARY]

    ------------------------------------------------------------------------------------------------------------------------

    --Table: SHA_REGISTER

    -- DROPs

    DROP INDEX [IDX_FPS_REGISTER_BOEK] ON SHA_REGISTER --will be covered by UIX_FPS_REGISTER

    DROP INDEX [IDX_FPS_REGISTER_GESLACHT] ON SHA_REGISTER --will be covered by UIX_FPS_REGISTER

    DROP INDEX [IDX_FPS_REGISTER_PREDIKAAT] ON SHA_REGISTER --will be covered by UIX_FPS_REGISTER

    DROP INDEX [IDX_FPS_REGISTER_REGISTERTYPE] ON SHA_REGISTER --will be covered by UIX_FPS_REGISTER

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Hi Scott,

    When I try to run the scripts I get:

    'Online index operations can only be performed in Enterprise edition of SQL Server.'

    Is there a way to do this in the not enterprise version?

    Peter

  • peter 67432 (7/25/2014)


    Hi Scott,

    When I try to run the scripts I get:

    'Online index operations can only be performed in Enterprise edition of SQL Server.'

    Is there a way to do this in the not enterprise version?

    Peter

    Sorry, sure, just remove "ONLINE = ON," from all the index commands.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply