query optimizer issue

  • Hi,

    I'm running into an sql query optimization issue that I would like to share and get help to overcome it best.

    The environment is based on SQL 2008 Express SP1.

    Here is the issue in simplified form.

    Two tables:

    City ( [id] int, [CountryCode] char(2) )

    CityNameAliases ( [id] int, [CityNameAlias] nvarchar(200) )

    Table [City] has about 2.4 million records and [City].[id] is PRIMARY KEY. I also have NON-Unique index on [CountryCode].

    CityNameAliases has about 2.7 million records and has NON-Unique index on [id].

    The goal is to get City Ids that are from cpecific country and have at least one alias matching pattern '%pattern%'.

    The obvious queries that should do the job for country 'cc' are:

    SELECT c.id FROM City c

    WHERE c.CountryCode='cc' AND c.id IN

    (SELECT id FROM CityNameAliases WHERE CityNameAlias LIKE '%pattern%')

    -- OR --

    SELECT cna.id FROM CityNameAliases cna

    WHERE cna.id IN (SELECT id FROM City WHERE CountryCode='cc')

    AND cna.CityNameAlias LIKE '%pattern%'

    -- OR the corresponding JOIN-type queries.

    The problem is that the server always scans the whole CountryCodeAliases to match the pattern (which is very expensive) and then filters the resulting subset by CountryCode criteria (which is cheap).

    It should definetely scan first by CountryCode and then by the search pattern.

    I have managed to get the desired behaviour by declaring a table variable, then select all Cities and NameAliases by countrycode and inserting them into the table variable and THEN filter by the pattern. This way I got execution times 8 times shorter.

    Please, advice!

  • Sam Lowry (6/3/2009)


    Hi,

    I'm running into an sql query optimization issue that I would like to share and get help to overcome it best.

    The environment is based on SQL 2008 Express SP1.

    Here is the issue in simplified form.

    Two tables:

    City ( [id] int, [CountryCode] char(2) )

    CityNameAliases ( [id] int, [CityNameAlias] nvarchar(200) )

    Table [City] has about 2.4 million records and [City].[id] is PRIMARY KEY. I also have NON-UNIQUE index on [CountryCode].

    CityNameAliases has about 2.7 million records.

    The goal is to get City Ids that are from cpecific country and have at least one alias matching pattern '%pattern%'.

    The obvious queries that should do the job for country 'cc' are:

    SELECT c.id FROM City c

    WHERE c.CountryCode='cc' AND c.id IN

    (SELECT id FROM CityNameAliases WHERE CityNameAlias LIKE '%pattern%')

    -- OR --

    SELECT cna.CountryCode FROM CityNameAliases cna

    WHERE cna.id IN (SELECT id FROM City WHERE CountryCode='cc')

    AND cna.CityNameAlias LIKE '%pattern%'

    -- OR the corresponding JOIN-type queries.

    The problem is that the server always scans the whole CountryCodeAliases to match the pattern (which is very expensive) and then filters the resulting subset by CountryCode criteria (which is cheap).

    It should definetely scan first by CountryCode and then by the search pattern.

    I have managed to get the desired behaviour by declaring a table variable, then select all Cities and NameAliases by countrycode and inserting them into the table variable and THEN filter by the pattern. This way I got execution times 8 times shorter.

    Please, advice!

    How about this?

    with CTE AS

    (

    SELECT c.id

    FROM City c

    WHERE c.CountryCode='cc'

    )

    SELECT id

    FROM CityNameAliases

    WHERE id IN (select id from CTE)

    and CityNameAlias LIKE '%pattern%'

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • as I understand it, the LIKE '%pattern%' will always require a tablescan, since parts of the string are not indexed.

    if the pattern is 'pattern%', so the value always starts with the pattern, you might get better performance, but that might not be what your application needs.

    Have you considered a full text index on the column? that might help I think.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Some additional information is needed to assist you, so please run the below two sets of SQL statements to gather index statistical information and post the output. Note that for the second SQL that runs DBCC, you need to run the result and the post that output.

    Please be sure to set the "results to text" on

    DECLARE @Table_name sysname

    SET@Table_name = 'OrgUnitHierarchy'

    -- run the output of this SQL

    SELECT'print ''Statistics for ' + sys.indexes.name + '''

    go

    DBCC SHOW_STATISTICS ( ''' + @Table_name + ''' , ' + sys.indexes.name + ') WITH NO_INFOMSGS, HISTOGRAM '

    FROMsys.tables

    JOINsys.indexes

    ON sys.indexes.OBJECT_ID = sys.tables.OBJECT_ID

    WHERE sys.tables.name = @Table_name

    SELECTsys.tables.name

    ,sys.indexes.name

    ,sys.indexes.index_id

    ,STATS_DATE(sys.tables.OBJECT_ID , sys.indexes.index_id ) AS StatisticUpdateTs

    FROMsys.tables

    JOINsys.indexes

    ON sys.indexes.OBJECT_ID = sys.tables.OBJECT_ID

    WHEREsys.tables.name = @Table_name

    SQL = Scarcely Qualifies as a Language

  • Wayne,

    The actual execution plan of the query you propose is the same as with the two quries I posted. Same poor performance ;-(

  • Lowell,

    as I understand it, the LIKE '%pattern%' will always require a tablescan, since parts of the string are not indexed.

    if the pattern is 'pattern%', so the value always starts with the pattern, you might get better performance, but that might not be what your application needs.

    Exactly.

    Have you considered a full text index on the column? that might help I think.

    Full text search matches by a whole keyword or phrase. I need to filter by part of the city name alias which part may be in the middle of the alias.

  • Carl Federl,

    I'll run the statements later today and provide the result.

  • Carl Federl,

    The statistics you requested are available here.

    I'm ready to post them is CSV/XLS format if required. Just let me know.

    Thanks in advance.

  • Thanks Sam, looking at the statistics right now and will get back to you soon.

    Can you advise the frequency of changes (insert,updates and deletes)?

    Is this user entered on a one by one basis or mass updated on a scheduled basis ?

    Are there any other columns in the tables expecially any variable length datatypes or columns that are updated between null and a value ?

    SQL = Scarcely Qualifies as a Language

  • Carl,

    Can you advise the frequency of changes (insert,updates and deletes)?

    Is this user entered on a one by one basis or mass updated on a scheduled basis ?

    The tables are fresh with only one bulk insert of all records. No updates or deletes so far.

    Are there any other columns in the tables expecially any variable length datatypes or columns that are updated between null and a value ?

    No other columns. I created a test database especialy to deal with this issue. The two tables schemas are exactly as specified in my first post.

  • The index statistics for the IX_City show a high degree of skewness with the top 10 countries having 42% of the total rows and the top 23 having 60% of the total rows. A non-skewed distribution would have each country having the same number of citites. This means that when a country with a large percentage of the cities is searched, the least cost execution plan is to perform a table scan but when a country with a low percentage of the cities is searched, a nested loop index scan may have the least cost.

    RANGE_HI_KEYEQ_ROWSRankPct of TotalRuning TotalRunning Pct

    RU179,73517.36179,7357.36%

    ID174,05227.13353,78714.49%

    US167,27736.85521,06421.34%

    CN102,03644.18623,10025.52%

    DE96,92853.97720,02829.49%

    PK86,21163.53806,23933.03%

    FR73,15473.00879,39336.02%

    IR63,96182.62943,35438.64%

    MX46,65291.91990,00640.55%

    TR41,291101.691,031,29742.25%

    PE41,014111.681,072,31143.93%

    KP39,856121.631,112,16845.56%

    NG39,547131.621,151,71447.18%

    BR38,474141.581,190,18948.75%

    PH36,114151.481,226,30350.23%

    KR35,764161.461,262,06651.70%

    PL35,340171.451,297,40653.15%

    TH33,607181.381,331,01354.52%

    AF33,395191.371,364,40855.89%

    ES29,478201.211,393,88657.10%

    BD28,234211.161,422,12058.25%

    LK27,492221.131,449,61259.38%

    SE27,476231.131,477,08860.51%

    I am going to assume that there is similar degree of skewness in the city name alias within country. You can confirm this by running:

    SELECTCity.CountryCode

    ,COUNT(*) AS CityNameAliasCnt

    FROMCity

    JOINCityNameAliases

    on CityNameAliases.Id= City.Id

    GROUP BY City.CountryCode

    ORDER BY COUNT(*) DESC

    Recommendations:

    A) As the search criteria includes CountryCode, the optimal index should have CountryCode and CityNameAlias , in that order, as the first two indexed columns. As these two columns are in different tables, either:

    1) redundantly have this column in the CityNameAliases table.

    2) Create a materialized view based on joining the City and CityNameAliases table.

    Additionally:

    1) This index should cover the query by including the City.id as the third column in the index.

    2) There may be a benefit from defining this index as unique and clustered.

    B) If the data is updated on an infrequent basis and there are no updates that increase the row length, use a fill factor of 100% to reduce the number of pages, which will reduce the disk IO and pages in memory to a minimum. Please also check for any index optimizations routines that specify using the server level default fill factor and revise the definition to instead use the index specific fill factor.

    CREATETABLE CityNameAliases

    ( [id] intNOT NULL

    , [CountryCode] char(2)NOT NULL

    , [CityNameAlias] nvarchar(200) ) NOT NULL

    )

    CREATE UNIQUE CLUSTERED INDEX CityNameAliases_X on CityNameAliases

    (CountryCode, CityNameAlias, [id])

    WITH FILLFACTOR = 100

    go

    The query would then be:

    SELECTDISTINCT CityNameAliases.[Id]

    FROMCityNameAliases

    WHERE CityNameAliases.CountryCode='cc'

    ANDCityNameAliases.CityNameAlias LIKE '%pattern%'

    The execution plan should show an index seek and the maximum logical reads would be for the country with the most city aliases ( RU with 179,735 cities).

    Good Luck and hope this helps.

    SQL = Scarcely Qualifies as a Language

  • Carl,

    Thank you very much for your last post. Every single word makes perfect sense.

    It's obvious that dealing with your recommendations will take some time, so I'll post the results in a couple of days.

    Thanks.

  • Here are the results of applying Carl's recommendations:

    1. Materialized view:

    I Created a view with schemabinding option combining id,CountryCode and CityNameAlias. Created clustered index on (CountryCode,CityNameAlias). There was no performance improvement at all. The actual execution plan was the same as with the original queries - 83% cost table scan (CityNameAlias). Typical query execution time - 9 seconds.

    2. New table with exactly the same structure/scheme as the view described above - dramatic improvement! Finally index seek instead of full table scan. Typical query execution time - 2 seconds.

    The execution plan report even recommended creating new index ON CountryCode with "INCLUDE (CityNameAlias)" option. This new index even improved performance further - typical query execution time - 0-1 second.

  • For the materialized view, you indicate that the actual execution plan is using the base tables rather than the materialized view, which indicates that you are using SQL Server Standard Edition. Only Enterprise Edition and Developer Edition will not replace a materialized view with the underlying view definition unless you include the "WITH (NOEXPAND)" hint after the view name.

    To continue the topic, is this data from geonames.org ? Be aware that their data structure is denormalized and if you make your tables match their structure, you will encounter performance problems.

    SQL = Scarcely Qualifies as a Language

  • Carl Federl (6/8/2009)


    For the materialized view, you indicate that the actual execution plan is using the base tables rather than the materialized view, which indicates that you are using SQL Server Standard Edition. Only Enterprise Edition and Developer Edition will not replace a materialized view with the underlying view definition unless you include the "WITH (NOEXPAND)" hint after the view name.

    I understand now. The final application will run on SQL Enterprise Edition, so the view will do the job.

    To continue the topic, is this data from geonames.org ? Be aware that their data structure is denormalized and if you make your tables match their structure, you will encounter performance problems.

    Yes, the data is from geonames.org and I'm aware that their dump files don't represent their internal data structures in normalized form. We have developed special tools for dealing with these dump files and importing/refreshing our sql data.

    Thanks Carl.

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply