query optimizer issue

Question

query optimizer issue

Sam Lowry

SSC Journeyman

Points: 95
More actions
June 3, 2009 at 12:10 pm

#97661

Hi,
I'm running into an sql query optimization issue that I would like to share and get help to overcome it best.
The environment is based on SQL 2008 Express SP1.
Here is the issue in simplified form.
Two tables:
City ( [id] int, [CountryCode] char(2) )
CityNameAliases ( [id] int, [CityNameAlias] nvarchar(200) )
Table [City] has about 2.4 million records and [City].[id] is PRIMARY KEY. I also have NON-Unique index on [CountryCode].
CityNameAliases has about 2.7 million records and has NON-Unique index on [id].
The goal is to get City Ids that are from cpecific country and have at least one alias matching pattern '%pattern%'.
The obvious queries that should do the job for country 'cc' are:
SELECT c.id FROM City c
WHERE c.CountryCode='cc' AND c.id IN
(SELECT id FROM CityNameAliases WHERE CityNameAlias LIKE '%pattern%')
-- OR --
SELECT cna.id FROM CityNameAliases cna
WHERE cna.id IN (SELECT id FROM City WHERE CountryCode='cc')
AND cna.CityNameAlias LIKE '%pattern%'
-- OR the corresponding JOIN-type queries.
The problem is that the server always scans the whole CountryCodeAliases to match the pattern (which is very expensive) and then filters the resulting subset by CountryCode criteria (which is cheap).
It should definetely scan first by CountryCode and then by the search pattern.
I have managed to get the desired behaviour by declaring a table variable, then select all Cities and NameAliases by countrycode and inserting them into the table variable and THEN filter by the pattern. This way I got execution times 8 times shorter.
Please, advice!

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply

WayneS SSC Guru Points: 95461 More actions · Answer 1

Sam Lowry (6/3/2009)
Hi,
I'm running into an sql query optimization issue that I would like to share and get help to overcome it best.
The environment is based on SQL 2008 Express SP1.
Here is the issue in simplified form.
Two tables:
City ( [id] int, [CountryCode] char(2) )
CityNameAliases ( [id] int, [CityNameAlias] nvarchar(200) )
Table [City] has about 2.4 million records and [City].[id] is PRIMARY KEY. I also have NON-UNIQUE index on [CountryCode].
CityNameAliases has about 2.7 million records.
The goal is to get City Ids that are from cpecific country and have at least one alias matching pattern '%pattern%'.
The obvious queries that should do the job for country 'cc' are:
SELECT c.id FROM City c
WHERE c.CountryCode='cc' AND c.id IN
(SELECT id FROM CityNameAliases WHERE CityNameAlias LIKE '%pattern%')
-- OR --
SELECT cna.CountryCode FROM CityNameAliases cna
WHERE cna.id IN (SELECT id FROM City WHERE CountryCode='cc')
AND cna.CityNameAlias LIKE '%pattern%'
-- OR the corresponding JOIN-type queries.
The problem is that the server always scans the whole CountryCodeAliases to match the pattern (which is very expensive) and then filters the resulting subset by CountryCode criteria (which is cheap).
It should definetely scan first by CountryCode and then by the search pattern.
I have managed to get the desired behaviour by declaring a table variable, then select all Cities and NameAliases by countrycode and inserting them into the table variable and THEN filter by the pattern. This way I got execution times 8 times shorter.
Please, advice!

How about this?

with CTE AS

(

SELECT c.id

FROM City c

WHERE c.CountryCode='cc'

)

SELECT id

FROM CityNameAliases

WHERE id IN (select id from CTE)

and CityNameAlias LIKE '%pattern%'

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

Lowell SSC Guru Points: 323748 More actions · Answer 2

as I understand it, the LIKE '%pattern%' will always require a tablescan, since parts of the string are not indexed.

if the pattern is 'pattern%', so the value always starts with the pattern, you might get better performance, but that might not be what your application needs.

Have you considered a full text index on the column? that might help I think.

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

Carl Federl One Orange Chip Points: 25384 More actions · Answer 3

Some additional information is needed to assist you, so please run the below two sets of SQL statements to gather index statistical information and post the output. Note that for the second SQL that runs DBCC, you need to run the result and the post that output.

Please be sure to set the "results to text" on

DECLARE @Table_name sysname

SET@Table_name = 'OrgUnitHierarchy'

-- run the output of this SQL

SELECT'print ''Statistics for ' + sys.indexes.name + '''

go

DBCC SHOW_STATISTICS ( ''' + @Table_name + ''' , ' + sys.indexes.name + ') WITH NO_INFOMSGS, HISTOGRAM '

FROMsys.tables

JOINsys.indexes

ON sys.indexes.OBJECT_ID = sys.tables.OBJECT_ID

WHERE sys.tables.name = @Table_name

SELECTsys.tables.name

,sys.indexes.name

,sys.indexes.index_id

,STATS_DATE(sys.tables.OBJECT_ID , sys.indexes.index_id ) AS StatisticUpdateTs

FROMsys.tables

JOINsys.indexes

ON sys.indexes.OBJECT_ID = sys.tables.OBJECT_ID

WHEREsys.tables.name = @Table_name

SQL = Scarcely Qualifies as a Language

Sam Lowry SSC Journeyman Points: 95 More actions · Answer 4

Wayne,

The actual execution plan of the query you propose is the same as with the two quries I posted. Same poor performance ;-(

Sam Lowry SSC Journeyman Points: 95 More actions · Answer 5

Lowell,

as I understand it, the LIKE '%pattern%' will always require a tablescan, since parts of the string are not indexed.
if the pattern is 'pattern%', so the value always starts with the pattern, you might get better performance, but that might not be what your application needs.

Exactly.

Have you considered a full text index on the column? that might help I think.

Full text search matches by a whole keyword or phrase. I need to filter by part of the city name alias which part may be in the middle of the alias.

Sam Lowry SSC Journeyman Points: 95 More actions · Answer 6

Carl Federl,

I'll run the statements later today and provide the result.

Sam Lowry SSC Journeyman Points: 95 More actions · Answer 7

Carl Federl,

The statistics you requested are available here.

I'm ready to post them is CSV/XLS format if required. Just let me know.

Thanks in advance.

Carl Federl One Orange Chip Points: 25384 More actions · Answer 8

Thanks Sam, looking at the statistics right now and will get back to you soon.

Can you advise the frequency of changes (insert,updates and deletes)?

Is this user entered on a one by one basis or mass updated on a scheduled basis ?

Are there any other columns in the tables expecially any variable length datatypes or columns that are updated between null and a value ?

SQL = Scarcely Qualifies as a Language

Sam Lowry SSC Journeyman Points: 95 More actions · Answer 9

Carl,

Can you advise the frequency of changes (insert,updates and deletes)?
Is this user entered on a one by one basis or mass updated on a scheduled basis ?

The tables are fresh with only one bulk insert of all records. No updates or deletes so far.

Are there any other columns in the tables expecially any variable length datatypes or columns that are updated between null and a value ?

No other columns. I created a test database especialy to deal with this issue. The two tables schemas are exactly as specified in my first post.

Carl Federl One Orange Chip Points: 25384 More actions · Answer 10

The index statistics for the IX_City show a high degree of skewness with the top 10 countries having 42% of the total rows and the top 23 having 60% of the total rows. A non-skewed distribution would have each country having the same number of citites. This means that when a country with a large percentage of the cities is searched, the least cost execution plan is to perform a table scan but when a country with a low percentage of the cities is searched, a nested loop index scan may have the least cost.

RANGE_HI_KEYEQ_ROWSRankPct of TotalRuning TotalRunning Pct
RU179,73517.36179,7357.36%
ID174,05227.13353,78714.49%
US167,27736.85521,06421.34%
CN102,03644.18623,10025.52%
DE96,92853.97720,02829.49%
PK86,21163.53806,23933.03%
FR73,15473.00879,39336.02%
IR63,96182.62943,35438.64%
MX46,65291.91990,00640.55%
TR41,291101.691,031,29742.25%
PE41,014111.681,072,31143.93%
KP39,856121.631,112,16845.56%
NG39,547131.621,151,71447.18%
BR38,474141.581,190,18948.75%
PH36,114151.481,226,30350.23%
KR35,764161.461,262,06651.70%
PL35,340171.451,297,40653.15%
TH33,607181.381,331,01354.52%
AF33,395191.371,364,40855.89%
ES29,478201.211,393,88657.10%
BD28,234211.161,422,12058.25%
LK27,492221.131,449,61259.38%
SE27,476231.131,477,08860.51%

I am going to assume that there is similar degree of skewness in the city name alias within country. You can confirm this by running:

SELECTCity.CountryCode

,COUNT(*) AS CityNameAliasCnt

FROMCity

JOINCityNameAliases

on CityNameAliases.Id= City.Id

GROUP BY City.CountryCode

ORDER BY COUNT(*) DESC

Recommendations:

A) As the search criteria includes CountryCode, the optimal index should have CountryCode and CityNameAlias , in that order, as the first two indexed columns. As these two columns are in different tables, either:

1) redundantly have this column in the CityNameAliases table.

2) Create a materialized view based on joining the City and CityNameAliases table.

Additionally:

1) This index should cover the query by including the City.id as the third column in the index.

2) There may be a benefit from defining this index as unique and clustered.

B) If the data is updated on an infrequent basis and there are no updates that increase the row length, use a fill factor of 100% to reduce the number of pages, which will reduce the disk IO and pages in memory to a minimum. Please also check for any index optimizations routines that specify using the server level default fill factor and revise the definition to instead use the index specific fill factor.

CREATETABLE CityNameAliases

( [id] intNOT NULL

, [CountryCode] char(2)NOT NULL

, [CityNameAlias] nvarchar(200) ) NOT NULL

)

CREATE UNIQUE CLUSTERED INDEX CityNameAliases_X on CityNameAliases

(CountryCode, CityNameAlias, [id])

WITH FILLFACTOR = 100

go

The query would then be:

SELECTDISTINCT CityNameAliases.[Id]

FROMCityNameAliases

WHERE CityNameAliases.CountryCode='cc'

ANDCityNameAliases.CityNameAlias LIKE '%pattern%'

The execution plan should show an index seek and the maximum logical reads would be for the country with the most city aliases ( RU with 179,735 cities).

Good Luck and hope this helps.

SQL = Scarcely Qualifies as a Language

Sam Lowry SSC Journeyman Points: 95 More actions · Answer 11

Carl,

Thank you very much for your last post. Every single word makes perfect sense.

It's obvious that dealing with your recommendations will take some time, so I'll post the results in a couple of days.

Thanks.

Sam Lowry SSC Journeyman Points: 95 More actions · Answer 12

Here are the results of applying Carl's recommendations:

1. Materialized view:

I Created a view with schemabinding option combining id,CountryCode and CityNameAlias. Created clustered index on (CountryCode,CityNameAlias). There was no performance improvement at all. The actual execution plan was the same as with the original queries - 83% cost table scan (CityNameAlias). Typical query execution time - 9 seconds.

2. New table with exactly the same structure/scheme as the view described above - dramatic improvement! Finally index seek instead of full table scan. Typical query execution time - 2 seconds.

The execution plan report even recommended creating new index ON CountryCode with "INCLUDE (CityNameAlias)" option. This new index even improved performance further - typical query execution time - 0-1 second.

Carl Federl One Orange Chip Points: 25384 More actions · Answer 13

For the materialized view, you indicate that the actual execution plan is using the base tables rather than the materialized view, which indicates that you are using SQL Server Standard Edition. Only Enterprise Edition and Developer Edition will not replace a materialized view with the underlying view definition unless you include the "WITH (NOEXPAND)" hint after the view name.

To continue the topic, is this data from geonames.org ? Be aware that their data structure is denormalized and if you make your tables match their structure, you will encounter performance problems.

SQL = Scarcely Qualifies as a Language

Sam Lowry SSC Journeyman Points: 95 More actions · Answer 14

Carl Federl (6/8/2009)
For the materialized view, you indicate that the actual execution plan is using the base tables rather than the materialized view, which indicates that you are using SQL Server Standard Edition. Only Enterprise Edition and Developer Edition will not replace a materialized view with the underlying view definition unless you include the "WITH (NOEXPAND)" hint after the view name.

I understand now. The final application will run on SQL Enterprise Edition, so the view will do the job.

To continue the topic, is this data from geonames.org ? Be aware that their data structure is denormalized and if you make your tables match their structure, you will encounter performance problems.

Yes, the data is from geonames.org and I'm aware that their dump files don't represent their internal data structures in normalized form. We have developed special tools for dealing with these dump files and importing/refreshing our sql data.

Thanks Carl.