June 3, 2009 at 12:10 pm
Hi,
I'm running into an sql query optimization issue that I would like to share and get help to overcome it best.
The environment is based on SQL 2008 Express SP1.
Here is the issue in simplified form.
Two tables:
City ( [id] int, [CountryCode] char(2) )
CityNameAliases ( [id] int, [CityNameAlias] nvarchar(200) )
Table [City] has about 2.4 million records and [City].[id] is PRIMARY KEY. I also have NON-Unique index on [CountryCode].
CityNameAliases has about 2.7 million records and has NON-Unique index on [id].
The goal is to get City Ids that are from cpecific country and have at least one alias matching pattern '%pattern%'.
The obvious queries that should do the job for country 'cc' are:
SELECT c.id FROM City c
WHERE c.CountryCode='cc' AND c.id IN
(SELECT id FROM CityNameAliases WHERE CityNameAlias LIKE '%pattern%')
-- OR --
SELECT cna.id FROM CityNameAliases cna
WHERE cna.id IN (SELECT id FROM City WHERE CountryCode='cc')
AND cna.CityNameAlias LIKE '%pattern%'
-- OR the corresponding JOIN-type queries.
The problem is that the server always scans the whole CountryCodeAliases to match the pattern (which is very expensive) and then filters the resulting subset by CountryCode criteria (which is cheap).
It should definetely scan first by CountryCode and then by the search pattern.
I have managed to get the desired behaviour by declaring a table variable, then select all Cities and NameAliases by countrycode and inserting them into the table variable and THEN filter by the pattern. This way I got execution times 8 times shorter.
Please, advice!
June 3, 2009 at 6:38 pm
Sam Lowry (6/3/2009)
Hi,I'm running into an sql query optimization issue that I would like to share and get help to overcome it best.
The environment is based on SQL 2008 Express SP1.
Here is the issue in simplified form.
Two tables:
City ( [id] int, [CountryCode] char(2) )
CityNameAliases ( [id] int, [CityNameAlias] nvarchar(200) )
Table [City] has about 2.4 million records and [City].[id] is PRIMARY KEY. I also have NON-UNIQUE index on [CountryCode].
CityNameAliases has about 2.7 million records.
The goal is to get City Ids that are from cpecific country and have at least one alias matching pattern '%pattern%'.
The obvious queries that should do the job for country 'cc' are:
SELECT c.id FROM City c
WHERE c.CountryCode='cc' AND c.id IN
(SELECT id FROM CityNameAliases WHERE CityNameAlias LIKE '%pattern%')
-- OR --
SELECT cna.CountryCode FROM CityNameAliases cna
WHERE cna.id IN (SELECT id FROM City WHERE CountryCode='cc')
AND cna.CityNameAlias LIKE '%pattern%'
-- OR the corresponding JOIN-type queries.
The problem is that the server always scans the whole CountryCodeAliases to match the pattern (which is very expensive) and then filters the resulting subset by CountryCode criteria (which is cheap).
It should definetely scan first by CountryCode and then by the search pattern.
I have managed to get the desired behaviour by declaring a table variable, then select all Cities and NameAliases by countrycode and inserting them into the table variable and THEN filter by the pattern. This way I got execution times 8 times shorter.
Please, advice!
How about this?
with CTE AS
(
SELECT c.id
FROM City c
WHERE c.CountryCode='cc'
)
SELECT id
FROM CityNameAliases
WHERE id IN (select id from CTE)
and CityNameAlias LIKE '%pattern%'
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
June 3, 2009 at 6:50 pm
as I understand it, the LIKE '%pattern%' will always require a tablescan, since parts of the string are not indexed.
if the pattern is 'pattern%', so the value always starts with the pattern, you might get better performance, but that might not be what your application needs.
Have you considered a full text index on the column? that might help I think.
Lowell
June 3, 2009 at 7:20 pm
Some additional information is needed to assist you, so please run the below two sets of SQL statements to gather index statistical information and post the output. Note that for the second SQL that runs DBCC, you need to run the result and the post that output.
Please be sure to set the "results to text" on
DECLARE @Table_name sysname
SET@Table_name = 'OrgUnitHierarchy'
-- run the output of this SQL
SELECT'print ''Statistics for ' + sys.indexes.name + '''
go
DBCC SHOW_STATISTICS ( ''' + @Table_name + ''' , ' + sys.indexes.name + ') WITH NO_INFOMSGS, HISTOGRAM '
FROMsys.tables
JOINsys.indexes
ON sys.indexes.OBJECT_ID = sys.tables.OBJECT_ID
WHERE sys.tables.name = @Table_name
SELECTsys.tables.name
,sys.indexes.name
,sys.indexes.index_id
,STATS_DATE(sys.tables.OBJECT_ID , sys.indexes.index_id ) AS StatisticUpdateTs
FROMsys.tables
JOINsys.indexes
ON sys.indexes.OBJECT_ID = sys.tables.OBJECT_ID
WHEREsys.tables.name = @Table_name
SQL = Scarcely Qualifies as a Language
June 3, 2009 at 11:05 pm
Wayne,
The actual execution plan of the query you propose is the same as with the two quries I posted. Same poor performance ;-(
June 3, 2009 at 11:12 pm
Lowell,
as I understand it, the LIKE '%pattern%' will always require a tablescan, since parts of the string are not indexed.
if the pattern is 'pattern%', so the value always starts with the pattern, you might get better performance, but that might not be what your application needs.
Exactly.
Have you considered a full text index on the column? that might help I think.
Full text search matches by a whole keyword or phrase. I need to filter by part of the city name alias which part may be in the middle of the alias.
June 3, 2009 at 11:16 pm
Carl Federl,
I'll run the statements later today and provide the result.
June 4, 2009 at 6:35 am
Thanks Sam, looking at the statistics right now and will get back to you soon.
Can you advise the frequency of changes (insert,updates and deletes)?
Is this user entered on a one by one basis or mass updated on a scheduled basis ?
Are there any other columns in the tables expecially any variable length datatypes or columns that are updated between null and a value ?
SQL = Scarcely Qualifies as a Language
June 4, 2009 at 7:42 am
Carl,
Can you advise the frequency of changes (insert,updates and deletes)?
Is this user entered on a one by one basis or mass updated on a scheduled basis ?
The tables are fresh with only one bulk insert of all records. No updates or deletes so far.
Are there any other columns in the tables expecially any variable length datatypes or columns that are updated between null and a value ?
No other columns. I created a test database especialy to deal with this issue. The two tables schemas are exactly as specified in my first post.
June 4, 2009 at 7:48 am
The index statistics for the IX_City show a high degree of skewness with the top 10 countries having 42% of the total rows and the top 23 having 60% of the total rows. A non-skewed distribution would have each country having the same number of citites. This means that when a country with a large percentage of the cities is searched, the least cost execution plan is to perform a table scan but when a country with a low percentage of the cities is searched, a nested loop index scan may have the least cost.
RANGE_HI_KEYEQ_ROWSRankPct of TotalRuning TotalRunning Pct
RU179,73517.36179,7357.36%
ID174,05227.13353,78714.49%
US167,27736.85521,06421.34%
CN102,03644.18623,10025.52%
DE96,92853.97720,02829.49%
PK86,21163.53806,23933.03%
FR73,15473.00879,39336.02%
IR63,96182.62943,35438.64%
MX46,65291.91990,00640.55%
TR41,291101.691,031,29742.25%
PE41,014111.681,072,31143.93%
KP39,856121.631,112,16845.56%
NG39,547131.621,151,71447.18%
BR38,474141.581,190,18948.75%
PH36,114151.481,226,30350.23%
KR35,764161.461,262,06651.70%
PL35,340171.451,297,40653.15%
TH33,607181.381,331,01354.52%
AF33,395191.371,364,40855.89%
ES29,478201.211,393,88657.10%
BD28,234211.161,422,12058.25%
LK27,492221.131,449,61259.38%
SE27,476231.131,477,08860.51%
I am going to assume that there is similar degree of skewness in the city name alias within country. You can confirm this by running:
SELECTCity.CountryCode
,COUNT(*) AS CityNameAliasCnt
FROMCity
JOINCityNameAliases
on CityNameAliases.Id= City.Id
GROUP BY City.CountryCode
ORDER BY COUNT(*) DESC
Recommendations:
A) As the search criteria includes CountryCode, the optimal index should have CountryCode and CityNameAlias , in that order, as the first two indexed columns. As these two columns are in different tables, either:
1) redundantly have this column in the CityNameAliases table.
2) Create a materialized view based on joining the City and CityNameAliases table.
Additionally:
1) This index should cover the query by including the City.id as the third column in the index.
2) There may be a benefit from defining this index as unique and clustered.
B) If the data is updated on an infrequent basis and there are no updates that increase the row length, use a fill factor of 100% to reduce the number of pages, which will reduce the disk IO and pages in memory to a minimum. Please also check for any index optimizations routines that specify using the server level default fill factor and revise the definition to instead use the index specific fill factor.
CREATETABLE CityNameAliases
( [id] intNOT NULL
, [CountryCode] char(2)NOT NULL
, [CityNameAlias] nvarchar(200) ) NOT NULL
)
CREATE UNIQUE CLUSTERED INDEX CityNameAliases_X on CityNameAliases
(CountryCode, CityNameAlias, [id])
WITH FILLFACTOR = 100
go
The query would then be:
SELECTDISTINCT CityNameAliases.[Id]
FROMCityNameAliases
WHERE CityNameAliases.CountryCode='cc'
ANDCityNameAliases.CityNameAlias LIKE '%pattern%'
The execution plan should show an index seek and the maximum logical reads would be for the country with the most city aliases ( RU with 179,735 cities).
Good Luck and hope this helps.
SQL = Scarcely Qualifies as a Language
June 4, 2009 at 8:05 am
Carl,
Thank you very much for your last post. Every single word makes perfect sense.
It's obvious that dealing with your recommendations will take some time, so I'll post the results in a couple of days.
Thanks.
June 8, 2009 at 2:25 am
Here are the results of applying Carl's recommendations:
1. Materialized view:
I Created a view with schemabinding option combining id,CountryCode and CityNameAlias. Created clustered index on (CountryCode,CityNameAlias). There was no performance improvement at all. The actual execution plan was the same as with the original queries - 83% cost table scan (CityNameAlias). Typical query execution time - 9 seconds.
2. New table with exactly the same structure/scheme as the view described above - dramatic improvement! Finally index seek instead of full table scan. Typical query execution time - 2 seconds.
The execution plan report even recommended creating new index ON CountryCode with "INCLUDE (CityNameAlias)" option. This new index even improved performance further - typical query execution time - 0-1 second.
June 8, 2009 at 5:41 am
For the materialized view, you indicate that the actual execution plan is using the base tables rather than the materialized view, which indicates that you are using SQL Server Standard Edition. Only Enterprise Edition and Developer Edition will not replace a materialized view with the underlying view definition unless you include the "WITH (NOEXPAND)" hint after the view name.
To continue the topic, is this data from geonames.org ? Be aware that their data structure is denormalized and if you make your tables match their structure, you will encounter performance problems.
SQL = Scarcely Qualifies as a Language
June 8, 2009 at 6:15 am
Carl Federl (6/8/2009)
For the materialized view, you indicate that the actual execution plan is using the base tables rather than the materialized view, which indicates that you are using SQL Server Standard Edition. Only Enterprise Edition and Developer Edition will not replace a materialized view with the underlying view definition unless you include the "WITH (NOEXPAND)" hint after the view name.
I understand now. The final application will run on SQL Enterprise Edition, so the view will do the job.
To continue the topic, is this data from geonames.org ? Be aware that their data structure is denormalized and if you make your tables match their structure, you will encounter performance problems.
Yes, the data is from geonames.org and I'm aware that their dump files don't represent their internal data structures in normalized form. We have developed special tools for dealing with these dump files and importing/refreshing our sql data.
Thanks Carl.
Viewing 15 posts - 1 through 15 (of 15 total)
You must be logged in to reply to this topic. Login to reply