RE: Questions regarding Spatial Data performance

SSC Guru

Points: 64780

November 3, 2013 at 12:20 pm

Hi Razzgor,

Thanks for posting the plan. Like I said, I'm not a spatial expert - but I do know a fair bit about plans. So let's start with some generic stuff.

1. Don't trust the percentages in the plan. These are based on the estimated cost of the operator, not on the actual cost (which is not measured per operator). If slow processing is caused by bad estimates, these percentages are based on those bad estimates. They will not show where the real pain is.

2. The table scan is not a problem. The query has no filter on the #fact_table, so it will need to read all rows. Scanning is the fastest and most efficient way to fetch all rows. The fact that it's a table scan, not a (clustered) index scan, proves that you have no clustered index on the table - for a permanent table almost always a bad thing, for a temp table not so much, though it can be. But in this case not the problem.

3. Outer joins give the optimizer much less freedom than inner joins. Do you really need this to be an outer join? Even if that is the case, I would still like to know if you get much better performance from an inner join.

4. The way you built the query makes it hard for the optimizer to optimize. I don't know the datatype and contents of the IP_Converted column, but I do know that you convert it to varchar(20), then convert that (using STGeomFromText) to a point, which is then fed into the STWithin function. This means that whatever statistics the optimizer may have on IP_Converted is useless for this query. Do you really need this complex double converston?

Can you try adapting the procedure to add an extra column to the temp table that holds the result of the conversion (so the point, as geometry data type)? It can be a normal column, or a persisted computed column. Then build a spatial index on that column too. Does this help?

5. You wrote in the original post that your benchmark worked fine, but it doesn't work well from within the report. Is the code exactly the same, or are there differences? And how about the execution plans? (I don't know how your report works; you may have to use profiler or extended events to get the exact query).

6. The plan you posted will fetch each row from the temporary fact table (the top-right table scan), then do some arithmetic in the Compute Scalar (that's where the Geomerty value is calculated from the IP_Converted column). Then, for each row, the GEO2IP table is scanned (the clustered index scan bottom-right - this IS a big part of the problem); those rows are passed through a filter (I'm willing to bet that this includes the georange.STwithin calculation), and then the join is performed. For a query like this, I would much rather have seen a merge or hash join, but I think those are not possible for geometry/geography functions. A nested loops join like this, but using the spatial index in the bottom part (instead of the clustered index scan + filter), would probably be the best option. Which, of course, brings us back to your original question. 😉

Can you try to create a repro? I understand that the data, and maybe even the schema, is confidential. That doesn't hinder a repro. You can create new tables, with fake column names, put in some fake data (but similar enough to your real problem to demonstrate the behaviour). Then post that schema (as CREATE TABLE statements) and data (as INSERT statements) here. Don't go overboard with the sample data - if you can repro the issue with 10 rows, don't post 1,000.

Some other questions:

* How many rows are in each of the two tables involved?

* Are the values in IP_Converted all different? If not, how many distinct values are there? If it's low, you could try to rewrite the query to nudge the optimizer to a plan that will first order in IP_Convnerted and perform the join only once per distinct value.

* In an actual execution plan, what are the rowcounts (both estimated and actual) for all of the arrows? And what are the execution counts (both estimated and actual) for the operators on the bottom part of the nested loops join?

* If I understand the query, you have a bunch of points in the temp table (in IP_Converted), a bunch of areas in the geo_range, and now want to report eachh point with all areas that enclose the point. An idea I have might or might not help at all - depending on the actual data distribution. What if you first create a new geometry object that includes all the points in the temp table (no idea how to do that - like I said, I have no experience with Spatial), then use STOverlaps between that area and the data in GEO2IP to quickly weed out all rows that can never overlap with anything. (This would be useful if, for instance, GEO2IP covers the whole world and the temp table is always restricted to a small area, like a single state; it would be useful if the temp table can cover all of the area).

EDIT: Sorry for the overlaps with some previous questions; I somehow overlooked those message before posting mine.

Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
Visit my SQL Server blog: https://sqlserverfast.com/blog/
SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/