Nearest neighbor, within the given shape query

  • Hi,

    I have a table with 257 mil records with latitude and longitude data.

    My goal is to find the closest intersecting values from a locations table (88 rows) and update any of the 257 mil records that are applicable with the location_Name and Location_Group_Name.

    The query I have works but doesn't perform well on such a big data set.

    CREATE TABLE #Positions -- Base table 257 mil rows. Actual table has 20 columns

    (

    IDBigInt PRIMARY KEY,

    LatitudeDec(10,6),

    LongitudeDec(10,6),

    Location_Name Varchar(50),

    Location_Group_NameVarchar(50)

    )

    CREATE TABLE #Locations -- 88 rows

    (

    IDInt,

    LatitudeDec(10,6),

    LongitudeDec(10,6),

    GeoShapeGeoGraphy,

    Loc_NameVarchar(50),

    Loc_Grp_NameVarchar(50),

    )

    Attached you will find the tables, test data, a function to measure distance and some queries that work but are too slow for this much data.

    Here are the best resources i found.

    Isaac Kunen

    Questioned asked here by drjdewhurst

    Posted link in previous link

    I have only been working with this stuff for a week now, could really use some help.

  • You will get a lot more responses if you post your tables and data in a consumable format. Not many people are willing to download rar files from people they don't know.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Thanks for the tip Sean. No .sql available, had to make a choice.

    Here it is again as .doc and .zip.

    Can also do .txt or post all 1130 lines of code if anyone prefers.

  • DennisPost (11/24/2014)


    Thanks for the tip Sean. No .sql available, had to make a choice.

    Here it is again as .doc and .zip.

    Can also do .txt or post all 1130 lines of code if anyone prefers.

    Replacing on potentially dangerous file type with another is not going to do much for you. What do you mean by no .sql available? Don't you have the content for a file as that? Wouldn't be too hard to change the extension of a text file to sql.

    Why do you need 1130 lines to demonstrate the issue here? Or are you saying your function is 1130 lines?

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Sean Lange (11/24/2014)


    DennisPost (11/24/2014)


    Thanks for the tip Sean. No .sql available, had to make a choice.

    Here it is again as .doc and .zip.

    Can also do .txt or post all 1130 lines of code if anyone prefers.

    Replacing on potentially dangerous file type with another is not going to do much for you. What do you mean by no .sql available? Don't you have the content for a file as that? Wouldn't be too hard to change the extension of a text file to sql.

    Why do you need 1130 lines to demonstrate the issue here? Or are you saying your function is 1130 lines?

    When I try to upload my file as .sql it says. See the .png

    A file you attempted to upload is not a permitted type.

    I need 1130 lines of code to demonstrate how I am tackling this. Test data for the #Position table is 1000 + 20 for the #Location table and some more the function I am using to measure distance. If you know of a way to generate random data for geography datatypes and how to generate latitude and longitudes that intersect with them, I'm keep to learn.:-)

    Which format would you choose?

  • Post your data as all text, as insert statements into your demonstration tables. Obviously, works best if you don't post zetabyte databases this way LOL just post a sample set that demonstrates the issue.

    edit:

    http://msdn.microsoft.com/en-us/library/bb895266.aspx might help build some sample data from textual geographical coordinates.

    -- basically you want a stack of these sorts of things:

    INSERT INTO SpatialTable (GeogCol1)

    VALUES (geography::STGeomFromText('LINESTRING(-122.360 47.656, -122.343 47.656)', 4326));

    INSERT INTO SpatialTable (GeogCol1)

    VALUES (geography::STGeomFromText('POLYGON((-122.358 47.653, -122.348 47.649, -122.348 47.658, -122.358 47.658, -122.358 47.653))', 4326));

  • Also check out this link: http://msdn.microsoft.com/en-us/magazine/jj133823.aspx

    This talks about putting locations into "sectors" to chip away at the efficiency problems, maybe you can get some insight into some of the issues you'll run into with these sort of big geographical datasets.

    Interesting stuff!

    edit: as a bonus, that page has c# routines to create dummy datasets.

  • Thanks for the link Patrick, was a good read.

    We use the sectors technique for other purposes, but our sectors are 6m x 6m. Unfortunately even sectors this small are not accurate enough.

    Most of the #Position and #Location data will be in one city.

    Here's the simplified sample code.

    USE TempDB

    GO

    IF OBJECT_ID('TempDB..#Positions') IS NOT NULL

    DROP TABLE #Positions

    CREATE TABLE #Positions -- Actual table has 257 mil rows, 20 columns

    (

    Position_IDBigInt CONSTRAINT PK_PosID PRIMARY KEY,

    LatitudeDec(10,6),

    LongitudeDec(10,6),

    Location_NameVarchar(50)

    )

    INSERT#Positions

    SELECT581763287, 51.546783, 4.803814, NULL UNION ALL

    SELECT581763288, 52.070960, 5.123537, NULL UNION ALL

    SELECT581773709, 51.118755, 3.561964, NULL UNION ALL

    SELECT581773710, 51.118759, 3.562069, NULL UNION ALL

    SELECT581773711, 51.118816, 3.562144, NULL

    IF OBJECT_ID('TempDB..#Locations') IS NOT NULL

    DROP TABLE #Locations

    CREATE TABLE #Locations -- 88 rows

    (

    Loc_IDInt CONSTRAINT PK_LocID PRIMARY KEY,

    LatitudeDec(10,6),

    LongitudeDec(10,6),

    GeoShapeGeoGraphy,

    Loc_NameVarchar(50)

    )

    INSERT#Locations

    SELECT638767, 52.070732, 5.123783, 0xE6100000010405000000FA8777B2D7084A4077EDB30B107C144039C124B041094A4077EDB30B107C144039C124B041094A40E42C30C86F811440FA8777B2D7084A40E42C30C86F811440FA8777B2D7084A4077EDB30B107C144001000000020000000001000000FFFFFFFF0000000003, 'Location 1' UNION ALL

    SELECT797174, 51.546888, 4.803603, 0xE6100000010405000000DF03BDD1CBC54940170659B43A341340F58B15D235C64940170659B43A341340F58B15D235C649401C97157D8A391340DF03BDD1CBC549401C97157D8A391340DF03BDD1CBC54940170659B43A34134001000000020000000001000000FFFFFFFF0000000003, 'Location 2'

    IF EXISTS (SELECT * FROM sys.objects WHERE Name = 'GetDistance' AND [Type] = 'FN')

    DROP FUNCTION GetDistance

    GO

    CREATE FUNCTION [dbo].[GetDistance]

    (

    @Latitude1 Decimal(10,6),

    @Longitude1Decimal(10,6),

    @Latitude2 Decimal(10,6),

    @Longitude2 Decimal(10,6)

    )

    RETURNS INT WITH SCHEMABINDING

    AS

    BEGIN

    DECLARE @Temp Float

    SET @Temp = SIN(@Latitude1/57.2957795130823) *

    SIN(@Latitude2/57.2957795130823) +

    COS(@Latitude1/57.2957795130823) *

    COS(@Latitude2/57.2957795130823) *

    COS(

    @Longitude2/57.2957795130823 -

    @Longitude1 /

    57.2957795130823

    )

    IF @Temp > 1

    SET @Temp = 1

    ELSE IF @Temp < -1

    SET @Temp = -1

    RETURN CAST((3958.75586574 * ACOS(@Temp) * 1609.344) AS INT)

    END

    GO

    UPDATE#Positions

    SETLocation_Name = L.Loc_Name

    FROM#Positions P WITH (NOLOCK)

    OUTER APPLY

    (

    SELECTTOP(1) L.Loc_Name

    FROM#Locations L WITH (NOLOCK)

    WHEREGeoshape.STIntersects(geography::Point(P.Latitude, P.Longitude, 4326)) = 1

    ORDERBY dbo.GetDistance(P.Latitude, P.Longitude, L.latitude, L.longitude) ASC

    ) L

    SELECT*

    FROM#Positions

    IF OBJECT_ID('TempDB..#Positions') IS NOT NULL

    DROP TABLE #Positions

    IF OBJECT_ID('TempDB..#Locations') IS NOT NULL

    DROP TABLE #Locations

    IF EXISTS (SELECT * FROM sys.objects WHERE Name = 'GetDistance' AND [Type] = 'FN')

    DROP FUNCTION GetDistance

  • Right off hand, you might want to try to convert that function into an inline calculation. Other experts might give better advice, but it might be the case that multiline functions exact a performance penalty. Yes, moving the calculations to the update statement will probably be cumbersome, but I think sacrificing some readability for getting rid of some of what I guess is an interpretation based penalty might pay off.

    Something like the following, although I doubt I'm doing the right thing as I hacked off a few things that you might need to be included (casting as int, limitting the figures to -1 or 1, etc...) just to get your sample to run 🙂

    UPDATE#Positions

    SETLocation_Name = L.Loc_Name

    FROM#Positions P WITH (NOLOCK)

    OUTER APPLY

    (

    SELECTTOP(1) L.Loc_Name

    FROM#Locations L WITH (NOLOCK)

    WHEREGeoshape.STIntersects(geography::Point(P.Latitude, P.Longitude, 4326)) = 1

    ORDERBY

    --dbo.GetDistance(P.Latitude, P.Longitude, L.latitude, L.longitude) ASC

    (3958.75586574 * ACOS

    (

    SIN(P.Latitude/57.2957795130823) *

    SIN(L.Latitude/57.2957795130823) +

    COS(P.Latitude/57.2957795130823) *

    COS(L.Latitude/57.2957795130823) *

    COS(

    L.Longitude/57.2957795130823 -

    P.Longitude /

    57.2957795130823

    )

    )

    )

    * 1609.344

    --

    ) L

  • Even completely removing the order by doesn't help.:(

  • DennisPost (11/26/2014)


    Even completely removing the order by doesn't help.:(

    Yeah that does sort of indicate that 257 million rows is a bit of data to push through looking for nearby neighbors. It seems doable as you're only checking 88 locations but most of my SQL is counting stuff and drawing reports, hopefully someone else can lend some insight.

    Here's some stuff on spatial indexing: http://technet.microsoft.com/en-us/library/bb964712(v=sql.105).aspx , I found that pretty interesting. Still, the fact that you've removed the "order by" without even getting help with performance (not even taking into consideration correctness) does tell a bit. Maybe trying 88 passes through the dataset looking for intersections for each location, but thats just moving the right side of the "apply" to the left though, if that even makes sense to do.

    Sorry to have you spend your time uploading samples and then not actually helping, it IS really helpful for the forum here to assist in getting answers, but that sort of applies only in the cases that good help is actually forthcoming.

  • I was able to speed up the query by reducing the intersect results by correcting the geography orientation.

    In the end a VIEW was sufficient as only a small portion of the position rows are queried.

    SELECTP.*

    , Locs.Location_Name

    , Locs.Location_Group_Name

    FROMdbo.Positions P WITH (NOLOCK)

    OUTER APPLY

    (

    SELECTTOP(1) Location_Name, Location_Group_Name

    FROMLocations L WITH (NOLOCK)

    CROSS APPLY

    (

    VALUES

    (

    -- Repair the ring orientation on geography datatype.

    Geography::STGeomFromWKB

    (Geometry::STGeomFromText(L.Geography_Shape.STAsText(), 4326).STUnion

    (Geometry::STGeomFromText(L.Geography.STAsText(), 4326).STStartPoint()

    ).STAsBinary(), 4326))

    ) Val1(GeogShape)

    WHEREVal1.GeogShape.STIntersects(geography::Point(P.Latitude, P.Longitude, 4326)) = 1

    ORDERBY dbo.GetDistance(P.Latitude, P.Longitude, L.Latitude, L.Longitude) ASC

    ) Locs

  • Thanks for the help patrickmcginnis59 10839.

    You my not have given a complete answer, but I learned a lot for the links and insights you provided.

    My colleague and I also devised a Sectoring style solution with all location IDs and their intersecting sectors. Unfortunately the resulting multi-billion row table made this unfeasible. (+100K locations with +100 million used sectors). Any changes to a locations shape, which occurs frequently, would have to be recalculated. Other attempts to generate the relevant table data on the fly resulted in terrible performance.

  • Thanks for posting your conclusion as a followup, interesting stuff!

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply