Geography Data Type - Best Practices

  • Scenario: We currently have an Address table with roughly 500,000 rows in it. This data represents store locations. Each address has the streety, city, zip, etc. and also has a Latitude and Longitude that are float. This table is from a 3rd party system, therefore we cannot make changes to it.

    Goal: We need to store geolocations for employees as they perform work in the various stores. The employees will be using a phone friendly website and we will regularly be capturing their geolocation. Once we have this information, it will be cross checked against the address table to ensure the employees went to the stores they were supposed to according to their schedule. This table will get big very fast. Easily 10,000 employees (possibly up to 20,000 eventually) having their geolocation stored several times a day.

    Proposed Soution: Based on what I've read, it sounds like I should make use of the geography data type. I envision a very simple table such as:

    create table EmployeeLocation

    (EmployeeID int NOT NULL,

    GeoLocation geography NOT NULL,

    InsertDate datetime NOT NULL)

    From here, I believe I would need to clone the Address table and add a geography column representing the lat/long and another geography column that stores that point with a buffer utilizing STBuffer(). I could then see which values in EmployeeLocation are within the buffer zone for each store. I spent most of yesterday reading articles on geography, geometry, STBuffer(), STBufferWithTolerance() and have come away a bit confused as to the best approach. It seems STBuffer() is CPU intensive, but if you use a spatial index in conjuction with Filter(), it can be more efficient.

    Important Considerations: Cross checking an employee's route will not be a real time operation. It would be something done off hours on a periodic basis to verify their route, therefore we are not as concerned about CPU usage. Additionally, we have the employee's schedule which we could then use to compare against the Address table and the EmployeeLocation table, thereby narrowing the records processed. The massaging of this data is not an immediate need, however I need to keep that in mind with what we develop now as we capture this data.

    Is this a viable approach? Any caveats?

    Thank you.

    Lisa

  • Honestly, if I were you, I'd set up a proof-of-concept database, try some sample data that way, and see if it does what you need and what you expect.

    It sounds like you're on the right track, but testing with some samples will tell you a lot more than me typing an opinion here.

    Have you tested it?

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Thank you for your reply. I've just started testing how to work with our data, simply to get the concepts down. I think I've definitely settled on storing the data as geography rather than geometry. Doing the comparison with the store data is where it will truly be put to the test. I'm going to do further investigation with STBuffer() and STDistance(). Like everything, it seems there are pros and cons to each.

  • Lisa, just this once, I will urge you to reconsider.

    Queries against spatial indexes using a geography datatype will run much slower. Geography is required to get accurate distances between lat/long coordinates at scales large enough for the curvature of the earth to come into play. It is totally unnecessary for determining intersections of points and polygons on the scale you are describing. Testing will reveal this to you, so by all means take the excellent advice offered by Gus.

    Best of luck to you anyway.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • VERY helpful information Dixie Flatline. Are you suggesting using geometry instead of geography, or ditching both, period? Understanding testing is definitely needed. We are just expecting that the user's locations will be acceptable if they are within a range of the lat/long of the store, hence I thought using the buffer around the store would accomplish this.

  • How accurate do you need it to be, and how big is the total space you are mapping?

    If, for example, all of your stores are in one city, then just figure out how many miles there are per degree of long at your lat, and do a square calculation and you'll probably be "good enough".

    Latitude is easy. It's about 69 miles per degree.

    Longitude is a little more complex, since 1° of longitude = (69.11) x (cosine of the latitude). So, if you are, for example, in Seattle, WA, USA, which is 122 degrees (approximately), cosine 122 is -0.86677, or close enough to 60 miles per degree long.

    So .1 degrees long is 6 miles, and .1 degree lat is about 7 miles. Just find the absolute difference between the person's lat and the stores lat, and the absolute difference between the person's long and the store's long, and check against that. No need for trig in the calculation. Refine to yards or whatever if you need to do so.

    Is it exact? Nah. But if there will only be a few edge cases where it's not accurate enough.

    It's more complex if you're mapping, say, all of Europe or North America, because the miles/kilometers per degree of longitude will vary significantly over that much distance. But, you'll still be comparing a float to a float if you stick to "sides of a square" instead of "hypotenus of a triangle", and float comparisons are the about the fastest thing a CPU can do.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • We are just expecting that the user's locations will be acceptable if they are within a range of the lat/long of the store, hence I thought using the buffer around the store would accomplish this.

    1. I absolutely agree that using STBuffer() around the store's location is the way to go. My point was that you can store the polygon produced by STBuffer(), and build an index on it, rather than store just the point produced by the lat/long coordinates of each store.

    If you don't store the STBuffer() results, your query will have to do a table scan of all stores, generate the buffered area, then see if the employee's location is at a store using STIntersects(). You could just as easily do a table scan and use STDistance() to test if the employee's location was within a specified distance from the store's front door. Storing the buffered results in your stores table enables you to build an index that STIntersects() can take advantage of.

    2. I was suggesting using geometry instead of geography for purposes of performance. At high volumes of data (millions of rows), a geometry spatial index significantly outperforms a geography spatial index. At small scales ( a buffer radius of only 100 meters for example), there should be only a small fraction of a percent difference in the results produced by geometry instead of geography.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • GSquared:

    I checked the Addresses table and it covers United States and Canada so it is a large area. As far as accuracy, I have asked what the tolerance should be. Since these employees are visiting stores (could be as big as a Walmart or as small as a 7-11), we need to come up with the proper tolerance that would assume they actually did visit the store.

    I will keep in mind your very important point that "float comparisons are the about the fastest thing a CPU can do."

    Dixie Flatline:

    Makes perfect sense (store the polygon produced by STBuffer(), and build an index on it).

    It may be a while before we get to actually working with and comparing the data. For now, we just need to start storing employee geolocations. Once we do come up with a solution, I will update this post to save someone else a possible headache.

    Thank you BOTH very much! I knew I could rely on this forum!

    Lisa

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply