Geometric Median / Weiszfeld's algorithm

  • Does anyone have a function that determines the geometric function from a series of spatial points (geometric or geographic.)?

    If not I am ready to build one ... I found a clean C++ implementation that I can translate to T-SQL.

    Russ

  • Haven't done Weiszfeld's algorithm in TSQL, would love giving it a go as it is a good example of iterative approach which potentially could be converted into a set based solution. Could you possibly share the C++ code?

    😎

  • I've had a quick look at this and I suspect while it could be done in T-SQL (painfully and slowly :-D), a CLR function built around your C++ would probably be a better option.

    Here's a link to a python version Eirikur

  • mickyT (9/22/2014)


    I've had a quick look at this and I suspect while it could be done in T-SQL (painfully and slowly :-D), a CLR function built around your C++ would probably be a better option.

    Here's a link to a python version Eirikur

    Thanks!

    😎

  • Looking at it further, it may not be so bad in t-sql.

    Here's my crack at it. It currently uses a while loop because I'm having a few issues turning it into a recursive CTE. Might be able to sort something out later.

    I also included a formula for a Geometric Mean.

    -- Setup some sample data

    --------------------------

    SELECT * INTO sample

    FROM (VALUES

    (Geometry::Point(54,37,0))

    ,(Geometry::Point(45,75,0))

    ,(Geometry::Point(34,45,0))

    ,(Geometry::Point(87,124,0))

    ,(Geometry::Point(15,68,0))

    ,(Geometry::Point(32,84,0))

    ,(Geometry::Point(87,48,0))

    ,(Geometry::Point(01,78,0))

    ,(Geometry::Point(78,35,0))

    ,(Geometry::Point(125,78,0))

    ,(Geometry::Point(57,14,0))

    ) p(point);

    -- Do that actual work

    ----------------------

    DECLARE @tempX float = 0.0;

    DECLARE @tempY float = 0.0;

    DECLARE @iterations int = 50;

    DECLARE @denomsum float;

    SELECT @tempX = EXP(SUM(LOG(point.STX))/COUNT(*)),

    @tempY = EXP(SUM(LOG(point.STY))/COUNT(*))

    FROM sample;

    WHILE @iterations > 0

    BEGIN

    SET @iterations -= 1;

    SELECT @denomsum = SUM(1 / SQRT(POWER(@tempX - point.STX,2) + POWER(@tempY - point.STY,2)))

    FROM sample;

    SELECT @tempX = SUM((point.STX * (1 / SQRT(POWER(@tempX - point.STX,2) + POWER(@tempY - point.STY,2)))) / @denomsum),

    @tempy = SUM((point.STY * (1 / SQRT(POWER(@tempX - point.STX,2) + POWER(@tempY - point.STY,2)))) / @denomsum)

    FROM sample;

    END;

    -- Show the results

    -------------------

    SELECT point.STBuffer(2) from sample

    UNION ALL

    SELECT Geometry::Point(EXP(SUM(LOG(point.STX))/COUNT(*)), EXP(SUM(LOG(point.STY))/COUNT(*)), 0).STBuffer(4) -- GeometricMean

    FROM sample

    UNION ALL

    SELECT Geometry::Point(@tempx, @tempy,0).STBuffer(6) -- GeometricMedian

    ;

  • I pulled it off of another forum (what is the etiquette relating to that?). I'm better off converting it to T-SQL for my purpose as it will be used to populate a report in a third party app.

    // header files for IO functions and math

    #include <cstdio>

    #include <cmath>

    // the maximul value n can take

    const int maxn = 50001;

    // given a point (x, y) on a grid, we can find its left/right/up/down neighbors

    // by using these constants: (x + dx[0], y + dy[0]) = upper neighbor etc.

    const int dx[] = {-1, 0, 1, 0};

    const int dy[] = {0, 1, 0, -1};

    // controls the precision - this should give you an answer accurate to 3 decimals

    const double eps = 0.001;

    // input and output files

    FILE *in = fopen("adapost2.in","r"), *out = fopen("adapost2.out","w");

    // stores a point in 2d space

    struct punct

    {

    double x, y;

    };

    // how many points are in the input file

    int n;

    // stores the points in the input file

    punct a[maxn];

    // stores the answer to the question

    double x, y;

    // finds the sum of (euclidean) distances from each input point to (x, y)

    double dist(double x, double y)

    {

    double ret = 0;

    for ( int i = 1; i <= n; ++i )

    {

    double dx = a.x - x;

    double dy = a.y - y;

    ret += sqrt(dx*dx + dy*dy); // classical distance formula

    }

    return ret;

    }

    // reads the input

    void read()

    {

    fscanf(in, "%d", &n); // read n from the first

    // read n points next, one on each line

    for ( int i = 1; i <= n; ++i )

    fscanf(in, "%lf %lf", &a.x, &a.y), // reads a point

    x += a.x,

    y += a.y; // we add the x and y at first, because we will start by approximating the answer as the center of gravity

    // divide by the number of points (n) to get the center of gravity

    x /= n;

    y /= n;

    }

    // implements the solving algorithm

    void go()

    {

    // start by finding the sum of distances to the center of gravity

    double d = dist(x, y);

    // our step value, chosen by experimentation

    double step = 100.0;

    // done is used to keep track of updates: if none of the neighbors of the current

    // point that are *step* steps away improve the solution, then *step* is too big

    // and we need to look closer to the current point, so we must half *step*.

    int done = 0;

    // while we still need a more precise answer

    while ( step > eps )

    {

    done = 0;

    for ( int i = 0; i < 4; ++i )

    {

    // check the neighbors in all 4 directions.

    double nx = (double)x + step*dx;

    double ny = (double)y + step*dy;

    // find the sum of distances to each neighbor

    double t = dist(nx, ny);

    // if a neighbor offers a better sum of distances

    if ( t < d )

    {

    update the current minimum

    d = t;

    x = nx;

    y = ny;

    // an improvement has been made, so

    // don't half step in the next iteration, because we might need

    // to jump the same amount again

    done = 1;

    break;

    }

    }

    // half the step size, because no update has been made, so we might have

    // jumped too much, and now we need to head back some.

    if ( !done )

    step /= 2;

    }

    }

    int main()

    {

    read();

    go();

    // print the answer with 4 decimal points

    fprintf(out, "%.4lf %.4lf", x, y);

    return 0;

    }

    Eirikur Eiriksson (9/22/2014)


    Haven't done Weiszfeld's algorithm in TSQL, would love giving it a go as it is a good example of iterative approach which potentially could be converted into a set based solution. Could you possibly share the C++ code?

    😎

  • Nice, I'm going to check it out tomorrow morning. I was using Geometric Mean and comparing it to Absolute distance from a separate point (to measure efficiency). If I run this against my data (which will be around 5,000 samples) to make sure the geographic median always has closer sums of distances than the mean centroids and my facility location.

    Russ

    mickyT (9/22/2014)


    Looking at it further, it may not be so bad in t-sql.

    Here's my crack at it. It currently uses a while loop because I'm having a few issues turning it into a recursive CTE. Might be able to sort something out later.

    I also included a formula for a Geometric Mean.

  • While I'm on a bit of a roll. Had a bit of success with a rCTE, but it does require a iTVF to be created.

    Removed: Incorrect results

  • Thanks a bunch mickyT!!! I've added a user-defined table type to make this re-usable from my GIS database

    CREATE TYPE udttPDXGeom AS TABLE (geompt GEOMETRY)

    CREATE TYPE udttPDXGeog AS TABLE (geogpt GEOGRAPHY)

    And added a function that does your math (allowing for iterations to be a parameter so it is flexible upon usage):

    CREATE FUNCTION udfPDXGeomMedian (@geomPts udttPDXGeom READONLY, @iIterations NUMERIC)

    RETURNS GEOMETRY AS

    BEGIN

    DECLARE @tempXFLOAT = 0.0;

    DECLARE @tempYFLOAT = 0.0;

    DECLARE @denomsumFLOAT;

    SELECT@tempx = EXP(SUM(LOG(geompt.STX))/COUNT(*)),

    @tempy = EXP(SUM(LOG(geompt.STY))/COUNT(*)) FROM @geomPts

    WHILE @iIterations > 0

    BEGIN

    SET @iIterations -= 1;

    SELECT @denomsum = SUM(1 / SQRT(POWER(@tempX - geompt.STX,2) + POWER(@tempY - geompt.STY,2)))

    FROM @geomPts

    SELECT @tempX = SUM((geompt.STX * (1 / SQRT(POWER(@tempX - geompt.STX,2) + POWER(@tempY - geompt.STY,2)))) / @denomsum),

    @tempy = SUM((geompt.STY * (1 / SQRT(POWER(@tempX - geompt.STX,2) + POWER(@tempY - geompt.STY,2)))) / @denomsum)

    FROM @geomPts;

    END

    RETURN GEOMETRY::Point(@tempx, @tempy,0)

    END

    So the usage is down to this for me ...

    DECLARE @sample AS udttPDXGeom

    INSERT INTO @sample

    VALUES

    (Geometry::Point(54,37,0))

    ,(Geometry::Point(45,75,0))

    ,(Geometry::Point(34,45,0))

    ,(Geometry::Point(87,124,0))

    ,(Geometry::Point(15,68,0))

    ,(Geometry::Point(32,84,0))

    ,(Geometry::Point(87,48,0))

    ,(Geometry::Point(01,78,0))

    ,(Geometry::Point(78,35,0))

    ,(Geometry::Point(125,78,0))

    ,(Geometry::Point(57,14,0))

    SELECT geompt.STBuffer(2) FROM @sample UNION ALL

    SELECT dbo.udfPDXGeomMedian(@sample, 50).STBuffer(6)

    Do you think this would translate will to GEOGRAPHY data types? I have all of my locations in both the spatial "4326" SRID as well as the "2601" Stateplane planar coordinates ... but the google maps API only supports Lat/Long so I would have to translate in my code which is expensive.

    Russ

  • Quick question, how accurate does this have to be? The reason for asking is that one can do a good approximation using the spatial data type methods and functions, happy to demonstrate if that's good enough.

    I have looked at the python and the C++ samples, think mikyT has pretty much nailed that approach, would like to prove otherwise if I have time:-D (good job mikyT)

    😎

  • Well that didn't go so well ....

    CREATE FUNCTION udfPDXGeogMedian (@geogPts udttPDXGeog READONLY, @iIterations NUMERIC)

    RETURNS GEOGRAPHY AS

    BEGIN

    DECLARE @tempXFLOAT = 0.0;

    DECLARE @tempYFLOAT = 0.0;

    DECLARE @denomsumFLOAT;

    SELECT@tempx = EXP(SUM(LOG(geogpt.Long))/COUNT(*)),

    @tempy = EXP(SUM(LOG(geogpt.Lat))/COUNT(*)) FROM @geogPts

    WHILE @iIterations > 0

    BEGIN

    SET @iIterations -= 1;

    SELECT @denomsum = SUM(1 / SQRT(POWER(@tempX - geogpt.Long,2) + POWER(@tempY - geogpt.Lat,2)))

    FROM @geogPts

    SELECT @tempX = SUM((geogpt.Lat * (1 / SQRT(POWER(@tempX - geogpt.Long,2) + POWER(@tempY - geogpt.Lat,2)))) / @denomsum),

    @tempy = SUM((geogpt.Lat * (1 / SQRT(POWER(@tempX - geogpt.Long,2) + POWER(@tempY - geogpt.Lat,2)))) / @denomsum)

    FROM @geogPts;

    END

    RETURN GEOGRAPHY::Point(@tempx, @tempy,0)

    END

    Sample data:

    DECLARE @sampleGeog AS udttPDXGeog

    INSERT INTO @sampleGeog

    VALUES

    (GEOGRAPHY::Point(44.478311,-122.62196,4326))

    ,(GEOGRAPHY::Point(44.4600192,-122.6435081,4326))

    ,(GEOGRAPHY::Point(44.5137321,-122.6458555,4326))

    ,(GEOGRAPHY::Point(44.505355,-122.58344,4326))

    ,(GEOGRAPHY::Point(44.518456,-122.6453,4326))

    ,(GEOGRAPHY::Point(44.4941156,-122.5793172,4326))

    SELECT geogpt.STBuffer(100) FROM @sampleGeog

    UNION ALL SELECT dbo.udfPDXGeogMedian(@sampleGeog, 500).STBuffer(300)

    Yields:

    (6 row(s) affected)

    Msg 3623, Level 16, State 1, Line 13

    An invalid floating point operation occurred.

  • The more accurate it is, the more money I will save :-). So for an area of 30 miles by 30 miles, I would like to get the Geographic Median within maybe a tenth of a mile. I'm dealing with ten of thousands of trips a year so every bit helps.

    Russ

  • busraker (9/24/2014)


    I'm dealing with ten of thousands of trips a year so every bit helps.

    Russ

    Sounds a little like travelling salesman here:-)

    This isn't too high of a volume, cannot say much but these are kind of hourly type volumes where I'm coming from. I'll piece together an example and post it shortly when I have time.

    😎

  • Eirikur Eiriksson (9/24/2014)


    Sounds a little like travelling salesman here:-)

    This isn't too high of a volume, cannot say much but these are kind of hourly type volumes where I'm coming from. I'll piece together an example and post it shortly when I have time.

    😎

    The only miles I log are on the CPU bus 🙂

    I'm thinking I will simply cast the log and lat as Geometries instead of Geographies and convert back to geography after getting the geometric median. It will be close enough I think, and may better then the Geometric Mean

  • busraker (9/24/2014)


    Well that didn't go so well ....

    ...

    SELECT@tempx = EXP(SUM(LOG(geogpt.Long))/COUNT(*)),

    @tempy = EXP(SUM(LOG(geogpt.Lat))/COUNT(*)) FROM @geogPts

    ...

    END

    Yields:

    (6 row(s) affected)

    Msg 3623, Level 16, State 1, Line 13

    An invalid floating point operation occurred.

    The floating point error will be due to the LOG getting a negative I think.

    I suspect the geography version will not be accurate enough for you. It would depend on the north south spread of your data.

    If you aren't adverse to using CLRs, there is a few libraries out there that can help with reprojections.

    http://sqlspatialtools.codeplex.com/wikipage?title=Current%20Contents&referringTitle=Home

    http://www.doogal.co.uk/dotnetcoords.php

Viewing 15 posts - 1 through 15 (of 24 total)

You must be logged in to reply to this topic. Login to reply