• Jeff Moden (6/26/2008)


    The other option was taking each line of the calculation, adding a column to a join table between the two sets, and calculating one column at a time on the whole set of the join table. This was ugly code, and nowhere near so simple to maintain. But it was amazingly fast.

    How many rows did you end up with in that table, Gus? Was it a full Cartesian product or just a triangular join ((2-1)=ABS(1-2))

    Limited cartesian join, depending on business rules. For example, the 5-million to 1,500 join had about 300-million rows. (To be more precise, it was an inner theta join on a range of data, not a full cross join, but definitely many-to-many within pre-specified limits.)

    Then I broke up each part of a spherical distance calculation into a column. For example, the first thing you calculate is SIN(Latitude1/57.29577951), so I had a SinLat1 column, and had it run that as one calculation; next is SIN(Latitude2/57.29577951), same treatment. Instead of trying to run the whole calculation at once (which I did test), I had it run each piece on each row, and store the data in the column. Doing it as calculated columns ended up being slower in this case, as did trying to run the whole thing at once. Main problem was that 300-million floating point operations takes a LOT of RAM, and there are 12 such operations per distance calculation. That 3.6-billion calculations, and each takes a minimum of 16 bytes (two float data types), plus whatever overhead is used for the actual calculations. Meant it had to push the larger sets onto the hard drive, since I just simply didn't have that much RAM available. (That's a minimum of just about 57 Gig just for this, not including what the OS, SQL Server, etc., need, if I'm calculating it correctly. As opposed to 4.8 Gig per calculation when broken up.)

    Breaking it up like this meant each calculation one column at a time darn near pegged the RAM, but didn't end up in the swap file/tempdb.

    Another option would have been doing whole calculations atomically, but on a smaller subset of the rows at a time. That means a cursor/loop, and I just plain don't like messing around with those if I can avoid it.

    As mentioned, this is a heavy-lifting option that probably doesn't apply in very many cases, but it sure as heck was better than an inline function in this case, which is the point.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon