The Set-Based Limit

  • I actually had something a while back that worked better procedurally than set-based, similar to the PIN code thing here.

    Six character alpha-numerics, each one needs to be unique, certain limitations on combinations allowed (nothing that could end up being obscene/racist/etc.), certain limitations on which letters to use, because of readability issues and phonic similarity (when someone is reading "N10501" over the phone, it's hard to tell if it's "M10501" or "N10501", so we didn't inclue any "N"s in the combinations, and so on). And nothing below four characters allowed.

    Needed in batches of about 5-10,000 at a time, usually about 100k per week.

    Because of various letter limitations, ended up with basically a six-digit, base-23 number system, which is a little over 148-million valid codes. Enough for about 10 years of operation before we'd have to expand to seven digits.

    It was simple enough to set up a table of the allowed characters and build a six-level self-cross-join (with the last two being outer joins to allow four- or five-digit combos), with "distinct" in the select.

    I set it to run at the end of work on a Monday, and when I came in Tuesday, the machine it was running on had run out of disk space and tempdb and its log file were completely filling everything up! It had crashed completely after six hours.

    So I fell back and punted, and created a set of nested loops that would iterate through the characters for each digit, and insert the rows one at a time into a table in a database with simple recovery. It finished the whole thing in about an hour, then I deleted the "don't use anything that resembles these combinations" list in a single pass, and was done with that in another hour.

    After further testing, I found that creating them in batches of 12,167 (23 cubed), by replacing the innermost three While loops with a pre-created table, sped up the process so that it finished in a few minutes.

    In that case, the purely set-based solution, with no loops, etc., in my part of the code, was simply too much for the hardware I was running it on, and I had to break the problem down into smaller pieces and run those procedurally to keep the server from having fits.

    So, there's on situation where I consider procedural code valid.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Jeff Moden (1/22/2009)


    peter (1/22/2009)


    Jeff, any luck in finding the saved "function argument" you mentioned?

    No, sorry... I can't even remember who the "argument" was with...

    Touch a minimal set of data, a minimal number of times, and in a minimal amount of set based steps as the optimizer or query complexity allows you to get away with. It is all about getting your results fast and efficient (and preferably in a non-blocking way) and not about a holy war of set vs procedural.

    Use set based constructs where you can as it exploits indexes and such and break it up in several set based stages when it gets to complex or inefficent due to optimizer/information limitations. This later is perceived as procedural by some. I would like to counter that this observation it is irrelevant and that set based solutons usually manage to achieve the primary goal in one step...doing the least amount of work...but sometimes you nead to break things up to achive this goal.

    And remember that under the hood ALL software is procedural, the set based aproach is just a mathematical model on top of it that allows for automatic optimizations to happen...and as we all know, no single model is perfect.

    That's mostly right (especially about touching the data the fewest number of times possible) except for one thing that I argue over and over... people simply give up on "set based" solutions too early because they simply haven't studied their trade well enough to "just know" the proper set based solution (I emphasize "proper" because there's a lot of wrong ways to do it and it sometimes picks up a bad name from those wrong ways). For me, it is a bit of a "holy war" because I've seen so much very slow, resource demanding code come from it along with some of the really lame excuses people use to justify RBAR.

    I understand how you feel...there is a quite some bad code and even more bad advice out on the net and the deeper I get into things the more that seems to be the case. But rest assured that your articles do help people of every skill level, and I myself picked up a few tricks from SQL Central and you in particular. Personally I try to be pragmatic in that when confronted with a possible new ingredient for a solution I try to understand and apply it. If it works well and meets certain professional criteria (maintainability / abstraction) then I stick with it until something better is found by me or someone else. And I always try to improve on the original if nothing else just to understand the ingredient I use better and a lot of the time I succeed :).

    Right now I do feel tempted to try and put an article together that analyzes functions more in-depth and demonstrates how to best use them according to my personal experience. That is without drowning in RBAR or being heavy on procedural constructs that most of the time are not needed. But given that it takes me ages just to write this post (I have this problem with every post I make), don't expect it soon. A lot of the possible material is already earlier posts on this forum, but to make an informative coherent article takes a lot more work.

    I will give it some deeper thought!

  • GSquared (1/22/2009)


    Needed in batches of about 5-10,000 at a time, usually about 100k per week.

    So, there's on situation where I consider procedural code valid.

    Are you saying it took overnight to run 10,000 of these? Also, what were you using for the random selections?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (1/22/2009)


    GSquared (1/22/2009)


    Needed in batches of about 5-10,000 at a time, usually about 100k per week.

    So, there's on situation where I consider procedural code valid.

    Are you saying it took overnight to run 10,000 of these? Also, what were you using for the random selections?

    Not at all random. And no, it didn't take overnight to run 10k of these, it took overnight to try to pre-generate the whole set.

    Here's a bit more of the business scenario:

    You're mailing about 100k cards per week, a little over 5-million per year, each one needs to have a barcode and a matching human-readable code on it, between 4 and 6 characters, and needs to be completely unique. If needed (doubtful), it could expand to 7 characters at some undefined point in the future. The generated codes need to not spell out words, need to not have "license plate words", and definitely need to not have anything offensive in them. By "license plate words", I mean combinations like "I69U" (sexual), or "H8YU" ("H8" = "hate"), or "G02HL" (G-zero-two-H-L).

    When the address lists are being processed, the whole process, include CASS certification, postal presort, assigning these unique codes, and output in mail-merge format for printing, needs to take under 5 minutes, to fit the needs for production and mailing.

    These batches can be anywhere from 5,000 at a time to 30,000 at a time, but if more than 10,000 at a time, it gets broken down into individual presort and output lists no larger than 10,000 each.

    The CASS and presort engine runs approximately 5k/minute on existing hardware and software. Simple hardware upgrade, by test, could get about a 10% improvement for a cost comparable to 6 months' profit for this line of business, and was thus not a valid ROI.

    Given those business requirements, what would your solution be? And, yes, in that line of business, these requirements all made sense and were valid.

    Mine was to pre-generate and pre-clean the combinations, store those in a table with an ID column, and then assign them by simply joining to the ID column in the address list table.

    Pre-generating them the way I did took a few hours total, and then the assignment would take a few milliseconds for each batch.

    Not that it will matter, since that company is out of business, but if you can come up with something more efficient, I'd be curious about it.

    Edit: One more requirement is that the codes have to be easy to read and understand over a phone, across regional accents, when being read by people who are mostly over 60 years old. (Voices and intonation often change as age increases.) Thus, you need to eliminate either "M" or "N", since they are too similar over the phone, and you can't include "O", because it looks too much like a zero. Ended up eliminating "P" and "D", since they sound too much like "B", and "Z" because it looks like a "2" to some people. Can't remember all the letters we got rid of, but we ended up with base-28 if I remember correctly.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Assuming my numbers are right, you generated a set of over 480 million codes overnight (before dropping the H8 and 69s)??

    I'd definitly not consider this slow by all means... and it's quite obvious why the 7th digit is not really required :-P. Can you imagine anyone needing and 7th or 8th digit for physical mailings (not e-mails) :w00t:.

  • Yeah, it was something in that range. I don't have access to that database any more (since it was for a company that no longer exists), but it was pretty huge.

    And, as per my original statement on it, imagine what tempdb would go through if I just tried to create that with a set of cross joins all in one pass.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • I'm not sure it was worth storing 43 years worth of numbers (480,000,000/30,000/365) even if (at 6 bytes with no overhead) that works out to be just 2.8 gig. Everything else sounds pretty copasetic.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • peter (1/22/2009)


    Right now I do feel tempted to try and put an article together that analyzes functions more in-depth and demonstrates how to best use them according to my personal experience. That is without drowning in RBAR or being heavy on procedural constructs that most of the time are not needed. But given that it takes me ages just to write this post (I have this problem with every post I make), don't expect it soon. A lot of the possible material is already earlier posts on this forum, but to make an informative coherent article takes a lot more work.

    I will give it some deeper thought!

    Heh... while you're thinking, write in down in a Word document and save your examples... Then, all you have to do is organize the document, copy the code in, and in only a week or two, viola! You have a world class article.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (1/23/2009)


    I'm not sure it was worth storing 43 years worth of numbers (480,000,000/30,000/365) even if (at 6 bytes with no overhead) that works out to be just 2.8 gig. Everything else sounds pretty copasetic.

    We were using about 5-million per year, and that business line was expected to at least double in the next couple of years. If the marketing people had been able to market, even a little bit, and the salespeople had any inclination whatsoever to sell, it could actually have been pushed up to 10X volume in about three years, and kept going up from there. (50-million pieces per year was definitely possible.)

    Instead, they got a wild hair up their ... um ... well ..... Instead they decided to drop both of the successful product lines and pursue a line of business that had no proven track record, huge overhead, tremendous development cost and timeline, and nothing going for it except a sort of "wow, that would be cool if it could possibly work" ... well, that and and commissions that were higher than the successful product lines.

    480-million was overkill, but perhaps not as badly as it might sound.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Like I said... just a little of 2.8 GB... I've got files much larger than that.

    I've been through that with 2 companies now... they ignore the flagship product that could make a lot of bread and butter and go wandering off into the unknown for the "Gee-whiz" of it all. Terrible way to crash and burn because they never actually see it coming. Hurts a lot of good people working for them.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (1/23/2009)


    Everything else sounds pretty copasetic.

    That's the first time in many years I've have to go and look up the definition of a word because I had absolutely no idea what it meant and couldn't work it out! 🙂

    For those, like me, who don't use North American English:

    co·pa·cet·ic or co·pa·set·ic

    adj.

    Very satisfactory or acceptable; fine: "You had to be a good judge of what a man was like, and the English was copacetic" John O'Hara.

    --------------------------------------------------------------------------------

    [Origin unknown.]

    (From http://www.thefreedictionary.com/copacetic)

    Derek

  • Heh... if it's not in the "SQL Dictionary", I usually don't get it, either. I learned one just recently that very well describes some of the source of "flames" on some of the threads I've seen over the years... cool sounding word (pair of words, actually)...

    [font="Arial Black"]ad hominem [/font]

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Is that from when I told someone to stop using those on you, Jeff? Just last week, if I remember correctly.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • GSquared (1/26/2009)


    Is that from when I told someone to stop using those on you, Jeff? Just last week, if I remember correctly.

    Yes, Sir! And I have to tell you, thank you very much for the position you took on that one... it was very timely and very much appreciated.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • You're welcome.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Viewing 15 posts - 61 through 74 (of 74 total)

You must be logged in to reply to this topic. Login to reply