Introduction to Bitmasking in SQL Server 2005

  • "This kind of programming is sending us back to the old days of 1950's mag tape files and bit fiddling with assembly language. "

    Hi Joe,

    In prefacing the article, I could have included the fact that sometimes I like to just "geek" with things, just for the sake of "geeking". You know - Quality Geek Time? However, as I point out often to my students some of those practices in your web article "Ten things that I hate about you" - you have to realize that not everything will be relational, nor will it be in an acceptable normal form. Nor will it be what Dr Codd envisioned. Did he envision blobs in the database? Hardly. Did he envision (gasp) XML data and XML data types? No, not at all.

    A toolkit is used most effectively when the appropriate tool is best fit to the job. I did warn that this is not something that should be used all the time, nor is it an attempt to overpower or overcomplicate a solution. I know that it is probably aggravating to you but we have debated things before, and it's good for a lively debate. Remember when I posed a dozen questions to you about NULLs at SQLPass in Grapevine last year? Good, thought provoking arguments make the developer community richer. Blasting a technique that it sends us back to another day ... not so much. Fact is, things change. Did anyone ever think that we'd be using IDENTITY attribute all over creation? Go look at Microsoft's "Best Practice" material - they have them everywhere and strongly recommend them. Normally, I teach my students that 1) they're not relational, 2) they don't describe attributes, and 3) attributes aren't dependent on them. Wonder why they are, then, used so much?

    Most of your pet peeves, I've agreed with you. For this one, I do think that there is a use for it, regardless of whether it's 1950, 1960, or 2006!

    Oh, and thanks the copy of your book. It's great!

  • You could use a number of different data types for this operation - it's personal preference. Lots of folks might be included to use an integer, it doesn't matter. The varbinary or binary simply allows you to look at the mask and map the representation easier...

  • Not only 2000, but what about 7? Or 6.5? Or 4.2! Yes all of them work equally as well. It made more sense to title as I did rather than, for a SQL Site, calll it "Intro to Bitmasking in FirebirdSQL"

     

  • You have a point. I really don't think that the code is all that difficult, though. What you might gain is replacing brute-force, or lengthy, DDL statements for this fictitious 100 column table. If you're faced with a challenge to capture ten columns - not recommended.

    Another example where this was used in a most clever way, where I first saw it as such, was in the original Match.com search engine, created back around 1996-7. The mask stored many, many values which allowed the "first cut" of a search stored procedure to eliminate many unwanted candidates for match making. From that result set, the search process then looked to the actual tables where the data was stored for further refinement, processing using WHERE and EXISTS clauses.

    Again, just an approach to think about. It isn't applicable for everything that is stored in the database!

  • You could possibly do that, but the CASE statement is required nonetheless. IE, the order of the decode (3,2,1) must be as such or the AND might occur against the wrong value. (Try this in the incorrect order and see if it works - it won't!). Therefore, it's semantics. Positioning is required regardless, so to me it's personal preference.

  • Of course the big drawback to this type of data encoding is that you have to completely decode it before you can do anything meaningful with it. This means that SQL server can not use indexes to find entries having a particular bit set, except when you are looking for a COMPLETE SET of bits (ex: a specific bit pattern where every possible bit column is specified).

    We chose to implement a single attribute table for this type of storage in our application. Three columns: DeviceName, AttributeName and AttributeValue. For us, a clustered PK on DeviceName/AttributeName worked best (since we most often search by device name).

    This implementation is not limited to the number of attributes that can be recorded (i.e. you can add a new attribute without any DDL), is easily indexed and performs very well.


    --Mitch

  • That comes in Part 2! Stay tuned

  • Great article, thank you. I think bitmasking is particularly useful for dealing with data permutations (e.g. dimension members in a data mart). And for those people who think 100 columns are better than one - did you people just graduate from vo-tech? Sure, storage and hardware may be cheap but time is not. Who is going to maintain the schema for all these columns? What about the data dictionary (and you WILL need one for 100 columns)? Where do you draw the line - 10 cols, 100 cols, 1000 cols? What about the downstream users of the data like analysts and report developers? "Hmmm, let's see; column A547DD tracks the OS while columns A547DE tracks cookies. Piece of cake." I think not.

    This solution saves time, and quite frankly if you are paying $100+/hr for an app developer when they have to continuously reference the definition for 50, 100, or 1000 columns the cheap hardware really becomes immaterial. Hardware is capex, labor is not. Remember people: accountants rule the business, you do not.


    James Stover, McDBA

  • Mr Stover -

    "And for those people who think 100 columns are better than one - did you people just graduate from vo-tech? Sure, storage and hardware may be cheap but time is not. Who is going to maintain the schema for all these columns? ..."

    Multiple columns would indeed be the wrong answer. Normalizing those values into rows in another table, with a foreign key relationship, would both solve both the original problem and your maintenance-of-schema objection.

  • I'm inclined to disagree with the data mart example. If you refer to Ralph Kimball's The Data Warehouse Toolkit he does mention that any data that is mapped or masked (I'm paraphrasing) is not optimal. Here, space is not an issue - and decoding on the fly would probably not be the best idea either, especially if you have users and analysts looking at data through a cube viewer. How are you going to render values in that situation I have no clue. I have not seen a data mart/data warehouse/star schema implementation that uses something like this, and I probably wouldn't recommend it.

    To answer Mr. Aldrich re: normalization, you have a good idea but it might not be practical depending on the situation. Suppose that you had 20 attributes. Care to guess how many permutations (without grouping) you come up with? If you guessed 20! you are correct....big, big number. So again, this may or may not be possible, actually trying to normalize any columns to reduce redunancy.

    I'll get into indexing and prioritizing (that one mentioned earlier as well) in the next segment...that is if anyone actually wants the next segment.

  • I'll get into indexing and prioritizing (that one mentioned earlier as well) in the next segment...that is if anyone actually wants the next segment.

    Hang in there, Lee, and write that next article. From a purely theoretical view we would want to avoid bitmasks. But sometimes they are useful to solve a practical problem.

     

    K. Brian Kelley
    @kbriankelley

  • In a recent article posted here, the writer was trying to make an argument for setting the nocount on. He made a point that the extra row generated increased delay by .3%. If this is the case, then one can make the argument for using bit masking as it reduces the amount of data being transferred over the network. In the event you normalize the data, then you have multiple rows to process; in the event you create different columns, you are still increasing the amount of data transmitted. Not to mention having to generate DDL statements if you want to add additional values to process.

    In the end, as software developer and database admins, we are always looking to make our process as efficient and maintainable as possible. This technique can be another tool in our arsenal.

    Kudos to Lee for writing an article that explores yet another way for us to code.

  • Interesting article.  I tend to agree with Joe Celko on this, BUT, the bitmask manipulation is something that a good programmer should know in the languages he/she is using.

    For example, I used something similar to a bitmask to evaluate a set of hierarchial rules.  I could have three factors upon which my decision is based..  (this is a simplified version of what I did in a healthcare application - it allowed for many rule factors to be evaluated in different orders with new factors being added on the fly)

    1. Doctor

    2. Type of procedure

    3. Intended recipient of the letter  (ie referring doctor, internal notes, the patient, etc)

    You can have a table storing columns such as

    Doctor

    ProcType

    RecipientType

    TemplateToUse  -  refers to the template table.  could be NULL

    SendingMethod  -  indicates how the letter should be sent.  could be NULL

    etc, etc... Other properties stored here

    I set up a loop to go from 7 down through to 0 - bitmask values 111 through 000.  For each loop iteration I can use the bitmask to determine whether or not I include that factor in the check for a rule match.  I'm not explaining this very well - apologies   (was up late!) 

    If I was to attempt to match for Doctor 7, ProcType 2 and RecipientType 1 to get the desired Template, sending method, etc and I had the following rules recorded

    Doctor       ProcType       RecipientType      TTU      SM

    null            null                null                   fallback  post

    7               null               1                       temp1   email

    7               2                  null                    temp2   null

    null            null               1                       null       post

    etc

    It depends on which rule factor has most significance (doctor, procType or recipientType) as to what eventual TTU and SM (plus others) I get.  Using the bitmask to AND with my doctor=7, procType=2 and recipientType=1 factors I can check all possible rule combinations, in the correct order of checking them (from most specific to least specific) for rule values in the table.  I can stop when I have all non-null values for TTU, SM, etc.  My first rule in the table above (with nulls for the rule factors) ensures that I get, if nothing else was found, at least a fallback template and sending via the postal system.

    Anyhow, I typically steer clear of storing "encoded" values in columns in the DB for many reasons, but a big one is that an end user attempting to write a report in Crystal Reports with very minimal knowledge of SQL and Crystal functions, etc would not be able to decode the field - bitmask, IP as integer or otherwise.  Also, if you need to capture an extra bit, I'd hate to visit various reports to ensure their decoding routine can cope with the extra bit(s).  If the workaround suggested is a database view that does the bitmask decoding then you might as well just store the values as discrete bits in the table anyway.

    It was a good article - I'm looking forward to the next one.  I like the ones that provoke good (not cranky) discussions   Cheers!

  • "Hmmm, let's see; column A547DD tracks the OS while columns A547DE tracks cookies. Piece of cake." I think not.

    I can see where you're coming from now ... 'cause it's easier to have 100 undocumented bits than 100 columns that have been defined in a dictionary.

    (smacking forehead)


    --Mitch

  • Oh my heavens what crap!

    WARNING TO NEWBIES: Please do not be dissuaded by the seeming "balance" of comments supporting bit-masking as a "useful" technique.

    POINT #1: You cannot do bitmasking in a relational database. Period. Doing this "technique" by definition makes it no longer relational. (Sure, you can do it in some "relational" products, the same way you can hammer nails with an HP calculator. Doesn't mean you should.) This sort of "step outside the rules" way have custom handling data values was *exactly* the sort of idiocy relational was invented to avoid. Heed Celko's post earlier and the others that echo this sentiment.

    (To be clear, putting data into a "relational" product does not make it a relational database any more than typing characters in Word makes you a Lawyer. See http://www.datazulu.com/portal/Home/tabid/36/EntryID/4/Default.aspx.)

    POINT #2: It *is* useful to understand how this kind of thing works. This is because at some point in your career you are likely to inherit some crap like this built by people doing what's "cool" instead of what's right and you will have to untangle the mess. To that end, this is a well-written article. Good use of screen-shots, etc.

    I hope Part 2 covers how to screen-capture your order processing system, do text recognition, and print out a spreadsheet the secretary can use to type into the fulfillment system. This is another example of craptastic systems we are likely to inherit. It might be good to know the mind-set of those who build such things.

    POINT #3: Be especially cautious of so-called performance benefit. It is very unlikely in SQL Server due to the availability of 1-bit columns. (As other comments point out.) You should explore such extreme violations of clarity and logic like this technique advocates ONLY for cases where you prove the benefit and the solution is in an ultra-performance-sensitive area.

    (For cases I've inherited, 8 of 9 of them had no performance benefit whatever. Some were worse. They did cause confusion though, all 9 of them. One of the 9 had a very minor performance benefit and it was not a performance-critical system. Across dozens of systems I have never seen one that had such a bit-crunch and performance problem that this technique was the best solution.)

    For those who wish to turn a relational database engine into an indexed file system or a C++ persistence layer, then all bets are off. But most of us are trying to collect and manipulate data in ways that are most useful and transparent.

    To me, using bit-masking in an otherwise relational database is like injecting a little bit of COBOL into your Java app just to show you know COBOL. In every case I've seen this where the person was still around, it was clear the only *real* reason they did it was to feel smart. Well, sorry, but that's a very *dumb* reason to put hacks like this into a system.

Viewing 15 posts - 16 through 30 (of 59 total)

You must be logged in to reply to this topic. Login to reply