Introduction to Bitmasking in SQL Server 2005

  • Good One

  • Hello...

     

    I have to agree with most former posters! Bitmasking is NOT the way to go... I can see the "posibility" to store "ONE" flagged enumerations inside a single field(This is "possibe but BAD"), but mixing several fields into one is EVIL. It will almost allways lead to some chaos.

    If you think about bitmasking in the suggested manner, why not deserialize complete object graphs directly into the Database? You will be able to get along with only 3 fields and one Table for your WHOLE System! Isnt that sweet? Of course you could encode the objecttype into the binary data also and you could do without a primary key... Now we got 1 Table with one field to store all our data! Sweet!

    But i am interested in indexing this system... Will you create a index for each datatype/bit position, so you can determine if bit 37 is set or not? A regular index wont help here, so you have to get fancy there also... (I can only think along the lines of computed colums (myBitMaskField & (128+256)) which need to be indexes seperately, but then you would have the "extra columns which you wanted to avoid...)

  • Eric Wilson - I could not have said it better. Well put.

    This is important information to have in one's toolkit, only for the times when you have to undo it in a system you've inherited from someone who thought it was a good idea 🙂

  • The points critcal of this article are well made.  That said - there is certainly a place for this sort of thing in a system designed to quickly capture lots of information in a condensed format from very many concurrent users.  I've used bitmasks to do just that and asyncronously replicated the data to relational data stores where the "secretary with an excel spreadsheet" can get at the information represented.  In fact the example the author mentions, that of capturing session information, is a perfect one for this sort of thing.  Bitmasking is obviously not a good idea everywhere however.  Did anyone say it was?  BTW - the 50's were great!  Good article!

  • Call me a "purist" or whatever, I don't care.  This article is well written and this kind of stuff is fun and interesting (in limited quantities), but it is also completely stupid to do in a relational database, period, end of story!  Any benefit accrued by this approach will be more than offset by the loss of data integrity and clarity.

    The arguments in favor of this technique betray a sad lack of understanding.  (I almost said "shocking" but this kind of ignorance has long ago ceased to shock me.) 

    The Relational Model says very little about data types and says absolutely NOTHING about limiting data type complexity, so arguments like "well, Dr. Codd couldn't have had data type X in mind when he came up with the RM, therefore the RM must be inadequate for handling data type X...blah, blah, blah..." are just stupidity of the highest order. 

    The problem with complex data types (including relation valued data types) is that in order to support them relationally you must have operators for them.  Having operators for a data type is much more than just being able to insert, update, select, and delete them.  The difficulties involved in implementing complex data types have NOTHING to do with the RM.

    /*****************

    If most people are not willing to see the difficulty, this is mainly because, consciously or unconsciously, they assume that it will be they who will settle these questions for the others, and because they are convinced of their own capacity to do this. -Friedrich August von Hayek

    *****************/

  • I think I agree that bitmasks are to be avoided in most instances.  But what happens when the number of bit columns is not known at design time, for example the processor affinity mask?

    John

  • My stand - I liked the article (very well, and professionally, written imho),I want to understand the technique, but am NOT in favour of it's usage in almost any circumstance I can think off ... but that doesn't mean those curcumstances don't exist.

    I am looking forward to the second article, but I would like to see it more clearly stated in the article that this is an advanced technique, for use in specialized circumstances - I fear going into a client site in years to come, and seeing many systems absed on this design because "it was a really good way to do things - you should see the article I read on it...."

    Again, Lee - kudo's on the presentation of the articles, and on your repsonses in the forum which have been pertinent and not a dogmatic defence...

  • Ref. the example I gave above ... handles this very nicely. And, if you use VARxxx columns, it's not *overly* wasteful. Still not as small as a BIT column, nor as small as an encoded bit as described in this article, but I think that's already been well argued.


    --Mitch

  • Indeed it won't work - and that was my point.

    The bitmask should be 3 ie binary 00000011 to mask out everything except the bottom two bits. Then, Windows would decode as 01 = 1, Linux would decode as 10 = 2, and Mac as 11 = 3.  It would not matter in which order the comparisons were done.  Whereas in the original article, by masking everything but one bit, both Mac and Windows would decode as 1, and the order of the comparisons would need to be considered to ensure that it worked.

    In other words:

    mask 00000011 AND windows 00000001 = 1

    mask 00000011 AND linux       00000010 = 2

    mask 00000011 AND Mac        00000011 = 3

    works whatever order they are compared.  Also, I was looking at the technique, not just its implementation in T-SQL.  Someone could take your example and struggle to make it work in VB or C#

     

     

  • The IP Address problem is one that I am facing.

    When stored as a string it is a VARCHAR(15), it is human readable but difficult to do range searches.

    When stored as a BIGINT you can convert 192.168.10.1 to 192168010001. Still readable (just), you can do a range search on it and it now takes 8 bytes. For the pedants amongst you, yes I know you have to consider whether the column is NULLable.

    When stored as an INT by combining the octets you loose readability, but storage is now 4 bytes and you can do range searches.

    Consider that the IP address is coming from the log files of a large web farm registering millions of hits per hour and the 4 byte solution starts to look very attractive.

    When it comes to deciphering the IP address when it is encoded within an INT you don't. Well, not straight away.

    You do your data mining first and the significant IP addresses get decoded.

    If you want to do searches for a range of IP addresses you use the udf that encodes the IP address in the first place to tell you what encoded values to search for.

    I wouldn't recommend storing an IP address as 4 separate TINYINT fields for the reasons Joe stated. An octet by itself has no meaning, it is scalar but not atomic.

  • For IP address manipulation, we store them as BIGINTs. We have two UDFs; one that encodes an IP address as a BIGINT and one that converts back. Information is decoded just before display and encoded before used in a query. By doing the same with the netmask, we can find the range of addresses that are on a particular subnet and simply select those devices between the range of BIGINTs represented by the range.

    -- Mitch


    --Mitch

  • Our data analysts wanted to build a model that excluded IP addresses in a pretty obscure range

    The problem was coming up with a simple query that would work on 4 separate fields simultaneously.

    I like the display column idea but what is the performance like?

  • Thanks, but the WHERE Oct_1 IN () clause doesn't really provide an easy solution.

    I can't remember the specific IP address range for the Google ROBOT but similar exclusions look for something like

    72.108.10.2 to 73.90.205.1

    WHERE(
    Oct_1 = 72 
    AND (Oct_2>108 OR (Oct_2=108 AND Oct_3>10) OR(Oct_2=108 AND Oct_3=10 AND Oct_4>=2))
    )
    OR
    Oct_1 = 73
    AND (Oct_2<90 OR (Oct_2=90 AND Oct_3<=205) OR(Oct_2=90 AND Oct_3=205 AND Oct_4<=1))
    )
    

    I suspect that the data analysts may raise the relative complexity as an issue.

    Particularaly as a lot of queries are for ad-hoc data mining though I suppose that we could build a template for the data analysts to use.

  • What about adding an indexed calculated field based on those 4 columns.  Than would make the range search much more simpler and it should be much more overhead on the server (not more than having to scan an index (not sure here but it seems that a scan is hard to avoid)).

  • I don't think that there is any easy answer.

    One of the examples in the "Update your development skills to SQL2005" course uses a CLR datatype to hold an IP address.

    I will have to experiment to see if that works.

    The dilema I am facing is that on one hand I need efficient storage but on the other I need readable data.

    I haven't index a calculated field so I'm not sure if the field is calculated on the fly or whether it is materialised.

Viewing 15 posts - 31 through 45 (of 59 total)

You must be logged in to reply to this topic. Login to reply