Huge Bitmap

  • I've recently encountered problem in work. We have quite huge bitmap which we use frequently and I was assigned to optimize it. It was obviously slow but let me just describe data size.

    It's about 50000000 (5*10^6) records and each has about 5000 (5*10^3) bits - each and every one important (donghh!).

    At first it was stored in few tables, each with row_id and few hundreds bit columns and no index, even PK (dongh! 2). Obviously - every search ended with choosing at least one table and full table scan. It was pain.

    My first suggestion was to pack bits into INT columns with ncl indexes on each one which has few advantages:

    1) one table instead of a couple

    2) possible index scans instead of table scan

    It works fine if you look in one column and has acceptable performance when searching on few columns. Still - few dozens columns criteria (while infrequent it is still possible) will still be slow as SQLServer has to scan several indexes and INTERSECT or ADD results (probably some combination). I understand that it is possible to make index seek instead of index scan for few, most frequently used bits in each column (instead of col & POWER(2,bit_no) = POWER(2,bit_no) i will get few <=, >= operators and several values) but it will not work for each and everyone.

    Do you have any suggestions how to handle this size of bitmaps in SQL Server efficiently (time efficiency is more important than storage efficency)?

  • Could you explain the need for such bitmap and the way you search it?

    It looks like like one of solutions for full text search I was evaluating many years ago and discarded.

  • I don't think it has anything to do with text index.

    It is a database of items which can be described with hundreds features. The features are ussually not correlated. Now - I perform two type of searches: to get the cardinality of the set and to get the set itself. The random query may look like (in my version of DB):

    SELECT COUNT(1) FROM table WHERE (col1 & POWER(2,24) = 0 OR col7 & POWER(2, 4) != 0) AND (col17 & POWER(2,0) != 0 OR (col211 & POWER(2,3) = 0 AND col77 & POWER(2,8) = 0))

    and

    SELECT row_id FROM table WHERE (col1 & POWER(2,24) = 0 OR col7 & POWER(2, 4) != 0) AND (col17 & POWER(2,0) != 0 OR (col211 & POWER(2,3) = 0 AND col77 & POWER(2,8) = 0))

    Note that row_id is Primary Key.

  • Index would help you with ordered search. Such bitmap search is random search of bit matrix. I'm not aware of any index type that would help you here. It always has to search the whole table.

    Alternatives are based on patterns.

    To what degree can you alter the table?

    What's the frequency of features? If each item can have 5-20, you can have a "compressed bitmap" to reduce the amount of data

    How many features you search at once? If the number is relatively low, reversing bit matrix would help to reduce the amount of IO.

    Are the any rules for grouping, hierarchy or any other relations between items or feature set? If so, you can transform it to sparse matrix.

  • The underlying application sounds a good candidate for sparse columns in SQL Server 2008.

    Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

    When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop HΓ©lder CΓ’mara

  • To what degree can you alter the table?

    I can redesign it completely.

    What's the frequency of features?

    To be honest - I'm not sure. I would say that on average there are about few dozens of 1s, maybe more.

    Compressed bitmap

    Honestly I don't know what do you mean by that. (I know what compressed is but I've no idea what would be helpful and possible to describe with this phrase).

    How many features you search at once?

    The problem is that it's not me who will be using it. My guess is that query should not use more than 20 bits at once. I will test reversing data - as I understand you mean PIVOT.

    Sparse matrix, while generally very interesting, were introduced in 08'. I'm working on 05'.

  • Without knowing the structure of your bitmaps, but assuming there is a product ID or something similar for each row, can you change the table to have a row for each id and bit combination? That makes for lots of rows, but they would be small rows, and easily indexable. What sort of information do your queries return, and how are the results used?

  • Ross McMicken (3/18/2009)


    Without knowing the structure of your bitmaps, but assuming there is a product ID or something similar for each row, can you change the table to have a row for each id and bit combination? That makes for lots of rows, but they would be small rows, and easily indexable. What sort of information do your queries return, and how are the results used?

    I've thought about it, but I suspect that T-SQL statements INTERSECT and UNION are not that fast.

    About the results: I've posted queries few posts above. The difference is in WHERE segment only.

  • My approach would be alike Ross:

    Your product table without any bits. A second table with the product-id and one row for each set bit (setting).

    To get products with a specific configuration use either a CTE or a sub-select. INSERT the requested settings into a temp table (or another CTE). and join the temp table with your setting table and group by product-id as last you need a HAVING clause which says COUNT(*) needs to be the count of items within your temp table. So you never have more than three JOINed tables and it stays fast even for searches with 50 or more settings.

    Hope this was comprehensible... If not give me a short feedback and I'll try again.

    Greets

    Flo

  • Hi Flo,

    Your idea there is intriguing, but it left me with a couple of questions:

    1. Does it support AND / OR combined searches as in the posted examples?

    2. Can it look for unset bits too?

    3. If most bits are set in reality (dunno if this is the case or not) - wouldn't the bit-per-row-per-ID table approach 6M * 5k = 30Grows? (Gigarows are a new SI unit πŸ˜‰

    My thoughts are that this is a type of problem which is only seriously optimisable if one has knowledge of the data. The idea of selecting up to twenty bits, combined using AND & OR, set and unset, from a pool of 5000 bits across 6 million IDs is kinda scary! If we knew that most bits were unset, or other process-dependent information, a specific optimisation approach could be worked out. Maybe πŸ™‚

  • I tinkered with this idea a little about 2 weeks age (I will continue later; not enough time right now) but:

    any combination of and/or is possible if you use union/intersect and count product_ids in the resulting set

    unset bits are just bits for which there is now row (id, bit_no) so ids with unset bits are s.t. like

    (SELECT id FROM table WHERE id != ALL (SELECT id FROM table WHERE bit_no = {no}))

    and then you can intersect/union the set with s.t. else.

    But the raw size is quite enormouse. It was easily above 100m. rows and counting.

    On the other hand - much smaller number of indexes πŸ˜‰

    I'm just not sure about speed of INTERSECT/UNION

  • Hey there nimdil,

    In the quick test I threw together, the intersect was implemented as an inner loop join, seeking on an index on both inner and outer inputs. It seems a good plan for the number of rows I had in my table, but I share your concerns about its ability to scale. At some point the QO is going to decide that a hash/merge is the way forward, and...well I dread to think.

    Sure - unset bits are the absence of a row - but doesn't that require duplicating the 'search for set bits' code including it's group by = count of bits required logic?

    Implementing the check for unset bits (the != ALL or NOT EXISTS query example you listed) will require a bit_no IN (list_of_bit_numbers_to_check) rather than an equality, and with maybe a dozen not-set checks, the QO might not come up with a fantastic plan...if history is any guide.

    To cope with the AND & OR issue, wouldn't the set and/or unset searches need to be run for each part of the predicate, stored in a temporary area, and then custom SQL written to implement the AND/OR using INTERSECT or UNION ALL? I suppose on could do something fairly horrible with dynamic SQL...:blink:

    The potential physical (and intermediate result) row sizes are, as you say, staggering.

    All of the above is off the top of my head - I have only tried the smallest of simulations (32 bits for options and a few hundred rows in the row-per-set-option-per-ID table) - so I'm happy for anyone to point out an errors in my reasoning, as always πŸ™‚

    My intuitive position is that optimisation over the whole problem space is unrealistic. If we knew a few more things about the data, and customer requirements, there might be easier, quicker, and cheaper optimisations available.

    /paul

  • Hi Paul

    1. Does it support AND / OR combined searches as in the posted examples?

    It depends. For simple any-ORs sure you can use a join without the having-condition. For combination of AND and OR it becomes a little tricky. Maybe you can extend the temp table to query with a group-id and the required matching count but I didn't try for now. This should be possible:

    bit1 and bit2 and (bit3 or bit4 or bit5) and (bit6 or bit7)

    If you want to handle this I would say this will be the wrong solution:

    (bit1 and (bit2 or bit3)) or (bit4 or (bit5 and bit6))

    2. Can it look for unset bits too?

    LEFT OUTER or NOT EXISTING? To avoid to much joins to the bits-table here also temp table concept should be used.

    3. If most bits are set in reality (dunno if this is the case or not) - wouldn't the bit-per-row-per-ID table approach 6M * 5k = 30Grows? (Gigarows are a new SI unit πŸ˜‰

    My thoughts are that this is a type of problem which is only seriously optimisable if one has knowledge of the data. The idea of selecting up to twenty bits, combined using AND & OR, set and unset, from a pool of 5000 bits across 6 million IDs is kinda scary! If we knew that most bits were unset, or other process-dependent information, a specific optimisation approach could be worked out. Maybe πŸ™‚

    Well... I would say optimization of giga-row tables could be a new challenge :-). This solution depends on the average count of settings per product. I also could make sense to use two or more setting tables or put some of the more unusual settings back to an INT bitmap.

    As you see there are much dependencies, might's and may's...

    Greets

    Flo

  • Hey flo,

    Agreed. I put some further thoughts in my previous post which you might not have seen yet.

    I wonder whether the OP would see any benefit from using BIGINT columns to stuff the bits in and indexing those. Sure, a couple of index scans will be required, but I can't help wondering if it might not be more efficient (at least in the short term, until a full analysis can be done) than the end result of all this normalizing, ANDing, Oring, INTERSECTing, UNIONing and EXISTing...

    If that's a painfully naiive suggestion, please feel free to say! πŸ˜€

    /Paul

    Addendum: ...or even storing the entire bitmap as a binary type (5000 bits = 625 bytes), and using a CLR routine to to a full-bitmap comparison in one step? As long as the binary type used wasn't MAX and the CLR routine does no data access and is created with the appropriate attributes, it should be parallelizable...:w00t:

  • As we are working on 32-bit environment I've seen no advantage of BIGINT over INT as all bit-wise operations are made, as I recall, on 32bit segments. So more or less any bit operation on BIGINT is about two processor operations instead of one on INT. On the other hand - it is possible that fewer indexes could be accessed if BIGINTs are used as storage for bits.

    one huge binary bitmap is tempting idea, but I don't have much experience with CLR in SQL Server as my boss dislike CLR functions in SQL Server completely (as for now) :/

Viewing 15 posts - 1 through 15 (of 28 total)

You must be logged in to reply to this topic. Login to reply