Huge Bitmap

  • nimdil (4/3/2009)


    As we are working on 32-bit environment I've seen no advantage of BIGINT over INT as all bit-wise operations are made, as I recall, on 32bit segments. So more or less any bit operation on BIGINT is about two processor operations instead of one on INT. On the other hand - it is possible that fewer indexes could be accessed if BIGINTs are used as storage for bits.

    one huge binary bitmap is tempting idea, but I don't have much experience with CLR in SQL Server as my boss dislike CLR functions in SQL Server completely (as for now) :/

    Let's say you had 64 bits to compare. Would you rather compare a 64-bit value once, perhaps using two CPU instructions, scanning one index...OR...scanning two indexes and comparing twice as many 32-bit values once? 😉

    This might be an excellent opportunity to learn a new skill and show your boss:

    (a) How clever you are and how he should double your rate immediately; and

    (b) How useful CLR can be (assuming my idea isn't complete idiocy)*

    * on this, I did have some success with implementing a compression function in C# CLR which was parallelizable and really helped in a specific case. This was before 2008 of course!

  • Well yes - if all the bits are actually in one column. An extremely rare case.

    About my boss... well - he also is IT guy. He just don't trust CLR in SQL 😉 I assume I could work a bit around this idea in deep secrecy (all documentation encrypted with strong password etc. :P) and finally - kaboom! It works flawlessy, we can use it and be happy.

    But there is one thing which should be noted - argument against BIGINT and BINARY(8000).

    With bit operations we use index scan, not index seek. So scanning index with BINARY(8000) is like scanning whole table. Not very good. And INT index should be smaller then BIGINT index, so if I have to scan two bits in two columns scanning two INT columns would be faster then scanning two BIGINT columns. Quite important.

  • Not quite. Let me try again:

    [Note that the probability of find the bits you are looking for in one column improves linearly with the number of bits in that column...]

    Whatever the number of bits, they all have to be read and compared at some stage. Index scans are expensive. Reading all the bits in, say 6 indexes (BIGINT) is clearly better than reading 12 indexes. See?

    I can't help you with your boss's shortcomings, sadly 😉

    Scanning an index on a BIGINT is not at all like scanning the whole table. The index will be 8 bytes wide plus overhead. Many more index rows will fit in a single 8K page than rows of the table as a whole. Therefore more index rows will fit into buffer cache, and into any physical IO that occurs. This is one of the fundamental reasons that indexes work so well - it is important that you grasp it fully!

    I'll leave finding the logical flaw in your last paragraph to you! Hint: following that line of thinking, SMALLINTs would be better than INTs, TINYINTs even better, and - well - BITs would be awesome!!!

    Finally, yes - bitwise operations always require a scan because all values must be compared. Don't be scared of scans though! They really are a good solution, if all rows must be examined.

    /Paul

  • I understand. My point is - probability of getting 12 bits in 6 BIGINTS (when I have 5000 bits) is quite low without good statistics. I don't have good statistics :]

  • nimdil (4/3/2009)


    I understand. My point is - probability of getting 12 bits in 6 BIGINTS (when I have 5000 bits) is quite low without good statistics. I don't have good statistics :]

    But you know which bits you are after, so you should know in which indexes those bits will be found, no?

    5000 / 64 bits = 79 columns of BIGINT

    5000 / 32 bits = 157 columns of INT

    So bits 0-31 are in the first INT column, and bits 0-63 would be in the first BIGINT column, and so on.

    If you know you need to compare bits 1, 29, 44, 62 (as a simple example) you would have to scan the first two INT indexes. Or scan one BIGINT index. Granted, the distibution won't always be this favourable, but still.

    Worst case you need 20 bits. They all happen to fall in different indexes, both for the INT case and for the BIGINT case. So that's 20 index scans. I can only suggest you try this out for yourself to see how little difference it makes in practice.

    Best case, 10 index scans are required on the BIGINT versus 20 index scans for the INT. I hope you can see that the 10 index scans will always be massively quicker than 20 scans, even if the index is 4 bytes less wide in the second case.

    /Paul

    P.S. With N to 2N index scans for an N-bit search, you can see why I like the 1-scan CLR idea.

  • Hi Paul

    I tested both solutions and you are completely right with BIGINT (or BINARY in combination with a CLR). I see there is still much to learn for a non-DBA ;-). I'm a developer.

    Just for information my tests...

    Sample tables with some sample data:

    DBCC SHRINKDATABASE (Sandbox)

    SET STATISTICS TIME OFF

    SET STATISTICS IO OFF

    IF (OBJECT_ID('BMProduct') IS NOT NULL)

    DROP TABLE BMProduct

    IF (OBJECT_ID('BMSetting') IS NOT NULL)

    DROP TABLE BMSetting

    GO

    CREATE TABLE BMProduct

    (

    Id INT IDENTITY,

    name VARCHAR(100),

    BigInfo BIGINT,

    PRIMARY KEY CLUSTERED

    (Id)

    )

    CREATE INDEX IX_BMProduct_BigInfo ON BMProduct (BigInfo)

    CREATE TABLE BMSetting

    (

    ProductId INT,

    Setting INT

    PRIMARY KEY CLUSTERED

    (Setting, ProductId)

    )

    GO

    CREATE INDEX IX_BMSetting_Setting ON BMSetting (Setting)

    -- some products

    INSERT INTO BMProduct

    SELECT TOP(1000000) c1.name, 0

    FROM master.sys.all_columns c1

    CROSS JOIN master.sys.all_columns c2

    GO

    -- some BIGINT sample settings

    UPDATE BMProduct SET BigInfo = BigInfo | POWER(2, 1) WHERE Id % 2 = 0

    UPDATE BMProduct SET BigInfo = BigInfo | POWER(2, 2) WHERE Id % 3 = 0

    UPDATE BMProduct SET BigInfo = BigInfo | POWER(2, 3) WHERE Id % 4 = 0

    UPDATE BMProduct SET BigInfo = BigInfo | POWER(2, 4) WHERE Id % 5 = 0

    UPDATE BMProduct SET BigInfo = BigInfo | POWER(2, 5) WHERE Id % 6 = 0

    UPDATE BMProduct SET BigInfo = BigInfo | POWER(2, 6) WHERE Id % 7 = 0

    UPDATE BMProduct SET BigInfo = BigInfo | POWER(2, 7) WHERE Id % 8 = 0

    UPDATE BMProduct SET BigInfo = BigInfo | POWER(2, 8) WHERE Id % 9 = 0

    UPDATE BMProduct SET BigInfo = BigInfo | POWER(2, 9) WHERE Id % 10 = 0

    UPDATE BMProduct SET BigInfo = BigInfo | POWER(2, 10) WHERE Id % 11 = 0

    UPDATE BMProduct SET BigInfo = BigInfo | POWER(2, 11) WHERE Id % 12 = 0

    UPDATE BMProduct SET BigInfo = BigInfo | POWER(2, 12) WHERE Id % 13 = 0

    UPDATE BMProduct SET BigInfo = BigInfo | POWER(2, 13) WHERE Id % 14 = 0

    UPDATE BMProduct SET BigInfo = BigInfo | POWER(2, 14) WHERE Id % 15 = 0

    UPDATE BMProduct SET BigInfo = BigInfo | POWER(2, 15) WHERE Id % 16 = 0

    -- some related sample settings

    INSERT INTO BMSetting SELECT Id, 1 FROM BMProduct WHERE Id % 2 = 0

    INSERT INTO BMSetting SELECT Id, 2 FROM BMProduct WHERE Id % 3 = 0

    INSERT INTO BMSetting SELECT Id, 3 FROM BMProduct WHERE Id % 4 = 0

    INSERT INTO BMSetting SELECT Id, 4 FROM BMProduct WHERE Id % 5 = 0

    INSERT INTO BMSetting SELECT Id, 5 FROM BMProduct WHERE Id % 6 = 0

    INSERT INTO BMSetting SELECT Id, 6 FROM BMProduct WHERE Id % 7 = 0

    INSERT INTO BMSetting SELECT Id, 7 FROM BMProduct WHERE Id % 8 = 0

    INSERT INTO BMSetting SELECT Id, 8 FROM BMProduct WHERE Id % 9 = 0

    INSERT INTO BMSetting SELECT Id, 9 FROM BMProduct WHERE Id % 10 = 0

    INSERT INTO BMSetting SELECT Id, 10 FROM BMProduct WHERE Id % 11 = 0

    INSERT INTO BMSetting SELECT Id, 11 FROM BMProduct WHERE Id % 12 = 0

    INSERT INTO BMSetting SELECT Id, 12 FROM BMProduct WHERE Id % 13 = 0

    INSERT INTO BMSetting SELECT Id, 13 FROM BMProduct WHERE Id % 14 = 0

    INSERT INTO BMSetting SELECT Id, 14 FROM BMProduct WHERE Id % 15 = 0

    INSERT INTO BMSetting SELECT Id, 15 FROM BMProduct WHERE Id % 16 = 0

    GO

    --SELECT COUNT(*) FROM BMProduct

    -- Result: 1,000,000

    --SELECT COUNT(*) FROM BMSetting

    -- Result: 2,380,726

    SET STATISTICS TIME ON

    SET STATISTICS IO ON

    Test: Give me all products with setting "3 and 4"

    -- Over BigInfo

    SELECT COUNT(*)

    FROM BMProduct

    WHERE (BigInfo & POWER(2, 3) != 0)

    AND (BigInfo & POWER(2, 4) != 0)

    -- -----------------------------------------------------

    -- STATISTICS

    --SQL Server parse and compile time:

    -- CPU time = 0 ms, elapsed time = 4 ms.

    --SQL Server parse and compile time:

    -- CPU time = 0 ms, elapsed time = 0 ms.

    --(1 row(s) affected)

    --Table 'BMProduct'. Scan count 1, logical reads 2239, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    -- SQL Server Execution Times:

    -- CPU time = 312 ms, elapsed time = 323 ms.

    -- -----------------------------------------------------

    -- Over Setting table

    ;WITH curr (Setting, GroupId, ResultCount) AS

    (

    SELECT 2, 1, 2

    UNION SELECT 4, 1, 2

    ),

    s (ProductId) AS

    (

    SELECT ProductId

    FROM BMSetting s1

    JOIN curr ON s1.Setting = curr.Setting

    GROUP BY s1.ProductId, curr.GroupId, curr.ResultCount

    HAVING COUNT(*) = curr.ResultCount

    )

    SELECT COUNT(*)

    FROM BMProduct p

    JOIN s ON p.Id = s.ProductId

    -- -----------------------------------------------------

    -- STATISTICS

    --SQL Server parse and compile time:

    -- CPU time = 0 ms, elapsed time = 0 ms.

    --(1 row(s) affected)

    --Table 'BMSetting'. Scan count 2, logical reads 736, physical reads 139, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    --Table 'BMProduct'. Scan count 3, logical reads 2269, physical reads 0, read-ahead reads 4, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    --Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    --Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    -- SQL Server Execution Times:

    -- CPU time = 1422 ms, elapsed time = 1485 ms.

    -- -----------------------------------------------------

    Test: Give me all products with setting "2 and 5 and (7 or 9)

    -- Over BigInfo

    SELECT COUNT(*)

    FROM BMProduct

    WHERE (BigInfo & POWER(2, 2) != 0)

    AND (BigInfo & POWER(2, 5) != 0)

    AND ((BigInfo & POWER(2, 7) != 0)

    OR (BigInfo & POWER(2, 9) != 0))

    -- -----------------------------------------------------

    -- STATISTICS

    --SQL Server parse and compile time:

    -- CPU time = 5 ms, elapsed time = 5 ms.

    --(1 row(s) affected)

    --Table 'BMProduct'. Scan count 1, logical reads 2239, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    -- SQL Server Execution Times:

    -- CPU time = 344 ms, elapsed time = 351 ms.

    -- -----------------------------------------------------

    -- Over Setting table

    ;WITH curr (Setting, GroupId, ResultCount) AS

    (

    SELECT 2, 1, 2

    UNION SELECT 5, 1, 2

    UNION SELECT 7, 2, 1

    UNION SELECT 9, 2, 1

    ),

    s (ProductId) AS

    (

    SELECT s1.ProductId

    FROM curr

    JOIN BMSetting s1 ON curr.Setting = s1.Setting

    GROUP BY s1.ProductId, curr.GroupId, curr.ResultCount

    HAVING COUNT(*) = curr.ResultCount

    )

    SELECT COUNT(*)

    FROM BMProduct p

    JOIN s ON p.Id = s.ProductId

    -- -----------------------------------------------------

    -- STATISTICS

    --SQL Server parse and compile time:

    -- CPU time = 0 ms, elapsed time = 0 ms.

    --(1 row(s) affected)

    --Table 'BMSetting'. Scan count 4, logical reads 1005, physical reads 1, read-ahead reads 2, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    --Table 'BMProduct'. Scan count 1, logical reads 4113, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    -- SQL Server Execution Times:

    -- CPU time = 1391 ms, elapsed time = 1422 ms.

    -- -----------------------------------------------------

    Test: Give me all products with setting "5 and not (9 or 10)"

    -- Over BigInfo

    SELECT COUNT(*)

    FROM BMProduct

    WHERE (BigInfo & POWER(2, 5) != 0)

    AND ((BigInfo & POWER(2, 9) = 0)

    OR (BigInfo & POWER(2, 10) = 0))

    -- -----------------------------------------------------

    -- STATISTICS

    --SQL Server parse and compile time:

    -- CPU time = 4 ms, elapsed time = 4 ms.

    --(1 row(s) affected)

    --Table 'BMProduct'. Scan count 1, logical reads 2239, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    -- SQL Server Execution Times:

    -- CPU time = 282 ms, elapsed time = 290 ms.

    -- -----------------------------------------------------

    -- Over Setting table

    ;WITH curr (Setting, GroupId, ResultCount) AS

    (

    SELECT 5, 1, 1

    UNION SELECT 9, 2, 0

    UNION SELECT 10, 2, 0

    ),

    s (ProductId) AS

    (

    SELECT s2.ProductId

    FROM curr

    JOIN BMSetting s2 ON curr.Setting = s2.Setting

    GROUP BY s2.ProductId, curr.GroupId, curr.ResultCount

    HAVING COUNT(*) = curr.ResultCount

    )

    SELECT COUNT(*)

    FROM BMProduct p

    JOIN s ON p.Id = s.ProductId

    -- -----------------------------------------------------

    -- STATISTICS

    --SQL Server parse and compile time:

    -- CPU time = 0 ms, elapsed time = 0 ms.

    --(1 row(s) affected)

    --Table 'BMSetting'. Scan count 3, logical reads 500, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    --Table 'BMProduct'. Scan count 1, logical reads 4113, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    -- SQL Server Execution Times:

    -- CPU time = 812 ms, elapsed time = 843 ms.

    -- -----------------------------------------------------

    My environment is no real server, I currently tried on my workstation. But it is definite that the solution with BIGINT is much faster and saves resources.

    So thanks for education 🙂

    Greets

    Flo

  • Anyway - worth trying. I think I will mirror INT version in BIGINT table and compare some real life stats.

  • Flo,

    Awesome script - nice job!

    That's some neat SQL for a 'developer' 😀

    I wish the developers at my workplace were as SQL-literate. I particularly like the numbers table generation from master (always a good trick)...but it's all very good stuff.

    I'm glad I'm not alone in finding this problem fascinating. I must admit I spent my lunch hour thinking about it, and my workplace PC has some unusual tables on my local SQL Server now :blush:

    Again, top-class work!

    Paul

  • Hey Paul

    Paul White (4/3/2009)


    Flo,

    Awesome script - nice job!

    Thanks for that!

    That's some neat SQL for a 'developer' 😀

    😎

    I wish the developers at my workplace were as SQL-literate. I particularly like the numbers table generation from master (always a good trick)...but it's all very good stuff.

    Since now I'm in two minds between the master-db solution and the Tally table but the master-db gets its margin.

    I'm glad I'm not alone in finding this problem fascinating. I must admit I spent my lunch hour thinking about it, and my workplace PC has some unusual tables on my local SQL Server now :blush:

    Same to me... but since the database is called Sandbox it seems to be okay for me. :hehe:

    Greets

    Flo

  • One more thought from me:

    In the worst possible case, where all flags are independent and no sensible optimizations can be made from the database manager's knowledge of the data, the following might be worth considering:

    Condense all 5000 flag bits into a fixed-length BINARY(625). (625 bytes = 5000 bits).

    Store this compacted value in the table via a CONVERT to CHAR(625).

    To test a given bit, we just need to find which byte the bit is in (some maths involving an integer division by 8) and then test the specific bit within that byte (bit number mod 8). The bit value to test will be POWER(2, (bit number mod 8)).

    We can extract the single byte to test using SUBSTRING, and get the character value using the ASCII function (or explicit CONVERTs if you prefer).

    For example, to test bit number 1073:

    byte number = 1073 \ 8 = 134

    bit position within the byte = 1073 % 8 = 1

    bit value = 2^1 => 2

    expression:

    WHERE ASCII(SUBSTRING([flags], 134, 1) & 2 = 2

    Clearly, we can combine tests for set and unset bits anyway we like:

    SELECT COUNT_BIG(*)

    FROM dbo.BitMapChar

    WHERE ASCII(SUBSTRING([flags], 134, 1) & 2 = 2

    AND (ASCII(SUBSTRING([flags], 574, 1) & 32 = 32 OR

    ASCII(SUBSTRING([flags], 63, 24) & 24 = 16)

    Note that the test on byte 63 combines multiple tests in a bitmask.

    On a test table with 800,000 such rows (product ID = primary key, non-clustered index on the [flags] column) a query with *20* bit tests all in different bytes completed in under 2 seconds from cold on my very ordinary laptop computer (2GHz P4 single-core CPU, 2GB RAM), and in 79ms with a warm cache. The logical read count was 28,895 - pretty darn good given the task at hand.

    The query plan shows a simple single scan of the NC index, with all of the bit tests being performed with the index scan operator. For a table with 6M rows, the QO might choose a parallel plan but I can't test this with only one CPU!

    /Paul 😎

  • Hi Paul

    Very nice and powerful solution! I'm taking my hat off to you!

    Anyway I had fife years in my project this ugly bit-masks (but we needed only 64 options). Since we just redesigned the whole system I changed the INT to BITs and I'm really glad to be rid of these non-readable "phone numbers" like 29438459... 😛

    Best regards

    Flo

  • Hey flo,

    Thanks so much, very kind of you to say.

    I agree that those 'phone numbers' are a pain to work with...:pinch:

    I've enjoyed this thread, and I think you have too. With luck the OP may get some use from it, or at least it might give him/her some new ideas or directions.

    Have a great weekend.

    /Paul

  • Hi Paul!

    I've enjoyed this thread, and I think you have too. With luck the OP may get some use from it, or at least it might give him/her some new ideas or directions.

    I definitely enjoyed this thread! I always like threads which don't end with the first answer but start discussion about the best solution.

    I'm out for now. It's quiet good weather here and maybe I'll have a look to one of your nice German "Biergarten" 😛

    Have also a great weekend!

    Flo

  • Again - good idea.

    I will try to test them all on wednesday (or at least initiate the data migration to make tests possible 😉 ). I will definitely post some real-life statistics as I am currently collecting some queries. Unfortunately my boss will give me new tasks as soon as current solution is "good enough" which ussually is far from best.

    Florian Reischl (4/4/2009)


    "phone numbers" like 29438459... 😛

    In such cases I like to prepare few API-like functions to trasnform readable format into real format.

Viewing 14 posts - 16 through 28 (of 28 total)

You must be logged in to reply to this topic. Login to reply