Storing IPv4 Addresses for Performance

  • Jeff Moden (8/29/2009)


    heh... understood... I just don't want developers to use a tool just because it's there... I want them to use it for all the right reasons as I suspect you do.

    And, my solution didn't require 11 bytes as you suggest... it only required 4 and, because of the computed columns, it allowed greatly expanded functionality without further obfuscation of the IP address.

    I agree, don't use a tool simply because its there. But I've met a bunch of people who know a lot about SQL and very little about other tools. This leads to what I call the hammer and nail problem. When all you have is a hammer (SQL) every problem looks like nail.

    There are a lot of tools, some are better for some cases, some better for others. There is one thing I am (personally) fundamentally opposed to, that is SQLCLR user defined datatypes, I'm sure they would be great, right up until the moment you have to change the code, then it becomes a nightmare.. Keep all tools on your pallete and chose the one that fits your problem.

    Also, it looks like Jean-Sebastien's solution stores the text and calculates the int, in that case you spend up to 11 extra bytes to store the address. As a thought, if you store the int and calc the text, it is probably considered deterministic and could be persisted.. I was commenting originally on his solution and not yours..

    CEWII

  • Ah... sorry. My mistake, Elliott. Thank you for the feedback. 🙂

    And, to your good point and if I remember correctly, all of the calculated columns constructed from the saved 4 octets are, in fact, deterministic as you suggest they might be.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Elliott W (8/31/2009)


    Also, it looks like Jean-Sebastien's solution stores the text and calculates the int, in that case you spend up to 11 extra bytes to store the address. As a thought, if you store the int and calc the text, it is probably considered deterministic and could be persisted.. I was commenting originally on his solution and not yours..

    CEWII

    I only used static text to calculate the int variation as a demonstration and for testing in previous posts of this thread. In a real world environment, I would not store the static text version, I would store the int version and use the computed column to retrieve the text version.

  • Jean-Sebastien Carle (8/31/2009)


    I only used static text to calculate the int variation as a demonstration and for testing in previous posts of this thread. In a real world environment, I would not store the static text version, I would store the int version and use the computed column to retrieve the text version.

    Wonderful, so we agree. Altogether, I thought this was a wonderful discussion with lots of variations and some cool ideas. I look forward to you addendum.

    Since that direction (int->text) would be deterministic you could persist it, I'd want to performance test it, but it is an option and it would depend on how often I used it..

    CEWII

  • Dynamically computed : (100000 row(s) affected)

    SQL Server Execution Times:

    CPU time = 204 ms, elapsed time = 4500 ms.

    Persisted : (100000 row(s) affected)

    SQL Server Execution Times:

    CPU time = 94 ms, elapsed time = 4293 ms.

    In terms of disk space, persisted required 1.484 MB of additional storage over 100,000 rows (an average of 15.56 bytes per row).

    Both are lightweight solutions, so you can trade up between what's more important for your particular situation : speed or disk space.

  • 207ms over 100K records hardly seems worth it.. Good to know though..

    CEWII

  • nimdil (8/19/2009)


    nicholastjh (8/19/2009)


    Nice article. Very informative.

    But at the final function where 128 is subtracted from the result of ParseName, before being multiplied by 16777216 (2^24), could potentially yield negative results, if the first final octet is not above 128.

    For example: 127.0.0.1 would result in -16777215.

    That's the whole point. Some IP addresses will map to negative integers. As there are 2^32 IP Addresses and 2^31 non-negative integers in SQL Server we have to use positive as well as negative values.

    Yes, of course we have to have some negative values. But I would be much happier if the integer stored actually consisted of the 32 bit IP address, instead of some bizarre different bit pattern, and the conversions in the article do not achieve that. If we are going to pass this IP address to some low leveldata comms software, we need the number to be negative when the relevant octet is above 127, and positive when it is under 128 -not the other way round. The dotted notation 1.1.1.1 represents the 32 bit value 0x01010101, not 0x81010101, and it is not useful to store it as the latter instead of the former. Of course it's trivial to alter the conversion to do the right thing.

    I think this nicely illustrates that a little knowledge is a dangerous thing - it shows us what might happen when the DBA doesn't really understand what the data is but acts on his misunderstqanding: with the approach in the article if some application component decides it wants the 32-bit IP address it has to do an extra conversion, because the database is providing only varchar and the wrong 32-bit pattern; that will probably destroy efficiency at some point in the future. It's much better if the DBA finds out what the external world uses and follows suit (in this case, uses the correct 32 bit pattern - which probably ought to be described as varbinary(4) rather than int, since IP protocol works in terms of streams of octets, not integers). Of course there will be casses where what the external world uses will be hopelessly ineficient - but that certainly isn't the case here: be under no illusion that any serious communications software anywhere uses the varbinary(15) representation internally, it's just a device for the human interface, internally the representation is 4 octets.

    Tom

  • Tom.Thomson (10/28/2009)


    Yes, of course we have to have some negative values. But I would be much happier if the integer stored actually consisted of the 32 bit IP address, instead of some bizarre different bit pattern, and the conversions in the article do not achieve that. If we are going to pass this IP address to some low leveldata comms software, we need the number to be negative when the relevant octet is above 127, and positive when it is under 128 -not the other way round. The dotted notation 1.1.1.1 represents the 32 bit value 0x01010101, not 0x81010101, and it is not useful to store it as the latter instead of the former. Of course it's trivial to alter the conversion to do the right thing.

    I think this nicely illustrates that a little knowledge is a dangerous thing - it shows us what might happen when the DBA doesn't really understand what the data is but acts on his misunderstqanding: with the approach in the article if some application component decides it wants the 32-bit IP address it has to do an extra conversion, because the database is providing only varchar and the wrong 32-bit pattern; that will probably destroy efficiency at some point in the future. It's much better if the DBA finds out what the external world uses and follows suit (in this case, uses the correct 32 bit pattern - which probably ought to be described as varbinary(4) rather than int, since IP protocol works in terms of streams of octets, not integers). Of course there will be casses where what the external world uses will be hopelessly ineficient - but that certainly isn't the case here: be under no illusion that any serious communications software anywhere uses the varbinary(15) representation internally, it's just a device for the human interface, internally the representation is 4 octets.

    I think the underlying discussion eventually came to the conclusion that a calculated column abstracts the underlying storage from the presentation. Given this, whether you want to use the varbinary(4) or int methods is effectively irrelevant since the computed column hides it.

    I am not sure about this statement "I think this nicely illustrates that a little knowledge is a dangerous thing - it shows us what might happen when the DBA doesn't really understand what the data is but acts on his misunderstqanding". I don't see how this really applies..

    CEWII

  • Also, the choice between varbinary(4) (which should really be binary(4)) and int comes down to deciding between being able to natively bitmask IPs with subnet from within SQL Server or not. The reason I chose int regardless of the shift into negative is that I can write a stored procedure to retrieve all IPs of a given subnet, which could not be possible with binary(4) and would require me to retrieve the full list and then bitmask filter it in the application layer.

  • Elliott W (10/28/2009)


    I am not sure about this statement "I think this nicely illustrates that a little knowledge is a dangerous thing - it shows us what might happen when the DBA doesn't really understand what the data is but acts on his misunderstqanding". I don't see how this really applies..

    CEWII

    The point I was trying to make was that the article appeared to start from the point of view that what was needed was just an efficient 32-bit representation of a particular set of varchar(15) strings. Well, that's ignoring the fact that those strings are actually a human interface representation of 32 bit IPv4 adresses, so that the natural - and ultimately the most efficient (since real software using IP addresses often wants the real 32-bit IPv4 address, not some different 32-bit value) - representation is to use the IP address to represent itself. It was as if the author knew that IP addresses were in theory 32-bit values but were not aware that that is what they are in practise, not just in theory, that the 32-bit value is what almost all software, other than user interface presentation software, actually uses, and therefor thought that any old 32-bit pattern would be just as good a representation as the 32-bit IP address itself as long as there was a one-to-one correspondance.

    With the article's representation, there is a serious possibility of dangerous misunderstandings: anyone experienced in data communications who sees a 32 bit int column used to represent IPv4 address will naturally assume that the integer is the one represented by the same bit pattern as the IP address itself; so that approach would need caveats all over the documentation and in code comments and even then would risk people taking the integer as the IP address, even though it clearly isn't. So there's an "is this way of doing it really safe" question as well as efficiency concerns.

    Tom

  • My intention was not to mislead or misinform, however do keep in mind that the article was dedicated to SQL Server efficiency. I am fully versed in how 32-bit network addresses function and I understand your concern for incorrect translation. As Elliott has pointed out before, there is no one shoe fits all solution. The SQL Server int datatype was chosen and worked around its limitations for it's native advantages. The binary datatype has unfortunate limitations that are not present with int. If native translation from SQL Server to your application is important or frequent because you do a lot of in application IP address processing then binary is the way to go for you.

  • >Now I'm going to test the retrieval of the varbinary(4) back to a string.. When I'm all done I will attach the code for all.

    Have you done this?

  • Yes, see page one of postings..

    We have tested int to string using function and computed column and SQLCLR

    We have tested binary(4) to string using function and computed column and SQLCLR

    CEWII

  • Jean-Sebastien Carle (10/28/2009)


    Also, the choice between varbinary(4) (which should really be binary(4)) and int comes down to deciding between being able to natively bitmask IPs with subnet from within SQL Server or not. The reason I chose int regardless of the shift into negative is that I can write a stored procedure to retrieve all IPs of a given subnet, which could not be possible with binary(4) and would require me to retrieve the full list and then bitmask filter it in the application layer.

    Yes, it should be binary(4) not varbinary(4). But I don't understand at all why you think I can't do subnet checks in SQL using binary(4) or, for that matter, the 4 tinyints representation.

    The check for whether address a occurs in the subnet base b mask m is quite simple:

    ((a XOR b) AND m) = 0

    which is easy to translate into SQL - I guess we agree on that?

    (Of course if all three values are ints, and wre produced from the varchar(15) representation according to your scheme, this doesn't work! To make it work you would have to have converted the mask using an algorithm that would deliver an IP address from the varchar(15) representation instead of a different integer, so your system would have to end up having to have both conversions in it).

    If a is 4 tinyints a1 a2 a3 a4 and the base and mask values are similarly given in small parts, the check is

    (((a1 XOR b1) AND m1) = 0) and

    (((a2 XOR b2) AND m2) = 0) and

    (((a3 XOR b3) AND m3) = 0) and

    (((a3 XOR b3) AND m3) = 0)

    This too is easy to convert to SQL.

    If a is an array of 4 octets, and the other values also, the test is

    (((a[1] XOR b[1]) AND m[1] = 0) and

    ...etcetera

    which is also trivial to turn into SQL. But it isn't even neccessary to worry about the 4 components if you are using binary(4) to represent the array of 4 octets, since bitwise XOR and AND are directly applicable to binary(4) values (see for example http://msdn.microsoft.com/en-us/library/ms190277%28SQL.105%29.aspx) - the only difference from the integer case is that you might write 0x00000000 instead of 0.

    So why can you not do it in SQL, why do you have to break out into the application layer?

    Tom

  • Tom.Thomson (10/29/2009)


    The point I was trying to make was that the article appeared to start from the point of view that what was needed was just an efficient 32-bit representation of a particular set of varchar(15) strings. Well, that's ignoring the fact that those strings are actually a human interface representation of 32 bit IPv4 adresses, so that the natural - and ultimately the most efficient (since real software using IP addresses often wants the real 32-bit IPv4 address, not some different 32-bit value) - representation is to use the IP address to represent itself. It was as if the author knew that IP addresses were in theory 32-bit values but were not aware that that is what they are in practise, not just in theory, that the 32-bit value is what almost all software, other than user interface presentation software, actually uses, and therefor thought that any old 32-bit pattern would be just as good a representation as the 32-bit IP address itself as long as there was a one-to-one correspondance.

    With the article's representation, there is a serious possibility of dangerous misunderstandings: anyone experienced in data communications who sees a 32 bit int column used to represent IPv4 address will naturally assume that the integer is the one represented by the same bit pattern as the IP address itself; so that approach would need caveats all over the documentation and in code comments and even then would risk people taking the integer as the IP address, even though it clearly isn't. So there's an "is this way of doing it really safe" question as well as efficiency concerns.

    Tom,

    I take exception to several things here.

    1. There was no confusion that the string version was for human readability and had nothing to do with storage.

    2. The statement: "anyone experienced in data communications who sees a 32 bit int column used to represent IPv4 address will naturally assume that the integer is the one represented by the same bit pattern as the IP address itself" I find to be invalid. If I see a field that is not clear I assume VERY LITTLE, maybe I'm the minority, but I wouldn't just assume that it is the 4 octet bit pattern, I would ask some questions.

    3. "It was as if the author knew that IP addresses were in theory 32-bit values but were not aware that that is what they are in practise, not just in theory, that the 32-bit value is what almost all software, other than user interface presentation software, actually uses, and therefor thought that any old 32-bit pattern would be just as good a representation as the 32-bit IP address itself as long as there was a one-to-one correspondance." We we reading the same article? I didn't see any issue here. I thought it was pretty clear, they do represent a 32-bit value BUT SQL can't hold the first octet without some manipulation.

    4. Most people would use this implementation in a logging application so there isn't going to be much system-to-system communication. You highlight potential issues with "so that approach would need caveats all over the documentation and in code comments and even then would risk people taking the integer as the IP address, even though it clearly isn't" that cover what I view to be the exception. Could these things happen, sure but I think that the potential risk is not anywhere near as dire or as likely to happen as stated. Your data dictionary just needs to clearly state what it is and how you get to the value.

    You are clearly unhappy with this solution, so don't use it. You highlight some points that in my opinion are the exceptions that don't negate the use of this method. Developers will invariably make some assumptions that are wrong, but their assumptions don't negate the methodologies that they made the assumptions about.

    With that I'll climb off my soapbox..

    So now I'll take about MY assumptions..

    For a logging application I don't see any clear winner between binary(4) and int, especially with the use of the computed column to make the data readable when we query it for humans to read.

    IF I were working on an application that dealt with IP addresses a lot and I thought for a second that I would be doing any submasking, I would almost certainly go with the int type.

    My data dictionary would state that the field is an encoded field containing the 4 octets and the formula used to encode them. I would also state what the human readable field name is if there is one..

    My design documentation would reference which method I chose and some quick reasons why, it would also likely contain the formulas to go back and forth as well as a reference to there being a human readable column if applicable.

    I would expect my developers to read and understand this document and if not to ask questions.

    CEWII

Viewing 15 posts - 76 through 90 (of 100 total)

You must be logged in to reply to this topic. Login to reply