Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase «««7891011»»»

Storing IPv4 Addresses for Performance Expand / Collapse
Author
Message
Posted Monday, August 31, 2009 8:14 PM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Thursday, May 15, 2014 5:11 PM
Points: 6,067, Visits: 5,283
207ms over 100K records hardly seems worth it.. Good to know though..

CEWII
Post #780407
Posted Wednesday, October 28, 2009 11:28 AM


SSCrazy Eights

SSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy Eights

Group: General Forum Members
Last Login: Today @ 1:33 PM
Points: 8,571, Visits: 9,076
nimdil (8/19/2009)
nicholastjh (8/19/2009)
Nice article. Very informative.

But at the final function where 128 is subtracted from the result of ParseName, before being multiplied by 16777216 (2^24), could potentially yield negative results, if the first final octet is not above 128.

For example: 127.0.0.1 would result in -16777215.

That's the whole point. Some IP addresses will map to negative integers. As there are 2^32 IP Addresses and 2^31 non-negative integers in SQL Server we have to use positive as well as negative values.



Yes, of course we have to have some negative values. But I would be much happier if the integer stored actually consisted of the 32 bit IP address, instead of some bizarre different bit pattern, and the conversions in the article do not achieve that. If we are going to pass this IP address to some low leveldata comms software, we need the number to be negative when the relevant octet is above 127, and positive when it is under 128 -not the other way round. The dotted notation 1.1.1.1 represents the 32 bit value 0x01010101, not 0x81010101, and it is not useful to store it as the latter instead of the former. Of course it's trivial to alter the conversion to do the right thing.

I think this nicely illustrates that a little knowledge is a dangerous thing - it shows us what might happen when the DBA doesn't really understand what the data is but acts on his misunderstqanding: with the approach in the article if some application component decides it wants the 32-bit IP address it has to do an extra conversion, because the database is providing only varchar and the wrong 32-bit pattern; that will probably destroy efficiency at some point in the future. It's much better if the DBA finds out what the external world uses and follows suit (in this case, uses the correct 32 bit pattern - which probably ought to be described as varbinary(4) rather than int, since IP protocol works in terms of streams of octets, not integers). Of course there will be casses where what the external world uses will be hopelessly ineficient - but that certainly isn't the case here: be under no illusion that any serious communications software anywhere uses the varbinary(15) representation internally, it's just a device for the human interface, internally the representation is 4 octets.


Tom
Post #810160
Posted Wednesday, October 28, 2009 12:25 PM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Thursday, May 15, 2014 5:11 PM
Points: 6,067, Visits: 5,283
Tom.Thomson (10/28/2009)
Yes, of course we have to have some negative values. But I would be much happier if the integer stored actually consisted of the 32 bit IP address, instead of some bizarre different bit pattern, and the conversions in the article do not achieve that. If we are going to pass this IP address to some low leveldata comms software, we need the number to be negative when the relevant octet is above 127, and positive when it is under 128 -not the other way round. The dotted notation 1.1.1.1 represents the 32 bit value 0x01010101, not 0x81010101, and it is not useful to store it as the latter instead of the former. Of course it's trivial to alter the conversion to do the right thing.

I think this nicely illustrates that a little knowledge is a dangerous thing - it shows us what might happen when the DBA doesn't really understand what the data is but acts on his misunderstqanding: with the approach in the article if some application component decides it wants the 32-bit IP address it has to do an extra conversion, because the database is providing only varchar and the wrong 32-bit pattern; that will probably destroy efficiency at some point in the future. It's much better if the DBA finds out what the external world uses and follows suit (in this case, uses the correct 32 bit pattern - which probably ought to be described as varbinary(4) rather than int, since IP protocol works in terms of streams of octets, not integers). Of course there will be casses where what the external world uses will be hopelessly ineficient - but that certainly isn't the case here: be under no illusion that any serious communications software anywhere uses the varbinary(15) representation internally, it's just a device for the human interface, internally the representation is 4 octets.

I think the underlying discussion eventually came to the conclusion that a calculated column abstracts the underlying storage from the presentation. Given this, whether you want to use the varbinary(4) or int methods is effectively irrelevant since the computed column hides it.

I am not sure about this statement "I think this nicely illustrates that a little knowledge is a dangerous thing - it shows us what might happen when the DBA doesn't really understand what the data is but acts on his misunderstqanding". I don't see how this really applies..

CEWII
Post #810215
Posted Wednesday, October 28, 2009 12:31 PM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Wednesday, December 22, 2010 12:12 AM
Points: 21, Visits: 81
Also, the choice between varbinary(4) (which should really be binary(4)) and int comes down to deciding between being able to natively bitmask IPs with subnet from within SQL Server or not. The reason I chose int regardless of the shift into negative is that I can write a stored procedure to retrieve all IPs of a given subnet, which could not be possible with binary(4) and would require me to retrieve the full list and then bitmask filter it in the application layer.
Post #810219
Posted Thursday, October 29, 2009 7:07 AM


SSCrazy Eights

SSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy Eights

Group: General Forum Members
Last Login: Today @ 1:33 PM
Points: 8,571, Visits: 9,076
Elliott W (10/28/2009)
I am not sure about this statement "I think this nicely illustrates that a little knowledge is a dangerous thing - it shows us what might happen when the DBA doesn't really understand what the data is but acts on his misunderstqanding". I don't see how this really applies..

CEWII

The point I was trying to make was that the article appeared to start from the point of view that what was needed was just an efficient 32-bit representation of a particular set of varchar(15) strings. Well, that's ignoring the fact that those strings are actually a human interface representation of 32 bit IPv4 adresses, so that the natural - and ultimately the most efficient (since real software using IP addresses often wants the real 32-bit IPv4 address, not some different 32-bit value) - representation is to use the IP address to represent itself. It was as if the author knew that IP addresses were in theory 32-bit values but were not aware that that is what they are in practise, not just in theory, that the 32-bit value is what almost all software, other than user interface presentation software, actually uses, and therefor thought that any old 32-bit pattern would be just as good a representation as the 32-bit IP address itself as long as there was a one-to-one correspondance.
With the article's representation, there is a serious possibility of dangerous misunderstandings: anyone experienced in data communications who sees a 32 bit int column used to represent IPv4 address will naturally assume that the integer is the one represented by the same bit pattern as the IP address itself; so that approach would need caveats all over the documentation and in code comments and even then would risk people taking the integer as the IP address, even though it clearly isn't. So there's an "is this way of doing it really safe" question as well as efficiency concerns.


Tom
Post #810725
Posted Thursday, October 29, 2009 7:23 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Wednesday, December 22, 2010 12:12 AM
Points: 21, Visits: 81
My intention was not to mislead or misinform, however do keep in mind that the article was dedicated to SQL Server efficiency. I am fully versed in how 32-bit network addresses function and I understand your concern for incorrect translation. As Elliott has pointed out before, there is no one shoe fits all solution. The SQL Server int datatype was chosen and worked around its limitations for it's native advantages. The binary datatype has unfortunate limitations that are not present with int. If native translation from SQL Server to your application is important or frequent because you do a lot of in application IP address processing then binary is the way to go for you.
Post #810733
Posted Thursday, October 29, 2009 7:47 AM
Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Friday, September 2, 2011 3:22 AM
Points: 1,227, Visits: 154
>Now I'm going to test the retrieval of the varbinary(4) back to a string.. When I'm all done I will attach the code for all.

Have you done this?
Post #810765
Posted Thursday, October 29, 2009 7:58 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Thursday, May 15, 2014 5:11 PM
Points: 6,067, Visits: 5,283
Yes, see page one of postings..

We have tested int to string using function and computed column and SQLCLR
We have tested binary(4) to string using function and computed column and SQLCLR

CEWII
Post #810778
Posted Thursday, October 29, 2009 8:10 AM


SSCrazy Eights

SSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy Eights

Group: General Forum Members
Last Login: Today @ 1:33 PM
Points: 8,571, Visits: 9,076
Jean-Sebastien Carle (10/28/2009)
Also, the choice between varbinary(4) (which should really be binary(4)) and int comes down to deciding between being able to natively bitmask IPs with subnet from within SQL Server or not. The reason I chose int regardless of the shift into negative is that I can write a stored procedure to retrieve all IPs of a given subnet, which could not be possible with binary(4) and would require me to retrieve the full list and then bitmask filter it in the application layer.

Yes, it should be binary(4) not varbinary(4). But I don't understand at all why you think I can't do subnet checks in SQL using binary(4) or, for that matter, the 4 tinyints representation.

The check for whether address a occurs in the subnet base b mask m is quite simple:
((a XOR b) AND m) = 0
which is easy to translate into SQL - I guess we agree on that?

(Of course if all three values are ints, and wre produced from the varchar(15) representation according to your scheme, this doesn't work! To make it work you would have to have converted the mask using an algorithm that would deliver an IP address from the varchar(15) representation instead of a different integer, so your system would have to end up having to have both conversions in it).

If a is 4 tinyints a1 a2 a3 a4 and the base and mask values are similarly given in small parts, the check is
(((a1 XOR b1) AND m1) = 0) and
(((a2 XOR b2) AND m2) = 0) and
(((a3 XOR b3) AND m3) = 0) and
(((a3 XOR b3) AND m3) = 0)

This too is easy to convert to SQL.

If a is an array of 4 octets, and the other values also, the test is
(((a[1] XOR b[1]) AND m[1] = 0) and
...etcetera
which is also trivial to turn into SQL. But it isn't even neccessary to worry about the 4 components if you are using binary(4) to represent the array of 4 octets, since bitwise XOR and AND are directly applicable to binary(4) values (see for example http://msdn.microsoft.com/en-us/library/ms190277%28SQL.105%29.aspx) - the only difference from the integer case is that you might write 0x00000000 instead of 0.

So why can you not do it in SQL, why do you have to break out into the application layer?


Tom
Post #810790
Posted Thursday, October 29, 2009 8:25 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Thursday, May 15, 2014 5:11 PM
Points: 6,067, Visits: 5,283
Tom.Thomson (10/29/2009)
The point I was trying to make was that the article appeared to start from the point of view that what was needed was just an efficient 32-bit representation of a particular set of varchar(15) strings. Well, that's ignoring the fact that those strings are actually a human interface representation of 32 bit IPv4 adresses, so that the natural - and ultimately the most efficient (since real software using IP addresses often wants the real 32-bit IPv4 address, not some different 32-bit value) - representation is to use the IP address to represent itself. It was as if the author knew that IP addresses were in theory 32-bit values but were not aware that that is what they are in practise, not just in theory, that the 32-bit value is what almost all software, other than user interface presentation software, actually uses, and therefor thought that any old 32-bit pattern would be just as good a representation as the 32-bit IP address itself as long as there was a one-to-one correspondance.
With the article's representation, there is a serious possibility of dangerous misunderstandings: anyone experienced in data communications who sees a 32 bit int column used to represent IPv4 address will naturally assume that the integer is the one represented by the same bit pattern as the IP address itself; so that approach would need caveats all over the documentation and in code comments and even then would risk people taking the integer as the IP address, even though it clearly isn't. So there's an "is this way of doing it really safe" question as well as efficiency concerns.

Tom,
I take exception to several things here.
1. There was no confusion that the string version was for human readability and had nothing to do with storage.
2. The statement: "anyone experienced in data communications who sees a 32 bit int column used to represent IPv4 address will naturally assume that the integer is the one represented by the same bit pattern as the IP address itself" I find to be invalid. If I see a field that is not clear I assume VERY LITTLE, maybe I'm the minority, but I wouldn't just assume that it is the 4 octet bit pattern, I would ask some questions.
3. "It was as if the author knew that IP addresses were in theory 32-bit values but were not aware that that is what they are in practise, not just in theory, that the 32-bit value is what almost all software, other than user interface presentation software, actually uses, and therefor thought that any old 32-bit pattern would be just as good a representation as the 32-bit IP address itself as long as there was a one-to-one correspondance." We we reading the same article? I didn't see any issue here. I thought it was pretty clear, they do represent a 32-bit value BUT SQL can't hold the first octet without some manipulation.
4. Most people would use this implementation in a logging application so there isn't going to be much system-to-system communication. You highlight potential issues with "so that approach would need caveats all over the documentation and in code comments and even then would risk people taking the integer as the IP address, even though it clearly isn't" that cover what I view to be the exception. Could these things happen, sure but I think that the potential risk is not anywhere near as dire or as likely to happen as stated. Your data dictionary just needs to clearly state what it is and how you get to the value.

You are clearly unhappy with this solution, so don't use it. You highlight some points that in my opinion are the exceptions that don't negate the use of this method. Developers will invariably make some assumptions that are wrong, but their assumptions don't negate the methodologies that they made the assumptions about.

With that I'll climb off my soapbox..

So now I'll take about MY assumptions..
For a logging application I don't see any clear winner between binary(4) and int, especially with the use of the computed column to make the data readable when we query it for humans to read.
IF I were working on an application that dealt with IP addresses a lot and I thought for a second that I would be doing any submasking, I would almost certainly go with the int type.
My data dictionary would state that the field is an encoded field containing the 4 octets and the formula used to encode them. I would also state what the human readable field name is if there is one..
My design documentation would reference which method I chose and some quick reasons why, it would also likely contain the formulas to go back and forth as well as a reference to there being a human readable column if applicable.
I would expect my developers to read and understand this document and if not to ask questions.

CEWII
Post #810809
« Prev Topic | Next Topic »

Add to briefcase «««7891011»»»

Permissions Expand / Collapse