Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase «««7891011»»

Storing IPv4 Addresses for Performance Expand / Collapse
Author
Message
Posted Thursday, October 29, 2009 2:21 PM


SSCrazy Eights

SSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy Eights

Group: General Forum Members
Last Login: Yesterday @ 10:00 AM
Points: 8,551, Visits: 9,043
Elliott W (10/29/2009)

Tom,
I take exception to several things here.

Elliot, if you think I intended to offend the author or malign him in any way, you are quite wrong. If I gave that impression, I expressed myself badly.
1. There was no confusion that the string version was for human readability and had nothing to do with storage.

There was a mention at the front of the article that the standard published way back when specified the IP address as 32 bits (4 octets) and that the string of 7 to 15 characters was there to make it easy to read. Then the article dived off into how to convert this string representation back to 32 bits with giving any consideration as to whether the 32-bit IP address might be a perfectly good representation.
2. The statement: "anyone experienced in data communications who sees a 32 bit int column used to represent IPv4 address will naturally assume that the integer is the one represented by the same bit pattern as the IP address itself" I find to be invalid. If I see a field that is not clear I assume VERY LITTLE, maybe I'm the minority, but I wouldn't just assume that it is the 4 octet bit pattern, I would ask some questions.

I always hope people won't make unwarranted assumptions too; but after more than 4 decades of experience in computing I've reached (with sadness) the conclusion that although that is my hope I should not permit it to become my expectation.
3. "It was as if the author knew that IP addresses were in theory 32-bit values but were not aware that that is what they are in practise, not just in theory, that the 32-bit value is what almost all software, other than user interface presentation software, actually uses, and therefor thought that any old 32-bit pattern would be just as good a representation as the 32-bit IP address itself as long as there was a one-to-one correspondance." We we reading the same article? I didn't see any issue here. I thought it was pretty clear, they do represent a 32-bit value BUT SQL can't hold the first octet without some manipulation.

But it is so trivial to get the IP address from the string representation - just change
(CAST(PARSENAME(@IP,4) AS INT) - 128) * 16777216

to read instead
CASE WHEN CAST(PARSENAME(@IP,4) AS INT)  > 127 
THEN CAST(PARSENAME(@IP,4) AS INT) - 256
ELSE CAST(PARSENAME(@IP,4) AS INT)
END * 16777216

and that changes the code in the paper to do it. I think if the author had considered using the natural representation and rejected it he would have mentioned it, so it appears to me that he didn't consider it. That's not terribly surprising, he is a DBA not a data communications developer, but in my view DBAs should ask people with application field expertise before taking decisions like the choice of representation of an IP address.
As for comments about the risks of subsequent developers screwing up when the see this approach, I guess I have a greater awareness of how often developers quickly skim domcumentation and miss important points, leading to problems down the line; or perhaps I'm just more of a pessimist. Anyway, these risks are secondary to the risk that the real IP address will turn out to be needed once you have some sort of IP address in your DB.
For a logging application I don't see any clear winner between binary(4) and int, especially with the use of the computed column to make the data readable when we query it for humans to read.

I don't disagree - there isn't much to choose between them for a logging application - but maybe the natural int representation has some advantages over the weird one of the article?. For a non-logging application, whether the natural representation is there or not is more important that whether it is there in its int verson or its binary(4) version.
IF I were working on an application that dealt with IP addresses a lot and I thought for a second that I would be doing any submasking, I would almost certainly go with the int type.

That sounds as if you want to do subnet checking by arithmetic comparison rather than by bitwise XOR and AND; be warned that this will not work with the representation in the article - for example the subnet <126.0.0.0 MASK 249.0.0.0>, although it is a contiguous subrange of 32 bit iP address values (considered as 32 bit unsigned ints) is contiguous neither in the representation of the article nor in the natural representation as 32-bit signed ints. So it's better to use the bitwise masking operations (which are identical for binary(4) and for int). But the bitwise masking operations don't give the correct result if IP addresses and the netmask are all converted to 32 bit form using the article's algorithm - they do only if the natural representation is used for the netmask - in which case why not use it for the addresses as well? (I don't much care about binary(4) vs int, just about natural versus unnatural representation.)


Tom
Post #811199
Posted Thursday, October 29, 2009 8:01 PM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Wednesday, December 22, 2010 12:12 AM
Points: 21, Visits: 81
Tom.Thomson (10/29/2009)
... he is a DBA not a data communications developer, but in my view DBAs should ask people with application field expertise before taking decisions like the choice of representation of an IP address.

How are you so sure that I am a DBA? I am, in fact, a Senior Systems Architect and Developer and simply extend my skill with SQL Server. I have a very solid grasp of networking down to the bit level. I have not blatantly ignored any of its fundamental principles. The perhaps unorthodox approach I took was the one that I believe had the best fit within SQL Server’s sometimes unexplainable limitations. It would appear that you’ve so closely focused on an IPv4 address’s natural representation that you’re ignoring the adaptations that are required within SQL Server to be able to do operations such as atomic subnet mask filtering.

So I ask you, please show us how you would, using T-SQL, take the IP address 192.168.1.1 in binary form (0xC0A80101), apply a mask of 255.255.255.0 in binary form (0xFFFFFF00), and compare it to the network 192.168.1.0 in binary form (0xC0A80100) to see if the IP address is part of the network.
Post #811328
Posted Saturday, October 31, 2009 5:56 AM


SSCrazy Eights

SSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy Eights

Group: General Forum Members
Last Login: Yesterday @ 10:00 AM
Points: 8,551, Visits: 9,043
Jean-Sebastien Carle (10/29/2009)
How are you so sure that I am a DBA? I am, in fact, a Senior Systems Architect and Developer and simply extend my skill with SQL Server. I have a very solid grasp of networking down to the bit level. I have not blatantly ignored any of its fundamental principles.

It seemed to me to be probable that you were a DBA - after all, most of the articles posted on SQL ServerCentral are written by DBAs, and you certainly appeared to give no consideration to using the natural representation. Nor did you say anything about transforming masks from varchar(15) form, which suggested to me that you hadn't thought of the impact of flipping the MS bit in the mask (flipping that sign bit is what your code does). I didn't at any point suggest that you had blatantly ignored any fundamental principle, nor that you didn't have a solid grasp of networking; if I gave the impression that I was suggesting any of that I'm very sorry, I certainly didn't intend to. I did suggest that you were not a data comms programmer, which I still think is true (for example I think a data comms programmer would know that the first RFC to specify a 32 bit internet address was RFC 760 of Jan 1980, rather than the more recent document that you referenced - and even that wasn't where it was first introduced)- or have you written and/or debugged chunks of code to implement protocols in layer 4 (transport) and downward?

The perhaps unorthodox approach I took was the one that I believe had the best fit within SQL Server’s sometimes unexplainable limitations. It would appear that you’ve so closely focused on an IPv4 address’s natural representation that you’re ignoring the adaptations that are required within SQL Server to be able to do operations such as atomic subnet mask filtering.

Well, since T-SQL code to transform the string form to the natural representation is only trivially different from what you suggested (see snippet in the comment which has annoyed you) it doesn't seem to me that the choice of representation was really to do with SQL limitations; and I already explained in another comment how I would handle subnet checking, but as I just wrote that in algebraic notation, not in T-SQL, I've provided some T-SQL below. As to focus, well it's pretty trivial to store the natural representation so the question to me was why on earth anyone would want to do anything else.

So I ask you, please show us how you would, using T-SQL, take the IP address 192.168.1.1 in binary form (0xC0A80101), apply a mask of 255.255.255.0 in binary form (0xFFFFFF00), and compare it to the network 192.168.1.0 in binary form (0xC0A80100) to see if the IP address is part of the network.

It's easy enough - although I don't understand why you want me to use binary, since the code snippet in the comment that offended you is clearly based on it being int. But code for binary is close to identical to that for int, so here it is:

-- first set up the addresses and mask specified by JS-C
declare @netAddress binary(4) set @netAddress = 0xC0A80100
declare @netMask binary(4) set @netMask = 0xFFFFFF00
declare @candidate binary(4) set @candidate = 0xC0A80101

-- now the one-liner test to see whether the two addresses are in the same subnet if the specified mask is used
if ((@netAddress ^ cast(@candidate as int)) & @netMask ) = 0
select 'The candidate address is in the target network'
else
select 'The candidate address is not in the target network'

go

This works for all valid network addresses and netmasks, and all valid candidate addresses. It is symmetrical in the two addresses - what we are actually checking is whether the two subnets each containing one of the given addresses and both having the given mask are identical, so the cast is an annoyance (and one that really is caused by a ludicrous restriction in SQL - why on earth can either argument of ^ be binary but not both) which can be avoided by using int instead of binary(4) (and the symmetry of course merans that the cast can be moved to the other address instead in contexts where that would be more efficient): the real issue is about what bit pattern should be used and not about how it should be typed. There's one thing that sometimes makes me (mistakenly) think that binary might have an advantage: machines I worked on a (very) long time ago required int to be word-aligned, which made it a pain in the butt for byte-stream (even worse for bit-stream, like HDLC and SDLC - but of course bit-stuffing is now done in hardware) protocols, but that can be dealt with at the DB interface, it's not relevant internally, so I should ignore that thought.

It's not actually difficult to produce code that works when the numbers are transformed as you suggest: it just needs one extra XOR (and some brackets) to test whether 192.168.1.42 mask 255.255.255.0 and 192.168.1.192 mask 255.255.255.0 are both the same when you are representing the addresses and mask by the integers 0x40A8012A, 0x40A801C0, and 0x7FFFFF00, even while making sure that you don't accidentally include include for example 64.168.1.200 in that subnet - but the test is no longer "no bits are 1 in the mask where the two addresses have different bits in that position", so it no longer matches the definition of the subnet in a straightforward manner. And that, to me, seems to be pointless obscurity. One can fix it by saying "we'll use the mask as its own representation" but that would seem very strange when you've rejected using the IP address as its own representation because you have to work around the sign bit, since the masks and the IP addresses have identical structures and identical human-readable forms (with of course some constraints on which values are allowed to be used as masks).







Tom
Post #812009
Posted Saturday, October 31, 2009 9:56 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Thursday, May 15, 2014 5:11 PM
Points: 6,067, Visits: 5,283
You know they say about assumptions..

Also I think the statement:
It seemed to me to be probable that you were a DBA - after all, most of the articles posted on SQL ServerCentral are written by DBAs, and you certainly appeared to give no consideration to using the natural representation.

Is wrong, while I may be working as a DBA right now, that is not what I have been doing for the last several decades.. And very few of the DBAs that I know are only working in that role now, they all have a much wider range of experience.

I had to think about whether I wanted to respond at all..

That's not terribly surprising, he is a DBA not a data communications developer, but in my view DBAs should ask people with application field expertise before taking decisions like the choice of representation of an IP address.

I am taking exception to the premise of "not a data communications developer". That makes the statement that the front-end guy should be setting the datatypes, and that is a BAD assumption. Most (and I know I'm generalizing) front-end developers get into SQL so they can get data for their front-ends, and a data guy can usually tell the databases that were built by front-end guys, the front-end guys just didn't know any better, involvement of a data guy is usually pretty clear. Now as a subject matter expert in SQL I would ask the "data communications developer" what he needed and how he was going to use it and I would set the type, now if the developer was unhappy with that we would likely haggle. But then end decision is MINE, because it is MY function to be a an expert in SQL.

CEWII
Post #812027
Posted Saturday, October 31, 2009 11:22 PM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Wednesday, December 22, 2010 12:12 AM
Points: 21, Visits: 81
Tom.Thomson (10/31/2009)
It seemed to me to be probable that you were a DBA - after all, most of the articles posted on SQL ServerCentral are written by DBAs, and you certainly appeared to give no consideration to using the natural representation ... if I gave the impression that I was suggesting any of that I'm very sorry, I certainly didn't intend to.
I'm not upset in any way, but there's a saying that I like to keep in the forefront of my mind. To assume makes an *** out of u and me.


I did suggest that you were not a data comms programmer ...
If you mean on the embedded level, that is most certainly true. I do hold my own when it comes to networking nonetheless because of the large quantity of networks I have deployed and managed as well as the amount of network focused applications I have developed that involved either subnetting or packet filtering.


... would know that the first RFC to specify a 32 bit internet address was RFC 760 of Jan 1980, rather than the more recent document that you referenced - and even that wasn't where it was first introduced ...
The reference to the RFC wasn't meant as a technical accuracy to the first mention as such and, if you notice my wording, I don't specify as much either. I was simply being a writer through that passage in an attempt to make the article seem less monotonous. Even the driest of us nerds can get bored of looking at flat code and table summaries.


... have you written and/or debugged chunks of code to implement protocols in layer 4 (transport) and downward?
I cannot say that I have had the luxury to do so, no.


... see snippet in the comment which has annoyed you ... the code snippet in the comment that offended ...
I am not quite sure how you infered that I was annoyed. I appologize if I led you to believe as such.


it's pretty trivial to store the natural representation so the question to me was why on earth anyone would want to do anything else
Because storage is not the only thing to take into consideration. If, in your application or usage, the only requirement is to log and store IPv4 addresses, then I would perhaps even question if there was a need to store the data in SQL Server. Traditionally, SQL Server's role is not only for storage but also to act as a mechanism to interact, aggregate and analyse data as well.


It's easy enough - although I don't understand why you want me to use binary, since the code snippet in the comment that offended you is clearly based on it being int.
Because you yourself have repeated several times that the natural representation of an address is in its binary 32-bit unmodified form. I simply wanted to see how you would achieve the required mathematics to accomplish subnet filtering while working around SQL Server's limitations. Specifically the fact that "In a bitwise operation, only one expression can be of either binary or varbinary data type.".


... why on earth can either argument of ^ be binary but not both ...
As much as I agree, it is outside of our control. What we can control is how we choose to work around those restrictions to provide the best solution to satisfy our goal. That goal may be different for different people. Some may choose performance as their target, others storage efficiency and perhaps someone like yourself may choose to negate either for the purpose of natural representation. (Not to say that doing so cannot be done without achieving either of the other goals as well.)


... one thing that sometimes makes me (mistakenly) think that binary might have an advantage: machines I worked on a (very) long time ago required int to be word-aligned, which made it a pain in the butt for byte-stream (even worse for bit-stream, like HDLC and SDLC - but of course bit-stuffing is now done in hardware) protocols, but that can be dealt with at the DB interface, it's not relevant internally, so I should ignore that thought.
SQL Server's 32 bit integer is its most used and most optimized datatype. Everything from indexes to storage and aggregate functions to joins have all been tweaked to gills in terms of efficiency and performance for this datatype. There are other datatypes that will perform similarly but I do not know of a datatype that can beat the integer.


It's not actually difficult to produce code that works when the numbers are transformed as you suggest: it just needs one extra XOR (and some brackets) to test ...
I did not believe it as something that was difficult and difficulty is an arbitrary thing once wrapped inside a stored procedure or a function. It is mostly a curiosity to see your approach to the matter and, again, how you would work with SQL Server's limitations.


... it no longer matches the definition of the subnet in a straightforward manner. And that, to me, seems to be pointless obscurity. One can fix it by saying "we'll use the mask as its own representation" but that would seem very strange when you've rejected using the IP address as its own representation because you have to work around the sign bit, since the masks and the IP addresses have identical structures and identical human-readable forms (with of course some constraints on which values are allowed to be used as masks).
I have to admit that at this point, I was no longer clear on the point you were trying to make as it is unclear what is pointlessly obscure, your XOR approach or my bitshifted approach.


Now allow me to make some points of my own.

I understand your point of view regarding storing IP addresses under their natural representations but being that SQL Server's integer datatype is its most efficient, I would still prefer to store the address as such regardless of the fact that it requires additional transformation to do so. At the end of the day, what is most important to me is how SQL Server performs with the data I choose to store within my database.

Generally speaking, IP addresses exist within a database as a result of logging. The logging will often cumulate over time to generate hundreds of thousands if not millions of rows. When operations are done over those rows, they usually revolve around verifying if an IP address already exists, verifying the number of times a specific IP address has been logged or finding logged entries that match either a list of addresses or networks. The efficiency of doing those operations is based on the performance achieved when SQL Server executes the mathematics, the aggregate functions, the joins and/or sorts required to do so.

Binary does not perform as well as the integer although the difference may be negligable if either your dataset is small or your operations are simplistic.

To illustrate, I used my test database. Within it I have two IP address tables. The first one ([dbo].[IPAddresses]) contains 100,000 random IP addresses in three forms varchar(15), int and binary. The second one ([dbo].[ManyIPAddresses]) contains 1,000,000 random IP addresses (10 copies of the [dbo].[IPAddresses] table).


If we look at simply filtering 1,000,000 rows to find IP addresses from a certain subnet, here is the setup for our comparison:
DECLARE @BinaryNetwork binary(4) SET @BinaryNetwork = [dbo].[ParseToBinary]('192.168.1.0')
DECLARE @BinaryMask binary(4) SET @BinaryMask = [dbo].[ParseToBinary]('255.255.255.0')
DECLARE @IntegerNetwork int SET @IntegerNetwork = [dbo].[ParseToInt]('192.168.1.0')


Storing the address as binary(4) and filtering using XOR:
SET STATISTICS TIME ON

SELECT [StringAddress]
FROM [dbo].[ManyIPAddresses]
WHERE ((@BinaryNetwork ^ CAST([BinaryAddress] AS int)) & @BinaryMask) = 0

SET STATISTICS TIME OFF

SQL Server Execution Times:
CPU time = 140 ms, elapsed time = 162 ms.

Storing the address as an int and filtering using AND with an int mask:
SET STATISTICS TIME ON

SELECT [StringAddress]
FROM [dbo].[ManyIPAddresses]
WHERE [IntegerAddress] & @BinaryMask = @IntegerNetwork

SET STATISTICS TIME OFF

SQL Server Execution Times:
CPU time = 94 ms, elapsed time = 98 ms.


Now if we compare the results between doing a binary JOIN vs an integer JOIN, this is what we achieve.

Doing a binary JOIN:
SET STATISTICS TIME ON

SELECT dbo.IPAddresses.StringAddress
FROM dbo.IPAddresses
INNER JOIN dbo.ManyIPAddresses
ON dbo.IPAddresses.BinaryAddress = dbo.ManyIPAddresses.BinaryAddress

SET STATISTICS TIME OFF

SQL Server Execution Times:
CPU time = 1718 ms, elapsed time = 45913 ms.

Doing an integer JOIN:
SET STATISTICS TIME ON

SELECT dbo.IPAddresses.StringAddress
FROM dbo.IPAddresses
INNER JOIN dbo.ManyIPAddresses
ON dbo.IPAddresses.IntegerAddress = dbo.ManyIPAddresses.IntegerAddress

SET STATISTICS TIME OFF

SQL Server Execution Times:
CPU time = 1032 ms, elapsed time = 45409 ms.


As for sorting, this is what we achieve.

Doing a binary SORT:
SET STATISTICS TIME ON

SELECT [StringAddress]
FROM [dbo].[ManyIPAddresses]
ORDER BY [BinaryAddress] ASC

SET STATISTICS TIME OFF

SQL Server Execution Times:
CPU time = 3390 ms, elapsed time = 46790 ms.

Doing an integer SORT:
SET STATISTICS TIME ON

SELECT [StringAddress]
FROM [dbo].[ManyIPAddresses]
ORDER BY [IntegerAddress] ASC

SET STATISTICS TIME OFF

SQL Server Execution Times:
CPU time = 2625 ms, elapsed time = 45974 ms.


And let's finish with some grouping.

Doing a binary GROUP BY:
SET STATISTICS TIME ON

SELECT [BinaryAddress], COUNT([BinaryAddress])
FROM [dbo].[ManyIPAddresses]
GROUP BY [BinaryAddress]

SET STATISTICS TIME OFF

SQL Server Execution Times:
CPU time = 610 ms, elapsed time = 3723 ms.

Doing an integer GROUP BY:
SET STATISTICS TIME ON

SELECT [IntegerAddress], COUNT([IntegerAddress])
FROM [dbo].[ManyIPAddresses]
GROUP BY [IntegerAddress]

SET STATISTICS TIME OFF

SQL Server Execution Times:
CPU time = 484 ms, elapsed time = 2911 ms.


So as you can see, not only does binary have its mathematical limitations, it also has its performance limitations. That's not to mention the fact that certain operations simply cannot be done on binary fields. Attempting to perform an aggregate function like AVG on a binary field will yield the error "Operand data type binary is invalid for avg operator.". Although this is something that would never be performed on IP addresses, it is still to be noted. But I digress.

With the obvious performance advantages offered by the integer datatype, I would not hesitate to store IP addresses in the same way as I described in my article. The cost of transformation at the time of insert is minimal at best. With the speed of computed columns, transformation from its stored integer form back to a human readable varchar(15) or a naturally represented binary(4) can be also be achieved at a minimal cost. So is storing an IP address in its natural representation more important then storing it for efficiency and performance? I don't believe so but perhaps that's a need you may have.
Post #812076
Posted Monday, November 2, 2009 4:43 AM
Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Friday, September 2, 2011 3:22 AM
Points: 1,227, Visits: 154
Thankyou (Ladies and) Gentlemen
for this ongoing discussion. Recent items have shown me the need to make further comparions on what is the best way to store an IP Address.
Post #812305
Posted Tuesday, November 3, 2009 3:15 AM
SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: Today @ 10:26 AM
Points: 77, Visits: 166
I'm not sure if I already said this in the thread, and it's a minor point, but I am not sure that int is the best thing about SQL Server generally - because when I performed an experiment on encoding char(6) alphanumeric strings (A-Z 0-9) - six bytes - into four bytes int in SQL Server 2000 as more economical indexed keys, the int representations seemed to perform worse as keys than the strings. I expect I have the details on file somewhere. That suggests that Microsoft worked hard on getting char key indexing to work really well. However, I may have overlooked other factors.

On another occasion I found that there was not much performance difference between int, smallint, and tinyint, presumably because a modern server's processor does 32 bit or 64 bit computation anyway. And bit stunk (they may have improved it since), and has use limitations as mentioned, so in practical programming I usually put bit data in tinyint - I've also considered using a user-defined 'tinybit' data type which would just be tinyint with or without a constraint of being <= 1 or IN (0, 1). Having said that, I don't know what CAST costs here. Maybe nothing.

One factor that my experiments probably excluded is hard disk access time, because my data probably ended up cached in RAM. Because of that, int ought to do better than char(6) in real applications, if - if - query performance is limited by disk speed and not by processor. In other words, if it isn't a fancy query that requires a lot of calculation per unit of data. But I've decided not to rely too much on what I think is going to work well in SQL Server.
Post #812832
Posted Tuesday, November 3, 2009 5:56 AM


SSCrazy Eights

SSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy Eights

Group: General Forum Members
Last Login: Yesterday @ 10:00 AM
Points: 8,551, Visits: 9,043
Jean-Sebastien Carle (10/31/2009)
<much snipped>
Tom.Thomson (10/31/2009)It's easy enough - although I don't understand why you want me to use binary, since the code snippet in the comment that offended you is clearly based on it being int.

Because you yourself have repeated several times that the natural representation of an address is in its binary 32-bit unmodified form.
<more snipped>

I think we have managed to get ourselves misunderstanding each-other quite a lot.

When I said "the natural representation" I meant one that use the bit pattern that is the IP address - whether it's typed as binary(4) or as int doesn't come into it. That's why I didn't (as I pointed out) understand why you wanted me to provide the code for the binary(4) version. Your integer representation flips the top bit, so it isn't a natural representation, and I already had posted a suggested mod to your transformation which delivers an integer without flipping the top bit, which was the natural representation I was advocating. This is particularly important when using the transformation to convert a mask from dotted form to a bit pattern, because if your original transformation is used for the mask the test
WHERE [IntegerAddress] & @IntegerMask = @IntegerNetwork
will sometimes give the wrong result: for example it will include 64.1.1.0 in the network <192.1.1.0 mask 255.255.255.255> because @IntegerMask will be positive instead of negative; and economy of code suggests that once you have the right transformation for the mask you may as well use it for all transformations from 3 dot format to int.

Also, I dislike that version of the test for a simple reason: it assumes that the subnet is always specified using the lowest address that it contains. But in practise how the subnet ios specified tends to vary according to where you are in the network - in a subnet with mask M, a machine with address A will tend to view the subnet as <A MASK M> rather than as <A&M MASK M>. That's why I would use a test which checks whether there is any bit position where the mask is 1 and the network and candidate addresses differ, if the subnet information was coming from the outside world as parameters to a stored proc. This could be expresssed (assuming that @IntegerMask doesn't have the top bit flipped) by
WHERE [IntegerAddress] & @IntegerMask = @IntegerNetwork & @IntegerMask
as well as by the expression using XOR. Obviously when a subnet is stored in a table it makes a lot of sense to use its lowest address as the network address, but I would prefer not to impose that restriction at the API between the DB and the outside world.

I agree completely with you that efficiency matters. Bothe performance and storage have to be considered, and if one has an address table with various forms in computed columns and only one form in storage, the best form for storage will be integer because it comes top for both storage efficiency (best equal with a few other forms) and performance. But I still think that the integer that doesn't have the flipped top bit is a better representation than the one that does.

When one is going for efficiency, one extra case that needs to be considered is the case where a table with a lot of addresses in it has to be filtered to get the rows in a particular subnet. I'm not sure that the optimiser will manage to use an index on the address if the column is subject to an & operation or an ^ operation in the filter; so I would probably want to convert the subnet to a pair of addresses, the lower and upper bound, for this case; then the index can certainly be used. Also one may want to discover whether two subnets and disjoint, or whether one is a subnet of the other. In these cases too it's useful to convert to lower bound - upper bound form. Should the two bounds be the standard representation of the subnet if it has to be held in the table, or should it be lower bound and mask If it's lower bound and upper bound, the mask can be held (if wanted) as a computed colum easily enough. If it's lower cound and mask, the upper bound can be a computed column. So I guess one would have to look at the performace and all operations and the frequency of each and cdecide accordingly - difficult in a world with incremental requirement discovery, incremental development, and incremental release.


Tom
Post #812871
Posted Tuesday, November 3, 2009 6:13 AM


SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Today @ 11:16 AM
Points: 133, Visits: 1,269
Jean-Sebastien Carle (10/31/2009)

Generally speaking, IP addresses exist within a database as a result of logging. The logging will often cumulate over time to generate hundreds of thousands if not millions of rows. When operations are done over those rows, they usually revolve around verifying if an IP address already exists, verifying the number of times a specific IP address has been logged or finding logged entries that match either a list of addresses or networks. The efficiency of doing those operations is based on the performance achieved when SQL Server executes the mathematics, the aggregate functions, the joins and/or sorts required to do so.

Binary does not perform as well as the integer although the difference may be negligable if either your dataset is small or your operations are simplistic.

To illustrate, I used my test database. Within it I have two IP address tables. The first one ([dbo].[IPAddresses]) contains 100,000 random IP addresses in three forms varchar(15), int and binary. The second one ([dbo].[ManyIPAddresses]) contains 1,000,000 random IP addresses (10 copies of the [dbo].[IPAddresses] table).


If we look at simply filtering 1,000,000 rows to find IP addresses from a certain subnet, here is the setup for our comparison:
DECLARE @BinaryNetwork binary(4) SET @BinaryNetwork = [dbo].[ParseToBinary]('192.168.1.0')
DECLARE @BinaryMask binary(4) SET @BinaryMask = [dbo].[ParseToBinary]('255.255.255.0')
DECLARE @IntegerNetwork int SET @IntegerNetwork = [dbo].[ParseToInt]('192.168.1.0')



[snip]


So as you can see, not only does binary have its mathematical limitations, it also has its performance limitations. That's not to mention the fact that certain operations simply cannot be done on binary fields. Attempting to perform an aggregate function like AVG on a binary field will yield the error "Operand data type binary is invalid for avg operator.". Although this is something that would never be performed on IP addresses, it is still to be noted. But I digress.

With the obvious performance advantages offered by the integer datatype, I would not hesitate to store IP addresses in the same way as I described in my article. The cost of transformation at the time of insert is minimal at best. With the speed of computed columns, transformation from its stored integer form back to a human readable varchar(15) or a naturally represented binary(4) can be also be achieved at a minimal cost. So is storing an IP address in its natural representation more important then storing it for efficiency and performance? I don't believe so but perhaps that's a need you may have.


I am not a network programmer, so I'll get that out of the way up front.

However, don't bitmasks always represent contiguous blocks of address space? Therefore, wouldn't you get better performance if you calculate the beginning and ending addresses in the range represented by the bitmask and then use BETWEEN rather than bitwise operators? This eliminates per-row implicit conversions within the query and it should be sargable. I'd be interested to see how it compares to the test results you posted.

As for using AVG on IP addresses, I can't think of a case where that even makes sense. (Your comment indicates you agree.) As such, I don't see much point in noting it. In fact, it may be a VERY minor point, but I can see a case to be made for preferring binary(4) specifically because it doesn't even offer operations like AVG that have no meaning relevant to the data.

Andrew


--Andrew
Post #812880
Posted Tuesday, November 3, 2009 6:36 AM
Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Friday, September 2, 2011 3:22 AM
Points: 1,227, Visits: 154
I agree that aggregate functions make little sense for IP Addresses.

This is a very important discussion for developers such as myself as originating IP addresses need to be logged quickly and efficiently with regard to space.
Post #812894
« Prev Topic | Next Topic »

Add to briefcase «««7891011»»

Permissions Expand / Collapse