SQLServerCentral Article

Storing IPs in SQL Server

,

What is the best way to store an IP Address in SQL Server

Imagine that you are asked to design a database for a data analysis team to perform web site traffic analysis.

This brief will have many requirements and amongst them will be the need to store IP addresses.

Search engines such as Google have a fixed range of IP addresses so the team can easily separate out

traffic from true visitors Vs traffic from bots. Although the team have a good rudimentary knowledge of T-SQL the data must

be relatively simple to query. So with regard to IP addresses the requirement is as follows:

  • Store IP addresses efficiently
  • Allow retrieval of IP addresses in a machine readable format
  • Allow simple querying on a range or ranges of IP addresses

The data type question

If traffic to your web site is high then the choice of data types is going to be important.

You could keep the IP address as a VARCHAR(15) but given the nature of what an IP address actually is, 4 integers in the range 0 to 255, this seems a trifle wasteful.

My initial thought was to use techniques described in Lee Everest's article Introduction to bitmasking in SQL2005

but if you read the forum discussion of the article you will see an interesting suggestion by Joe Celko.

So what possiblities do we have? Let us consider the ip address 192.168.0.5

MethodStorageComment
VARCHAR(15)Between 7 and 15 bytesStores the IP address in human readable but this is wasteful.
BIGINT8 bytesWe can represent our IP address as 192168000005. It is stretching the definition of human readability somewhat but this depends on your audience.
INT4 bytesOur IP address is no longer human readable being represented as 1084751877.
Four separate TINYINT fields4 bytesOur address is now both efficient and human readable just as Joe Celko pointed out.

SQL 2005 CLR User Defined Types

SQL2005 provides us with one other possiblity. The .NET assembly user defined type.

I was fortunate to go on the Microsoft "Updating Your Database Development Skills to Microsoft SQL Server 2005" (Course 2734B) which included

an IP address UDT. As I am not sure of the copyright issues surrounding the code for the UDT I am not including the source code here but the UDT

provided the following functionality.

  • Accept an ip address in the form nnn.nnn.nnn.nnn
  • Return the individual bytes of an IP address
  • Return the string representation of the IP address
  • Return a varbinary representation of the IP address
  • Return a string with the PING command and the IP address. I removed this from the code as it was irrelevant.

There are similar functions available on the web.

Test methodology

I decided to create five tables and into each put 1 million rows. Each table would contain a single INT IDENTITY field (IPID)acting as the primary key, plus the chosen method of storing an IP address.

TableComment
IPAddressSourceIP Address stored as 4 separate TINY INT fields
IPAddressINTIP Address stored as a single INT field
IPAddressBIGINTIP Address stored as a single BIGINT field
IPAddressVARCHARIP Address stored as a single VARCHAR(15) field
IPAddressUDTIP Address stored in a .NET CLR data type

After population I ran an sp_spaceused on each table using the stored procedures described in

Automated Monitoring Database Size Using sp_spaceused.

For those who caught the article first time around I have added the equivalent procedure for SQL2000 in the discussion forum

attached to the article.

Initial results

TableNameRowsReservedDataIndexSizeUnused
dbo.IPAddressBIGINT1,000,00020,87220,784880
dbo.IPAddressINT1,000,00016,90416,8087216
dbo.IPAddressSource1,000,00016,90416,8087224
dbo.IPAddressUDT1,000,00049,9284969619240
dbo.IPAddressVARCHAR1,000,00030,08829,9601208

Straight away we can see that the table using the CLR UDT is by far the worst in terms of storage size taking nearly 50MB

to store our million rows.

Running an sp_help reveals why.

Column_nameTypeComputedLength...etc
IPIDintno4 
IPAddressIPAddressno37 

There is very little information available on the storage of data in CLR UDTs but what appears to happen is that

the instance of the datatype (including structure and data) is serialised and stored in a VARBINARY type format in the field.

This sort of begs the question as to what real-world use are CLR UDTs?

The motivation to try a CLR IPAddress type was that a single type could encapsulate the IP address. The type is atomic.

Storing the data as 4 TINYINT fields means that any one of the four parts of the IP address is meaningless in its own

right. It is not atomic.

In this case we can see that there is a heavy overhead in gaining atomicity.

The indexing question

We have established that for readability and efficient storage of data storing an IP address as 4 separate TINYINT fields is the most efficient. But what about searching on IP Addresses?

What I need to be able to do is to identify ranges of IP addresses that correspond to Web Bots. Given the sheer volume of records

I am going to need to discover the most efficient method of doing this.

The SQL Query engine decides for itself whether or not there is a benefit in using an index. If the index is not regarded as being

selective enough the engine will simply ignore the index.

To give an example I had a table containing details of cars where the primary key was make, model and year.

WHERE make='FORD' resulted in a TABLE SCAN

WHERE make='MASERATI' resulted in an INDEX SEEK

The only difference in the WHERE clause was the value being searched for.

Again if you look at the discussion for Lee Everest's article

to make the index as selective as possible Joe Celko suggests that an index be placed across the four TINYINT columns representing the IP Address but from the right-most

first, in effect recording the IP address backwards. As the digits in an IP address vary the most in the right hand portion and least in the left hand portion this

should make the index as selective as it can possibly be.

My first port of call was to create such an index on my dbo.IPAddressSource table.

The next step was to devise a query that would search for a range of IP addresses such as 72.100.5.25 to 74.50.25.1



/*
    Define parameters to test various ranges of IP addresses
*/
DECLARE
@LowIP1 TINYINT , @HighIP1 TINYINT ,
@LowIP2 TINYINT , @HighIP2 TINYINT ,
@LowIP3 TINYINT , @HighIP3 TINYINT ,
@LowIP4 TINYINT, @HighIP4 TINYINT 
SELECT 
@LowIP1 = 72, @HighIP1 = 74 ,
@LowIP2 = 100, @HighIP2 = 50 ,
@LowIP3 = 5, @HighIP3 = 25 ,
@LowIP4 = 25, @HighIP4 = 1 

SELECT * FROM dbo.IPAddressSource
WHERE
(
IP1= @LowIP1 
AND (
IP2>@LowIP2
OR (IP2=@LowIP2 AND ((IP3 = @LowIP3 AND IP4>=@LowIP4) OR IP3>@LowIP3))
)
)
OR (IP1>@LowIP1 AND IP1OR
(
IP1 = @HighIP1
AND (
IP2 OR (IP2=@HighIP2  AND ((IP3=@HighIP3 AND IP4OR IP3

As we can see the WHERE clause is a bit convoluted and this is just searching for one range of ip addresses. We should also look at the execution plan.

Convoluted WHERE clause execution plan

The subtree cost is 4.56566

An alternative approach

We have established that storing an IP address as 4 TINYINT fields achieves two objectives

  • Storing the IP address in an efficient format
  • Storing the IP address in a format that is easily read by a human being

The problem areas are as follows

  • Searching for a range of addresses requires convoluted WHERE conditions
  • The query is expensive

As an alternative experiment I decided to create a new table called dbo.IPAddressCALC as follows

IF NOT EXISTS(SELECT 1 FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME='IPAddressCalc')
CREATE TABLE dbo.IPAddressCalc(
IPID int NOT NULL,
IP1 tinyint NOT NULL,
IP2 tinyint NOT NULL,
IP3 tinyint NOT NULL,
IP4 tinyint NOT NULL,
IPAddress AS 
CAST(
((CAST(IP1 AS BIGINT)* 16777216)
+( CAST(IP2 AS BIGINT)* 65536)
+( CAST(IP3 AS BIGINT)*256)
+IP4)
-2147483648 AS INT),
 CONSTRAINT PK_IPAddressCalc PRIMARY KEY CLUSTERED (IPID ASC)
) 
GO

Our IPAddress field is a calculated field using the bitmasking technique and contains an INT value.

To aid our search I placed an index on this column. To search this column now our query would become

SELECT * FROM dbo.IPAddressCalc
WHERE IPAddress BETWEEN @Low AND @High

We would probably create function to return the values for @Low and @High from supplied IP addresses.

Our WHERE clause is much simpler and it is now easy to write a query that can handle multiple ranges of ip addresses.

Our execution plan is also slightly different.

Simple WHERE clause execution plan

The subtree cost is 3.73884

Wrapping it all up

The final step is to example sp_spaceused results for our two tables. We are interested

in the amount of data and index space taken up by each item.

TableNameRowsReservedDataIndexSizeUnused
dbo.IPAddressCalc1,000,00030,86416,80813,952104
dbo.IPAddressSource1,000,00027,85616,80810,96088

Adding a calculated column has no affect on the data held within the table, however, over 1 million rows the index takes

up just under 3MB more space.

The final note to add is that our analysis tables would be populated using some form of bulk loading process of the previous

day's results. Placing an index on a calculated field of a highly dynamic table could potentially cause performance problems.

Addendum

To develop this solution I generated 1 million random IP addresses.

For those of us who have cursed the SQL Server RAND function my population query was as follows


IF NOT EXISTS(SELECT 1 FROM dbo.IPAddressSource)
BEGIN
SET ROWCOUNT 1000000
INSERT INTO dbo.IPAddressSource(IP1, IP2, IP3, IP4)
SELECT 
FLOOR(256*RAND(cast(cast(left(newid(),8) as varbinary  ) as int )) ),
FLOOR(256*RAND(cast(cast(left(newid(),8) as varbinary  ) as int )) ),
FLOOR(256*RAND(cast(cast(left(newid(),8) as varbinary  ) as int )) ),
FLOOR(256*RAND(cast(cast(left(newid(),8) as varbinary  ) as int )) )
FROM master.dbo.sysobjects AS O1 ,
master.dbo.sysobjects AS O2 
SET ROWCOUNT 0
END

The method works because NEWID() returns a number that is supposed to be unique in time and space and consists fo a consistant format

containing hexadecimal strings.

What I am doing is taking the first block of 8 characters, converting the resulting string to a VARBINARY value and this resulting value

is then converted into an integer value.

I don't actually care what the integer value is as I am simply feeding it into the RAND function as a seed.

Rate

4.75 (8)

You rated this post out of 5. Change rating

Share

Share

Rate

4.75 (8)

You rated this post out of 5. Change rating