Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

Storing IPs in SQL Server

By David Poole,

What is the best way to store an IP Address in SQL Server

Imagine that you are asked to design a database for a data analysis team to perform web site traffic analysis. This brief will have many requirements and amongst them will be the need to store IP addresses.

Search engines such as Google have a fixed range of IP addresses so the team can easily separate out traffic from true visitors Vs traffic from bots. Although the team have a good rudimentary knowledge of T-SQL the data must be relatively simple to query. So with regard to IP addresses the requirement is as follows:

  • Store IP addresses efficiently
  • Allow retrieval of IP addresses in a machine readable format
  • Allow simple querying on a range or ranges of IP addresses

The data type question

If traffic to your web site is high then the choice of data types is going to be important. You could keep the IP address as a VARCHAR(15) but given the nature of what an IP address actually is, 4 integers in the range 0 to 255, this seems a trifle wasteful.

My initial thought was to use techniques described in Lee Everest's article Introduction to bitmasking in SQL2005 but if you read the forum discussion of the article you will see an interesting suggestion by Joe Celko.

So what possiblities do we have? Let us consider the ip address 192.168.0.5

Method Storage Comment
VARCHAR(15) Between 7 and 15 bytes Stores the IP address in human readable but this is wasteful.
BIGINT 8 bytes We can represent our IP address as 192168000005. It is stretching the definition of human readability somewhat but this depends on your audience.
INT 4 bytes Our IP address is no longer human readable being represented as 1084751877.
Four separate TINYINT fields 4 bytes Our address is now both efficient and human readable just as Joe Celko pointed out.

SQL 2005 CLR User Defined Types

SQL2005 provides us with one other possiblity. The .NET assembly user defined type.

I was fortunate to go on the Microsoft "Updating Your Database Development Skills to Microsoft SQL Server 2005" (Course 2734B) which included an IP address UDT. As I am not sure of the copyright issues surrounding the code for the UDT I am not including the source code here but the UDT provided the following functionality.

  • Accept an ip address in the form nnn.nnn.nnn.nnn
  • Return the individual bytes of an IP address
  • Return the string representation of the IP address
  • Return a varbinary representation of the IP address
  • Return a string with the PING command and the IP address. I removed this from the code as it was irrelevant.

There are similar functions available on the web.

Test methodology

I decided to create five tables and into each put 1 million rows. Each table would contain a single INT IDENTITY field (IPID)acting as the primary key, plus the chosen method of storing an IP address.

Table Comment
IPAddressSource IP Address stored as 4 separate TINY INT fields
IPAddressINT IP Address stored as a single INT field
IPAddressBIGINT IP Address stored as a single BIGINT field
IPAddressVARCHAR IP Address stored as a single VARCHAR(15) field
IPAddressUDT IP Address stored in a .NET CLR data type

After population I ran an sp_spaceused on each table using the stored procedures described in Automated Monitoring Database Size Using sp_spaceused. For those who caught the article first time around I have added the equivalent procedure for SQL2000 in the discussion forum attached to the article.

Initial results

TableName Rows Reserved Data IndexSize Unused
dbo.IPAddressBIGINT1,000,00020,87220,784880
dbo.IPAddressINT1,000,00016,904 16,8087216
dbo.IPAddressSource1,000,00016,90416,8087224
dbo.IPAddressUDT1,000,00049,9284969619240
dbo.IPAddressVARCHAR1,000,00030,08829,9601208

Straight away we can see that the table using the CLR UDT is by far the worst in terms of storage size taking nearly 50MB to store our million rows.

Running an sp_help reveals why.

Column_name Type Computed Length ...etc
IPID int no 4  
IPAddress IPAddress no 37  

There is very little information available on the storage of data in CLR UDTs but what appears to happen is that the instance of the datatype (including structure and data) is serialised and stored in a VARBINARY type format in the field.

This sort of begs the question as to what real-world use are CLR UDTs?

The motivation to try a CLR IPAddress type was that a single type could encapsulate the IP address. The type is atomic.

Storing the data as 4 TINYINT fields means that any one of the four parts of the IP address is meaningless in its own right. It is not atomic.

In this case we can see that there is a heavy overhead in gaining atomicity.

The indexing question

We have established that for readability and efficient storage of data storing an IP address as 4 separate TINYINT fields is the most efficient. But what about searching on IP Addresses?

What I need to be able to do is to identify ranges of IP addresses that correspond to Web Bots. Given the sheer volume of records I am going to need to discover the most efficient method of doing this.

The SQL Query engine decides for itself whether or not there is a benefit in using an index. If the index is not regarded as being selective enough the engine will simply ignore the index.

To give an example I had a table containing details of cars where the primary key was make, model and year.
WHERE make='FORD' resulted in a TABLE SCAN
WHERE make='MASERATI' resulted in an INDEX SEEK

The only difference in the WHERE clause was the value being searched for.

Again if you look at the discussion for Lee Everest's article to make the index as selective as possible Joe Celko suggests that an index be placed across the four TINYINT columns representing the IP Address but from the right-most first, in effect recording the IP address backwards. As the digits in an IP address vary the most in the right hand portion and least in the left hand portion this should make the index as selective as it can possibly be.

My first port of call was to create such an index on my dbo.IPAddressSource table.

The next step was to devise a query that would search for a range of IP addresses such as 72.100.5.25 to 74.50.25.1



/*
    Define parameters to test various ranges of IP addresses
*/

DECLARE
	@LowIP1 TINYINT , @HighIP1 TINYINT ,
	@LowIP2 TINYINT , @HighIP2 TINYINT ,
	@LowIP3 TINYINT , @HighIP3 TINYINT ,
	@LowIP4 TINYINT	, @HighIP4 TINYINT 

SELECT 
	@LowIP1 = 72	, @HighIP1 = 74 ,
	@LowIP2 = 100	, @HighIP2 = 50 ,
	@LowIP3 = 5	, @HighIP3 = 25 ,
	@LowIP4 = 25	, @HighIP4 = 1 
	
SELECT * FROM dbo.IPAddressSource
WHERE
	(
		IP1= @LowIP1 
		AND (
			IP2>@LowIP2
			OR (IP2=@LowIP2 AND ((IP3 = @LowIP3 AND IP4>=@LowIP4) OR IP3>@LowIP3))
		)
	)
	OR (IP1>@LowIP1 AND IP1<@HighIP1)
OR
	(
		IP1 = @HighIP1
		AND (
			IP2 < @HighIP2
			OR (IP2=@HighIP2  AND ((IP3=@HighIP3 AND IP4<=@HighIP4) OR IP3<@HighIP3))
	)
)


As we can see the WHERE clause is a bit convoluted and this is just searching for one range of ip addresses. We should also look at the execution plan.

Convoluted WHERE clause execution plan

The subtree cost is 4.56566

An alternative approach

We have established that storing an IP address as 4 TINYINT fields achieves two objectives

  • Storing the IP address in an efficient format
  • Storing the IP address in a format that is easily read by a human being

The problem areas are as follows

  • Searching for a range of addresses requires convoluted WHERE conditions
  • The query is expensive

As an alternative experiment I decided to create a new table called dbo.IPAddressCALC as follows

IF NOT EXISTS(SELECT 1 FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME='IPAddressCalc')
	CREATE TABLE dbo.IPAddressCalc(
		IPID int NOT NULL,
		IP1 tinyint NOT NULL,
		IP2 tinyint NOT NULL,
		IP3 tinyint NOT NULL,
		IP4 tinyint NOT NULL,
		IPAddress AS 
		CAST(
			((CAST(IP1 AS BIGINT)* 16777216)
		+	( CAST(IP2 AS BIGINT)* 65536)
		+	( CAST(IP3 AS BIGINT)*256)
		+	IP4)
		-2147483648 AS INT),
	 CONSTRAINT PK_IPAddressCalc PRIMARY KEY CLUSTERED (	IPID ASC)
	) 
GO

Our IPAddress field is a calculated field using the bitmasking technique and contains an INT value.

To aid our search I placed an index on this column. To search this column now our query would become

SELECT * FROM dbo.IPAddressCalc
WHERE IPAddress BETWEEN @Low AND @High

We would probably create function to return the values for @Low and @High from supplied IP addresses.

Our WHERE clause is much simpler and it is now easy to write a query that can handle multiple ranges of ip addresses. Our execution plan is also slightly different.

Simple WHERE clause execution plan

The subtree cost is 3.73884

Wrapping it all up

The final step is to example sp_spaceused results for our two tables. We are interested in the amount of data and index space taken up by each item.

TableName Rows Reserved Data IndexSize Unused
dbo.IPAddressCalc1,000,00030,86416,80813,952104
dbo.IPAddressSource1,000,00027,85616,80810,96088

Adding a calculated column has no affect on the data held within the table, however, over 1 million rows the index takes up just under 3MB more space.

The final note to add is that our analysis tables would be populated using some form of bulk loading process of the previous day's results. Placing an index on a calculated field of a highly dynamic table could potentially cause performance problems.

Addendum

To develop this solution I generated 1 million random IP addresses.

For those of us who have cursed the SQL Server RAND function my population query was as follows


IF NOT EXISTS(SELECT 1 FROM dbo.IPAddressSource)
	BEGIN
		SET ROWCOUNT 1000000
		INSERT INTO dbo.IPAddressSource(IP1, IP2, IP3, IP4)
		SELECT 
			FLOOR(256*RAND(cast(cast(left(newid(),8) as varbinary  ) as int )) ),
			FLOOR(256*RAND(cast(cast(left(newid(),8) as varbinary  ) as int )) ),
			FLOOR(256*RAND(cast(cast(left(newid(),8) as varbinary  ) as int )) ),
			FLOOR(256*RAND(cast(cast(left(newid(),8) as varbinary  ) as int )) )

		FROM master.dbo.sysobjects AS O1 ,
		master.dbo.sysobjects AS O2 

		SET ROWCOUNT 0
	END

The method works because NEWID() returns a number that is supposed to be unique in time and space and consists fo a consistant format containing hexadecimal strings.

What I am doing is taking the first block of 8 characters, converting the resulting string to a VARBINARY value and this resulting value is then converted into an integer value.

I don't actually care what the integer value is as I am simply feeding it into the RAND function as a seed.

Total article views: 13962 | Views in the last 30 days: 12
 
Related Articles
FORUM

Question about included columns in nonclustered indexes: Where exactly are they stored?

Do nonclustered indexes just store pointers to clustered indexes?If so where are the included column...

FORUM

Generate MAc addresses with a stored procedure

Generate MAc addresses with a stored procedure

ARTICLE

Storing IPv4 Addresses for Performance

An in-depth analysis on how to store IPv4 addresses to achieve maximum efficiency and performance in...

BLOG

Denali — Day 5: Column-store indexes (aka Project Apollo)

Denali – Day 5: Column-store indexes (aka Project Apollo)   There is another great achievement...

FORUM

Known problems with bit-columns?

Use tinyint instead?

Tags
advanced    
advanced querying    
database design    
t-sql    
 
Contribute

Join the most active online SQL Server Community

SQL knowledge, delivered daily, free:

Email address:  

You make SSC a better place

As a member of SQLServerCentral, you get free access to loads of fresh content: thousands of articles and SQL scripts, a library of free eBooks, a weekly database news roundup, a great Q & A platform… And it’s our huge, buzzing community of SQL Server Professionals that makes it such a success.

Join us!

Steve Jones
Editor, SQLServerCentral.com

Already a member? Jump in:

Email address:   Password:   Remember me: Forgotten your password?
Steve Jones