SQLServerCentral Article

Storing IPv4 Addresses for Performance

,

When designing tables, it is important to take into consideration the impact that your choice of columns and its data types will have on performance. The list of data types available has also been growing, making design choices even more difficult. Often real world data is not optimized for performance and it is necessary to find methods of transforming this information to allow efficient storage within tables that take both speed and size into consideration.

Evaluating the human readability factor

It is very rare that SQL Server operates on its own without an external application, website or service responsible for the presentation layer of table data. Data is also often accessed with SQL Server acting as an intermediate between the application and the actual table data. Deciding at which level to implement the human readability factor can have very large impacts on performance. If we choose to apply this at the table level, we negate the need for data conversion when displaying results at the expense of increasing our query costs when comparing, filtering or sorting. If we implement the transformation at the application level, any performance benefit achieved in SQL Server can quickly become negated if large amounts of data need to be displayed simultaneously.

IPv4 addresses are interesting in their own right. We've become used to seeing them in a traditional four number three dot form, yet, by their very nature, they require very little real world human readability. IPv4 addresses are usually stored in databases as result of automatic logging, either for security or data mining reasons. Sometimes, IPv4 addresses are be stored to be compared against other IPv4 addresses such as, for example, a list of all known search engine IPv4 addresses.

David Poole does a good job of talking about the different methods commonly used for storing IPv4 addresses in his article Storing IPs in SQL Server. I felt, however, that more weight was needed to be attributed to performance.

Improving IPv4 address storage and query performance

I started digging into the depths of the issue when faced with a newly acquired list of search engine IPv4 addresses to compare against a rapidly growing log of page views. Since IPv4 addresses were being stored only for protection against abuse as well as occasional activity monitoring, there was no performance aspect to consider. The only requirement was human readability, and as such, IPv4 addresses were being stored as VARCHAR(15). Presented with this new listed of IPv4 addresses to compare, it was clear that VARCHAR would no longer be acceptable.

In 1981, the Information Sciences Institute published RFC 791 with a single line that would change the world: "A route data is composed of a series of internet addresses. Each internet address is 32 bits or 4 octets". Little did they know that this would become one of the world's most significant numbers. The 32-bit INT data type for SQL Server is also the most efficient one. Storing the value as an integer remains the best way to achieve efficiency and performance. Doing so also only stores 4 bytes instead of the 8 bytes of BIGINT or the 17 of VARCHAR(15).

We can do the conversion from standard notation to an integer (and vice versa) at the application level. It is however outside the scope of this article and is dependent on the programming language used.

When dealing with SQL Server, unfortunately, since we only have 32-bit signed integers available (there is no 32-bit unsigned integer data type), we have to do some shifting of values into the negatives in order to account for this. For maximum performance, it is also best to do the conversion using the least amount of operations as possible.

Converting a string IPv4 address to an integer IPv4 address

To convert a string IPv4 address into an integer IPv4 address we simply do a little bit shifting. Each octet of an IPv4 address represents 8 of the 32 bits. The first octet is bit shifted by 24 bits and therefore we multiply its value by 16,777,216. The second octet is bit shifted by 16 bits, so this time we multiply by 65,536. The third octet is bit shift by 8 bits, as such we multiply by 256. The final octet does not need to be bit shifted. Finally, to allow the result to fit inside the INT data type, we subtract the total of all bit shifted octects by 2,147,483,648.

Now let's look at some of the different ways that we can accomplish that.

Using SUBSTRING and CHARINDEX

The most common approach is to grab the octets using familiar SUBSTRING and CHARINDEX lookups.

CREATE FUNCTION [dbo].[IPAddressToInteger]
(
@IPVARCHAR(15)
)
RETURNS INT
AS
BEGIN
  RETURN
 CAST(

       (CAST(SUBSTRING(@IP, 0, CHARINDEX('.', @IP)) AS BIGINT) * 16777216) +

   (CAST(SUBSTRING(@IP, CHARINDEX('.', @IP) + 1, CHARINDEX('.', @IP,
   CHARINDEX('.', @IP) + 1) - (CHARINDEX('.', @IP) + 1)) AS BIGINT) * 65536) +

   (CAST(REVERSE(SUBSTRING(REVERSE(@IP), CHARINDEX('.', REVERSE(@IP)) + 1, CHARINDEX('.',
    REVERSE(@IP), CHARINDEX('.', REVERSE(@IP)) + 1) - (CHARINDEX('.', REVERSE(@IP))
    + 1))) AS BIGINT) * 256) +

    (CAST(REVERSE(SUBSTRING(REVERSE(@IP), 1, CHARINDEX('.', REVERSE(@IP)) - 1)) AS BIGINT))

    - 2147483648
 AS INT)
END

Using PARSENAME

An alternative approach is to use PARSENAME to extract the four octets. Although PARSENAME was never designed for this, SQL Server's path naming is also dot delimited, which is why this works.

CREATE FUNCTION [dbo].[IPAddressToInteger]
(
@IPVARCHAR(15)
)
RETURNS INT
AS
BEGIN
RETURN
CAST(

(CAST(PARSENAME(@IP,4) AS BIGINT) * 16777216)
(CAST(PARSENAME(@IP,3) AS BIGINT) * 65536) +
(CAST(PARSENAME(@IP,2) AS BIGINT) * 256) +
CAST(PARSENAME(@IP,1) AS BIGINT)
- 2147483648
AS INT)
END

Improving the math for SQL Server

Since the math overflows the INT data type, we have no choice but to CAST to BIGINT, do the math, and then CAST back to INT after having shifted the result into the negatives. This results in unnecessary double-casting. Knowing ahead of time that the end result is shifted, we can rewrite the math to do the negative shift at the beginning. To do so, we simply subtract 128 from the first octet before we multiply by 16,777,216.

An improved SUBSTRING and CHARINDEX conversion

Here is the SUBSTRING and CHARINDEX method with the new math.

CREATE FUNCTION [dbo].[IPAddressToInteger]
(
@IPVARCHAR(15)
)
RETURNS INT
AS
BEGIN
  RETURN
     ((CAST(SUBSTRING(@IP, 0, CHARINDEX('.', @IP)) AS INT) - 128) * 16777216) +
 (CAST(SUBSTRING(@IP, CHARINDEX('.', @IP) + 1, CHARINDEX('.', @IP,
 CHARINDEX('.', @IP) + 1) - (CHARINDEX('.', @IP) + 1)) AS INT) * 65536) +
 (CAST(REVERSE(SUBSTRING(REVERSE(@IP), CHARINDEX('.', REVERSE(@IP)) + 1, CHARINDEX('.',
 REVERSE(@IP), CHARINDEX('.', REVERSE(@IP)) + 1) - (CHARINDEX('.', REVERSE(@IP))
 + 1))) AS INT) * 256) +

(CAST(REVERSE(SUBSTRING(REVERSE(@IP), 1, CHARINDEX('.', REVERSE(@IP)) - 1)) AS INT))
END

An improved PARSENAME conversion

Here is the PARSENAME method with the new math.

CREATE FUNCTION [dbo].[IPAddressToInteger]
(
@IPVARCHAR(15)
)
RETURNS INT
AS
BEGIN
 RETURN
   ((CAST(PARSENAME(@IP,4) AS INT) - 128) * 16777216) +
   (CAST(PARSENAME(@IP,3) AS INT) * 65536) +
   (CAST(PARSENAME(@IP,2) AS INT) * 256) +
   CAST(PARSENAME(@IP,1) AS INT)
END

Performance testing the different conversion methods

So now that we have four similar methods that we can use to do the conversion, how do we know which one is the most efficient? Using performance testing of course!

To eliminate deviations caused by table lookups, I looped each method 10,000 times on the same VARCHAR(15) IPv4 address. I performed the test 10 times to make sure the results were consistent.

SUBSTRING and CHARINDEX3334 ms
PARSENAME3325 ms
Improved SUBSTRING and CHARINDEX3332 ms
Improved PARSENAME3323 ms

It is very clear that there is little difference in performance between the different methods. However, it is also evident that PARSENAME is faster than multiple SUBSTRING and CHARINDEX calls. Removing the double casting also helps.

Converting from an integer IPv4 address to a string IPv4 address

Not many methods exist for converting an integer IPv4 address to a string. You can either determine the octets using division and remainder subtractions or by using bit masking and division.

Division and remainder method

CREATE FUNCTION [dbo].[IPAddressToString]
(
        @IP             INT
)
RETURNS VARCHAR(15)
AS
BEGIN
  DECLARE
   @IPAddressString                VARCHAR(15),
   @Octet                                  TINYINT,
   @Remainder                              DECIMAL
        
   SET @Octet = @IP / 16777216
   SET @Remainder = @IP - (@Octet * 16777216)
   SET @IPAddressString = CAST((@Octet + 128) AS VARCHAR)
   SET @Octet = @Remainder / 65536
   SET @Remainder = @Remainder - (@Octet * 65536)
   SET @IPAddressString = @IPAddressString + '.' + CAST(@Octet AS VARCHAR)
   SET @Octet = @Remainder / 256
   SET @Remainder = @Remainder - (@Octet * 256)
   SET @IPAddressString = @IPAddressString + '.' + CAST(@Octet AS VARCHAR)
   SET @IPAddressString = @IPAddressString + '.' + CAST(@Remainder AS VARCHAR)
   RETURN @IPAddressString
END

Bit masking and division method

CREATE FUNCTION [dbo].[IPAddressToString]
(
        @IP             INT
)
RETURNS VARCHAR(15)
AS
BEGIN
  RETURN
  CAST(((@IP & 0xFF000000) / 16777216) + 128 AS varchar(3)) + '.' +
  CAST((@IP & 0x00FF0000) / 65536 AS varchar(3)) + '.' +
  CAST((@IP & 0x0000FF00) / 256 AS varchar(3)) + '.' +
  CAST((@IP & 0x000000FF) AS varchar(3))
END

It is not necessary to do any performance testing. The overhead from the remainder operations is far too evident. Bit masking and division is the obvious winner here.

Different table designs

We have already established storing the IPv4 address as an INT offers the best performance but how does the human readability conversion affect performance?

Typical scenarios that will insert IPv4 addresses into a table are usually a result of logging. As previously mentioned, the conversion can occur within the Stored Procedure responsible for the INSERT. As there are no alternative approaches at the SQL Server level, there is no performance consideration to analyse since the only comparison could be versus the application level, which is outside the scope of this article. Instead, I will focus on the conversion aspect from an integer IPv4 address to a string IPv4 address at the SQL Server level.

Fortunately, I was able to use a live visitor log table to do some real-world performance testing. I copied the contents of the table to another table within the database to prevent changes to data from occurring during testing. I ran a straight SELECT and a COUNT based SELECT using three methods at the same time which allows the Execution Plan to automatically figure out the performance balance.

Additional column within the same table

An obvious method is to create an additional VARCHAR(15) column in the table and populate it at the same time as the INT column. This takes up an additional 17 bytes per row in the table but removes the need to do any conversions when querying the data.

Unsurprisingly, the simple SELECT query cost is the lowest of the three as there is no computational or hash matching required.

SELECT
[VisitID],
[IPAddressInteger],
[IPAddressString]
FROM [Visits]

The picture is quite different with the second query. The additional column becomes a burden making this method the most costly when performing an aggregated query.

SELECT
[IPAddressInteger],
[IPAddressString],
[TotalVisits] = COUNT([VisitID])
FROM [Visits]
GROUP BY
[IPAddressInteger],
[IPAddressString]

Second table for lookups

With an additional table to store the reference values, we can use an INNER JOIN to match our results. At 55%, the Hash Match is clearly creating a lot of additional overhead. This negates the storage benefits of normalizing the column.

SELECT
[Visits].[VisitID],
[Visits].[IPAddressInteger],
[IPAddresses].[IPAddressString]
FROM [Visits]
INNER JOIN [IPAddresses]
ON [Visits].[IPAddressInteger] = [IPAddresses].[IPAddressInteger]

SQL Server's efficiency with the INT data type really shines here as the query cost low.

SELECT
[Visits].[IPAddressInteger],
[IPAddresses].[IPAddressString],
[TotalVisits] = COUNT([Visits].[VisitID])
FROM [Visits]
INNER JOIN [IPAddresses]
ON [Visits].[IPAddressInteger] = [IPAddresses].[IPAddressInteger]
GROUP BY
[IPAddresses].[IPAddressInteger],
[Visits].[IPAddressString]

On the fly conversion

Using the improved PARSENAME conversion as a scalar function, the conversion overhead for the simple SELECT is almost unnoticeable.

SELECT
[VisitID],
[IPAddressInteger],
[IPAddressString] = [dbo].[IPAddressToString]([IPAddressInteger])
FROM [Visits]

Leveraging the benefits of using the INT data type, the aggregated sub query has the lowest cost.

SELECT
[IPAddressInteger],
[IPAddressString] = [dbo].[IPAddressToString]([IPAddressInteger]),
[TotalVisits]
FROM (
SELECT [IPAddressInteger], [TotalVisits] = COUNT([VisitID])
FROM [Visits]
GROUP BY [IPAddressInteger]
) AS [Visits]

Using a second table cannot be recommended as there is too much overhead caused by the INNER JOIN. With such a small difference between using an additional column and converting the value on the fly, the decision on which method to use is placed on the results from the aggregated query. The parallelism caused by having an additional column increases the query cost significantly and makes on the fly conversion the golden choice.

High performance and low storage costs

Using small datasets, performance tuning often results in no apparent real-world benefits. The benefits can quickly become clear as datasets, query complexity and server loads grow. Ten milliseconds may not seem like much, but ten milliseconds may mean the difference between thirty seconds or a single second when compounded under the right circumstances. With an extremely rapid conversion in both directions, it is evident that converting and storing an IPv4 address exclusively as an INT gives us the best performance. With the low storage requirements of only 4 bytes per row, we achieve both computational and storage efficiency. The SQL Server INT data type is the best method for storing IPv4 addresses.

Rate

4.33 (33)

You rated this post out of 5. Change rating

Share

Share

Rate

4.33 (33)

You rated this post out of 5. Change rating