Storing IPv4 Addresses for Performance

  • I really don't have a strong opinion on how to store IP addresses, since I've yet to be asked to do so for any of my clients (thankfully).

    But nonetheless I feel obliged to comment, because in my previous life as a network engineer, I used IP addresses a lot.

    I believe in storing data in the most 'natural' way for the type of data. So I store whole numbers as int, freeform text as varchar, monetary values as decimal or money, dates as dates etc.

    So from that perspective, probably the 'most correct' data type is the binary(4), which doesn't add any meaning to the data. i.e. it isn't trying to pretend that the data is a number, which it isn't.

    Some people like to think of an IP address as four small numbers, and so for them the 'natural' way to store the data would be in four tinyints. However, strictly speaking in the modern IPv4 Internet, an IP address is a 32-bit pointer, and not a number (or set of numbers) at all. The dotted-octet form is merely to make the address easy for people to remember and type, and has no other meaning.

    And with respect to subnetting, in my time as a network engineer (which I still do part-time to 'keep my hand in'), apart from internal (inside a firewall) company or home networks, I very seldom see an octet-aligned network mask (e.g. 255.255.0.0), much more common are sub- or super- netted masks (e.g. 255.128.0.0 or 255.255.255.224). So if you're wanting to use subnetting as an argument in favour of your choice of storage for IP addresses, then storing it as four tinyints I think I would find quite awkward because finding addresses belonging to a non-aligned subnet would require extra care and some spooky bit-shifting (for T-SQL).

    Having said all that, I think everyone should remember that there's no one-size-fits-all approach to this kind of thing, and there's no point in getting hot under the collar because someone doesn't do it your way.

    Do whatever suits your requirements best. As demonstrated by several posts, there's nothing stopping you from doing a bit of 'mix-and-match' and adding an extra int or binary(4) to your four tinyint columns, or the other way around, should your needs change.

  • nimdil (8/19/2009)


    That's the whole point. Some IP addresses will map to negative integers. As there are 2^32 IP Addresses and 2^31 non-negative integers in SQL Server we have to use positive as well as negative values.

    The idea with four columns - while not bad - has some flaws:

    1. There is no merits with 1NF as IP address is a complete being. Would you store postal code with few char(1) columns?

    2. If you want to perform table wide bit operation with mask (mask may be from 0 to 32 actually) you will have to either map four tinyints to one int the same way it is presented in the article or map it to bigint. Not good for performance :>

    If you want readability - just use view with some ip_int_to_ip_string computed column.

    Ahh I see.

    My mistake. Overlooked that part. Thanks for clearing it up.

  • Gosh... I'm really torn here. Jean-Sebastien, you did a heck of a nice job.

    I can't take anything away from the author. This is a very well written article with great examples and a noble purpose... I just don't agree with the extra obfuscation that occurs just to save 3 to 9 bytes per entry when there are so many other methods. Without prior knowledge and if the documentation for the method were lost and not included in the embedded comments of the code or the function code were lost, very few people could actually reverse engineer the reasoning behind this ingenious but obscure method of squashing an IPV4 into an INT datatype.

    Further, I really don't agree with the idea of using UDF's for this because they just slow things down... a lot! 3.3xx seconds for all of the methods demonstrated is pretty tough for only 10,000 rows and that only gives you one method of interrogating the table for IP addresses.

    How many of you believe that Calendar tables are useful and provide good set-based performance where RBAR may otherwise have to be used? Quite a few, I would imagine. So why relegate yourselves to the use of UDF's for relatively simple IPV4 transformations?

    While I agree that it was never the goal of this article, as Ryan Price said in his good response, not any one method will fit all requirements. You can, however, anticipate that requirements will change or be added as time wears on (your boss will love you when you say "I anticipated your future need and it's already done"). A table similar to that of a Calendar table would do nicely. Think of it as a "Rosetta Stone" for IPV4 addresses.

    Now, if space is a real crunch, you can simply make computed columns to cover most of the eventualities and still get away with only 4 bytes although (materialized) persisted computed columns would provide a bit of extra performance. Done correctly, you can account for all sorts of masking, a Binary version of the IPV4, the two most common forms of character based/human readable IPV4's, a BIGINT IPV4, a human readable (kind of) Numeric IPV4, AND the highly desirable 4 x TinyInt form… all for just 4 bytes plus indexes on columns that you think you'll actually use. And, it's fast as hell. All 10,000 rows are split and inserted into the "final" target table in just 187 milliseconds (sans all indexes but the PK) and the SELECT of all 10,000 rows including ALL of the calculated methods I just mentioned only takes 684 milliseconds to be returned to the (ugh! Slow) screen. That's on a 7 year old P4 1.8Ghz desktop box using SQL Server 2005 DE sp3!

    Heh… yeah, I know… "Hey Jeff! Put up or shut up!" I'll do both… 😛

    /*****************************************************************************************************************
     This section is just to setup the test environment. It is NOT a part of the solution.  --Jeff Moden
    *****************************************************************************************************************/
    --===== Make sure the stats timer is off for reruns
        SET STATISTICS TIME OFF
    ;
    --===== Conditionally drop the test tables for test reruns
         -- This table will contain "x" number of whole IP addresses as if collected from some file or whatever.
         IF OBJECT_ID('TempDB..#RawIPV4','U') IS NOT NULL
            DROP TABLE #RawIPV4
    ;
         -- This table simulates our final target table for the TinyInt Octets
         IF OBJECT_ID('TempDB..#TargetTable','U') IS NOT NULL
            DROP TABLE #TargetTable
    ;
    GO
    --===== Create the target table for this test.  In real life,
         -- this would be a permanent table somewhere.
     CREATE TABLE #TargetTable
            (
            TargetTableID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
            SomeID        INT     NOT NULL,
            Octet4        TINYINT NOT NULL,
            Octet3        TINYINT NOT NULL,
            Octet2        TINYINT NOT NULL,
            Octet1        TINYINT NOT NULL,
            BinaryIP   AS CAST(Octet4 AS BINARY(1))
                        + CAST(Octet3 AS BINARY(1))
                        + CAST(Octet2 AS BINARY(1))
                        + CAST(Octet1 AS BINARY(1)),
            StringIP1  AS CAST(Octet4 AS VARCHAR(3)) + '.'
                        + CAST(Octet3 AS VARCHAR(3)) + '.'
                        + CAST(Octet2 AS VARCHAR(3)) + '.'
                        + CAST(Octet1 AS VARCHAR(3)),
            StringIp2  AS RIGHT('000' + CAST(Octet4 AS VARCHAR(3)) + '.', 4)
                        + RIGHT('000' + CAST(Octet3 AS VARCHAR(3)) + '.', 4)
                        + RIGHT('000' + CAST(Octet2 AS VARCHAR(3)) + '.', 4)
                        + RIGHT('000' + CAST(Octet1 AS VARCHAR(3))      , 3),
            BigIntIP   AS Octet4 * CAST(16777216 AS BIGINT)
                        + Octet3 * 65536
                        + Octet2 * 256
                        + Octet1,
            NumericIP  AS Octet4 * CAST(1000000000 AS BigInt)
                        + Octet3 * 1000000
                        + Octet2 * 1000
                        + Octet1 
            )
    ;
    GO
    --===== Create "x" number of totally random IPV4 addresses
     SELECT TOP 10000
            SomeID = IDENTITY(INT,1,1), --Wouldn't be an IDENTITY in real life.
                                        --Might be just some arbitrary unique ID
                                        --or whatever.
            IPV4   = CAST(ABS(CHECKSUM(NEWID()))%256 AS VARCHAR(3)) + '.'
                   + CAST(ABS(CHECKSUM(NEWID()))%256 AS VARCHAR(3)) + '.'
                   + CAST(ABS(CHECKSUM(NEWID()))%256 AS VARCHAR(3)) + '.'
                   + CAST(ABS(CHECKSUM(NEWID()))%256 AS VARCHAR(3))
       INTO #RawIPV4
       FROM Master.sys.All_Columns ac1
      CROSS JOIN Master.sys.All_Columns ac2
    ;
    --===== Display the first 100 rows from the test table
     SELECT TOP 100 *
       FROM #RawIPV4
    ;
    GO
    --===== Start the stats timer and do the split to the screen
        SET STATISTICS TIME ON
    ;
    /*****************************************************************************************************************
     This section solves the problem of splitting the individual Octets out of the IP address and assembling them
     as rows with 4 TinyInt Octets.  The computed columns do ALL the rest.
    *****************************************************************************************************************/
      PRINT '========== Time to INSERT ================================================================'
    ;
        SET STATISTICS TIME ON
    ;
       WITH cteSplit AS
    (
     SELECT ip.SomeID,
            Octet4 = CAST(PARSENAME(ip.IPV4,4) AS TINYINT),
            Octet3 = CAST(PARSENAME(ip.IPV4,3) AS TINYINT),
            Octet2 = CAST(PARSENAME(ip.IPV4,2) AS TINYINT),
            Octet1 = CAST(PARSENAME(ip.IPV4,1) AS TINYINT)
       FROM #RawIPV4 ip
    )
     INSERT INTO #TargetTable
            ( SomeID,   Octet4,   Octet3,   Octet2,   Octet1)
     SELECT s.SomeID, s.Octet4, s.Octet3, s.Octet2, s.Octet1
       FROM cteSplit s
    ;
        SET STATISTICS TIME OFF
    ;
      PRINT '========== Time to DISPLAY ================================================================'
    ;
        SET STATISTICS TIME ON
    ;
    --===== All done... display all the rows of the target table as a sanity check.
         -- Make sure you check out the "Messages" tab in SSMS for the CPU time and 
         -- duration of the solution code above.
     SELECT tt.*
       FROM #TargetTable tt
    ;
        SET STATISTICS TIME OFF
    ;
    --===== Let's see what the table uses as is (same as just 4 TinyInt's and the 2 ID's)
       EXEC dbo.sp_SpaceUsed #Targettable
    
    --===== Just to prove that all of the calculated columns are deterministic
         -- and, therefor, Indexable.  You only need to include the indexes
         -- for columns that you know you will use.
    -- CREATE INDEX IX_TargetTable_BinaryIP  ON #TargetTable (BinaryIP)
    --;
    -- CREATE INDEX IX_TargetTable_StringIP1 ON #TargetTable (StringIP1)
    --;
    -- CREATE INDEX IX_TargetTable_StringIP2 ON #TargetTable (StringIP2)
    --;
    -- CREATE INDEX IX_TargetTable_BigIntIP  ON #TargetTable (BigIntIP)
    --;
    -- CREATE INDEX IX_TargetTable_NumericIP ON #TargetTable (NumericIP)
    --;
    
    --===== I'm being a bit lazy here to show all the indexes
    --   DBCC SHOWCONTIG (#TargetTable) WITH ALL_INDEXES, TABLERESULTS
    --;
    --   EXEC dbo.sp_SpaceUsed #Targettable
    --;

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • How about if you do a two-stage conversion - from octets to binary(4), and then CAST to int? Does that hold the same 4 bytes internally? Does that sign bit cause trouble? And does it actually matter? Only the whole Internet has no subnetting at all. If your enterprise network isn't subnetted class A at least then you may be insane.

  • You are correct Jeff - it was a very good article. Much better IMHO than much of what is put out on the web these days.

    As for your constructs, well, there is a very good reason why you are a SQL Server MVP! Nicely done. 😎

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • I've seen a lot of posts regarding the poor performance of UDFs. I believe that's missing the point. Generally speaking, there is no need for displaying 10,000 IPv4 addresses in human readable form simultaneously. The majority of the work with IPv4 addresses in databases is usually the result of logging or statistics. In the case of logging, logs are usually only consulted a few rows at a time in human readable form, whilst in terms of statistics, usually in a report summary which is also only a few rows at a time.

    Most operations involve combinations of DISTINCT, COUNT, GROUP BY, HAVING and so on and so forth, on usually what is hundreds of thousands if not millions of rows before ever having a single row returned for human readability. For maximum efficiency during those operations, performance is highest with the data types that SQL Server is most optimized for. A lot of behind the doors optimization goes on with integers due to a combination of the long standing 32-bit architecture it evolved on as well as the recurring use of integers for primary and foreign keys.

    Also, when dealing with million row scenarios, additional storage bytes can have drastic impact on both storage requirements and performance. For the sake of argument as ridiculously improbable as this is, if you were to create a single calendar lookup table for all 4,294,967,296 possible IPv4 addresses, the primary key (which could be considered the integer version of the IP if you started at -2,147,483,648) alone would take up 16GB of storage space. Adding a four part tinyint equivalent to that would raise the total to 32GB. If we add a varchar(15) field as well, we're now up to 100GB of storage space.

    Although performance and storage for four tinyint fields is similar to int, the additional complexity required to do bitmask operations on a range of IPs as well as the split into four different fields makes this a choice I would personally pass on. binary(4) would have been my prefered choice because as Ryan pointed out, it is the most natural way to represent an IPv4 address. Unfortunately, SQL Server's inability to provide bitwise operands for binary as it does for integer led me to the integer bitshifting solution I presented.

    As many have stated previously, there is no one shoe fits all solution. I believe however that the disparity between the number of human readable rows to be displayed versus the number of internally calculated, queried, filtered and sorted rows is so high that the performance detriment of using UDF conversions is negligible at best.

  • Well said Jean-Sebastien, well said.

    CEWII

  • The funny thing about this whole topic is: while INT is almost surely the best way to store IPv4 data type in most scenarios, it is interesting why there is no native IPv4 and IPv6 data types in SQL Server 08' with possible bitwise operations on them supported similarly to data types?

  • I repeatedly forgot that the point of the article was performance - assume that processing of IPv4 address data is critical, how do you best store it?

    Processing is likely to require bitwise operations, yes. And how are we receiving the data, too?

    I don't rule out the x 10 GBytes lookup table - you probably only need that all online unless you're running the whole Internet, and no one does that so you're fine 🙂 You only need a subset of that. Unless you're building the world's largest network filter, to defend against denial of service from China, Russia, etc. Or if you are there and you want to stop attacks from the West.

  • Jean-Sebastien Carle (8/20/2009)


    I've seen a lot of posts regarding the poor performance of UDFs. I believe that's missing the point. Generally speaking, there is no need for displaying 10,000 IPv4 addresses in human readable form simultaneously. The majority of the work with IPv4 addresses in databases is usually the result of logging or statistics. In the case of logging, logs are usually only consulted a few rows at a time in human readable form, whilst in terms of statistics, usually in a report summary which is also only a few rows at a time.

    Most operations involve combinations of DISTINCT, COUNT, GROUP BY, HAVING and so on and so forth, on usually what is hundreds of thousands if not millions of rows before ever having a single row returned for human readability. For maximum efficiency during those operations, performance is highest with the data types that SQL Server is most optimized for. A lot of behind the doors optimization goes on with integers due to a combination of the long standing 32-bit architecture it evolved on as well as the recurring use of integers for primary and foreign keys.

    Also, when dealing with million row scenarios, additional storage bytes can have drastic impact on both storage requirements and performance. For the sake of argument as ridiculously improbable as this is, if you were to create a single calendar lookup table for all 4,294,967,296 possible IPv4 addresses, the primary key (which could be considered the integer version of the IP if you started at -2,147,483,648) alone would take up 16GB of storage space. Adding a four part tinyint equivalent to that would raise the total to 32GB. If we add a varchar(15) field as well, we're now up to 100GB of storage space.

    Although performance and storage for four tinyint fields is similar to int, the additional complexity required to do bitmask operations on a range of IPs as well as the split into four different fields makes this a choice I would personally pass on. binary(4) would have been my prefered choice because as Ryan pointed out, it is the most natural way to represent an IPv4 address. Unfortunately, SQL Server's inability to provide bitwise operands for binary as it does for integer led me to the integer bitshifting solution I presented.

    As many have stated previously, there is no one shoe fits all solution. I believe however that the disparity between the number of human readable rows to be displayed versus the number of internally calculated, queried, filtered and sorted rows is so high that the performance detriment of using UDF conversions is negligible at best.

    You've seriously mistaken what I said. No one in their right mind would create all possible IPV4 addresses. The idea of the "Calendar" table was to make it simple to calculate dates. So goes the code I posted and only for the given IP addresses.

    As for displaying all of the data, I agree... totally NOT necessary. To wit, that's why I put the data in a table... all 4 bytes for each address. The purpose of the calculated columns is to make it easy to sort, join, and mask virtually any type of "format" that may be deemed necessary to use by requirements.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • rja.carnegie (8/20/2009)


    I don't rule out the x 10 GBytes lookup table - you probably only need that all online unless you're running the whole Internet, and no one does that so you're fine

    You should rule it out... it's not what I suggested. I suggested that, like a Calendar table, each IP address in captured data could easily be formatted for virtually any use no matter what format was necessary.

    And you also need to read my notes on performance. Why take 3.3xx seconds to produce an int when you can take 0.2 seconds to derive virtually any other format you need?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Elliott W (8/20/2009)


    Well said Jean-Sebastien, well said.

    CEWII

    You need to go back and look at my code then. I've not suggested the creation of every possible IP address.... I've suggested a way to format them virtually any way and still only store 4 bytes without the additional obfuscation that the author's method requires for conversion to integer.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • TheSQLGuru (8/20/2009)


    You are correct Jeff - it was a very good article. Much better IMHO than much of what is put out on the web these days.

    As for your constructs, well, there is a very good reason why you are a SQL Server MVP! Nicely done. 😎

    Coming from you, that's a high compliment. Thank you, Sir.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I must appologize Jeff, in my haste I misread part of your code. It is indeed an interesting approach that I will explore further.

  • Jeff Moden (8/20/2009)


    Elliott W (8/20/2009)


    Well said Jean-Sebastien, well said.

    CEWII

    You need to go back and look at my code then. I've not suggested the creation of every possible IP address.... I've suggested a way to format them virtually any way and still only store 4 bytes without the additional obfuscation that the author's method requires for conversion to integer.

    Jeff, I may not have been clear, he makes good points, but the single most important sentence is:

    "As many have stated previously, there is no one shoe fits all solution."

    I think his points are valid, and as a long time programmer I know there is almost always at LEAST two good ways to solve a problem, each with their own tradeoffs. I can see three good ways to solve this problem, one is Jean-Sebastien's, the other is the binary(4) solution, and the last is 4 tinyints. There are probably a couple more that aren't obvious. The point I'm getting at is that you need to chose a solution that fits your problem and all problems you are likely to solve in the future, none of the solutions prevent you from doing this. I too am going to review your solution in greater detail..

    CEWII

Viewing 15 posts - 31 through 45 (of 100 total)

You must be logged in to reply to this topic. Login to reply