Guru needed- Real versus BigInt

  • Hi-

    I have a HUGE list of phone numbers currently stored as BigInts - which are 8 bytes a piece. They look like this:

    Rather than (123) 456-7890, I store them like this: 1234567890. It really saves space.

    However, it appears that the Real data type is only 4 bytes(Floats are 8) and may want to make the change for a 4 byte savings per row. I just need to see if a phone number exists in the database at all - so after running some tests - it looks like the Real data type is working fine at a HUGE savings.

    Anyone have any suggestions ?

  • Whoops-

    Not working at all.

    If I have one number 1234567890 in the table as a real and do a SELECT WHERE X = 1234567890, it returns the one row - fine.

    But, it also returns a row if I search for 1234567891, or 2, or 3 etc...

    I knew it was only an approximation using a mantissa, but this really confims the problem. Sticking with BigInt.

    - B

  • Ok - saving phone numbers as bigint is wasting a lot of space.

    I use the varchar data type for phone numbers since I do not use the numbers for any calculations. The same with postal codes.

    Real is an approximate (floating point) data type. Since you are not using a decimal point I would not use that data type.

    Dr. Peter Venkman: Generally you don't see that kind of behavior in a major appliance.

    Patrick

    Quand on parle du loup, on en voit la queue

  • Patrick,

    If you save a phone number as a varchar, it will be at least 10 bytes per item. The BigInt is only 8 - a BIG savings with many millions of rows. Test it and you'll see.

  • If your data is numeric, store it in a numeric typed column. This will requier more work if your data-requierment changes, but it's a good practise to type your data accurately. It will prevent abuse of your column ("just for this case" generaly takes way longer than predicted !)

    If on the other hand you want to store phonenumbers like "555-sqlserver", you could start thinking of a presentation issue and change the numeric requierment, or add a presentation-column.

    You have (glad it's during test) encountered a downside of datatype "float".

    If indexed, your index (size and speed) also reflects the benefit/downside of your datatype choice.

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • In addition to what has been mentioned above, this might also be of some help

    http://www.sqlservercentral.com/forum/link.asp?TOPIC_ID=17632

    Frank

    http://www.insidesql.de

    http://www.familienzirkus.de

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

  • Also consider

    Over here in the UK we start national numbers with a zero (eg 01273 876986) whereas international numbers start with two zeros (eg 00 01 786 876 5678). If you don't store the leading zeros you get the wrong phone number and any numeric data type strips the leading zeros...

  • Another consideration for those not using numeric, I would use char instead of varchar for such a small data type, especially if you are not storing international and the lengths will be constant. That will reduce some of the perfomance issue. ( still not as fast as numeric )

  • Heh-

    I like ErasmusJ's comments.

    I guess it's really no big deal - just trying to be as clean as possible. Perhaps the searching will be a bt faster as well - but really, who gives a %hit.

  • In the US the numbers are 10 digits (at least for the foreseeable future) so it makes things a bit easier. In this case I would either use bigint to store the number or if there is any potential this may need to be recreated on a 7 server user Numeric(10,0) (Numeric will cost one more byte). In either case use a constriant to make sure the number is between 2002000000 AND 9999999999, and that the secound set of 3 (the exchange in the format NPA-NXX-XXXXX where NXX is the exchange) is not < 100 or = 500 and that the Area Code (NPA portion) != 700 as those are resevered or outside the numbering system (I used to have a complete guidline but lost when I left the Long Distance company I worked for). If you concern is saving space then you could use a int and a tinyint to piece together but makes a bit more dificult in working with. In that the tinyint would be the first number of the Area Code and the ini would be all the reast between 000000000 and 999999999, only will be 5 bytes then.

    If you need to deal with internation number either use a varchar (so 0's not lost and can be wide or narrow) or use bigint or numeric that is large enough to do whole number and a bit field for international or not. Not rules that I know will at all there.

  • I do a lot of work with phone number filtering on a daily basis. The structure for every column I have throughout the system that holds the phone number is a char(10). I did this because I didn't want to have to fight the battle of losing leading zeroes and I'm not doing math on them anyway. The cost of 2 bytes per record (even though there are dozens of millions of records to process and filter) doesn't really cause me much pause, because storage is cheap. The more important thing to me for the filtering and comparisons I need to do for OnNet/OffNet filtering, BadANI filtering, CLEC filtering 42Block filtering, Etc filtering, is that I have more fast spindles, processors, and memory available so I can get my filtering done and billing submitted quickly and efficiently. <br><Br>If the dataset you are working with is in the hundreds of millions of records (you have a few more than I do apparently (must be White Pages stuff)) then your filtering processes would likely take longer than mine (depending on what you are filtering for / out). It might help to know what your filtering steps are to discuss the topic in more detail, but I stand by my decision to go with char(10) for speed and usability.

  • BillyWilly, sorry, I forgot to say that I deal with international numbers most of the time.

    As others have said, numeric will cut off the leading zeros and I need those.

    Dr. Peter Venkman: Generally you don't see that kind of behavior in a major appliance.

    Patrick

    Quand on parle du loup, on en voit la queue

  • I'm only dealing with the new "Do Not Call" phone number lists provided by the US States and the US Feds(both provide lists).

    No International numbers here.

  • I would still stick to char(10) and maybe consider a date column on the table so you know when you loaded the specific numbers into your system (in case there is a loophole where they can come off the list after a certain amount of time), and consider a clustered index on the phone number column (faster for searching against but slower for loading). I would also run a duplicate remover each time you load a new update to the list from the DNC list. My Do Not Bill list is only about 35 million, but that's after duplicate removal. <Opinion> I would bet a dozen really good donuts that the feds and states are not bothering to remove duplicated entries from their lists, so you may want to take that action yourself on your data. </Opinion> Let me know if you need a good dup remover for a list much like what you're discussing... I have one written that I have to run monthly (and it is quick).

  • Jeffo...

    Yes,

    I have a Clustered index on the BigInt phone column.

    I'm actually using 2 tables for this - the key is using a 1 byte TinyInt ID for all my rows for easy Kill and Refill when new states come in.

    CREATE TABLE [dbo].[cbDoNotCallRoot] (

    [ID_DoNotCallRoot] [tinyint] IDENTITY (1, 1) NOT NULL ,

    [ListTyp] [char] (1) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL ,

    [StateNme] [char] (2) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL ,

    [LastUpdatedDte] [smalldatetime] NOT NULL

    ) ON [PRIMARY]

    GO

    CREATE TABLE [dbo].[cbDoNotCall] (

    [ID_DoNotCallRoot] [tinyint] NOT NULL ,

    [PhoneNum] [bigint] NOT NULL

    ) ON [PRIMARY]

    GO

    CREATE CLUSTERED INDEX [IXC_cbDoNotCall$PhoneNum] ON [dbo].[cbDoNotCall]([PhoneNum]) WITH FILLFACTOR = 100 ON [PRIMARY]

    GO

    CREATE UNIQUE CLUSTERED INDEX [IXUC_cbDoNotCallRoot$Type_State] ON [dbo].[cbDoNotCallRoot]([StateNme], [ListTyp]) WITH FILLFACTOR = 100 ON [PRIMARY]

    GO

    I would LOVE to see your duplcate checking code if it's quick and dirty.

    - B

Viewing 15 posts - 1 through 15 (of 23 total)

You must be logged in to reply to this topic. Login to reply