The Largest Database

  • It's not for me, but who out there wants to work on the largest database? The NSA is collecting phone records of calls made and brags that "It's the largest database ever assembled in the world."

    If they're really recording all phone calls, then it has to be billions or rows. Heck, it's probably tens or millions or more a day! After all there are more than 200 million people in the US and with cell phones, home businesses, and prepaid calling cards, there's probably a good percentage of those people making a call a day.

    Can you imagine having 10 or 20 billion rows of data that you want to index? How about testing a query that joins this table with a list of the people, a much smaller 200 million record table, and you forget the where clause? Or what about a genius manager asking for a list of people that didn't make a phone call between two dates? Probably have time for some nice coffee breaks there.

    I've worked with smaller databases, 10s of GB or less, for most of my career and even on older hardware, things ran relatively fast. That includes backups, restores, broken queries, etc. Even cross joins of my largest tables might be annoying, but would run within minutes or an hour or two. It's also made me less concerned about some of the scale issues that the VLDB folks face, like maintenance windows, online operations, or even table recovery. In most of my jobs, if someone deleted a table, like me forgetting a WHERE clause, I could just restore the latest backup on another machine and then move the data across. Often I could even do this on the same server!!!

    But that's not possible in a VLDB environment. Or even a large database. I had a 600GB or so database at JD Edwards and we didn't have space to restore that database onto another server to recover data. It would require a complete restore of the entire system. Fortunately it was a warehouse and could be reloaded without data loss if we ran into that type of situation.

    VLDBs present some complex challenges to the simplest things that we take for granted. I can imagine that it's fun to work out solutions to some of the problems, like reindexing, defragmenting, and other tedious, mundane chores we take for granted on smaller databases. At least until something's broken.

    I know it's not fun to try and fix something in that type of environment when you boss or his boss is watching over your shoulder.

    Steve Jones

  • impressive..wouldnt google be challenging for the largest database ???


    ------------------------------
    Life is far too important to be taken seriously

  • "I know it's not fun to try and fix something in that type of environment when you boss or his boss is watching over your shoulder. "

     

    Actually, I'd suggest that there are few environments in which it's fun to fix anything with your boss looking over your shoulder...

    -----------------

    C8H10N4O2

  • I'm part of a small group presently in the task of developing a database and application which has the ability to become and is projected to become a VLDB. At some points, we have seen growth per day of around 2 to 10GB or more of data.

    This challenge is ongoing and has been for the past 7 years, using all sorts of software and several core changes, each with their own advantages and disadvantages. Overall, it all comes down to how you plan to use the data.

    In our particualr system, we have found that there is around 40gb of data that's constantly changing and updating. The rest is simply colleting and being stored for later use or archiving. Archiving is easy as you can distribute this data across multiple machines as necessary, for example, one of our developmental archives is currently using around 100 DB servers each with MySQL (cost biased for this quantity) holding around 70Gb of archive data each.

    We have found that through parallell access and indexing, we have achieved our desired result, but this more comes from how it's programmed. We knew beforehand how the data is going to be used and have developed out scale-out based upon that. Without knowing how the data will be used (to a large extent), it would be almost impossible to handle a database over 100Gb.

    All in all, it really comes down to the developer(s) behind the infrastructure. Time is money, but a wrongly developed database or quickly written one can cost far much more in the long run when a query that should take 200ms to run takes 200seconds or 200minutes.

    Databases are going to be something that will continue to develop over time and as technology gets better, we will find better ways of handling things. At the present time, I/O handlers and companies are holding database development back as this is the core part of a database, without I/O there is no DB, conversly if your I/O is slow, it doesn't matter how good the database is or how well it's indexed it'll be slow too.

    I think in the scale of VLDBs you need to have a much greater planning and projecting before implemtation, it takes a much broader thinking and thinking out of the box to achieve success. Always remember the golden rule... "Nothing is impossible."

  • Seems like the critical measures of database size are row count, transaction per second, user count and model complexity.

    Any thoughts on other metrics to measure size with?

  • Interesting. Use is an important factor in design.

    My thought devolves to the legal uses of this data - not whether you can capture the data in the first place.

    Does one have a legal right to data about your actions (in this case a phone call) that is captured by a governmental entity? Further, can you subpoena the data for legal use? Say you know you are the suspect of a murder investigation and you know the fact that you made a phone call could produce an alibi for you. Lacking that you look pretty guilty and might get injected. Could you subpoena that data for your defense? If you could do it for this purpose, where do you draw the line backwards? Could the database be subpoenaed in a divorce proceeding to better establish through the showing of routine and frequent phone calls that one of the parties was involved in an illicit relationship outside the marriage?

  • I have been dealing with real-time and near-time VLDB environments housing 500 to 1000 GB for the last 4 years and the following summarizes the size management strategies employed:

    - VL table partitioning and recombining with updateable views

    - multi-filegroups

    - partition per filegroup

    - indexes per partition

    - weekly full reindexing

    - daily stats updates

    - archiving by rotating oldest partitions out of a view

  • I believe that the size of the database mentioned by Steve is off by at lease 2 orders of magnitude. Yes, two orders of magnitude. I know this from experience from working at a CLEC (a competing telco after deregulation). We were small, maybe a few million phone numbers. Our database table for completed (billed) phone calls was on the order of 1.6 billion rows ! We only kept 3 months of this detailed information online. The remainder was archived. This was just call billing information that was fixed and one could 'optimize' the 9's. I just cannot fathom how the government could even attempt to store, and better yet analyze this volume of data. Could you imagine, tables with trillions of rows ? Oh, and lest we forget our 'blobs. This is all of the digitized (recorded) phone calls. At lease one per call record. Then what about breaking this down (parsing) the digitized conversation into words for further analysis ? Soon trillions turns into quadrillions and quintillions. I just do not see how this information could be gathered, loaded and analyzed in any reasonable fashion to where it could provide any 'proactive' value.  

    RegardsRudy KomacsarSenior Database Administrator"Ave Caesar! - Morituri te salutamus."

  • The production size issues for relational are one thing (and I agree with Rudy, I think the orders of magnitude are off), but the sizing is for use in development is another entirely different can of worms. It becomes a big problem when you can't create a completely isolated development environment when even a 10% sample data set won't fit inside a Virtual PC image.

    I primarily do BI consulting and processing a OLAP cube on top of three quarters of a billion rows of development data is a PITA. Especially when you know that you're working with a 10% "sample" of real production health insurance claim data for a "small" regional health insurance carrier. It's a challenge to determine if your 10% "sample" is really representative or not...

    20 billion rows is nothin'. I'll be the boys in Vegas who do all the "real" cutting edge BI work are playing with even larger numbers of rows by a couple orders of magnitude...

    I'd love to know how many point of sale records Wal-Mart has in their warehouse. Extracts that I've seen in the past for just a few manufacturers' product lines at a time in selected parts of the continental US for just three months of data were in the 400M record range.

    [Does somebody at the NSA think that it makes people feel better about the program to brag about the size of their database(s)? Doh!]

  • What is really amazing is that I heard that the NSA does this all over a modem line.

    You know, dial up..

  • What I would like to know is what kind of hardware this thing is implemented on; must be truly impressive, but thanks to working folks the government has very deep pockets! Also, what RDBMS do you think it is running? My guess would be Oracle....

    Ken...

  • Regarding querying the data like Steve mentioned...i'd be interested as to whether this is within the capabilities of Analysis Services or not.

    Its a shame that there aren't TPC multidimensional DB benchmarks as there are for relational DBs. I'd really like to see that.

    Incidentally (digressing slightly) I've heard a whisper that AS is going to be used somewhere within the MSN Search team. To what extent, I have no idea! It'd be fascinating if it were servicing the actual searches - and that would be a hell of alot of data!

    -Jamie

     

  • I would imagine that only interesting conversations were recorded. The equipment intercepting the call would have logic in it searching for keywords spoken during the call for the first 1-5 minutes. Up to that point the call is saved entirely in RAM. If a keyword pattern is detected then the call is further saved with its metadata information for analysis. Most calls are held etherally in RAM for a short duration and then totally flushed, metadata and all. This way relatively little data (less than 1/100 or maybe 1/1000 of 1% of calls) is actually written to disk. I'm sure that the keyword analysis is constantly tweaked and would be considered the genius of the system.

    Some calls would be always flagged based on their metadata - numbers related to known terrorists/organizations and the like - whether the calls were significant in their content or not. This would provide a webbing effect for an investigation.

  • By the way, there is always a dis-information effect that needs to be considered. I could imagine that it might be beneficial for bad guys to believe that every one of their phone calls were being monitored - that there really is such a database as premised by this article. This would cause the bad guys to adopt an unusual pattern in their communications that could be more easily detected. As in, the pattern not of what you see but what is missing from what you would expect to see.

    So, you have to give a grain of salt to the whole notion that this massive database in the sky is just a piece of disinformation.

  • My best record are almost two billion of records with IMS(pharma sector).

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply