Challenged with modeling a huge DB

  • Hi,

    I'm tasked to develop a gaming web application system with relatively large database - it should hold the data about 300 million players, the orders/payment of each player - ~3B records - , and other info about the players. Note that the Players entity is subject to frequent read and writes. Every operation that a user does (login, purchase, start to play) require to load his info and validate that he's authorized.

    I have few questions in this regard:

    1. Is there a DB technology that can simply store 300M records and return a single indexed record in a very short time (e.g. 0.3 second) ? by saying simply I mean without special techniques like sharding.

    2. Are no-sql db REALLY ready for the mission in terms of reliability, supporting tools, simplicity of maintenance, etc?

    3. I thought of splitting the entire dataset into multiple smaller database (~1 million record each), where each DB represent a subset of the players (e.g. players with id 1-100000), then some component will hold the mapping of player id to DB (user with id 33443 belong to DB xxx), this way I keep the DB lean. Whenever some cross DB data is needed (top rank player among all players in the system) I will keep this queries ready in a separate DB. Would love to hear your feedback.

    3. Kindly recommend any reading, technologies, best practices... 🙂

    cheers,

    Yonatan

  • goldberg.yoni (5/7/2015)


    Hi,

    I'm tasked to develop a gaming web application system with relatively large database - it should hold the data about 300 million players, the orders/payment of each player - ~3B records - , and other info about the players. Note that the Players entity is subject to frequent read and writes. Every operation that a user does (login, purchase, start to play) require to load his info and validate that he's authorized.

    I have few questions in this regard:

    1. Is there a DB technology that can simply store 300M records and return a single indexed record in a very short time (e.g. 0.3 second) ? by saying simply I mean without special techniques like sharding.

    No easy answer here. To just retrieve a "single record" in the time referenced without sharding? Sure, but define what this record looks like, how many columns of what size? Then, depending on your answers, No. Can your system be built? Yes. Can it be built without possibly doing some type of "special technique"? Unlikely. But... one point, designing for 300 million players when you currently have ZERO, could be a mistake. Best would be to design simply, but effectively, taking under consideration the fact that you may need to shard or something in the future, but won't have to now.

    2. Are no-sql db REALLY ready for the mission in terms of reliability, supporting tools, simplicity of maintenance, etc?

    Yes. Absolutely. But, most of them are geared to collecting data. Not so much to reporting on it. However, for ID/Value type saving or retrieval, they will likely work very fast. The problem comes when you have to retrieve by something other than the ID.

    3. I thought of splitting the entire dataset into multiple smaller database (~1 million record each), where each DB represent a subset of the players (e.g. players with id 1-100000), then some component will hold the mapping of player id to DB (user with id 33443 belong to DB xxx), this way I keep the DB lean. Whenever some cross DB data is needed (top rank player among all players in the system) I will keep this queries ready in a separate DB. Would love to hear your feedback.

    3. Kindly recommend any reading, technologies, best practices... 🙂

    cheers,

    Yonatan

    Not sure I'd split by player ID. It's too random and you might find that all the "hot" players, the ones with the most activity, are in a single db, causing a lot of pain. Having aggregate data in a separate storage system (doesn't even have to be a database) is a good idea. Split the data according to purpose. Reporting systems can be in one place. OLTP such as billing, player management, and other structured information, in another. Streaming play information absolutely would go into one of the NoSQL systems because it's what they're designed for. Assume multiple structures for multiple behaviors.

    Again, I'd start with the fundamentals. Build a clear, accurate data structure with proper normalization. Then work from there.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • goldberg.yoni (5/7/2015)


    Hi,

    I'm tasked to develop a gaming web application system with relatively large database - it should hold the data about 300 million players, the orders/payment of each player - ~3B records - , and other info about the players. Note that the Players entity is subject to frequent read and writes. Every operation that a user does (login, purchase, start to play) require to load his info and validate that he's authorized.

    I have few questions in this regard:

    1. Is there a DB technology that can simply store 300M records and return a single indexed record in a very short time (e.g. 0.3 second) ? by saying simply I mean without special techniques like sharding.

    2. Are no-sql db REALLY ready for the mission in terms of reliability, supporting tools, simplicity of maintenance, etc?

    3. I thought of splitting the entire dataset into multiple smaller database (~1 million record each), where each DB represent a subset of the players (e.g. players with id 1-100000), then some component will hold the mapping of player id to DB (user with id 33443 belong to DB xxx), this way I keep the DB lean. Whenever some cross DB data is needed (top rank player among all players in the system) I will keep this queries ready in a separate DB. Would love to hear your feedback.

    3. Kindly recommend any reading, technologies, best practices... 🙂

    cheers,

    Yonatan

    A) Why are you tasked with this requirement? What is your experience developing high-scale, high-performance database applications, especially from the database side? If you don't have experience with that as well as SIGNIFICANT experience tuning SQL Server, you are set up to fail spectacularly.

    B) Caching will be key to the performance of this application, as it is for every high-scale system in existence.

    C) No SQL databases have NO BUSINESS in this discussion. You are taking payments and doing other things that MUST BE TRANSACTIONALLY CONSISTENT and you CANNOT lose committed data! Neither of those is really No SQL. Here is a link about MongoDB for example:

    https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads

    D) Neither does No SQL have "simplicity of maintenance". It DOES have it's place in the world, obviously, and there could be some form of logging or activity metrics where it could be useful in this system.

    E) SQL Azure has automated range-based sharding like you mention, built right into a framework where you can manage the key ranges and the application knows what shard to reference for each call. Bob Duffy has a very nice SQL Saturday presentation where he shows using this to scale up a system to a massive number of operations above what he accomplished via traditional means. I bunch of caveats, limitations, provisos, etc. would come with putting a system like yours up in the cloud though (excessive chattiness being the main one that most devs fall prey too, with improper caching being the second).

    In general ACID-COMPLIANT DATABASE-LEVEL sharding is frickin' HARD to do right/well. It is only a bit less hard to manage properly in the app tier.

    F) You may want to take a look at NuoDB. I would LOVE the chance to do a real project on that platform (http://www.nuodb.com/)

    G) Pre-aggregation will be a key part of your success I can promise you.

    H) You may also want to examine the use of SQL 2014's In Memory OLTP technology. I am pretty sure there are sections of this app where that will be key.

    I) SQL Server can ABSOLUTELY provide for this level of scale at even better than your 300ms response time needs. I have worked with auction sites, high-volume online retailers, etc. that I was able to tune to get up to over 10000 transactions per second and there were still legacy crap in their apps we could have fixed to easily get higher than that.

    J) One final recommendation: hire me to help you do this right 😎

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

Viewing 3 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply