We have a very large client that will have 5000+ stores that will be connecting in on a web app and I'm looking for the best architectural design to make everything run super efficiently. Putting aside how it's coded, etc. what are some recommendations for best performance when considering the high number of incoming connections?
The application is 80% reads and about 20% writes/updates. We batch insert new data in at night or through the day at regular intervals.
We have thought about possibly doing a partitioned approach (10 servers) using a load balancer to point certain store number ranges to certain servers (ie - at the network level (reading packet to get first digit of the store number or something like that). Or we could think about letting all connections come in to one server for reads and another for writes and then doing merge or peer to peer replication between them. Or do we just keep everything in one really beefy server and use snapshot isolation?
I'm not a hardware guy but I managed an ecommerce project a few years ago for a store doing $50 million in annual sales and thousands of online orders per day. We estimated that the store would lose $10,000 per hour if the system was down during prime sales hours. Now I realize this is slightly different than a B2B environment but the cost of downtime is probably on a similar scale.
That client--like most clients--had a hard time seeing the potential dangers of failing to prepare for disaster. When a clumsy backhoe operator somewhere down the road from the co-lo facility dug up the city's main internet trunk line 2/3rds of the city was without internet access for 8+ hours in the middle of a work week. That was $100,000 in revenue the client lost. The backup plan finally assembled cost half that much.
The original architecture with load balanced production servers, an equal number of load-balanced backup servers updated via transactional replication, separate development servers including a dedicated pre-production staging server, backup plans, and backups to the backups made everyone feel safe. Until the "Day of the Backhoe."
After that they found the money to create a parallel geo-located system in another state. The next time we had a major failure (there's always a next time), we had everything rolled over and running off the out-of-state server location in less than 10 minutes. I know this doesn't answer your question specifically but I wanted to point out the major danger of putting so much firepower in one place. Once you do get a good environment setup you should consider strongly setting up a disaster recovery site somewhere far away. Actually, it should all be part of your original plan and tightly integrated right from the start. There are many good co-lo facilities around the country that will monitor such a site for you like the one we use: Peak10
. I'm sure there are thousands of companies in the northeast right now who wish they'd had a disaster recovery site that was out of Hurricane Sandy's path.