Guest Editorial: That ain't a Database, it's a Spreadsheet

  • Comments posted to this topic are about the item Guest Editorial: That ain't a Database, it's a Spreadsheet

    Best wishes,
    Phil Factor

  • Last real system I worked on, I loaded the database with 100% fill of six months worth of data using redgate data generator. (3 million rows).

    Six months was the retention period requirement.

    Then I quadrupled the data i.e. 2 years worth. (12 million rows)

    Half an hour or less was the process time limit for the big stored proc that generated the predictions - it ran just shy of eight. ๐Ÿ˜‰

    Job Done.

    --Shaun

    Hiding under a desk from SSIS Implemenation Work :crazy:

  • This is my first reply to a topic on SQLServerCentral. I have been a dba for 10 years. Sometimes I feel that I do not have the knowledge you all have but, hey, I have been able to handle teraoctet databases with million rows, so maybe I am doing something right, no?

    I definitely have a point of view on this topic. working for a telecommunication company, handling their SQL servers, I feel I might have a point.

    My point is very simple, keep the retention period as low as possible on a production environment, dealing with transactional databases.

    We can talk about indexing, defragmentation, vertical partitioning, you name it.... All good things to do, I accept that.

    But one of the thing that will not change is:

    1. Querying tables, especially with table scans, is so much faster with the fewest rows

    2. Maintenance on those tables and in particular reindexing is so much faster

    3. Disruption of service is almost nothing if it happens

    4. Invasive heavy maintenance DB tasks are not as important

    a. backup strategies are faster to implement and execute (especially compression files)

    b. restore strategies are faster to implement

    c. DW extraction is faster to implement and execute

    d. High availability strategies (both load balancing and failover) are faster to implement

    I think Dba should focus more on managing data and not databases or servers, mainly because the data is more important that the recipient of it.

    Then when all data management strategies (purging, reindexing, backuping, extracting) are done, we can focus on optimizing the rest of it.

    This is my experience with large databases in telecommunication companies.

    All industries are different but Dba, with this approach can clearly synch business needs with a proper approach to the data.

    Clement

  • I once worked for a consulting company (national level, but not one of the biggest) that contracted to replace an older system with SQL Server. The development DBA was, shall we say, a nice guy. He and a lot of other people who were billing high consulting rates created a system that was much slower than the older system we were replacing.

    I was asked to hang around after others were moved to other projects to get the animal into production.

    What Phil is describing is exactly our mistake. We tested everything with teaspoons of data when the customer needed gallons. When we hit production the system fell flat on its face. When I started to look under the covers, I was shocked at how many developers were doing things like writing reports that pulled whole tables from the database and processed them on client machines instead of using concise queries. You can get away with that when the test table has 10 rows, but it's awfully clumsy when the production table has millions. What was the indexing strategy? Create one when someone complains about speed, not before. Again, you don't need a lot of indexes when you're only pulling a few rows of test data.

    Yes, testing along the way with data volumes that the system is expected to handle is excellent advice.

    ___________________________________________________
    โ€œPoliticians are like diapers. They both need changing regularly and for the same reason.โ€

  • I don't think your editorial (like many I have seen along this topic line) is realistic. Sure, it would be great on Day One to "know" the size of your final production database - but even a statement such as that is highly flawed. Think about it - Is the size of a database ever fixed? No, always fluid. So what does "final production database" really mean? Is that the database on Day One, Day One Hundred, Day One Thousand?

    Maybe in telecommunications work you can say "We have X number of customers who make X number of calls..." and from that you extrapolate expected size and load - but let me pose a question that in my experience, is more the rule than the exception...

    Lets say you work for a large company - who cares what they do, but.... How many JPG image files are there RIGHT NOW on your entire company network? Do you know? Can you guess? Is it 100, maybe 1000, maybe 1 million, maybe 10 million - fact is, you don't know and likely cannot know easily. So how do you design and test a database to track them? How do you load up test records to "match" what that expected load is? Suppose you were counting these JPG image files - how many are actually duplicates of the same file? You cannot know that any better than you know the total number.

    Another example - Imagine if you got a job tomorrow in a company with a 250 node network, and you were asked to design a SQL database to track all Word documents in the company on every server and work station in the company. How would you proceed in assessing just how many records you would be dealing with to start? How many new Word documents are generated each day? Each month, year? You cannot simply "pick a number" with any degree of certainty.

    This is something our companies deal with all the time. We work in businesses and applications where you cannot simply "pick" a number and say "this is where we're going to be when we're running in production"... Its simplistic to assume most businesses are quantifiable - in fact, the majority are not and though I think good design and testing is vital - to present it as some sort of science where you just "figure" the total load you will be dealing with is just not real world thinking for the majority of businesses.

    Plan for the worst, build for the best, but don't presume you can "know" anything on Day One because 99.99% of the time, you will be wrong unless you are truly fortunate to be working in a field where these numbers are truly known on Day One.

    There's no such thing as dumb questions, only poorly thought-out answers...
  • Well then, what is your fully tested method of operation in those cases. Take a good guess, then go with 25 times more data then adjust as need be or something along those lines?

  • I agree with blandry.

    My answer a couple of answers above, was not an answer.

    It was kind of trying to find a solution so that your production servers are sizeable.

    Saying that, we can manage to develop a load test with x millions of rows that would be consider a full-size production server.

    Once we are near to reach this amount when on production, an alert would tell us to set up several servers to load balance the loads.

    So although theoritically, you can never know the amount of data you will deal with, you can "manage" it by spreading the loads on different servers.

    By doing so, the load testing makes entire significance!

    The sizing of the load test then depends on how frequently you will extract, purge, reallocate ressource.

    Clement

  • In my case the 3 million rows was full population of the database for six months worth of usage.

    I had specifc fixed limits in the system I was testing and actual data would be at most 2 million rows, given the nature of the application. So doing all the load testing with 4 - 6 times the allowed data volumes was a good test.

    hey thats the beauty of redgate sql data generator though. ๐Ÿ˜‰

    Hiding under a desk from SSIS Implemenation Work :crazy:

  • I tend to agree with Phil and yet disagree. I feel initial development should start out small in a concept mode using isolated data in a strictly development only environment. However, the trick is have not just one or two levels of development and testing but at least four levels where each level of test and debugging is progressively more production-like prior to cutting a new system or maintenance changes over to production.

    Granted, that implementing a series of separate server (farms) for Development, Test (1 and/or 2), Sim-Prod, and Production could be very expensive for a small business. However, the heavy production system Phil outlines, is not one of a small business, or at least is not any longer a small business category. And, the firm involved must be willing to invest in providing a more robust data environment in hardware, software and especially in data and code migration and testing policies.

    Ron K.

    "Any fool can write code that a computer can understand. Good programmers write code that humans can understand." -- Martin Fowler

  • I think this is a very valid point, but I would suggest working against two similarly structured databases for development purposes - one with lots of data and one with a little. Reason being, when you are performance testing your database code, you want the mega-datasets. When you are testing the accuracy of your code, you probably want smaller sets that you can control the results better to make sure you are getting exactly the results you seek with your queries. Our production databases have millions of records in some of the tables and I definitely do want to make sure my work can perform well against it - but I also need to make sure that hidden in the thousands of results returned from some of my queries aren't some of the wrong records (and that some of the right ones aren't missing). For simple selects this isn't a big problem but when the query logic gets complicated, it is very easy to have a small mistake that is a big impact in terms of data integrity.

    --
    Anye Mercy
    "Service Unavailable is not an Error" -- John, ENOM support
    "You keep using that word. I do not think it means what you think it means." -- Inigo Montoya in "Princess Bride"
    "Civilization exists by geologic consent, subject to change without notice." -- Will Durant

  • While i agree with blandry that having a solid idea of data throughput is nearly impossible, i don't think that invalidates Phil's point. I think what Shaun was talking about is the perfect example:

    Shaun McGuile (12/18/2008)


    In my case the 3 million rows was full population of the database for six months worth of usage.

    I had specifc fixed limits in the system I was testing and actual data would be at most 2 million rows, given the nature of the application. So doing all the load testing with 4 - 6 times the allowed data volumes was a good test.

    hey thats the beauty of redgate sql data generator though. ๐Ÿ˜‰

    He had an inkling of what production numbers would look like, but the more important thing was performing the test itself. If you can't know how much data is going to be in the database (and you have data generator...), filling it up with a ludicrous number of rows and doing performance tests is still valid.

  • When I worked as a technical architect, I would never dream of letting a project go into development unless we had carefully specified the loading (e.g. transactions per second) and the data metrics (e.g. no. of records) that the system had to deal with. This makes sense because then the developers can do their unit testing to those figures (with allowance for the inherent optimism of humanity), and the testers can test to those figures. Then the project can be signed off as being fit for production service.

    If you don't have those figures, how then can you DBAs specify the hardware, or produce/cost out the Service Level agreements? How can you developers do unit testing? In my example, the company hadn't really thought it through, but we very quickly had all those figures calculated and the hardware in place. I think that my response to those who don't have a proper specification of the system is just to express huge waves of sympathy. I feel your pain. You're more firefighters than developers, or DBAs

    Best wishes,
    Phil Factor

  • Sorry, BLandry, but I have to respectfully disagree. I think you can do reasonable estimates on important systems as to how big it can get. Calculating table and index size is almost trivial, but everything depends. If the company is a brand-new startup, then doing accurate estimates will be difficult, especially if the company takes off and ramps up at an unanticipated and explosive rate. But that reinforces Phil's point: test under heavy (or monstrous) load; it will perform well with a light load and you have confidence of acceptable performance under heavy load.

    It's definitely harder to accurately forecast size and load for new projects, much easier for redevelopments and conversions.

    -----
    [font="Arial"]Knowledge is of two kinds. We know a subject ourselves or we know where we can find information upon it. --Samuel Johnson[/font]

  • Phil,

    Thanks for the break and guest editorial. I had a good time skiing with the kids yesterday.

    I agree that you want to have an idea of scale and build development systems to that level. There are times you have an inkling of scale, and you should test to that level.

    For most of my career, as I've built systems, we had no idea of the load. So we've not been sure where to test, though we've always tried to account for scale and load early on. You don't want to get wrapped up in it, but take a good guess and test beyond that guess. Then plan for things to get busier, or spend a few minutes making sure you haven't prevented things from evolving.

    To some extent, experience teaches you how to build small systems that can grow. Or it should.

  • Phil Factor (12/18/2008)


    When I worked as a technical architect, I would never dream of letting a project go into development unless we had carefully specified the loading (e.g. transactions per second) and the data metrics (e.g. no. of records) that the system had to deal with.

    You've obviously worked in companies with very much more static businesses that I have then!

    Although, in an ideal world, it would be great to have the numbers for the exact number or records, transactions, queries, etc, in my experience the best you get is usually very rough and likely to change depending on business performance. The best I've usually managed is to get the most up to date estimates and predictions you can, then assume that there going to be off by at least a factor of two. If the system can copy with that, there's a chance the production system wll be OK for a while!

    On another point, I would agree with the idea of multiple levels of testing during development. When doing early testing, use small samples of data so you get fast turnaround; once things are beleived to be OK, step up to realistic levels so you can see what it's really going to be like; finally, before release, try it with what you consider ridiculously high levels to see how it's likely to degrade. Chances are

    the real world will turn out somewhere between your realistic and ridiculous estimates.

    Of course, this also relates to developers always wanting the latest/fastest hardware when they should have some of the oldest/slowest...

    If they can produce acceptable performance on that old junk...

    Derek

Viewing 15 posts - 1 through 15 (of 32 total)

You must be logged in to reply to this topic. Login to reply