Will this design scale?

  • Go ahead and guess whether this will scale. I am. Obviously I can't post my schema here but we have a new product in our company ( .net C# SQL 2005). The client data is all being stored in one database ( hundreds of clients ) on a sql 2005 Enterprise dual node ( active/passive ) cluster. The sql boxes are currently using 6GB of memory -- AWE is enabled. Server 2003 OS fibre-attached to a leased SAN . I'm sure more memory can be installed, probably up to 16GB since it's all 32 bit.

    We're now looking into our own storage devices to save $$. The data file is currently 170GB with 6 million "unique" records and should grow to over 50 million over the next year. ( 250 million rows in all tables totaled ) We have a design which is highly normalized ( way too normalized from my old-school opinion ) with 160 tables, 400+ indexes and 250+ foreign keys. Designed with an object modeling tool ( Vision I believe ).

    Select/update/insert etc statements are generally crafted by an object, not in stored procedures. Profiler shows many,many joins in these statements. But everyone tells us sql server is designed to perform with such a schema and can store the execution plans for "adhoc" statements. We're also told only to denormalize as a last resort.

    I understand the data integrity advantages of such design -- reporting on the data or fixing it isn't much fun. It seems the trend is towards all database access done by some tool that magically reads the schema for you -- are the days of actually knowing the tables and data to "hand-craft" queries over??

  • The numbers you quote seem OK to me. It really comes down to what the application is doing and what implications exist wrt reporting off the same database.

    If the application is not well behaved, you will most likely see blocking/locking issues.

    Reporting against a normalised schema can be hard work (doing joins all the time). You can help yourself out by creating views.

    HOWEVER, reporting and transactional workloads frequently conflict. Reports regularly ask for 100's or 1000's of records and may take a while to get them. Transactional applications typically will be updating 1 or two records at a time. Hence reports may delay updates. I have seen huge conflicts caused by this sort of interaction. There are a number of workarounds for this including indexed views, indexes to suit the report requirements (including covering indexes), denormalisation and using a dedicated reporting database (on the same or a different server).

    As for using tools instead of hand crafting SQL - there certainly is a trend towards using tools. The query building functionality within the tools is improving. Creating views/sprocs for use by reports is still a useful thing to do - it does tend to ensure that valid joins and business rules are used.

  • We will need to build a separate system for reporting ( data warehouse ). The application limits recordsets returned to the browser to 2000 records. One immediate hurdle is the 16,000 record per hour load rate of the conversion program ( loading from our legacy databases ). Some of our larger legacy clients have 5-15 million records plus lots of scanned images that will be converted ( not stored in the DB as blobs, just on the file system ). With all clients in one database and an expectation that a given client will only be off line for 2 or 3 days.....?

    I'm used to a world where each client has their own database so this one seems risky.

  • In general terms, it doesn't sound bad, but it all comes down to the code, the indexes and the code, and if I didn't mention it, the code. How you query the database will make or break you.

    Since it's an existing system, before I'd say, "Yes, this will scale" I'd capture the performance in production over a few days using Profiler and perfmon. Then gather all the information on the most frequently run queries and the longest running queries, etc. Check their execution plans. Look at wait states. Then, assuming nothing was out of line, you can make an educated estimate at how well it will scale.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • I'd actually consider designing so you can move clients to separate databases. SQL Server scales, but to a point. There may come a time when a client doubles or triples and you might want to move them to another database/server.

    Ad hoc queries can be reused, but they must be structured the same. Craig Freedman did a great blog series on this awhile back where he looked at how query plans are re-used or not. Sometime like Whitespace can make a difference, so if you are generating with a tool, this could be a benefit. However if you've cut and pasted and potentially changed orders or code (order of appearance in the code) or changed whitespace you might force new plans.

  • I'd have 3 issues with the one database strategy....

    if one database is out of service, all clients are off the air. not good news for your customer support people, never mind your customers.

    also your security model better be perfect - because otherwise some customer may end up seeing anothers data.

    and if one customer ever needs to rollback data to an earlier state, then under one database design you've got a large negotiation problem.

  • Yes, I'm not happy with the all in one database approach. I'm told the new application either requires or assumes this. It may turn out that performance, if nothing else, will require separate databases or even separate sql server systems.

    Adhoc "data fixes" are supposed to be needed less often but whenever you do something like that the "where clause" becomes even more critical since you risk updating the wrong client's data.

    During conversion loads, any performance hit is going to affect everyone, etc etc etc.

  • scaling isn't about numbers of objects but of usage so in effect your question can never be answered. Your only way to answer your question is to stress test using scale up and scale out of users and data - anything else is speculation and rumour.

    As I'm currently working on exactly what you're asking ( but more tables indexes users etc. etc. what i can say is that going to 64bit and adding as much memory as you can will help - if you have memory then things happen quicker and you don't have issues with disk performance. I was reading an article which states thta microsoft have been working on parallel processing of queries in memory ( not parallel plans ) - they have a way to go yet it seems - but adding memory is your best approach. In my case having more memory than databases has definitely improved matters. Most scalability is achieved from the application , the old 20 / 80 split ( hadware / software )

    If reporting is an issue replicating a database for reporting would be a good move.

    [font="Comic Sans MS"]The GrumpyOldDBA[/font]
    www.grumpyolddba.co.uk
    http://sqlblogcasts.com/blogs/grumpyolddba/

  • My money would be on "not" but my guess is that you're probably obligated to make a go of it. I'm actually responding to two of Randy's posts here (one related to Agile in general, this other here in what I think may be a bit of a tongue in cheek post, meaning that Randy doesn't think it will scale either but for whatever reason can't say so out loud).

    First things first, based on your original post you're headed for a 1.5TB database in the very near future - but looking to save $$ on storage costs; if anything you're going to need to spend some serious $$$$ to scale to that size with a database like you've described. It ain't going to happen on 4x500GB sata drives in a RAID 5. As the DBA the ball is definitely in your court in this regard - you can see the train a comin, best to raise it as an issue sooner rather than later. Have any specific database performance metrics been requested/documented at this point?

    As for the ORM (select/insert/update/delete via adhoc SQL) it's really a matter of how your developers are using the tool - if they're creating "big bang" objects (e.g. get everything I could possibly ever need) with tons of parameters, etc. it won't scale. On the other hand if your developers are creating objects for specific functions (e.g. get a customer list by state, get customer X by customer_id, get a list of customer X's orders) much like you or I would create stored procedures for specific functions, you may not be in too bad a shape - some ORM's including hibernate, LINQ, etc. actually generate some very good/efficient SQL. In my experience it's a tossup as to which direction any particular group of developers will take - might be worth asking them about it and maybe discussing the pro's/con's of each approach ("big bang" = bad). Agile folks are big on doing absolutely nothing more than required to meet a particular requirement - and guess what, that exact same approach works nicely when dealing with a database - ask for nothing more than what you need, update no more than what you need, etc.

    Agile & ORM's are most definitely not the end of the DBA - if anything I expect us to be even more in demand as the developers find that their new and wonderful ORM crutch has gotten them in trouble whether from performance or data quality (dealt with both). Probably one of the scariest things about ORM's in general is that by using the database server as nothing but a "stupid data store" it's going to hurt us all in the end as there is no way for other products/projects to get at the business logic that is getting moved out of the database layer - yes, you could use web services, etc. but my guess is that particular feature hasn't been included in the business layer as of yet (or even worse, people are starting to duplicate business logic here, there and everywhere like in reports)? IMHO one of the worst features of most ORM's is that they are often language specific and designed to be consumed by applications - not reporting tools, ETL tools, or any of the other applications that may need to interpret/enforce those business rules.

    Joe

  • I see your point about having business logic scattered about. We are just now talking about a "data warehouse" for

    reporting -- possibly via a second log shipping arrangement. We don't have a DBA per se.

    Some instantiations of "class agile" don't include a named DBA 🙂

    As a data analyst I took over "disaster recovery" setting up backups and log shipping.

    I try to feed suggestions on performance/configuration to the developer team who designed the database and are coding the application.

    They're smart folks and have added indexed views, adjusted fill factors to reduce the frequency of index rebuilds,etc

    I suspect as the product matures that a small data group will be created or an experienced DBA will be brought in.

    Right now I'm researdhing auto_update_statistics_async since they've told me the system is more on the OLTP side --

    they've been having to run dbcc freeproccache on a regular basis.

  • oh sh*t. dynamic sql. I hope your securtity model is robust.

    freeproccache...you lose the benefit of cache everytime this is run.

    no tables qualified by dbo. you're inviting a system-overhead in order to resolve the schema.

    top 2000....without an order by....you're inviting random results.

    code as posted is difficult to read....have you got an execution plan? maybe we could spot performance issue from that.

  • Holy moly. Referring back to my statement, after having seen this code, I wish to refine my original estimate.

    Nope. Won't scale. Next.

    BTW, I reformatted the query so I could actually see it. All those IN and NOT IN statements are going to kill your performance, especially on a system the size you're talking about. Also, correct me if I'm wrong here, but I see a TOP without an ORDER BY. They do know that they can, and probably will get different result sets for the exact same parameters? You've got problems coming at you.

    SELECT TOP 2000

    PROPERTY18.ORIGINAL_MANUFACTURER_ID AS PROPERTY18_ORIGINAL_MANUFACTURER_ID

    ,ACCOUNT_OWNERSHIP_DOC_SUMMARY02.OWNERSHIP_DOC_ISSUING_STATE_ABBR AS ACCOUNT_OWNERSHIP_DOC_SUMMARY02_OWNERSHIP_DOC_ISSUING_STATE_ABBR

    ,SERVICE_REQUEST15.REQUEST_TRANSACTION_TYPE AS SERVICE_REQUEST15_REQUEST_TRANSACTION_TYPE

    ,ACCOUNT_OWNERSHIP_DOC_SUMMARY02.LIENHOLDER_IDENTIFIER AS ACCOUNT_OWNERSHIP_DOC_SUMMARY02_LIENHOLDER_IDENTIFIER

    ,ACCOUNT110.ACCOUNT_ID AS ACCOUNT110_ACCOUNT_ID

    ,ACCOUNT_OWNERSHIP_DOC_SUMMARY02.CUSTOM_ATTRIBUTE_1 AS ACCOUNT_OWNERSHIP_DOC_SUMMARY02_CUSTOM_ATTRIBUTE_1

    ,ACCOUNT_OWNERSHIP_DOC_SUMMARY02.CUSTOM_ATTRIBUTE_2 AS ACCOUNT_OWNERSHIP_DOC_SUMMARY02_CUSTOM_ATTRIBUTE_2

    ,ACCOUNT_OWNERSHIP_DOC_SUMMARY02.CUSTOM_ATTRIBUTE_3 AS ACCOUNT_OWNERSHIP_DOC_SUMMARY02_CUSTOM_ATTRIBUTE_3

    ,ACCOUNT_OWNERSHIP_DOC_SUMMARY02.CUSTOM_ATTRIBUTE_4 AS ACCOUNT_OWNERSHIP_DOC_SUMMARY02_CUSTOM_ATTRIBUTE_4

    ,ACCOUNT_OWNERSHIP_DOC_SUMMARY02.BORROWER_FULL_NAMES AS ACCOUNT_OWNERSHIP_DOC_SUMMARY02_BORROWER_FULL_NAMES

    ,ACCOUNT_OWNERSHIP_DOC_SUMMARY02.OWNER_FULL_NAMES AS ACCOUNT_OWNERSHIP_DOC_SUMMARY02_OWNER_FULL_NAMES

    ,SERVICE_REQUEST15.RECORDED_REQUEST_DATE AS SERVICE_REQUEST15_RECORDED_REQUEST_DATE

    ,ACCOUNT110.ACTUAL_PAYOFF_DATE AS ACCOUNT110_ACTUAL_PAYOFF_DATE

    ,SERVICE_REQUEST15.REQUEST_MECHANISM_TYPE AS SERVICE_REQUEST15_REQUEST_MECHANISM_TYPE

    ,OWNERSHIP_DOC13.OWNERSHIP_DOC_ID AS OWNERSHIP_DOC13_OWNERSHIP_DOC_ID

    FROM ACCOUNT_OWNERSHIP_DOC_SUMMARY AS ACCOUNT_OWNERSHIP_DOC_SUMMARY02

    INNER JOIN ( OWNERSHIP_DOC AS OWNERSHIP_DOC13

    INNER JOIN ( SERVICED_COLLATERAL_GROUP_ITEM AS SERVICED_COLLATERAL_GROUP_ITEM14

    INNER JOIN ( SERVICE_REQUEST AS SERVICE_REQUEST15

    INNER JOIN STATE_MESSAGE_EXCHANGE

    AS STATE_MESSAGE_EXCHANGE16

    ON SERVICE_REQUEST15.SERVICE_REQUEST_ID = STATE_MESSAGE_EXCHANGE16.COLLATERAL_GROUP_REQUESTSERVICE_REQUEST_ID

    )

    ON SERVICED_COLLATERAL_GROUP_ITEM14.COLLATERAL_GROUP_REQUESTSERVICE_REQUEST_ID = SERVICE_REQUEST15.SERVICE_REQUEST_ID

    )

    ON OWNERSHIP_DOC13.OWNERSHIP_DOC_ID = SERVICED_COLLATERAL_GROUP_ITEM14.OWNERSHIP_DOC_ID

    INNER JOIN PROPERTY AS PROPERTY18

    ON OWNERSHIP_DOC13.OWNERSHIP_DOC_PROPERTY_ID = PROPERTY18.PROPERTY_ID

    )

    ON ACCOUNT_OWNERSHIP_DOC_SUMMARY02.OWNERSHIP_DOC_ID = OWNERSHIP_DOC13.OWNERSHIP_DOC_ID

    LEFT OUTER JOIN ( PROPERTY AS PROPERTY19

    INNER JOIN ACCOUNT AS ACCOUNT110

    ON PROPERTY19.ACCOUNT_ID = ACCOUNT110.ACCOUNT_ID

    )

    ON ACCOUNT_OWNERSHIP_DOC_SUMMARY02.ACCOUNT_PROPERTYPROPERTY_ID = PROPERTY19.PROPERTY_ID

    WHERE ( ( ACCOUNT_OWNERSHIP_DOC_SUMMARY02.STATUS <> @ACCOUNT_OWNERSHIP_DOC_SUMMARY02_STATUS_0

    AND ACCOUNT_OWNERSHIP_DOC_SUMMARY02.CLIENTORGANIZATION_ID = @ACCOUNT_OWNERSHIP_DOC_SUMMARY02_CLIENTORGANIZATION_ID_1

    AND SERVICE_REQUEST15.REQUEST_TRANSACTION_TYPE NOT IN (

    @SERVICE_REQUEST15_REQUEST_TRANSACTION_TYPE_2 )

    AND STATE_MESSAGE_EXCHANGE16.MESSAGE_EXCHANGE_STATUS = @STATE_MESSAGE_EXCHANGE16_MESSAGE_EXCHANGE_STATUS_3

    AND STATE_MESSAGE_EXCHANGE16.BUSINESS_PROCESS_STATUS = @STATE_MESSAGE_EXCHANGE16_BUSINESS_PROCESS_STATUS_4

    AND SERVICE_REQUEST15.BUSINESS_PROCESS_STATUS = @SERVICE_REQUEST15_BUSINESS_PROCESS_STATUS_5

    AND SERVICE_REQUEST15.REQUEST_MECHANISM_TYPE = @SERVICE_REQUEST15_REQUEST_MECHANISM_TYPE_6

    )

    AND ( ( SERVICED_COLLATERAL_GROUP_ITEM14.CONCRETE_TYPE IN (

    @SERVICED_COLLATERAL_GROUP_ITEM14_CONCRETE_TYPE_7 ) )

    AND ( PROPERTY18.CONCRETE_TYPE IN (

    @PROPERTY18_CONCRETE_TYPE_8, @PROPERTY18_CONCRETE_TYPE_9 ) )

    )

    )

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • I should be able to get an execution plan. I tried to edit the post to reformat the Tsql but never saw a "post" just preview/edit/spell check. Perhaps the order by doesn't matter because what they're doing is preventing a user ( Browser GUI ) from hitting the database for 2 million records. You get to see 2,000 with a message that there may be more. Who needs more than that displayed in html --- anything huge should be done by a reporting group against a separate system, right?

    We've converted a few hundred of our smallest clients to this system but we have a big one coming up with a few million records for one client. Right now the conversion rate is under 30k records per hour and I would imagine the client expects the conversion to occur over a weekend. :w00t: The mdf is now 175GB with about 6 million "unique" records. I've recommended a full blown test conversion for this large client, and I think they will do that much. As far as a stress test of this system with one or two large clients converted into a test system -- not sure that will happen.

  • set rowcount 2000

    would be far more effective than TOP 2000 in limiting returned data....as TOP implies sorting, which would be an extra processing step.

    @30K per hour, 1m will take 33hours and 2m 66+ hours...getting "a couple of million" converted over a weekend "looks very ambitious"...especially when you don't seem to allow for any problem analysis/fixing time, any backup time, etc....

  • A little googling indicates 1) whether set rowcount or top performs better is unclear without testing 2) set rowcount is deprecated

Viewing 15 posts - 1 through 15 (of 34 total)

You must be logged in to reply to this topic. Login to reply