Working with large database tables

  • We are in the process of getting some converted data from a mainframe system into SQL server into our accounting system also We are needing to query these data for reporting purposes. One of the tables in this database has at least 10 million records. Other tables linking to this table have around 3 millions or so. In addition to adding indexes to this table what would be the best approach on how to query these tables? Any reading material regarding best practices in querying large databases that anyone can suggest would be great.

    Thank you.

  • Without more information, the answer is it depends. What kind of queries, aggregation? How does the schema look like etc.

    😎

  • We're planning to do all sorts of aggregation and select statements. The main tables that I'm talking about are recording financial transactions (payments and invoices mostly but also reversals and voided items). These transactions are for different contacts and organizations. So at any given point in time one of our customers may want to know what he/her balance might be. In order to do this we would have to add up all the transactions to come up with a balance for this year or even as a whole. We're thinking about running a ETL twice a day to extract this aggregated data to a data warehouse which would contain the total balance for each customer. We face two problems, one it would be out of balance. If we run a nightly job would work, but during the day, however, how can we best do this without taking too much resources from the current users using the system and also how can we best gather the data without having to get incorrect information (dirty records).

    We have very little experience in dealing with database with tons of records, any sources that I can read and learn about would be greatly appreciated.

    Thank you.

  • HildaJ (5/23/2014)


    We're planning to do all sorts of aggregation and select statements. The main tables that I'm talking about are recording financial transactions (payments and invoices mostly but also reversals and voided items). These transactions are for different contacts and organizations. So at any given point in time one of our customers may want to know what he/her balance might be. In order to do this we would have to add up all the transactions to come up with a balance for this year or even as a whole. We're thinking about running a ETL twice a day to extract this aggregated data to a data warehouse which would contain the total balance for each customer. We face two problems, one it would be out of balance. If we run a nightly job would work, but during the day, however, how can we best do this without taking too much resources from the current users using the system and also how can we best gather the data without having to get incorrect information (dirty records).

    We have very little experience in dealing with database with tons of records, any sources that I can read and learn about would be greatly appreciated.

    Thank you.

    For something like this (not much more than a large checkbook for multiple accounts), I don't believe that a DW would actually buy you anything but grief. Proper design will be the lifesaver for performance and accuracy with this.

    Even running balances on 10 million rows can be done in seconds.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • There is a wealth of technology options to help you out on this, the pitfall is to choose one before the requirements are carved in stone! A handful of million or hundreds of million records are not normally considered large these days, but even few thousand records can form an insurmountable obstacle if the technology chosen and the design does not fit the purpose.

    Start by getting something working, test it against the requirements and improve if needed. This way, you can hopefully avoid another pitfall which is premature optimization. Hope this helps and feel free to ask further.

    😎

  • Eirikur Eiriksson (5/23/2014)


    There is a wealth of technology options to help you out on this, the pitfall is to choose one before the requirements are carved in stone! A handful of million or hundreds of million records are not normally considered large these days, but even few thousand records can form an insurmountable obstacle if the technology chosen and the design does not fit the purpose.

    Start by getting something working, test it against the requirements and improve if needed. This way, you can hopefully avoid another pitfall which is premature optimization. Hope this helps and feel free to ask further.

    😎

    Ahhhhh... be real, REAL careful there. A lot of people make the mistake of confusing premature optimization with good coding practices and there's sometimes no way to fix it after the fact. For example, writing a CURSOR to do something that could be done in a high performance, set-based manner isn't avoiding premature optimization... it just avoiding the reality that you should embrace good coding practices.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Eirikur Eiriksson (5/23/2014)


    Start by getting something working, test it against the requirements and improve if needed. This way, you can hopefully avoid another pitfall which is premature optimization.

    Doing a proper job upfront is not premature optimisation. It's doing your job to the best of your ability.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • To be honest, 10 million rows isn't particularly large. That's about what I put into tables when I'm doing query testing to get a good, but not huge, load.

    Write your queries in the simplest way possible, following all good practices around well-performing queries. Create indexes to support the queries. Then test and make sure that they perform within your requirements. If they don't, then consider changing the queries into more complex forms (premature optimisation would be going straight to the complex, non-standard methods without testing and seeing if they are required)

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Must say that I'm slightly puzzled. I have read my post over and over but I cannot find neither any implicit nor explicit suggestions of ignoring good coding or working practices. And most certainly, if that is the impression left by my post, it is not the intention. Could someone help me out here if that is the case?

    😎

  • Eirikur Eiriksson (5/24/2014)


    Must say that I'm slightly puzzled. I have read my post over and over but I cannot find neither any implicit nor explicit suggestions of ignoring good coding or working practices. And most certainly, if that is the impression left by my post, it is not the intention. Could someone help me out here if that is the case?

    😎

    It was this... particularly the bolded part...

    [font="Arial Black"]Start by getting something working[/font], test it against the requirements and improve if needed. This way, you can hopefully avoid another pitfall which is premature optimization.

    I've become quite sensitive to how such advice is worded because it's been my very unpleasant experience that "getting something working" frequently means making all character based columns NVARCHAR(4000), all integer columns NUMERIC(18,0), using While Loops and scalar functions, single row triggers, single row non-crud stored procedures, 6 or 8 levels of stored procedures, views of views, functions of functions, testing it against 10 rows (because the While Loops they know how to write take too long to build a couple million rows of test data), and then calling it "good enough". And, when you challenge someone on why they coded in such a fashion, they give you this big ol' s4it-eating grin and say "pre-optimization is the root of all evil". The quoted saying is correct... how they've applied it is not.

    I've also found the words "test it against the requirements and improve if needed" can also be misconstrued because people are rarely, if ever, given either performance or scalability requirements and that opens up a whole world of excuses for crap code. :pinch:

    I didn't mean for you to take it personally. I just thought it needed some clarification that folks like you and me take for granted as to what the "something" to get working should be.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (5/25/2014)


    Eirikur Eiriksson (5/24/2014)


    Must say that I'm slightly puzzled. I have read my post over and over but I cannot find neither any implicit nor explicit suggestions of ignoring good coding or working practices. And most certainly, if that is the impression left by my post, it is not the intention. Could someone help me out here if that is the case?

    😎

    It was this... particularly the bolded part...

    [font="Arial Black"]Start by getting something working[/font], test it against the requirements and improve if needed. This way, you can hopefully avoid another pitfall which is premature optimization.

    I've become quite sensitive to how such advice is worded because it's been my very unpleasant experience that "getting something working" frequently means making all character based columns NVARCHAR(4000), all integer columns NUMERIC(18,0), using While Loops and scalar functions, single row triggers, single row non-crud stored procedures, 6 or 8 levels of stored procedures, views of views, functions of functions, testing it against 10 rows (because the While Loops they know how to write take too long to build a couple million rows of test data), and then calling it "good enough". And, when you challenge someone on why they coded in such a fashion, they give you this big ol' s4it-eating grin and say "pre-optimization is the root of all evil". The quoted saying is correct... how they've applied it is not.

    I've also found the words "test it against the requirements and improve if needed" can also be misconstrued because people are rarely, if ever, given either performance or scalability requirements and that opens up a whole world of excuses for crap code. :pinch:

    I didn't mean for you to take it personally. I just thought it needed some clarification that folks like you and me take for granted as to what the "something" to get working should be.

    Thank you Jeff for the explanation, I do see your point. :Whistling:

    😎

  • Jeff Moden (5/25/2014)


    Eirikur Eiriksson (5/24/2014)


    Must say that I'm slightly puzzled. I have read my post over and over but I cannot find neither any implicit nor explicit suggestions of ignoring good coding or working practices. And most certainly, if that is the impression left by my post, it is not the intention. Could someone help me out here if that is the case?

    😎

    It was this... particularly the bolded part...

    [font="Arial Black"]Start by getting something working[/font], test it against the requirements and improve if needed. This way, you can hopefully avoid another pitfall which is premature optimization.

    Same here. It had the feeling of 'just get something working, no matter how bad, no matter how hard to maintain. It can always be fixed later'. In my experience, the 'fix it later' part never happens.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • I have seen older systems store all data as characters. In your initial ETL to the SQL Server I would encourage proper conversions (make date fields DATE or DATETIME2(n) , ... numbers as perhaps DECIMAL(m,n), .. etc). Also if more real time reporting for the accounting data is required, consider transactional replication to the warehouse from the SQL Server source. You can take monthly snapshots per account and sum up the latest snapshot to the transactions within the current month so that real time reporting aggregates do not have to consider too many rows.

    ----------------------------------------------------

  • HildaJ (5/23/2014)


    So at any given point in time one of our customers may want to know what he/her balance might be.

    This is relatively simple and a good use of triggers. Maintaining a current balance table is typically expected in any high transaction financial system. You need to know if a withdrawal/payment can be made before you even include the transaction to the main table. Real time calculation every time you want to check this can be painful.

    In order to do this we would have to add up all the transactions to come up with a balance for this year or even as a whole.

    Why would you ever want a balance for 'this year'? You want to know if another transaction can occur, no? If you're looking to do a kind of banking report, as to the balance as per a particular date, what I recommend you look at is running totals in SSRS. This way you can pull their transactions and let the report do the day to day valuations.

    We're thinking about running a ETL twice a day to extract this aggregated data to a data warehouse which would contain the total balance for each customer.

    Overkill. Proper setup and indexing means that you won't require this out of date system to handle immediate balance expectations. You do this when you're trying to do BI work on 'your spending for the year, by category' type stuff... or whatever (depends on the questions you want to ask). I wouldn't do this for balance reporting.

    We have very little experience in dealing with database with tons of records, any sources that I can read and learn about would be greatly appreciated.

    To echo the above, 10-20 million records isn't a lot anymore. Just need to get yourself straightened out for indexing, expected usage, query techniques, etc. so you can set it up right.

    EDIT: Sorry, nasty necro here. I really should have looked at the dates on the OP.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Evil Kraig,

    Since my initial post we have developed an approach to those issue. We have started developing a data warehouse where we're extracting the needed transactions. We have not implemented this solution in production but we will in the following months.

    Thank you for your post.

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply