Quick design question for database used for statistical/analysis purposes

  • I have a quick SQL Server design question on an usual (at least for me!) database situation: 

    Situation
    : We are creating a SQL Server 2012 database to store data that is exported daily from an old mainframe system. That system is operated by another firm; we have no control over it. It collects day-to-day transactions (retail sales records). Each day, admins on that system dump out that day's activity and FTP it to us. The database we're charged with creating must store the last 30 days of data  ---  e.g., On Dec. 31, we'll import that day's data and delete records in our system dated Dec. 1 or earlier. It's like a perpetually changing 30-day snapshot of the mainframe's data.  We'll use this data our end for statistical/analytic purposes. It will never be edited by us. The data is pretty darn big; a month's worth of activity can be about ~50 gigabytes (and space on our SQL Server boxes is unfortunately very limited right now).

    My question: In creating the initial .MDF file, I'm considering a size of ~75GB (seems like this would more than cover the typical size of 30 days of data). But I have no clue as to the appropriate size for the .LDF  (log file). Since we're short on server space, and since this is just a dump of data that will never be edited, should I set the initial size of the log to something very tiny (with no autogrowth)? 

    I hope this is not a dumb question but I'm accustomed to creating databases in which edits are made constantly by users (and thus the log is important). This one is just a revolving 30-day warehouse used only for a ton of read-only queries to explore the data. Would the log be of any real importance in this scenario?  If you have any insights or suggestions as to the appropriate sizes of the .LDF (and .MDF) files, or have any other insights on the design of databases such as this, I would greatly appreciate your help! Thanks so much and best wishes for a happy holiday season!

  • The issue with all at once large data loads is log file growth. the best way to deal with that is to chunk out the load into small sets with a commit after each. SSIS does this for you.

  • Two ways to accomplish this and it depends if you have Enterprise Edition or not.  If not, use partitioned views.  Each day is in its own table and you rebuild the view after each load (or use synonyms for the 30 days of tables and switch the synonym daily).  If you have EE you could use a partitioned table.  While writing this I starting thinking the partitioned view would be easier.

  • Combine Joe's answer with Lynn's answer and you're most of the way there.

    To get the precise size of the log you're going to have to do some testing once you've built out your load process. Measure that, add a little head room for exceptions, and then you should be good to go. It will be much smaller than normal because you'll only ever need it for the load processing based on what you've told us.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • If you use a single table for each day the tables will be smaller.  You would only need 31 tables for the data as 30 of the tables are used in the partitioned view and 31st table can then be truncated and used for the next days data load.
    Do a Bing or Google search on Partitioned Views.

  • Thanks Joe, Lynn & Grant! Your insights are all very helpful!

    We do not have Enterprise Edition. A local partitioned views approach may be appropriate; we had not really considered this approach. Very interesting!

    Up till now, we'd been thinking of tossing everything into one big table, such that each day we'd add new records containing last night's activity, and then run a DELETE query to nuke records that had been imported more than 30 days ago.

    Given the size of the stuff to be imported, this is a lot of data churning, and thus the log could grow large quickly. That's why we were wondering about what size we should set it to in the initial creation of the database (and what autogrowth, if any, we should allow).  Once imported, there will be zero editing of the data; we'll just be running a bunch of SELECT queries on it. Perhaps my question should have been more along the lines of, "what log size would generally make sense in this sort of data-warehouse situation? "

    Given that we'll be checking daily to make sure last night's import went okay, should the log be just large enough to serve as a record of what occurred during last night's import routine? Or maybe we should hold a few days worth of activity, to account for weekends when staff might not be able to verify that last night's import ran okay?

    I apologize if my questions seem confused or uniformed.  Our staff is spread across a dozen different projects with (of course!) different technologies and platforms and none of us is a SQL Server expert.  Your insights really are valuable, thanks again!  

    Thanks again!

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply