Performance tuning for ETL

  • Hi,

    We currently have a very large data warehouse (approx 700Gigs), and are busy doing history loads from flat files. The box is currently running with 8 processors, 8Gigs of RAM, 14 disk drives on RAID 5 and SQL 2000. We have two file groups, one primary and one for indexes.

    Every load takes about 22hrs to complete, and the amount of records per load varies from around 600 000 to about 900 000. Monitoring performance while a load was running, showed that the memory utilization wasn't even half and the processors were also not maxed out. It seems that the problem lies with IO operations taking too long.

    Any suggestions on how we can improve IO operations?

     

    Thanks,

    Martin.

     

  • - avoid non-clustering-indexes at load time (consider drop before / re-create after load)

    - sort you data according to the clustering index of your objects before loading.

    - use bulk-insert whenever possible

    I hope this gives some hints ...

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • you say 14 disk configured in a raid 5 array? i think this might be the problem

    we have a similar set up for a 600Gb database and have it set up as follows

    c: raid 1 (o/s)

    d: raid 1 tempdb (this gave us a massive boost)

    e: raid 1+0 data files

    f: raid 1+0 log files

    g: raid 1+0 indexes

    h: raid 1+0 out of row data (ntext, text,binary)

    runs absolutely brilliantly

    please note that these are seperate arrays and not partitiions of a single array

    we also use 2 controller cards for the mirroring so that we have redundancy on the controller and any bottlenecks at that component are eased

    MVDBA

  • Raid 5 is pants for data writing for a database. You may get the most disk space for the spindles but at a poor performance. btw splitting filegroups across an array doesn't work for performance, filegroups work best on seperate arrays.

    Raid 10 is much better but you'd need more disks.

    check the disk stats in perfmon to be sure , i/o completion time and disk idle time are the two most useful stats ( use idle time as use = 100-idle time is often more accurate then disk usage ) if the i/o completions times are high ( i'd suggest probably over 10 ms then you have disk probs )

    To be honest disk and cpu don't always show the problems and you might want to check the actual load procs/code is good.

    I solved a slow load, 12 hrs to 6 secs by removing an inline fucntion in a query.

    [font="Comic Sans MS"]The GrumpyOldDBA[/font]
    www.grumpyolddba.co.uk
    http://sqlblogcasts.com/blogs/grumpyolddba/

  • Thanks for all the replies and comments, it is much appreciated. The current environment is a very complex one and they don't really want to change any of the code at the moment, so we are looking for quick wins where possible...

    I'm not too strong on the physical architecture side, so your suggestions really helps. Please feel free to send some more my way should you have any.

    Thanks,

    Martin.

  • It sounds like it may be the import process that is at fault rather than the server -

    a rule of thumb says max 20% in hardware, 80% tuning in software - always found that to be true unless the hardware was way off kilter.

    I'd seriously look at the code 'cos I bet that's where the bottle neck is!! When I tracked the function issue the server was basically just sitting there apparently doing nothing.

    I assume you have awe memory set btw?

    [font="Comic Sans MS"]The GrumpyOldDBA[/font]
    www.grumpyolddba.co.uk
    http://sqlblogcasts.com/blogs/grumpyolddba/

  • Extend your file group before loading large amount of data.  Extending datafile during the loading process is very slow.   For ETL, If the data require further processing, it is alway better to load the source into a staging table (in its own file group) then build the indexes and process (clean) the data. etc.

    Good Luck!!

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply