Splitting TempDB across multiple files

  • GilaMonster - Friday, January 5, 2018 3:00 AM

    Tava - Friday, January 5, 2018 2:46 AM

      Is it purely when you see lot of PageIO and CXpacket wait MS time

    Those have got nothing to do with TempDB allocation contention. PageIOLatch is an IO wait (pages loaded into memory) and CXPacket is related to parallelism.

    TempDB contention = PageLatch waits (not PageIOLatch) on the TempDB allocation pages (most commonly 2:1:3)

    Thanks, Gail, for your insight.

    Here is a list of snapshots captured at 5-min interval using sp_WhoisActive. I should have made it clear that the waits were of PageIOLatch_UP type. For 4 days a week this job would finish by 2:15am. The next day or two it becomes a game of luck. It runs a single encrypted SP (this is a leased solution so I have little visibility into the actual code). The source database is a daily restore of our production database. Data volume is quite static throughout the week. The end result of this SP is a table with 70k-80k rows that get merged into another DW database. MAXDOP is set to 8 and there are 8 tempdb data files. Adding additional data files made things worth the next day so I rolled back to 8 files. At this juncture I'm using weekly reboot to avoid recurrence. I'd appreciate any further suggestions. Thanks.

  • PageIOLatch = contention on the IO subsystem. That won't be fixed by more files, unless the files are on separate IO subsystems.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster - Friday, January 5, 2018 1:23 PM

    PageIOLatch = contention on the IO subsystem. That won't be fixed by more files, unless the files are on separate IO subsystems.

    Thank you very much, Gail.

  • This was removed by the editor as SPAM

  • Thanks. I shall remember to provide more details. I'll start recording PLE changes from now on.

    Windows Server 2016 Standard Ed, 256GB Physical Memory. 24 Logical Processors
    SQL Server 2016 Standard Edition, SP1 CU6
    Min Mem: 225280MB
    Max Mem: 245760MB
    Max DOP: 8
    Cost Threshold for Parallelism: 50

    Tempdb has 8 data files and 1 log file and all of them are on a 446GB SSD drive. I was told last week by my manager that it was NOT a high-performance SSD. (I thought everything in that box was enterprise-grade at time of my original post because of all the FusionIO cards. Now I know this SSD could be the weakest link).

    PLE's at this moment are:

    ObjectName~InstanceName~PLE
    SQLServer:Buffer Node~001~158601
    SQLServer:Buffer Node~000~185136
     

    Thanks.

  • This was removed by the editor as SPAM

  • subramaniam.chandrasekar - Wednesday, January 10, 2018 6:30 AM

    RandomStream - Monday, January 8, 2018 10:39 AM

    Thanks. I shall remember to provide more details. I'll start recording PLE changes from now on.

    Windows Server 2016 Standard Ed, 256GB Physical Memory. 24 Logical Processors
    SQL Server 2016 Standard Edition, SP1 CU6
    Min Mem: 225280MB
    Max Mem: 245760MB
    Max DOP: 8
    Cost Threshold for Parallelism: 50

    Tempdb has 8 data files and 1 log file and all of them are on a 446GB SSD drive. I was told last week by my manager that it was NOT a high-performance SSD. (I thought everything in that box was enterprise-grade at time of my original post because of all the FusionIO cards. Now I know this SSD could be the weakest link).

    PLE's at this moment are:

    ObjectName~InstanceName~PLE
    SQLServer:Buffer Node~001~158601
    SQLServer:Buffer Node~000~185136
     

    Thanks.

    Please monitor your PLE daily and use performance counters if you want, thanks !

    I've set up a job to keep track of PLE. Thank you for the suggestion.

  • 1) You absolutely can harm performance by having too many tempdb files on insufficient IO performance storage. I have fixed this numerous times at clients.

    2) PLE is an essentially useless metric for monitoring, and has been for quite some time. average disk sec/read and average disk sec/write is what you need to key off of.

    3) RandomStream, it seems you have hijacked this thread. That stuff should have been posted on a new thread.

    4) Having said that:
    a) SQL Server Standard Edition 2016 can only use 128GB of RAM (although it is good you have double that, especially with FusionIO in play). Still should change memory settings.
    b) When the big job is running, is there ANYTHING else running on the SERVER?
    c) Do you have any visibility into the big job? I would be willing to bet it is rather inefficient.
    d) If your tempdb SSD is slowing to those IO stall numbers it is OBVIOUSLY bottlenecked. What model SSD is it? How is it connected to the server? Is that connection shared? Have you tried moving tempdb to the FusionIO card? 
    e) Is the firmware and drivers up to date for all aspects of your IO system (including the disk themselves)? If you are missing any critical/urgent updates I strongly recommend testing and rolling those out.
    f) You mentioned 24 cores. So I will guess it is 2X 6-core hyperthreaded CPUs. So MAXDOP of 8 is more likely to get you non-local memory access. If I am correct, drop to 6 and see how that goes. I would even try 4 and 0 to see if they make a difference.
    g) You may not be able to see the code, but you should be able to do an indexing analysis. If any of your CXPACKET waits are related to other than tempdb I would take a crack at that.
    h) I STRONGLY recommend hiring a professional tuner to give your system a review.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • TheSQLGuru - Wednesday, January 10, 2018 9:45 AM

    1) You absolutely can harm performance by having too many tempdb files on insufficient IO performance storage. I have fixed this numerous times at clients.

    2) PLE is an essentially useless metric for monitoring, and has been for quite some time. average disk sec/read and average disk sec/write is what you need to key off of.

    3) RandomStream, it seems you have hijacked this thread. That stuff should have been posted on a new thread.

    4) Having said that:
    a) SQL Server Standard Edition 2016 can only use 128GB of RAM (although it is good you have double that, especially with FusionIO in play). Still should change memory settings.
    b) When the big job is running, is there ANYTHING else running on the SERVER?
    c) Do you have any visibility into the big job? I would be willing to bet it is rather inefficient.
    d) If your tempdb SSD is slowing to those IO stall numbers it is OBVIOUSLY bottlenecked. What model SSD is it? How is it connected to the server? Is that connection shared? Have you tried moving tempdb to the FusionIO card? 
    e) Is the firmware and drivers up to date for all aspects of your IO system (including the disk themselves)? If you are missing any critical/urgent updates I strongly recommend testing and rolling those out.
    f) You mentioned 24 cores. So I will guess it is 2X 6-core hyperthreaded CPUs. So MAXDOP of 8 is more likely to get you non-local memory access. If I am correct, drop to 6 and see how that goes. I would even try 4 and 0 to see if they make a difference.
    g) You may not be able to see the code, but you should be able to do an indexing analysis. If any of your CXPACKET waits are related to other than tempdb I would take a crack at that.
    h) I STRONGLY recommend hiring a professional tuner to give your system a review.

    Indeed I have and I apologize and will stop with this post, after these comments.
    On your Point 2: I had never suspected memory pressure until I two days ago from this thread. I thought I should be aware of all variables and it wouldn't hurt to collect data.

    On Point 3: yes, my bad.

    On 4.a). MS changed wording on memory limit after SQL2016 SP1 to: Memory: Maximum buffer pool size per instance.
    4.b) Absolutely no other big job is running. During night time this is the one and only job.
    4.c) No visibility to the big job. It is a leased solution and the SP is encrypted. All I know is it does several steps of data crunching in tempdb. I am sure it is not efficient but this is a legacy system and all processes are iron-cast.
    4.d) I have great trouble to find out the make/specs of the SSD. We have a very strict environment and everything is on need-to-know basis. I hope it is not connected via a USB3 cable (jk) and will recommend moving tempdb to FusionIO.
    4.e) A ticket remains open with HP and they have done all the patching they could find. OS patching is done monthly.
    4.f) You mentioned 24 cores... I 'Hijacked' this thread because I wanted to ask whether I should reduce the number of tempdb files and MAXDOP at the same time. But that thought was lost along the way.
    4.g) The main database supporting this nightly job is a daily refresh from the production database. Indexes are being maintained weekly. Yes, there are quite a bunch of missing indexes but since this is a leased solution, I am not at liberty to make any chances, including fill-factors.
    4.h) I STRONGLY recommend hiring a professional tuner to give your system a review: I certainly will. Thank you for your suggestion.

Viewing 9 posts - 16 through 23 (of 23 total)

You must be logged in to reply to this topic. Login to reply