Tempdb allocation and Tempdb filling up

  • Hi,

    I have an odd situation.
    We have hosted our SQL servers on AWS (Amazon Cloud), and we have some overnight jobs running.
    We recently went from provisioned IOPs to General Provisioned SSD but the throughput is very similar the way we have configured the server.
    We are generating a lot of data but it is workable at the moment. Tempdb was set at 30 GB which seems fine.

    A few weeks back we started to get an occurrence where the jobs were filling up Tempdb. After a few changes, I fixed the size to 500 GB. I set no growth on this to avoid downing the instance.

    The jobs now run ok some days and then the next, Tempdb will just fill up to 500GB and fail the jobs. I have monitored this for usage and I can see the procedure that causes this. But it is not consistent. The same amount of data is produced each night and in fact all the data we use in all the data bases is not more than 150 GB. Also I can run the jobs in the morning after a failure and they will run through fine.

    As the jobs run through ok sometimes I am unsure of what might cause this.

    Could this be a bug?
    Could this be the execution plan ?
    It seems strange it is not a consistent failure.

    Has anyone had any experience of this ?

    Any thoughts would be welcome

    Thanks
    Graeme

  • Graeme

    The execution plan would definitely be a good place to start - the actual execution plan, please, so that we can compare estimated numbers with actual numbers.

    Does your monitoring show what tempdb is actually filling up with - is it user-defined temp tables or table variables, or something else, such as spills caused by incorrect memory grants, or possibly even index maintenance using SORT_IN_TEMPDB taking place at the same time?

    John

  • John,

    Thanks for the response.

    tempdb is being consumed by internal objects only which would suggest a lot of worktables, hash matching and sorting.

    There is nothing else running on the server as far as maintenance is concerned.

    This hasn't happened before, could the the amount of data being generated be a limit for the procedure/execution plan. We have looked at the execution plan and it has been consistent across various runs. But it is dealing with large chunks of data.
    I have wondered whether the procedure needs a re-write.

    Thanks again
    Graeme

  • Without seeing the stored procedure and the actual execution plan, we're just guessing.  Could be any number of things.  Are the actual figures also consistent?  How volatile is your data, and when was the last time statistics were updated on each of the tables involved?

    John

  • SELECT TE.name AS [EventName] ,
       T.DatabaseName ,
       t.DatabaseID ,
       t.NTDomainName ,
       t.ApplicationName ,
       t.LoginName ,
       t.SPID ,
       t.Duration ,
       t.StartTime ,
       t.EndTime,
       planhandle,
       sqlhandle
      
    FROM  sys.fn_trace_gettable(CONVERT(VARCHAR(150), ( SELECT TOP 1
                         f.[value]
                       FROM  sys.fn_trace_getinfo(NULL) f
                      WHERE f.property = 2
                      )), DEFAULT) T
       JOIN sys.trace_events TE ON T.EventClass = TE.trace_event_id
    WHERE databasename = 'tempdb' and te.name = 'Data File Auto Grow'
       OR te.name = 'Data File Auto Shrink'
            
    ORDER BY t.StartTime ;

    What forced TEMPDB to Grow script above.

  • select * from sys.dm_db_task_space_usage
    where internal_objects_alloc_page_count <> 0

    What has allocated pages in TEMPDB but not deallocated pages.

  • Hi 
    Thanks for your comments.
    We are using Amazon RDS which is a little restrictive but that said we are only running procedures.
    Our data is not that volatile, we apply some daily changes but nothing major.
    We run all the stats updates after dropping and re-applying indexes.
    I am gathering data using the script below
    Thw weird thing is, for the procedure that is causing the problems (i suspect it is), the plan is only returned when there is a problem.
    Running the script below every 5 mins does not capture the plan when there are no problems.
    I will be trying other things.

    Thanks for your interest
    Graeme

    select
      t1.session_id
      , t1.request_id
      , int_task_alloc_GB = cast((t1.internal_objects_alloc_page_count * 8./1024./1024.) as numeric(10,1))
      , int_task_dealloc_GB = cast((t1.internal_objects_dealloc_page_count * 8./1024./1024.) as numeric(10,1))
        , usr_task_alloc_GB = cast((t1.user_objects_alloc_page_count * 8./1024./1024.) as numeric(10,1))
      , usr_task_dealloc_GB = cast((t1.user_objects_dealloc_page_count * 8./1024./1024.) as numeric(10,1))
      , host= case when t1.session_id <= 50 then 'SYS' else s1.host_name end
      , s1.login_name
      , s1.status
      , s1.last_request_start_time
      , s1.last_request_end_time
      , s1.row_count
      , s1.transaction_isolation_level
      , query_text=
       coalesce((SELECT SUBSTRING(text, t2.statement_start_offset/2 + 1,
        (CASE WHEN statement_end_offset = -1
         THEN LEN(CONVERT(nvarchar(max),text)) * 2
           ELSE statement_end_offset
         END - t2.statement_start_offset)/2)
       FROM sys.dm_exec_sql_text(t2.sql_handle)) , 'Not currently executing')
      , query_plan=(SELECT query_plan from sys.dm_exec_query_plan(t2.plan_handle))
        , GETDate() As RunDate
    FROM
      (Select session_id, request_id
      , internal_objects_alloc_page_count
        , user_objects_alloc_page_count
      , internal_objects_dealloc_page_count
        , user_objects_dealloc_page_count
      from sys.dm_db_task_space_usage
      ) as t1
    left join sys.dm_exec_requests as t2 on
      t1.session_id = t2.session_id
      and t1.request_id = t2.request_id
    left join sys.dm_exec_sessions as s1 on
      t1.session_id=s1.session_id
    WHERE
      t1.session_id > 50 -- ignore system
      and t1.session_id <> @@SPID -- ignore this request itself
    ORDER BY t1.session_id DESC;

  • Graeme

    That script isn't going to help much here because the plan cache doesn't store the actual execution plan.  You need to either run the query interactively and capture the plan, or use Query Store in SQL Server 2016.

    Like I said, stored procedure code and execution plan are minimum requirements for us to be able to provide non-speculative help.

    John

  • Thanks yes I've realized that 🙁

    A slight update
    I have been running a profiler against 2 instances, one that suddenly fails and a successful run.
    In the instance that fails, at the point of failure I see a lot of these statements

    Clustered Index Scan(OBJECT:([DATABASE].[sys].[filetable_updates_2105058535].[FFtUpdateIdx]), ORDERED FORWARD)

    These run 3 times a minute for 2 hours before the main job fails.
    We don't have any filestream or always on running and I have read that these are common and low impact. But as they don't occur on the successful instance, it is a bit of a concidence.

    Do you think this could be the problem/bug ? And any thoughts why these should suddenly start appearing ?
    Any thoughts would be appreciated

    Graeme

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply