Help diagnosing performance issue

  • We have a nightly ETLV job that our 30 clients run every night. Depending on the size of the client, it can run from 1 to 7 hours. We have one client, clientX, where the etlv job takes about 7 hours. But about 3 times a month, it takes 20 hours. Looking at the logs, it is always the same proc, procx, that is the issue. Normally procx runs at clientx for 35 minutes. But these 3 times a month it runs 838, 839, 828, 835, or 825 minutes. And then procx finishes successfully! There does NOT seem to be a pattern on day of week, time of day procx starts, time of day procx finishes, amount of data, etc. But it is so strange that if it does not finish in 35 or so minutes, it finishes in 832 +- 7 minutes.

    Yesterday when the etlv at clientx was running over by several hours, i could tell by our log that it was in procx. I did a sp_who2 and there was nothing in the BlkBy column for any SPID. Unfortunately I am not good with diagnosing performance problems and do not remember exactly what sp_who2 said about procx. Know that it was not taking any CPU. Believe that the Status was Suspended. The Command said Update.

    What can I do so that in the future so that I can better diagnose the issue? I know that it is procx which has 3 insert and 10 update statements to the same db. I do not want to do something that will impact the entire etlv job, which is already pushing its 7 hour limit.

    I want some diagnosis steps that I can run if I am called while procx is running. And I would like some diagnosis steps that I can setup in advance because many times I am just told that "oh, yesterday we had that issue again.'

    Thanks

  • It's hard to say without any code, but whenever something like this happens I start with "bad" parameter-sniffing. I'd start by looking at this blog post[/url].

  • You can try looking at the query plan to see if it doing tablescans during the procedure, and index appropriately. I would definitely read the link about parameter sniffing.

    Tom

  • bgrossnickle (12/11/2014)


    We have a nightly ETLV job that our 30 clients run every night. Depending on the size of the client, it can run from 1 to 7 hours. We have one client, clientX, where the etlv job takes about 7 hours. But about 3 times a month, it takes 20 hours. Looking at the logs, it is always the same proc, procx, that is the issue. Normally procx runs at clientx for 35 minutes. But these 3 times a month it runs 838, 839, 828, 835, or 825 minutes. And then procx finishes successfully! There does NOT seem to be a pattern on day of week, time of day procx starts, time of day procx finishes, amount of data, etc. But it is so strange that if it does not finish in 35 or so minutes, it finishes in 832 +- 7 minutes.

    Yesterday when the etlv at clientx was running over by several hours, i could tell by our log that it was in procx. I did a sp_who2 and there was nothing in the BlkBy column for any SPID. Unfortunately I am not good with diagnosing performance problems and do not remember exactly what sp_who2 said about procx. Know that it was not taking any CPU. Believe that the Status was Suspended. The Command said Update.

    What can I do so that in the future so that I can better diagnose the issue? I know that it is procx which has 3 insert and 10 update statements to the same db. I do not want to do something that will impact the entire etlv job, which is already pushing its 7 hour limit.

    I want some diagnosis steps that I can run if I am called while procx is running. And I would like some diagnosis steps that I can setup in advance because many times I am just told that "oh, yesterday we had that issue again.'

    Thanks

    Could be bad parameter sniffing as the others have suggested. Here are a couple of things that you could do to investigate the issue beyond the parameter sniffing issue. First, you could get Adam Machanic's sp_WhoisActive and run that while the ETLV is going. It is a custom script based on the ideas present in sp_who2. Lots of people use it so it is safe and you easily can get help understanding what it does. Second, you could get sp_AskBrent from BrentOzar.com and run that to capture what is happening as well. You absolutely can set up either of these two procedures to run in a job and log data to a table for later review. A third option is to set up a job to capture long running queries to a file. You can then review the trace file after the fact to pinpoint high duration statements. If you do this, I would definitely set it up so that it also captures blocking information. This third option could have a much higher impact on your system than the first two.

  • This sounds familiar to ETL performance gremlins I've been chasing lately. If we assume that the procedure and database are capable of better performance, I would assume the problem is outside SQL.

    First, is the Server on a SAN or in a VM? If so, it is possible that your increased write times are not because of SQL, but because if increased latency and IO wait times getting to the disks. Your SQL instance may be idle and have plenty of resources, but another server on the same blade is having a snapshot made, or maybe the SAN is replicating.

    Assuming all clients are executing the same procedure, then performance has to be a factor of "when" the jobs are run, the size of the dataset being inserted, or the bandwidth between you and that client. (Is the long running client on remote network by chance? What is that connection like, and what else is going through that pipe at load time?) You could prove/disprove this theory by setting up a test scenario where you loaded the same large data set from a local connection; remove the LAN/WAN from the equation.

    Alternatively, it could be other queries/jobs on the box not causing locks, but keeping the server tied up.

    Here's a bit of code I wrote that I have in front of my maintenance plans, and ETL jobs to record what other open connections exist while my code is running. You can drop this in your own database and inspect connections on a specific database, or if you control the database in question, drop the history table and detective procedure in there. Once you do that, you just need to figure out how to make the call to capture history before your other jobs. You might want to enhance this procedure/history table with either the SPID or user name of who caused the history to be written.

    If you find that not only are there no locks during the load, but SQL is also not busy doing other things, then the issue has to be external.

    if not exists (select 'x' from sys.schemas where name = 'ID')

    exec('create schema ID')

    go

    IF NOT EXISTS (select 'x' from sysobjects where name = 'LoadTime_Foreign_Process_History' and xtype = 'U')

    create table ID.LoadTime_Foreign_Process_History

    (spid varchar(20),

    db_name varchar(256),

    user_name varchar(256),

    blocked smallint,

    waitresource varchar(512),

    physical_io bigint,

    memusage int,

    login_time datetime,

    last_batch datetime,

    status varchar(60),

    hostname varchar(256),

    program_name varchar(256),

    hostprocess varchar(20),

    cmd varchar(32),

    command text,

    lastwaittype varchar(64),

    UPDATE_TS DATETIME DEFAULT GETDATE()

    );

    if exists (select * from dbo.sysobjects where id = object_id(N'[ID].[Foreign_Process_Detective]')

    and OBJECTPROPERTY(id, N'IsProcedure') = 1)

    drop procedure [ID].[Foreign_Process_Detective]

    go

    create procedure [ID].[Foreign_Process_Detective]

    (@DbName as varchar(35) = NULL) as

    if @DBName is NULL

    SELECT @DBName = db_name();

    SET NOCOUNT ON;

    BEGIN TRY

    INSERT INTO ID.LoadTime_Foreign_Process_History

    (spid, db_name, user_name,blocked,waitresource,physical_io, memusage, login_time,

    last_batch, status, hostname, program_name, Hostprocess, cmd, command,lastwaittype, update_ts)

    SELECT

    case when spid = @@spid then 'This' else convert(varchar(10), spid) end as spid,

    rtrim(db_name(dbid)) as db_name, rtrim(loginame) as user_name,

    pr.blocked, pr.waitresource, pr.physical_io, pr.memusage, pr.login_time,

    pr.last_batch, pr.status, rtrim(pr.hostname), rtrim(pr.program_name),

    pr.hostprocess, pr.cmd,

    (select text from sys.dm_exec_sql_text(sql_handle)) as command,

    pr.lastwaittype, getdate()

    from master.sys.sysprocesses pr

    where lower(db_name(dbid)) = (@dbName); /*You could also parameterize the procedure and pass

    the name of the DB you wanted to inspect*/

    Print('Foreign Process History written to: LoadTime_Foreign_Process_History for database ' + @DBName);

    END TRY

    BEGIN CATCH

    Print('Unable to write process activity to LoadTime_Foreign_Process_History. Actual Error: ' + ERROR_MESSAGE());

    END CATCH

    go

    You can call the procedure with or without a database name; without and it will use the current db name.

    exec ID.Foreign_Process_Detective

    go

    exec ID.Foreign_Process_Detective 'DATAMART'

    go

    select * from id.LoadTime_Foreign_Process_History

    go

    Hope that helps or gives you some places to look.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply