Stuck with a unique JOIN scenario

  • I have a scenario where I need to return values in a very unique way given the schema that is in place that I can't modify.

    I have the following tables:

    The Workflow table will contain workflow records. From a business perspective Workflows are like decision trees that can contain/kickoff zero/one/multiple child workflows. But each workflow, whether it's a child or parent, will contain a record in this table. The other gotcha is that there is no limit to the number of children a workflow can contain. One parent workflow can contain multiple children, which also contain children, etc., etc. And the parent workflow will have it's own steps (WorkflowDetails) that do its' own thing.

    CREATE TABLE [dbo].[Workflow] (

    Id int IDENTITY(1,1) NOT NULL,

    Name nvarchar(30) NULL,

    RequestedOn datetime NULL,

    InternalStatus nchar(2) NULL,

    LastUpdatedDateTime datetime NULL

    )

    Example Data:

    Id Name RequestedOn InternalStatusLastUpdatedDateTime

    20379Child3 2015-11-06 13:53:07.043 FN2015-11-06 13:53:10.663

    20378Child2 2015-11-06 13:53:04.640 FN2015-11-06 13:53:05.873

    20377Child1 2015-11-06 13:51:46.217FN2015-11-06 13:53:03.580

    20376Parent1 2015-11-06 13:51:26.187FN2015-11-06 13:53:11.770

    20375Parent2 2015-11-05 19:53:59.110FN2015-11-05 19:54:01.530

    18217 Child4 2015-08-18 21:09:27.797 AB 2015-08-18 21:11:04.573

    18216 Parent3 2015-8-18 21:09:24.753 AB 2015-08-18 21:11:04.513

    The WorkflowDetails table will contain all of the steps within every Workflow.

    CREATE TABLE dbo.WorkflowDetails (

    ID int IDENTITY(1,1) NOT NULL,

    ExecutionID int NULL, --This is Workflow.Id from above

    Action nvarchar(100) NULL

    )

    WorkflowDetails example data:

    Id ExecutionIDAction

    30508420376Finish Marker1

    30508320376Create Contract

    30508220376Create Contract Details

    30508120379Set Workflow output variable3

    30508020379VM Details

    30507920378Rule1

    30507820379Decision1

    30507720379Start

    280376 18217 Choice1

    280375 18217 Gate2

    280370 18216 CallChildWorkFlow1

    280369 18216 Decision4

    The only way to map what workflows are tied to one another in the parent/child relationship is with another table called WorkflowRelations

    CREATE TABLE [dbo].[WorkflowRelations](

    [ExecutionId] [int] NOT NULL,

    [ParentExecutionId] [int] NOT NULL,

    [TopParentExecutionId] [int] NOT NULL

    )

    Sample data from WorkflowRelations:

    ExecutionId ParentExecutionIdTopParentExecutionId

    20377 20376 20376

    20378 20376 20376

    20379 20376 20376

    18217 18216 18216

    Things get even more interesting because there is a Request table that houses those processes that kick off the main workflow.

    CREATE TABLE Request(

    id int IDENTITY(1,1) NOT NULL,

    user_id nvarchar(50) NULL,

    productId nvarchar(50) NULL,

    execution_id int NULL,

    orderNum int NULL,

    createDate datetime NULL

    )

    The Request.execution_id is always the parent workflow execution Id.

    Example data from Request:

    iduser_idproductId execution_id orderNum createDate

    1289testacctVMSetup 20376 1285 2015-11-06 08:51:23.493

    1259 testacct SecurityBadge 18216 1259 2015-08-18 17:09:24.067

    So now my issue....For each Request.execution_id that has the potential to have it's own workflowdetails, child workflowdetails, child of child workflowdetails, etc. I need to return one row that contains the following:

    For 20376:

    RequestNo - which is Request.orderNum + '-' + Request.execution_id = 1285-20376

    Request.user_id = testacct

    Request.productId = VMSetup

    Request.createDate = 2015-11-06 08:51:23.493

    Workflow.Name = Parent1 - Because I only want to track the parent Workflow....if it exists as the engine may have died after the Request was generated and before the Workflow kicked off

    Workflow.InternalStatus = FN - This is the latest status of Parent1. Potential values are as follows:

    WHEN 'FN' THEN 'Finished'

    WHEN 'FE' THEN 'Finished with Errors'

    WHEN 'AB' THEN 'Aborted'

    WHEN 'EX' THEN 'Running'

    WHEN 'PA' THEN 'Paused'

    WHEN 'SL' THEN 'Waiting'

    ELSE 'Other'

    WorkflowDetails.ExecutionId = 20376 - The issue is I want the last WorkflowDetails.ExecutionId that exists for the entire chain. In this case since the full workflow finished, it's the last step of Parent1 (20376). But, if for some reason the workflow failed in Child2 at WorkflowDetails.Id = 305079 and everything above 305079 didn't happen due to this, I'd need 20378 to be the value that is returned here.

    WorkflowDetails.Action - name associated with WorkflowDetails.ExecutionId

    The previous person that wrote the query used a CURSOR and I'm trying to rewrite using a set-based approach because the cursor took over 4 minutes to run with the amount of data.

    Here is the script that I have but I'm not getting the correct child workflow.

    DECLARE @ParamStartDate datetime = NULL, @ParamEndDate datetime= NULL-- Input dates

    , @ServiceId nvarchar(max)= NULL-- To filter by Service

    , @ParentWFName nvarchar(max)= NULL-- To filter by Parent WF Name

    , @ReportType nvarchar(30) = 'Debug'

    DECLARE @StartDate datetime = COALESCE(@ParamStartDate,DATEADD(Month,-3,GETDATE()))

    DECLARE @TempEndDate date = COALESCE(@ParamEndDate,GETDATE())

    DECLARE @EndDate datetime = cast(cast(@TempEndDate as nvarchar(10)) + ' 23:59:59' as datetime)

    ---==================================================================================================

    -- DECLARE Table to store results from processing each requests getting details from the executed WFs

    DECLARE @WFdetails TABLE(ParentWF_ExecutionID int, LastWF_Name nvarchar(100), LastWF_InternalStatus nchar(2)

    , LastWF_ExecutionID int, LastWF_ActionName nvarchar(100))

    DECLARE @WFName nvarchar(100)

    DECLARE @WFId int

    DECLARE @TotalRequestsReported int

    -- =================

    -- GET THE LIST OF REQUESTS

    SELECT

    RD.id, RD.user_id, RD.productId, RD.execution_id, RD.orderNum, RD.createDate

    INTO #RequestDetailedList

    FROM [dbo].[Request] RD with (nolock)

    LEFT OUTER JOIN Workflow SWE with (nolock) on

    RD.execution_id = SWE.id --In case there are any requests that don't have an associated Workflow

    WHERE createDate >= @StartDate

    AND createDate <= @EndDate

    AND (@ServiceName IS NULL OR RD.productId=@ServiceId) --If we wanted to target a specific product or request

    AND (@ParentWFName IS NULL OR SWE.Name=@ParentWFName)

    order by id desc

    --- Removed cursor and went with set based approach below

    -- DECLARE cursor to iterate on Requests

    /*DECLARE WFLIST CURSOR FOR

    SELECT execution_id from #RequestDetailedList with (nolock)

    -- Iterate on the list of requests getting details from the WF execution

    OPEN WFLIST

    FETCH NEXT FROM WFLIST

    INTO @WFId

    WHILE @@FETCH_STATUS = 0

    BEGIN

    -- FOR EACH EXECUTION_ID from the Request list, get the last executed action

    INSERT INTO @WFdetails

    SELECT TOP 1 @WFId as ParentWF_ExecutionID

    , SWE.Name as LastWF_Name

    , SWE.InternalStatus as LastWF_InternalStatus

    , SWED.ExecutionID as LastWF_ExecutionID

    , Action as LastWF_ActionName

    FROM [dbo].WorkflowDetails SWED with (nolock)

    INNER join [dbo].[Workflow] SWE with (nolock)

    on SWED.ExecutionID = SWE.ID

    where SWED.ExecutionID in (

    SELECT exid FROM dbo.udf_get_related_workflows2(@WFId) --this UDF returns a table containing the parent Workflow and all children under parent

    ) order by SWED.Id desc

    FETCH NEXT FROM WFLIST

    INTO @WFId

    END

    CLOSE WFLIST

    DEALLOCATE WFLIST

    */

    INSERT INTO @WFdetails

    SELECT A.ParentWF_ExecutionID

    , A.LastWF_Name

    , A.LastWF_InternalStatus

    , A.LastWF_ExecutionID

    , SWED.Action as LastWF_ActionName

    FROM

    [dbo].WorkflowDetails SWED with (nolock)

    INNER JOIN

    (

    SELECT RDL.execution_id as ParentWF_ExecutionID

    , SWE.Name as LastWF_Name

    , SWE.InternalStatus as LastWF_InternalStatus

    , SWED.ExecutionID as LastWF_ExecutionID

    , MAX(SWED.Id) AS MaxSWEDId

    FROM [dbo].WorkflowDetails SWED with (nolock)

    INNER JOIN #RequestDetailedList RDL on SWED.ExecutionID = RDL.execution_id

    INNER join [dbo].Workflow SWE with (nolock)

    on SWED.ExecutionID = SWE.ID

    where SWED.ExecutionID in (

    SELECT exid FROM dbo.udf_get_related_workflows2(RDL.execution_id)

    )

    GROUP BY RDL.execution_id

    , SWE.Name

    , SWE.InternalStatus

    , SWED.ExecutionID ) A ON A.MaxSWEDId = SWED.ID

    order by SWED.Id desc

    -- CREATE the final result with details for each request

    select cast(RDL.order_number as nvarchar(20)) + '-' + cast(RDL.execution_id as nvarchar(20)) as RequestNo

    , RDL.user_logon as Request_SubmittedBy

    , RDL.product_name as Request_ServiceName

    , RDL.createDate as Request_CreateDate

    , SWE.Name as ParentWF_Name

    , SWE.InternalStatus as ParentWF_InternalStatus

    , LastWF.*

    , (CASE LastWF_InternalStatus -- Based on the internal status and the last action executed, the request get the real final status

    WHEN 'FN' THEN 'Finished Unexpectedly'

    WHEN 'FE' THEN 'Finished with Errors'

    WHEN 'AB' THEN 'Aborted'

    WHEN 'EX' THEN 'Running'

    WHEN 'PA' THEN 'Paused'

    WHEN 'SL' THEN 'Waiting'

    ELSE 'Other'

    END) as ServiceRealStatus

    , CAST(YEAR(RDL.createDate) as varchar(4)) + '-' + RIGHT('0'+CAST(MONTH(RDL.createDate) as varchar(2)),2) AS Request_MonthYear

    into #Results

    from @WFdetails LastWF

    left JOIN #RequestDetailedList RDL with (nolock)

    on RDL.execution_id = LastWF.ParentWF_ExecutionID

    left join[dbo].Workflow SWE with (nolock)

    on SWE.ID=RDL.execution_id

    order by Request_CreateDate DESC

    SET @TotalRequestsReported = @@ROWCOUNT

    -- RETURN THE EXPECTED RESULTS BASED ON THE REPORT TYPE

    -- ============================================================

    IF @ReportType = 'Debug'

    SELECT * FROM #Results ORDER BY ServiceRealStatus

    Any thoughts on my set-based query to return the correct data so that I can find the max(SWED.Id) for the parent/child in the Request and then associate it back to the original parent Workflow and Request? If you use my data and query above, the data for 18216 doesn't come out correctly. It's only finding the max WorkflowDetailsId for 18216 where it needs to pull in the max WorkflowDetailsId for 18217 that is associated with it.

  • I am sure what you wrote took a considerable amount of time to write. But I am afraid at least I cannot quickly determine a solution to your problem without seeking many clarifications. To avoid a need to seek many clarifications, I think readers (or at least I) would be more able to address your problem, if you minimized the use of English and maximized the use of T-SQL. For example, if you were to provide readers with no more than one runnable script, we may be able to help. That script should include ALTER TABLE ... CONSTRAINT ... declarations (instead of writing "This is Workflow.Id from above" or writing "another in the parent/child relationship ", or not identifying a primary key in table WorkflowRelations, etc.), should include INSERT statements (instead of "Sample data from" or "Example data"), should omit columns or objects that are not needed in the T-SQL (such as the Id column for the Workflow table, @ReportType, etc.), and should include complete DDL for all objects (such as SWExecute and udf_get_related_workflows2). The script you offer should be followed by the actual results (using the scripted INSERT data) and the expected results. If you can run that script in a new, empty database, you are assured that we can run the same script, which means we will know your foundation :). I realize what I just wrote is a lot of work, but in addition to helping us, it will help you boil a problem down to its barest form;-).

    Looking at the overall T-SQL, I am not certain the use of a cursor is the cause of your performance concern. It appears there are other problems with the T-SQL. Because I cannot understand your problem (please accept my apologies), I will instead offer you some general performance tips.

    Table variables

    DECLARE @WFdetails TABLE

    (ParentWF_ExecutionID int,

    LastWF_Name nvarchar(100),

    LastWF_InternalStatus nchar(2),

    LastWF_ExecutionID int,

    LastWF_ActionName nvarchar(100))

    are convenient, but SQL Server 2008 always considers a table variable to contain merely one row (one page of IO). When a table variable is used in a JOIN

    from @WFdetails LastWF

    left JOIN #RequestDetailedList RDL

    there is a distinct risk that SQL Server will incorrectly assume the table variable is less costly (has fewer rows) than the table being JOINed (#RequestDetailedList). You can avoid that risk by instead using a Common Table Expression (CTE). You already have most of the CTE written, starting with SELECT A.ParentWF_ExecutionID By using a CTE instead of a table variable, you will be allowing SQL Server to see statistics that estimate the actual number of rows (as opposed to SQL Server assuming just one row) The same applies to #RequestDetailedList, even though SQL Server will recompile this batch (to get the current statistics for #RequestDetailedList) at certain thresholds (triggered by data inserts into #RequestDetailedList).

    It may appear INSERT INTO @WFdetails has SELECTed an order by SWED.Id desc, but relational database tables (and table variables) never have a defined order. You must always SELECT the order, even if it appears you had INSERTed an "order". That is true even if you run a simple SELECT * FROM @WFdetails (without an ORDER BY clause) and you see results in the expected order. Unsure whether this relates to unexpected results, but I doubt this is a performance concern.

    Short-circuiting predicates (i.e. WHERE clauses and JOIN clauses), such as AND (@ServiceName IS NULL OR RD.productId=@ServiceId) --If we wanted to target a specific product or request

    AND (@ParentWFName IS NULL OR SWE.Name=@ParentWFName)

    will frequently introduce poor performance. See Dynamic Search Conditions in T-SQL. In particular section "4. Static SQL with OPTION (RECOMPILE)" Given there are only 4 permutations of the above predicates (the permutations are @ServiceName IS NULL AND @ParentWFName IS NULL, @ServiceName IS NOT NULL AND @ParentWFName IS NULL, @ServiceName IS NULL AND @ParentWFName IS NOT NULL, and @ServiceName IS NOT NULL AND @ParentWFName IS NOT NULL), you could move those variables into 4 control-of-flow IF statements. Doing so would allow SQL Server to compile 4 execution plans, each tailored for each of the 4 permutations.

    Using UDF's in predicates, such aswhere SWED.ExecutionID in (

    SELECT exid FROM [PMGSPE].dbo.udf_get_related_workflows2(@WFId) --this UDF returns a table containing the parent Workflow and all children under parent

    is also risky (or not, depending upon its DDL). I suspect a CTE could be used to replace the UDF, JOINed with SWED.ExecutionID. By using a CTE instead of a UDF, a future developer (or you) would be able to consider all objects accessed by this batch/stored procedure (without having to inspect the DDL for the UDF).

    Given that the previous author used the NOLOCK hint on certain tables, I suspect those tables are where your performance concern actually exists (as opposed to the cursor's rather simple SELECT). I would consider whether the NOLOCK'ed predicates

    RDL.execution_id = LastWF.ParentWF_ExecutionID and SWE.ID=RDL.execution_id

    and

    on SWED.ExecutionID = SWE.ID

    where SWED.ExecutionID in (

    SELECT exid FROM dbo.udf_get_related_workflows2(RDL.execution_id)

    ) and WHERE createDate >= @StartDate

    AND createDate <= @EndDate

    AND (@ServiceName IS NULL OR RD.productId=@ServiceId) --If we wanted to target a specific product or request

    AND (@ParentWFName IS NULL OR SWE.Name=@ParentWFName)

    are indexed, and create indexes (to test performance) if they are not.

    If you are instead interested in getting a quick answer (about where a performance problem exists), BCP OUT the data (or backup the database), script out the tables (if not using a backup), find an otherwise very inactive SQL Server, create or restore a test database (if not restored, run the scripts), BCP IN the data, open SSMS query window, select @@SPID, configure a SQL Server Profiler trace to collect the SP:STmtCompleted event (filtered by the @@SPID), start the SQL Server Profiler trace, run the batch (or stored procedure), stop the trace, and find the maximum duration in the trace. I suspect you will find the longest duration will be spent inserting into #RequestDetailedList, inserting into @WFdetails, or selecting into #Results (i.e. not the cursor's select from #RequestDetailedList). The answer depends upon how many rows are in each table, the table variable or the temporary table, and what SQL Server knows or doesn't know about the expected row counts within them.

    It may be tempting (because it is extremely quick) to use SSMS to generate a Estimated Query Plan for the T-SQL as written. Unfortunately, an Estimated Query Plan will not reveal missing indexes and will not display accurate cost estimates. Those will not happen because the T-SQL as written uses temporary tables and table variables (which are not populated until run time, well after the estimated plan is created). By replacing them with CTEs, the estimated plan will be more accurate and a missing index might be suggested (that would not be suggested when table variables and temporary tables are used). There is still the problem of variable values not being known to SQL Server at the time when the estimated plan is compiled and displayed, but that can be partially overcome by replacing variables with a typical value, expressed as a literal value (such as WHERE SWE.Name='A Typical Parent Workflow Name' instead of WHERE SWE.Name=@ParentWFName). With the T-SQL as written, inspecting its estimated query plans will be misleading. In contrast, a trace of the T-SQL as written (when run against realistic data) will focus your efforts, by more reliably exposing needed performance fixes.

    In short, I think you risk wasting precious time by rewriting the cursor, because I think the batch's salient performance problem does not concern the cursor:cool:

  • Thanks for the suggestions.

  • I see scalar UDF's in the where clause operating on a column and not a variable.

    Convert the UDF to a TVF, then cross apply onto the TVF.

    Not just any TVF, a single statement TVF.

    Catch-all queries done right [/url]
    Gail Shaw's Performance Blog[/url]

  • Quite formally you may first refactor cursor part of code to

    INSERT INTO @WFdetails

    select RDL.execution_id as ParentWF_ExecutionID,

    zzz.LastWF_Name,

    ...

    from #RequestDetailedList RDL

    cross apply (

    SELECT TOP 1 SWE.Name as

    , SWE.InternalStatus as LastWF_InternalStatus

    , SWED.ExecutionID as LastWF_ExecutionID

    , Action as LastWF_ActionName

    FROM [dbo].WorkflowDetails SWED with (nolock)

    INNER join [dbo].[Workflow] SWE with (nolock)

    on SWED.ExecutionID = SWE.ID

    where SWED.ExecutionID in (

    SELECT exid FROM dbo.udf_get_related_workflows2(RDL.execution_id) --this UDF returns a table containing the parent Workflow and all children under parent

    ) order by SWED.Id desc

    ) zzz

    and then try to optimize it.

  • Thanks Serg and MadAdmin....I took your suggestions and did a similar approach based on the specific outcome I was looking for and I got it to work.

    Much appreciated!

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply