Most recent non-null record; Tracking data provenance in calculations

  • Chrissy321

    SSCoach

    Points: 15486

    First off thanks to all those who share their expertise on this forum. Apology #1: I am really in SQL Server 2005. An upgrade is in the works so a more preferred solution in 2016 would be a bonus. Apology #2: I have to obfuscate my data so much that it becomes very generic but hopefully I have fully demonstrated the problem.

    This is really a two-part question. I have a vendor who provides data on a daily basis about certain entities. I take this data and follow business rules to generate my own calculated value.

    1) If the vendor fails to provide a data point on the most recent day I want to use the most recent non-null value for that entity. Any of my datapoints can be null  and this most recent non-null value logic needs to apply to any of them(DataPoint1,DataPoint2, and DataPoint3). This article seems to be state of the art but I don't believe it will not work in 2005. https://www.itprotoday.com/software-development/last-non-null-puzzle

    2) Given that the value of my calculation can be sourced from any of the datapoints on any of the dates I want to be able to track which data point provided the value and from which day.

    Thanks if you can help.
    IF OBJECT_ID('dbo.SourceData', 'U') IS NOT NULL DROP TABLE dbo.SourceData;

    CREATE TABLE dbo.SourceData
    (
        EntityID INT,
        ReportingDate DATETIME,
        EntityType CHAR(1),
        DataPoint1 DECIMAL(5,1),
        DataPoint2 DECIMAL(5,1),
        DataPoint3 DECIMAL(5,1)
    )

    INSERT INTO dbo.SourceData VALUES (1,'2018-07-01','A',1.3,2.0,3.0)
    INSERT INTO dbo.SourceData VALUES (2,'2018-07-01','B',3,3,3)
    INSERT INTO dbo.SourceData VALUES (3,'2018-07-01','C',2.1,2.3,2.7)
    INSERT INTO dbo.SourceData VALUES (4,'2018-07-01','C',2.6,2.3,2.7)
    INSERT INTO dbo.SourceData VALUES (5,'2018-07-01','C',2.6,2.6,2.7)
    INSERT INTO dbo.SourceData VALUES (1,'2018-07-02','A',NULL,2.0,3.0)
    INSERT INTO dbo.SourceData VALUES (2,'2018-07-02','B',3,3,3)
    INSERT INTO dbo.SourceData VALUES (3,'2018-07-02','C',2.1,2.3,2.7)
    INSERT INTO dbo.SourceData VALUES (4,'2018-07-02','C',2.6,2.3,2.7)
    INSERT INTO dbo.SourceData VALUES (5,'2018-07-02','C',2.6,2.6,NULL)

    --This demonstrates my calculation but the expected results should have one record per EntityID
    SELECT
    CASE EntityType
    WHEN 'A' THEN DataPoint1
    WHEN 'B' THEN COALESCE(DataPoint1,DataPoint2)
    ELSE    
        CASE WHEN DataPoint1 > DataPoint2 THEN DataPoint1 ELSE DataPoint3 END
    END
    AS MyCalculation,
    *
    FROM dbo.SourceData

    --Expected results
    SELECT '1.3' AS MyCalculation,'1' AS EntityID, '2018-07-01' AS SourceDate,'DataPoint1' AS SourceDatapoint UNION ALL
    SELECT '3.0'AS MyCalculation,'2' AS EntityID,'2018-07-02' AS SourceDate,'DataPoint1' AS SourceDatapoint UNION ALL
    SELECT '2.7'AS MyCalculation,'3' AS EntityID,'2018-07-02' AS SourceDate,'DataPoint3' AS SourceDatapoint UNION ALL
    SELECT '2.6'AS MyCalculation,'4' AS EntityID,'2018-07-02' AS SourceDate,'DataPoint1' AS SourceDatapoint UNION ALL
    SELECT '2.7'AS MyCalculation,'5' AS EntityID,'2018-07-01' AS SourceDate,'DataPoint3' AS SourceDatapoint

  • Luis Cazares

    SSC Guru

    Points: 183519

    Probably something like this?

    WITH CTE AS(
      SELECT
       CASE
        WHEN EntityType = 'A' THEN DataPoint1
        WHEN EntityType = 'B' THEN COALESCE(DataPoint1,DataPoint2)
        WHEN DataPoint1 > DataPoint2 THEN DataPoint1
        ELSE DataPoint3
       END
       AS MyCalculation,
       CASE
        WHEN EntityType = 'A' THEN 'DataPoint1'
        WHEN EntityType = 'B' AND Datapoint1 IS NOT NULL THEN 'DataPoint1'
        WHEN EntityType = 'B' AND Datapoint1 IS NULL THEN 'DataPoint2'
        WHEN DataPoint1 > DataPoint2 THEN 'DataPoint1'
        ELSE 'DataPoint3'
       END AS SourceDataPoint,
       ROW_NUMBER() OVER(PARTITION BY EntityID ORDER BY ReportingDate DESC) AS rn,
       EntityID,
       ReportingDate AS SourceDate
      FROM dbo.SourceData
      WHERE CASE
        WHEN EntityType = 'A' THEN DataPoint1
        WHEN EntityType = 'B' THEN COALESCE(DataPoint1,DataPoint2)
        WHEN DataPoint1 > DataPoint2 THEN DataPoint1
        ELSE DataPoint3
       END IS NOT NULL
    )
    SELECT CTE.MyCalculation,
      CTE.EntityID,
      CTE.SourceDate,
      CTE.SourceDataPoint
    FROM CTE
    WHERE CTE.rn = 1;

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Chrissy321

    SSCoach

    Points: 15486

    Very nice, thank you. Putting aside the most recent non-null record issue I started toying with the below. The advantage being that I don't have too have a second expanded calculation which may simplify matters if the logic gets deep. Definitely kludgey since I don't have a ways to distribute unique identifiers across these 3 fields:  DataPoint1ID, DataPoint2ID and DataPoint3ID.

    I expect I'll go with your solution but I thought I'd share my thought experiment. Thanks again.

    IF OBJECT_ID('dbo.SourceData', 'U') IS NOT NULL DROP TABLE dbo.SourceData;

    CREATE TABLE dbo.SourceData
    (
        EntityID INT,
        ReportingDate DATETIME,
        EntityType CHAR(1),
        DataPoint1 DECIMAL(5,1),
        DataPoint1ID INT,
        DataPoint2 DECIMAL(5,1),
        DataPoint2ID INT,
        DataPoint3 DECIMAL(5,1) ,
        DataPoint3ID INT
    )

    INSERT INTO dbo.SourceData VALUES (1,'2018-07-01','A',1.3,1,2.0,11,3.0,21)
    INSERT INTO dbo.SourceData VALUES (2,'2018-07-01','B',3,2,3,12,3,22)
    INSERT INTO dbo.SourceData VALUES (3,'2018-07-01','C',2.1,3,2.3,13,2.7,23)
    INSERT INTO dbo.SourceData VALUES (4,'2018-07-01','C',2.6,4,2.3,14,2.7,24)
    INSERT INTO dbo.SourceData VALUES (5,'2018-07-01','C',2.6,5,2.6,15,2.7,25)

    --Luis's solution
    SELECT
      CASE
      WHEN EntityType = 'A' THEN DataPoint1
      WHEN EntityType = 'B' THEN COALESCE(DataPoint1,DataPoint2)
      WHEN DataPoint1 > DataPoint2 THEN DataPoint1
      ELSE DataPoint3
      END
      AS MyCalculation,
      CASE
      WHEN EntityType = 'A' THEN 'DataPoint1'
      WHEN EntityType = 'B' AND Datapoint1 IS NOT NULL THEN 'DataPoint1'
      WHEN EntityType = 'B' AND Datapoint1 IS NULL THEN 'DataPoint2'
      WHEN DataPoint1 > DataPoint2 THEN 'DataPoint1'
      ELSE 'DataPoint3'
      END AS SourceDataPoint,
      EntityID,
      ReportingDate AS SourceDate
    FROM dbo.SourceData
    WHERE CASE
      WHEN EntityType = 'A' THEN DataPoint1
      WHEN EntityType = 'B' THEN COALESCE(DataPoint1,DataPoint2)
      WHEN DataPoint1 > DataPoint2 THEN DataPoint1
      ELSE DataPoint3
      END IS NOT NULL

    IF OBJECT_ID('tempdb..#DataPointSourceIDS') IS NOT NULL DROP TABLE #DataPointSourceIDS
    IF OBJECT_ID('tempdb..#SourceIDS') IS NOT NULL DROP TABLE #SourceIDS

    SELECT
    *
    INTO #DataPointSourceIDS
    FROM
    (
        SELECT
                Datapoint1 AS DataPoint,
                DataPoint1ID AS DataPointID,
                ReportingDate AS SourceDate,
                'Datapoint1' AS SourceDatePoint
            FROM dbo.SourceData
            UNION ALL
            SELECT
                    Datapoint2,
                    DataPoint2ID,
                ReportingDate AS SourceDate,
                'Datapoint2' AS SourceDatePoint
            FROM dbo.SourceData
            UNION ALL
            SELECT
                Datapoint3,
                DataPoint3ID,
                ReportingDate AS SourceDate,
                'Datapoint3' AS SourceDatePoint
            FROM dbo.SourceData
    ) DPSIDS

    SELECT
    CASE EntityType
    WHEN 'A' THEN DataPoint1ID
    WHEN 'B' THEN COALESCE(DataPoint1ID,DataPoint2ID)
    ELSE    
        CASE WHEN DataPoint1 > DataPoint2 THEN DataPoint1ID ELSE DataPoint3ID END
    END
    AS MyCalculationID,
    *
    INTO #SourceIDS
    FROM dbo.SourceData

    SELECT
    --SID.MyCalculationID,
    (SELECT DataPoint FROM #DataPointSourceIDS WHERE DataPointID = SID.MyCalculationID) AS MyCalculation,
    (SELECT SourceDatePoint FROM #DataPointSourceIDS WHERE DataPointID = SID.MyCalculationID) AS SourceDataPoint,
    EntityID,
    (SELECT SourceDate FROM #DataPointSourceIDS WHERE DataPointID = SID.MyCalculationID) AS SourceDate
    FROM #SourceIDS SID

Viewing 3 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply