detect if previous record is higher or lower ?

Question

detect if previous record is higher or lower ?

Viewing 10 posts - 16 through 25 (of 25 total)

You must be logged in to reply to this topic. Login to reply

mick burden SSCarpal Tunnel Points: 4432 More actions · Answer 1

Jeff, I’d like to take you up on your offer regarding Quirky Update. I’d appreciate some sample code regarding this. Many thanks for your patience, have a great weekend.

kind regards

Mick

Jeff Moden SSC Guru Points: 1004619 More actions · Answer 2

I'll try to hammer out an example tonight after work, Mick. Sorry for the delay in responding.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

mick burden SSCarpal Tunnel Points: 4432 More actions · Answer 3

Jeff Moden - Tuesday, June 5, 2018 7:50 AM
I'll try to hammer out an example tonight after work, Mick. Sorry for the delay in responding.

That's brilliant, thank you Jeff

Jeff Moden SSC Guru Points: 1004619 More actions · Answer 4

Ugh! I'm going to have to take it back. I'm assuming that you want the return to return to the screen. If that's the case, then the "Quirky Update" is no faster than David's fine code and also has the disadvantage of needing an extra column and being unsupported. With that in mind, I'm not going to post the code for it because I don't want someone that might not be able to support it to use it when there's a very good supported method available (David's fine code).

I will, however, post the test data generator I used for my tests in case someone want's to race David's code.
--===================================================================================================================== -- Create a million row test table with 1000 names each having 1000 dates and each of those having an -- [AMOUNT of JOBS] ranging from 1 to 10. -- Not to worry, this only takes 2 to 4 seconds to execute on my litte I5 laptop. --===================================================================================================================== -- DROP TABLE dbo.Test -- Commented out for safety --===== Create the table. CREATE TABLE dbo.Test ( [Name] VARCHAR(50) NOT NULL ,[Date] SMALLDATETIME NOT NULL ,[AMOUNT of JOBS] INT NOT NULL ) ; --===== Add the quintessential Clustered Index CREATE UNIQUE CLUSTERED INDEX PK_Test ON dbo.Test ([Name],[Date]) ; --===== Populate the test table with random constrained test data. WITH cteTally AS (--==== Generate a simple thousand row source of sequential numbers. SELECT TOP 1000 N = CONVERT(INT,ROW_NUMBER() OVER (ORDER BY (SELECT NULL))) FROM sys.all_columns ac1 CROSS JOIN sys.all_columns ac2 )--===== Creates a cross join of a thousand names with a thousand dates (1 million rows) and assigns -- a random number from 1 to 10 to each and dumps it into the test table. INSERT INTO dbo.Test WITH (TABLOCK) (Name, [Date], [AMOUNT of JOBS]) SELECT Name = 'Person'+RIGHT(t1.N+10000,4) --Simulated names ,[Date] = DATEADD(dd,t2.N-1,'2015') ,[AMOUNT of JOBS] = ABS(CHECkSUM(NEWID())%10)+1 FROM cteTally t1 CROSS JOIN cteTally t2 ; SELECT * FROM dbo.Test;

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Eirikur Eiriksson SSC Guru Points: 182948 More actions · Answer 5

I recommend using a straight self join rather than the CTE as the CTE will have both sort operators and a hash match in the plan.

Self join example
USE TEEST; GO SET NOCOUNT ON;

IF OBJECT_ID(N'dbo.TBL_TEST_PREV') IS NOT NULL DROP TABLE dbo.TBL_TEST_PREV;

CREATE TABLE dbo.TBL_TEST_PREV ( [Name] VARCHAR(50) ,[Date] smalldatetime ,[AMOUNT of JOBS] Int ) ; INSERT INTO dbo.TBL_TEST_PREV(name,[date],[AMOUNT of JOBS]) VALUES ('John','12/jan/2017','3') ,('John','13/jan/2017','4') ,('John','14/jan/2017','2') ,('John','15/jan/2017','4') ,('John','16/jan/2017','4') ,('John','17/jan/2017','1');

SELECT TP.Name ,TP.Date ,TP.[AMOUNT of JOBS] AS CURR_JOBS ,CASE SIGN(TP.[AMOUNT of JOBS] - ISNULL(PP.[AMOUNT of JOBS],TP.[AMOUNT of JOBS])) WHEN -1 THEN 'DOWN' WHEN 0 THEN 'STATIC' WHEN 1 THEN 'UP' END AS DIRECTION FROM dbo.TBL_TEST_PREV TP LEFT OUTER JOIN dbo.TBL_TEST_PREV PP ON TP.Name = PP.Name AND TP.Date = PP.Date + 1 ;

Plugging Jeff's fine data sample into a simple test harness will show the difference
USE TEEST; GO SET NOCOUNT ON;

-- /* --===================================================================================================================== -- Create a million row test table with 1000 names each having 1000 dates and each of those having an -- [AMOUNT of JOBS] ranging from 1 to 10. -- Not to worry, this only takes 2 to 4 seconds to execute on my litte I5 laptop. --===================================================================================================================== -- DROP TABLE dbo.Test -- Commented out for safety --===== Create the table. CREATE TABLE dbo.Test ( [Name] VARCHAR(50) NOT NULL ,[Date] SMALLDATETIME NOT NULL ,[AMOUNT of JOBS] INT NOT NULL ) ; --===== Add the quintessential Clustered Index CREATE UNIQUE CLUSTERED INDEX PK_Test ON dbo.Test ([Name],[Date]) ; --===== Populate the test table with random constrained test data. WITH cteTally AS (--==== Generate a simple thousand row source of sequential numbers. SELECT TOP 1000 N = CONVERT(INT,ROW_NUMBER() OVER (ORDER BY (SELECT NULL))) FROM sys.all_columns ac1 CROSS JOIN sys.all_columns ac2 )--===== Creates a cross join of a thousand names with a thousand dates (1 million rows) and assigns -- a random number from 1 to 10 to each and dumps it into the test table. INSERT INTO dbo.Test WITH (TABLOCK) (Name, [Date], [AMOUNT of JOBS]) SELECT Name = 'Person'+RIGHT(t1.N+10000,4) --Simulated names ,[Date] = DATEADD(dd,t2.N-1,'2015') ,[AMOUNT of JOBS] = ABS(CHECkSUM(NEWID())%10)+1 FROM cteTally t1 CROSS JOIN cteTally t2 ; -- */ DECLARE @CHAR_BUCKET VARCHAR(50) = ''; DECLARE @DATE_BUCKET SMALLDATETIME = GETDATE(); DECLARE @INT_BUCKET INT = 0; DECLARE @STR_BUCKET VARCHAR(10) = ''; DECLARE @TRES TABLE ( TT_TS DATETIME2(7) NOT NULL DEFAULT (SYSDATETIME()) ,TT_TXT VARCHAR(50) NOT NULL );

INSERT INTO @TRES(TT_TXT) VALUES('DRY RUN'); SELECT @CHAR_BUCKET = T.Name ,@DATE_BUCKET = T.Date ,@INT_BUCKET = T.[AMOUNT of JOBS] FROM dbo.Test T; INSERT INTO @TRES(TT_TXT) VALUES('DRY RUN');

INSERT INTO @TRES(TT_TXT) VALUES('SELF JOIN'); SELECT @CHAR_BUCKET = TP.Name ,@DATE_BUCKET = TP.Date ,@INT_BUCKET = TP.[AMOUNT of JOBS] -- AS CURR_JOBS ,@STR_BUCKET = CASE SIGN(TP.[AMOUNT of JOBS] - ISNULL(PP.[AMOUNT of JOBS],TP.[AMOUNT of JOBS])) WHEN -1 THEN 'DOWN' WHEN 0 THEN 'STATIC' WHEN 1 THEN 'UP' END --AS DIRECTION FROM dbo.Test TP LEFT OUTER JOIN dbo.Test PP ON TP.Name = PP.Name AND TP.Date = PP.Date + 1 ; INSERT INTO @TRES(TT_TXT) VALUES('SELF JOIN');

INSERT INTO @TRES(TT_TXT) VALUES('CTE JOIN'); ;WITH cte ([Name],[Date],[AMOUNT of JOBS],RowNum) AS ( SELECT [Name],[Date],[AMOUNT of JOBS], ROW_NUMBER() OVER (PARTITION BY [Name] ORDER BY [Date] ASC) FROM dbo.Test ) SELECT @CHAR_BUCKET = cte.[Name] ,@DATE_BUCKET = cte.[Date] ,@INT_BUCKET = cte.[AMOUNT of JOBS], @STR_BUCKET = CASE WHEN prev.[AMOUNT of JOBS] IS NULL THEN 'Static' WHEN prev.[AMOUNT of JOBS] = cte.[AMOUNT of JOBS] THEN 'Static' WHEN prev.[AMOUNT of JOBS] < cte.[AMOUNT of JOBS] THEN 'Up' ELSE 'Down' END FROM cte LEFT JOIN cte prev ON prev.[Name] = cte.[Name] AND prev.RowNum = cte.RowNum-1; INSERT INTO @TRES(TT_TXT) VALUES('CTE JOIN');

SELECT T.TT_TXT ,DATEDIFF(MICROSECOND,MIN(T.TT_TS),MAX(T.TT_TS)) AS DURATION FROM @TRES T GROUP BY T.TT_TXT ORDER BY DURATION;

Results
TT_TXT DURATION ----------- --------- DRY RUN 176776 SELF JOIN 1250797 CTE JOIN 2585184

Jeff Moden SSC Guru Points: 1004619 More actions · Answer 6

The only problem with using date in such a fashion is that it must be certain and continuous without any gaps or duplication. Of course, there are rabbit holes for all code.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Eirikur Eiriksson SSC Guru Points: 182948 More actions · Answer 7

Jeff Moden - Wednesday, June 6, 2018 6:17 AM
The only problem with using date in such a fashion is that it must be certain and continuous without any gaps or duplication. Of course, there are rabbit holes for all code.

As always Jeff, you never run out of spanners
My take on it is that if there are gaps, then those should be regarded as restarts, that is the value returned would be 'STATIC'

BTW, did also test a calculated column, but since the function cannot be made deterministic and therefore the column cannot be persisted, the performance is too poor to even post the code. Further, in this case, using the LAG Window function, slows down drastically after the internal limit is reach and it ends up slower than the self-join. An obvious alternative would be to add a normal column and insert the previous value on the insert.

xsevensinzx One Orange Chip Points: 25560 More actions · Answer 8

David Burrows - Wednesday, May 23, 2018 2:03 AM
Performance could be an issue here and Jeff's suggestion of Quirky Update would probably the best.
However, a possibility
WITH cte ([Name],[Date],[AMOUNT of JOBS],RowNum) AS ( SELECT [Name],[Date],[AMOUNT of JOBS], ROW_NUMBER() OVER (PARTITION BY [Name] ORDER BY [Date] ASC) FROM test ) SELECT cte.[Name],cte.[Date],cte.[AMOUNT of JOBS], CASE WHEN prev.[AMOUNT of JOBS] IS NULL THEN 'Static' WHEN prev.[AMOUNT of JOBS] = cte.[AMOUNT of JOBS] THEN 'Static' WHEN prev.[AMOUNT of JOBS] < cte.[AMOUNT of JOBS] THEN 'Up' ELSE 'Down' END FROM cte LEFT JOIN cte prev ON prev.[Name] = cte.[Name] AND prev.RowNum = cte.RowNum-1;

This is how we do it with attribution gap analyze for the most part. We do this on billions of records easy enough with CTE's or even just JOIN (SELECT ROW_NUMBER) AS b num = num - 1 as the joined table. While not the most optimized approach it works easy enough to offset record positions. These days, we do this entire process in a data lake (Azure) as it's pretty fast to create lobs of datasets and join them using USQL.

Missing data is easily solved with a proper date dimension table if you're doing this based on dates. This is accomplished simply by joining your date dimension table and flagging it with a missing flag in the event you have gaps in date ranges for example. The record still gets joined as it gets assigned a row number with no related attributes or metrics. Similar approaches can be created on other sequential data.

In cases where I use this, we don't care for gaps in data though and we have tons of duplicates (which in the CTE case, duplicates do not cause us issues). We are measuring the time difference down to the microsecond from the record that comes after the prior record in consumer journeys with ads. This allows us to see how much time between certain ads--certain tactics--takes on the way either to a conversion or a non-conversion.

Alternatively, the fastest approach for us (using Tally table prob be nice) with the above method is to create a copy of the dataset and index the joined keys. This ultimately is the final approach on extremely large datasets. But bear in mind, I did this originally in MPP where duplication of data has to happen in most use cases. Now I do this in the data lake where we have scalability to create multiple schemas in a single query and slice/dice/chop/etc on the fly per query. Works out nicely.

Jeff Moden SSC Guru Points: 1004619 More actions · Answer 9

Eirikur Eiriksson - Wednesday, June 6, 2018 6:56 AM
Jeff Moden - Wednesday, June 6, 2018 6:17 AM
The only problem with using date in such a fashion is that it must be certain and continuous without any gaps or duplication. Of course, there are rabbit holes for all code.
As always Jeff, you never run out of spanners
My take on it is that if there are gaps, then those should be regarded as restarts, that is the value returned would be 'STATIC'
BTW, did also test a calculated column, but since the function cannot be made deterministic and therefore the column cannot be persisted, the performance is too poor to even post the code. Further, in this case, using the LAG Window function, slows down drastically after the internal limit is reach and it ends up slower than the self-join. An obvious alternative would be to add a normal column and insert the previous value on the insert.

Yes. That'll work. They just need to know that would happen. I wonder what they'd want to do for someone that made a sale only every other day, though. It's not a problem of the code. It's a problem of what they expect and that's actually the only thing I wanted to point out to them (not you... you get it).

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Eirikur Eiriksson SSC Guru Points: 182948 More actions · Answer 10

Jeff Moden - Thursday, June 7, 2018 6:50 AM
Eirikur Eiriksson - Wednesday, June 6, 2018 6:56 AM
Jeff Moden - Wednesday, June 6, 2018 6:17 AM
The only problem with using date in such a fashion is that it must be certain and continuous without any gaps or duplication. Of course, there are rabbit holes for all code.
As always Jeff, you never run out of spanners
My take on it is that if there are gaps, then those should be regarded as restarts, that is the value returned would be 'STATIC'
BTW, did also test a calculated column, but since the function cannot be made deterministic and therefore the column cannot be persisted, the performance is too poor to even post the code. Further, in this case, using the LAG Window function, slows down drastically after the internal limit is reach and it ends up slower than the self-join. An obvious alternative would be to add a normal column and insert the previous value on the insert.
Yes. That'll work. They just need to know that would happen. I wonder what they'd want to do for someone that made a sale only every other day, though. It's not a problem of the code. It's a problem of what they expect and that's actually the only thing I wanted to point out to them (not you... you get it).

Good point as always Jeff, no the question to the OP would be how to deal with gaps or missing dates, if it is a "sales by day" related, are missing dates zero in sales which means a down - up or are those "closed for business" dates which should be ignored? If it is the latter, then I would go with a filtering on the sequence type of approach. The inherit problem with the ROW_NUMBER is that if the correct index isn't in place, then it will introduce a sort operator into the execution plan. Even with the index, there will still be a rather hefty cost of a HASH MATCH join which resource cost is directly related to the cardinality of the set. Reaching a resource limit, it will spill to tempdb and bomb out the performance.

Further, the ROW_NUMBER approach degrades in performance when the internal cardinality limit is reached, pushes the set from a highly optimised temp table structure to a physical temp table which has all the locks and latches.

detect if previous record is higher or lower ?

Cookies on SQLServerCentral