Unique runs

Question

Unique runs

sjsubscribe

SSCrazy

Points: 2080
More actions
August 29, 2012 at 7:45 pm

#261194

I am looking for a query that will produce a list of unique runs for the following data, where EvDate is unique and only one EID is associated with a date. The following is a sample from the original data set, which as about 300K records.
CREATE TABLE tempTable (EvDate date NOT NULL, EID int NOT NULL,
CONSTRAINT PKtemp PRIMARY KEY CLUSTERED (EvDate ASC, EID ASC))
GO
CREATE UNIQUE NONCLUSTERED INDEX IXtemp ON tempTable (EvDate DESC)
GO
INSERT INTO tempTable (EvDate,EID)
SELECT '02/27/2012',1
UNION ALL
SELECT '03/05/2012',4
UNION ALL
SELECT '03/12/2012',5
UNION ALL
SELECT '03/19/2012',3
UNION ALL
SELECT '03/26/2012',2
UNION ALL
SELECT '04/02/2012',7
UNION ALL
SELECT '04/09/2012',1
UNION ALL
SELECT '04/16/2012',4
UNION ALL
SELECT '04/23/2012',9
UNION ALL
SELECT '04/30/2012',8
UNION ALL
SELECT '05/07/2012',7
UNION ALL
SELECT '05/14/2012',5
UNION ALL
SELECT '05/21/2012',3
UNION ALL
SELECT '05/28/2012',2
UNION ALL
SELECT '05/30/2012',2
UNION ALL
SELECT '06/04/2012',2
UNION ALL
SELECT '06/11/2012',4
UNION ALL
SELECT '06/18/2012',5
UNION ALL
SELECT '06/25/2012',6
UNION ALL
SELECT '07/02/2012',1
UNION ALL
SELECT '07/09/2012',3
UNION ALL
SELECT '07/16/2012',8
UNION ALL
SELECT '07/23/2012',3
UNION ALL
SELECT '07/30/2012',6
UNION ALL
SELECT '08/06/2012',5
UNION ALL
SELECT '08/13/2012',4
UNION ALL
SELECT '08/20/2012',1
UNION ALL
SELECT '08/27/2012',1
Is there a way I can find a count of the number of unique EIDs in a row, summarized by start and end dates as shown below.
UniqueRuns is the count of consecutively different EIDs. From 2/27/2012 to 4/02/2012 every EID is unique for those 6 dates. On 4/9/2012, the EID is 1, which is a repeat from 2/27/2012. So the count resets for the next unique run. That run ends after 8 dates on 5/28/2012. New run begins on 5/30/2012, but ends on that day becuase the same EID=2 is for the next date. A new run begins. So on.
I want to find the longest unique runs in a format as shown below. Ties are sorted by EndDate.
StartDateEndDate UniqueRuns
04/09/201205/28/20128
06/04/201207/16/20127
02/27/201204/02/20126
07/23/201208/20/20125
05/30/1205/30/121
08/27/201208/27/20121
Unlike the traditional islands and gaps problems based on dates, here it's based on the uniqueness of the EIDs.
I'm stumped as to how I can convert this into a traditional islands-and-gaps problem. I've not looked at any of the new T-SQL functions in SQL Server 2012 for something like this.
Thanks for any insights.

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply

Dwain Camps SSC Guru Points: 86908 More actions · Answer 1

First of all, and a good thing for you because I suck at them, I don't think this is a gaps and islands problem. It looks more like a cumulative totals problem, so I propose it be handled with a quirky update. You'll need to add a helper column to your table and reverse the primary key/index clustering.

CREATE TABLE #tempTable (EvDate date NOT NULL, EID int NOT NULL

,helpergroupno INT,

CONSTRAINT PKtemp PRIMARY KEY NONCLUSTERED (EvDate ASC, EID ASC))

CREATE UNIQUE CLUSTERED INDEX IXtemp ON #tempTable (EvDate DESC )

Now add in your setup data and then run this:

DECLARE @tempstr VARCHAR(100) = ''

,@tempno INT = 1

UPDATE #tempTable

SET @tempstr = CASE

WHEN CHARINDEX('[' + CAST(EID AS VARCHAR) + ']', @tempstr) = 0

THEN @tempstr + '[' + CAST(EID AS VARCHAR) + ']'

ELSE '[' + CAST(EID AS VARCHAR) + ']' END

,@tempno = helpergroupno = CASE

WHEN @tempstr = '[' + CAST(EID AS VARCHAR) + ']'

THEN @tempno + 1

ELSE @tempno END

OPTION(MAXDOP 1)

SELECT startdate=MIN(EvDate), enddate=MAX(EvDate), [Count of EIDs]=COUNT(*)

FROM #tempTable

GROUP BY helpergroupno

DROP TABLE #tempTable

Results are:

startdateenddateCount of EIDs

2012-02-272012-04-026

2012-04-092012-05-288

2012-05-302012-05-301

2012-06-042012-07-167

2012-07-232012-08-205

2012-08-272012-08-271

Please advise if this is correct. 😀

Edit: Corrected the code above to handle EID > 9 and added this explanation:

I realized that my suggestion that this is a cumulative sums problem may be a bit enigmatic, so let me say that it is except that instead of accumulating totals you're accumulating events.

If you're unclear as to how it works you can do this:

1. Add a second (VARCHAR(100)) helper column and dump the current value of @tempstr into it as I have done with @tempno in the row that follows.

2. SELECT * from the table without grouping.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

sjsubscribe SSCrazy Points: 2080 More actions · Answer 2

Dwain, works perfectly. Thanks very very much.

First off, I never thought about quirky update though I used it before.

Second, your code ran in 5.1 seconds for 300K records on SQL 2012. Impressive.

--S.J.

Eugene Elutin SSC Guru Points: 59322 More actions · Answer 3

...
...
,@tempno = helpergroupno = CASE
...

Please note, the "three-parts" SET is not reliable. You better to split it to two of "two-part" SET's.

_____________________________________________
"The only true wisdom is in knowing you know nothing"
"O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
(So many miracle inventions provided by MS to us...)

How to post your question to get the best and quick help[/url]

Dwain Camps SSC Guru Points: 86908 More actions · Answer 4

sjsubscribe (8/30/2012)
Dwain, works perfectly. Thanks very very much.
First off, I never thought about quirky update though I used it before.
Second, your code ran in 5.1 seconds for 300K records on SQL 2012. Impressive.
--S.J.

SJ - Glad it worked for you.

I must say, you led me down the garden path suggesting it was gaps and islands. Maybe it is but I couldn't figure it from that perspective. Once I realized QU would work, it didn't take long to get it working.

Eugene - I'm not sure I understand what you mean. I mean I get the part saying that the 3 part SET might not be reliable. But can you demonstrate a working solution using a 2 part SET?

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Eugene Elutin SSC Guru Points: 59322 More actions · Answer 5

Search for "OMG" In J.M. article:

http://www.sqlservercentral.com/articles/T-SQL/68467/

It does explain the problem in details and gives good example.

_____________________________________________
"The only true wisdom is in knowing you know nothing"
"O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
(So many miracle inventions provided by MS to us...)

How to post your question to get the best and quick help[/url]

Dwain Camps SSC Guru Points: 86908 More actions · Answer 6

Eugene Elutin (8/30/2012)
Search for "OMG" In J.M. article:
http://www.sqlservercentral.com/articles/T-SQL/68467/
It does explain the problem in details and gives good example.

Ah yes, the bible on this subject. It appears Jeff has revised it since the last time I read it. New rules including the part about the 2 part update. So I'll need to brush up.

For now though, it seems to have worked for SJ but s/he should probably take a look at the article also and at least add the TABLOCK hint if this is being done in something other than a temp table (refer to the OMG and Rules section) and it is not just a one off requirement.

Thanks for letting me know about this Eugene!

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Eugene Elutin SSC Guru Points: 59322 More actions · Answer 7

No probs.

Actually it was never safe to use three part set in quirky update.

If your read BoL abnout UPDATE you will find that:

Setting Variables and Columns
Variable names can be used in UPDATE statements to show the old and new values affected, but this should be used only when the UPDATE statement affects a single record. If the UPDATE statement affects multiple records, to return the old and new values for each record, use the OUTPUT clause.

http://msdn.microsoft.com/en-us/library/ms177523(v=sql.90).aspx

Unfortunately, explicit explanation of the above behaviour/restriction was removed from BoL for SQL2008 and above, but this behaviour/restriction is still in place.

So, as you can see it's quite expected that it will not work for quirky update.

I never used "three-part" SET in quirky update myself due to the simple reason: Quirky update method is older than "three part" SET" feature of SQL Server (I think it was introduced in SQL2K), so I always used two set lines with standard two-parts set's.

_____________________________________________
"The only true wisdom is in knowing you know nothing"
"O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
(So many miracle inventions provided by MS to us...)

How to post your question to get the best and quick help[/url]

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 8

Eugene Elutin (8/30/2012)
...
...
,@tempno = helpergroupno = CASE
...
Please note, the "three-parts" SET is not reliable. You better to split it to two of "two-part" SET's.

That's absolutely not true. It's a documented feature and it works just fine. What may not be "reliable" is people's understanding of it thanks to the terrible explanation that MS included in Books Online.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 9

To clarify, I'm just taking exception to when people say that the 3 part update isn't "reliable" and it b oils down to what the semantics of what someone considers the word "unreliable" to actually mean.

To me, "unreliable" means that it should never be trusted to work. As I said throughout the article on running totals, one of the rules is that you first have to make it all work. Once you get it working, then it will work reliably.

It's like taking into consideration the problems that people have documented about the relatively new MERGE statement. It sometimes does strange things that cause the code to not work as expected. Does that make it "unreliable"? Perhaps. I like to think of it as it being very reliable in giving you problems if you use it a certain way and very reliable in giving you correct answers if you use it in another.

To quote the article...

It's just one option that happens to work well for running totals and running counts.

Interpret that as meaning that the 3 part update is reliable for "normal" Quirky Update running totals and counts. If you do something strange with it, then it will be "unreliable".

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

sjsubscribe SSCrazy Points: 2080 More actions · Answer 10

Eugene, thanks for 3-parts set warning. I read it in the JM article and went back and fixed some of my old quirky update code as well. But all of them were using temp tables -- within the confines of "reliable" as Jeff described. I'ver never implemented quirky update for permanent data, just for intermediate calculations.

Dwain, concatenating ints as strings into an array is something I'd never would have thought despite the quirky update trick. I just increased the @tmpstr varchar size to 4000 to handle the foreseeable future. Also I'm glad you didn't fall for my note about islands and gaps, which was misleading. Lastly, though I've tested and implemented this on SQL 2012, your solution should work on 2000/5/8 as well. I started playing with try_convert, concat, IIF, and other newer functions in SQL 2012, but they would not be as portable.

Dwain Camps SSC Guru Points: 86908 More actions · Answer 11

sjsubscribe (8/30/2012)
Eugene, thanks for 3-parts set warning. I read it in the JM article and went back and fixed some of my old quirky update code as well. But all of them were using temp tables -- within the confines of "reliable" as Jeff described. I'ver never implemented quirky update for permanent data, just for intermediate calculations.

You should be able to convert my submission to a 2 part SET easily and in a temp table it is probably pretty safe.

sjsubscribe (8/30/2012)
Dwain, concatenating ints as strings into an array is something I'd never would have thought despite the quirky update trick. I just increased the @tmpstr varchar size to 4000 to handle the foreseeable future. Also I'm glad you didn't fall for my note about islands and gaps, which was misleading. Lastly, though I've tested and implemented this on SQL 2012, your solution should work on 2000/5/8 as well. I started playing with try_convert, concat, IIF, and other newer functions in SQL 2012, but they would not be as portable.

Actually the idea came to me partly through this past article that I wrote (http://www.sqlservercentral.com/articles/sql+n-Tuples/89809/), even though the rest of that is inapplicable here. I am also gratified to hear you explored the solution and came to the conclusion that you should make the string size larger - good on you!

I did fall for your suggestion on gaps and islands, otherwise I would have solved it a lot faster. It's why I try to tell developers not to approach problems with pre-conceived notions (like someone else's idea about how it should work). The mind plays tricks with you when that happens. I spent a good 30 minutes on it as a gaps and islands problem. When I realized the QU approach would work, I knocked it out in about 10 minutes. Nonetheless you are forgiven. 😀

And yes, the backwards portability is good. I heard someone suggest once that MS might deprecate the feature in the future, but I think that notion was pretty well shot down. So safe for the future I think.

I'm glad it was fast too, although I was shooting for 3M rows in 5 sec. 🙂

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Eugene Elutin SSC Guru Points: 59322 More actions · Answer 12

Jeff Moden (8/30/2012)
Eugene Elutin (8/30/2012)
...
...
,@tempno = helpergroupno = CASE
...
Please note, the "three-parts" SET is not reliable. You better to split it to two of "two-part" SET's.
That's absolutely not true. It's a documented feature and it works just fine. What may not be "reliable" is people's understanding of it thanks to the terrible explanation that MS included in Books Online.

I called it "unreliable" only in relevance to "quirky" update method. I agree that BoL doesn't explain its behaviour as we would like it to do, but it does quite clearly state that

...this should be used only when the UPDATE statement affects a single record...

(yeah, as I said they removed any text about it in BoL for SQL2008 and later)

To be on a safe side, I would use two-part SET only in a "quirky" update.

_____________________________________________
"The only true wisdom is in knowing you know nothing"
"O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
(So many miracle inventions provided by MS to us...)

How to post your question to get the best and quick help[/url]

sjsubscribe SSCrazy Points: 2080 More actions · Answer 13

I'm glad it was fast too, although I was shooting for 3M rows in 5 sec. 🙂

It will probably be that fast if I had optimized the underlying source data (views). I've not yet implemented it in production.