SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Reduce records in a slowly changing dimension table


Reduce records in a slowly changing dimension table

Author
Message
Chrissy321
Chrissy321
SSCrazy Eights
SSCrazy Eights (8.9K reputation)SSCrazy Eights (8.9K reputation)SSCrazy Eights (8.9K reputation)SSCrazy Eights (8.9K reputation)SSCrazy Eights (8.9K reputation)SSCrazy Eights (8.9K reputation)SSCrazy Eights (8.9K reputation)SSCrazy Eights (8.9K reputation)

Group: General Forum Members
Points: 8898 Visits: 5173
I have a table that tracks groups over time using a start and end date columns. Some of the groups were loaded with a daily snapshot method where new records were inserted with equal start and end dates rather than the end date being incremented to show continuing membership in a group.

I have some sample data below.

I would like to delete all the records showing member 2 belong to group B and insert 1 record having a start date of 2000-01-01 and an end date of 2000-01-04. The 4 records showing member 4 belong to group D for different periods should be replaced with 2 records: 2017-01-01 - 2017-01-02 and 2017-01-04 - 2017-01-05

Thanks if you can help. I am working in SQL Server 2008 (SP3)



IF OBJECT_ID('dbo.GroupMembers', 'U') IS NOT NULL DROP TABLE dbo.GroupMembers;

CREATE TABLE dbo.GroupMembers
(
GroupCode VARCHAR(1) NOT NULL,
GroupMember INT NOT NULL,
StartDate DATE NOT NULL,
EndDate DATE NULL
)


--Some of my groups have proper ranges for start and end dates
INSERT INTO dbo.GroupMembers VALUES ('A', 1, '20000101', '20170101')

--Some of my groups have were populated with a snapshot for every day
INSERT INTO dbo.GroupMembers VALUES ('B', 2, '20000101', '20000101')
INSERT INTO dbo.GroupMembers VALUES ('B', 2, '20000102', '20000102')
INSERT INTO dbo.GroupMembers VALUES ('B', 2, '20000103', '20000103')
INSERT INTO dbo.GroupMembers VALUES ('B', 2, '20000104', '20000104')

--Other groups are open ended and that is fine
INSERT INTO dbo.GroupMembers VALUES ('C', 3, '20000104', NULL)

--Some members may be removed and then re-added
INSERT INTO dbo.GroupMembers VALUES ('D', 4, '20170101', '20170101')
INSERT INTO dbo.GroupMembers VALUES ('D', 4, '20170102', '20170102')
INSERT INTO dbo.GroupMembers VALUES ('D', 4, '20170104', '20170104')
INSERT INTO dbo.GroupMembers VALUES ('D', 4, '20170105', '20170105')

SELECT * FROM dbo.GroupMembers


Lynn Pettis
Lynn Pettis
SSC Guru
SSC Guru (226K reputation)SSC Guru (226K reputation)SSC Guru (226K reputation)SSC Guru (226K reputation)SSC Guru (226K reputation)SSC Guru (226K reputation)SSC Guru (226K reputation)SSC Guru (226K reputation)

Group: General Forum Members
Points: 226339 Visits: 40443

So, are you looking for something like this?


IF OBJECT_ID('dbo.GroupMembers', 'U') IS NOT NULL DROP TABLE dbo.GroupMembers;

CREATE TABLE dbo.GroupMembers
(
GroupCode VARCHAR(1) NOT NULL,
GroupMember INT NOT NULL,
StartDate DATE NOT NULL,
EndDate DATE NULL
);


--Some of my groups have proper ranges for start and end dates
INSERT INTO dbo.GroupMembers VALUES ('A', 1, '20000101', '20170101');

--Some of my groups have were populated with a snapshot for every day
INSERT INTO dbo.GroupMembers VALUES ('B', 2, '20000101', '20000101');
INSERT INTO dbo.GroupMembers VALUES ('B', 2, '20000102', '20000102');
INSERT INTO dbo.GroupMembers VALUES ('B', 2, '20000103', '20000103');
INSERT INTO dbo.GroupMembers VALUES ('B', 2, '20000104', '20000104');

--Other groups are open ended and that is fine
INSERT INTO dbo.GroupMembers VALUES ('C', 3, '20000104', NULL);

--Some members may be removed and then re-added
INSERT INTO dbo.GroupMembers VALUES ('D', 4, '20170101', '20170101');
INSERT INTO dbo.GroupMembers VALUES ('D', 4, '20170102', '20170102');
INSERT INTO dbo.GroupMembers VALUES ('D', 4, '20170104', '20170104');
INSERT INTO dbo.GroupMembers VALUES ('D', 4, '20170105', '20170105');


SELECT * FROM dbo.GroupMembers;

WITH base AS (
SELECT
[GroupCode]
, [GroupMember]
, [StartDate]
, [EndDate]
, rn = ROW_NUMBER() OVER (PARTITION BY [GroupCode], [GroupMember] ORDER BY [StartDate])
FROM
dbo.GroupMembers
), Interim AS (
SELECT
[b].[GroupCode]
, [b].[GroupMember]
, [b].[StartDate]
, [b].[EndDate]
, [b].[rn]
, [GroupDate] = DATEADD(DAY,-[b].[rn],[b].[StartDate])
FROM
base b
)
SELECT
[i].[GroupCode]
, [i].[GroupMember]
, [StartDate] = MIN([i].[StartDate])
, [EndDate] = MAX([i].[EndDate])
FROM
[Interim] i
GROUP BY
[i].[GroupCode]
, [i].[GroupMember]
, [i].[GroupDate];



Cool
Lynn Pettis

For better assistance in answering your questions, click here
For tips to get better help with Performance Problems, click here
For Running Totals and its variations, click here or when working with partitioned tables
For more about Tally Tables, click here
For more about Cross Tabs and Pivots, click here and here
Managing Transaction Logs

SQL Musings from the Desert Fountain Valley SQL (My Mirror Blog)
Chrissy321
Chrissy321
SSCrazy Eights
SSCrazy Eights (8.9K reputation)SSCrazy Eights (8.9K reputation)SSCrazy Eights (8.9K reputation)SSCrazy Eights (8.9K reputation)SSCrazy Eights (8.9K reputation)SSCrazy Eights (8.9K reputation)SSCrazy Eights (8.9K reputation)SSCrazy Eights (8.9K reputation)

Group: General Forum Members
Points: 8898 Visits: 5173
Yes that is exactly what I am looking for. This should help me eliminate 99% of the records. Thanks so much.
Chrissy321
Chrissy321
SSCrazy Eights
SSCrazy Eights (8.9K reputation)SSCrazy Eights (8.9K reputation)SSCrazy Eights (8.9K reputation)SSCrazy Eights (8.9K reputation)SSCrazy Eights (8.9K reputation)SSCrazy Eights (8.9K reputation)SSCrazy Eights (8.9K reputation)SSCrazy Eights (8.9K reputation)

Group: General Forum Members
Points: 8898 Visits: 5173
I neglected to provide sample data that would illustrate a particular scenario and naturally the code Lynn provided does not address it. The scenario is some groups may have both existing ranges and the snapshot records (records where the start and end date is the same). When there are 2 date range records with no gaps between I would like to consolidate these multiple records

--New sample data that illustrates what I neglected to show earlier; Some members may have existing ranges and snapshot records
INSERT INTO dbo.GroupMembers VALUES ('E', 5, '20171201', '20171215');
INSERT INTO dbo.GroupMembers VALUES ('E', 5, '20171216', '20171216');
INSERT INTO dbo.GroupMembers VALUES ('E', 5, '20171217', '20171217');
INSERT INTO dbo.GroupMembers VALUES ('E', 5, '20171218', '20171218');

In the spirit of showing you the code I have worked on I have what may be a 'solution' but it is rather inelegant. I used a calendar table to un-consolidate the data before consolidating it with the code Lynn provided.
Below will have the new sample data; Lynn's code and then I jam the unconsolidation into Lynn's code as a subquery.
Any suggestions on how to do this properly? It seems silly to -un-consolidate the data.
IF OBJECT_ID('dbo.GroupMembers', 'U') IS NOT NULL DROP TABLE dbo.GroupMembers; 
CREATE TABLE dbo.GroupMembers
(
GroupCode VARCHAR(1) NOT NULL,
GroupMember INT NOT NULL,
StartDate DATE NOT NULL,
EndDate DATE NULL
);


--Some of my groups have proper ranges for start and end dates
INSERT INTO dbo.GroupMembers VALUES ('A', 1, '20170101', '20170201');

--Some of my groups have were populated with a snapshot for every day
INSERT INTO dbo.GroupMembers VALUES ('B', 2, '20000101', '20000101');
INSERT INTO dbo.GroupMembers VALUES ('B', 2, '20000102', '20000102');
INSERT INTO dbo.GroupMembers VALUES ('B', 2, '20000103', '20000103');
INSERT INTO dbo.GroupMembers VALUES ('B', 2, '20000104', '20000104');

--Other groups are open ended and that is fine
INSERT INTO dbo.GroupMembers VALUES ('C', 3, '20000104', NULL);

--Some members may be removed and then re-added
INSERT INTO dbo.GroupMembers VALUES ('D', 4, '20170101', '20170101');
INSERT INTO dbo.GroupMembers VALUES ('D', 4, '20170102', '20170102');
INSERT INTO dbo.GroupMembers VALUES ('D', 4, '20170104', '20170104');
INSERT INTO dbo.GroupMembers VALUES ('D', 4, '20170105', '20170105');

--Some members may have existing ranges and snapshot records
INSERT INTO dbo.GroupMembers VALUES ('E', 5, '20171201', '20171215');
INSERT INTO dbo.GroupMembers VALUES ('E', 5, '20171216', '20171216');
INSERT INTO dbo.GroupMembers VALUES ('E', 5, '20171217', '20171217');
INSERT INTO dbo.GroupMembers VALUES ('E', 5, '20171218', '20171218');

--SELECT * FROM dbo.GroupMembers;

--This wil have multiple E records
WITH
base AS
(
SELECT
GroupCode,
GroupMember,
StartDate,
EndDate,
rn = ROW_NUMBER() OVER (PARTITION BY GroupCode, GroupMember ORDER BY StartDate)
FROM dbo.GroupMembers
)
,
Interim AS
(
SELECT
b.GroupCode,
b.GroupMember,
b.StartDate,
b.EndDate,
b.rn,
GroupDate = DATEADD(DAY,-b.rn,b.StartDate)
FROM base b
)

SELECT
i.GroupCode,
i.GroupMember,
StartDate = MIN(i.StartDate),
EndDate = MAX(i.EndDate)
FROM Interim i
GROUP BY
i.GroupCode,
i.GroupMember,
i.GroupDate
ORDER BY
i.GroupCode; ;

--If I explode the date ranges into the separate records in a subquery then the E records will be consolidated
WITH
base AS
(
SELECT
GroupCode,
GroupMember,
StartDate,
EndDate,
rn = ROW_NUMBER() OVER (PARTITION BY GroupCode, GroupMember ORDER BY StartDate)
FROM
(
SELECT
#T1.GroupCode,
#T1.GroupMember,
C.CalendarDate AS StartDate,
C.CalendarDate AS EndDate
FROM #T1
INNER JOIN dbo.Calendar C
ON C.CalendarDate BETWEEN #T1.StartDate AND #T1.EndDate

UNION ALL

SELECT
GroupCode,
GroupMember,
StartDate,
EndDate
FROM #T1
WHERE EndDate IS NULL
)Unconsolidation
)
,
Interim AS
(
SELECT
b.GroupCode,
b.GroupMember,
b.StartDate,
b.EndDate,
b.rn,
GroupDate = DATEADD(DAY,-b.rn,b.StartDate)
FROM base b
)

SELECT
i.GroupCode,
i.GroupMember,
StartDate = MIN(i.StartDate),
EndDate = MAX(i.EndDate)
FROM Interim i
GROUP BY
i.GroupCode,
i.GroupMember,
i.GroupDate
ORDER BY
i.GroupCode;


Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (512K reputation)SSC Guru (512K reputation)SSC Guru (512K reputation)SSC Guru (512K reputation)SSC Guru (512K reputation)SSC Guru (512K reputation)SSC Guru (512K reputation)SSC Guru (512K reputation)

Group: General Forum Members
Points: 512827 Visits: 44328
Chrissy321 - Tuesday, December 26, 2017 5:00 PM
I neglected to provide sample data that would illustrate a particular scenario and naturally the code Lynn provided does not address it. The scenario is some groups may have both existing ranges and the snapshot records (records where the start and end date is the same). When there are 2 date range records with no gaps between I would like to consolidate these multiple records

--New sample data that illustrates what I neglected to show earlier; Some members may have existing ranges and snapshot records
INSERT INTO dbo.GroupMembers VALUES ('E', 5, '20171201', '20171215');
INSERT INTO dbo.GroupMembers VALUES ('E', 5, '20171216', '20171216');
INSERT INTO dbo.GroupMembers VALUES ('E', 5, '20171217', '20171217');
INSERT INTO dbo.GroupMembers VALUES ('E', 5, '20171218', '20171218');

In the spirit of showing you the code I have worked on I have what may be a 'solution' but it is rather inelegant. I used a calendar table to un-consolidate the data before consolidating it with the code Lynn provided.
Below will have the new sample data; Lynn's code and then I jam the unconsolidation into Lynn's code as a subquery.
Any suggestions on how to do this properly? It seems silly to -un-consolidate the data.
IF OBJECT_ID('dbo.GroupMembers', 'U') IS NOT NULL DROP TABLE dbo.GroupMembers; 
CREATE TABLE dbo.GroupMembers
(
GroupCode VARCHAR(1) NOT NULL,
GroupMember INT NOT NULL,
StartDate DATE NOT NULL,
EndDate DATE NULL
);


--Some of my groups have proper ranges for start and end dates
INSERT INTO dbo.GroupMembers VALUES ('A', 1, '20170101', '20170201');

--Some of my groups have were populated with a snapshot for every day
INSERT INTO dbo.GroupMembers VALUES ('B', 2, '20000101', '20000101');
INSERT INTO dbo.GroupMembers VALUES ('B', 2, '20000102', '20000102');
INSERT INTO dbo.GroupMembers VALUES ('B', 2, '20000103', '20000103');
INSERT INTO dbo.GroupMembers VALUES ('B', 2, '20000104', '20000104');

--Other groups are open ended and that is fine
INSERT INTO dbo.GroupMembers VALUES ('C', 3, '20000104', NULL);

--Some members may be removed and then re-added
INSERT INTO dbo.GroupMembers VALUES ('D', 4, '20170101', '20170101');
INSERT INTO dbo.GroupMembers VALUES ('D', 4, '20170102', '20170102');
INSERT INTO dbo.GroupMembers VALUES ('D', 4, '20170104', '20170104');
INSERT INTO dbo.GroupMembers VALUES ('D', 4, '20170105', '20170105');

--Some members may have existing ranges and snapshot records
INSERT INTO dbo.GroupMembers VALUES ('E', 5, '20171201', '20171215');
INSERT INTO dbo.GroupMembers VALUES ('E', 5, '20171216', '20171216');
INSERT INTO dbo.GroupMembers VALUES ('E', 5, '20171217', '20171217');
INSERT INTO dbo.GroupMembers VALUES ('E', 5, '20171218', '20171218');

--SELECT * FROM dbo.GroupMembers;

--This wil have multiple E records
WITH
base AS
(
SELECT
GroupCode,
GroupMember,
StartDate,
EndDate,
rn = ROW_NUMBER() OVER (PARTITION BY GroupCode, GroupMember ORDER BY StartDate)
FROM dbo.GroupMembers
)
,
Interim AS
(
SELECT
b.GroupCode,
b.GroupMember,
b.StartDate,
b.EndDate,
b.rn,
GroupDate = DATEADD(DAY,-b.rn,b.StartDate)
FROM base b
)

SELECT
i.GroupCode,
i.GroupMember,
StartDate = MIN(i.StartDate),
EndDate = MAX(i.EndDate)
FROM Interim i
GROUP BY
i.GroupCode,
i.GroupMember,
i.GroupDate
ORDER BY
i.GroupCode; ;

--If I explode the date ranges into the separate records in a subquery then the E records will be consolidated
WITH
base AS
(
SELECT
GroupCode,
GroupMember,
StartDate,
EndDate,
rn = ROW_NUMBER() OVER (PARTITION BY GroupCode, GroupMember ORDER BY StartDate)
FROM
(
SELECT
#T1.GroupCode,
#T1.GroupMember,
C.CalendarDate AS StartDate,
C.CalendarDate AS EndDate
FROM #T1
INNER JOIN dbo.Calendar C
ON C.CalendarDate BETWEEN #T1.StartDate AND #T1.EndDate

UNION ALL

SELECT
GroupCode,
GroupMember,
StartDate,
EndDate
FROM #T1
WHERE EndDate IS NULL
)Unconsolidation
)
,
Interim AS
(
SELECT
b.GroupCode,
b.GroupMember,
b.StartDate,
b.EndDate,
b.rn,
GroupDate = DATEADD(DAY,-b.rn,b.StartDate)
FROM base b
)

SELECT
i.GroupCode,
i.GroupMember,
StartDate = MIN(i.StartDate),
EndDate = MAX(i.EndDate)
FROM Interim i
GROUP BY
i.GroupCode,
i.GroupMember,
i.GroupDate
ORDER BY
i.GroupCode;



You're destroying the whole reason for SCDs on this table. You should never have gaps between "records" and, if you check the "records", you'll likely find that something changed, which is why a new "record" was required. If you do have absolutely identical rows except for the start and end date of the row and those are adjacent to other rows, then you need to fix your SCD marking and row generation process before you get into deleting all the nice history you've built up.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
aaron.reese
aaron.reese
SSCertifiable
SSCertifiable (6.5K reputation)SSCertifiable (6.5K reputation)SSCertifiable (6.5K reputation)SSCertifiable (6.5K reputation)SSCertifiable (6.5K reputation)SSCertifiable (6.5K reputation)SSCertifiable (6.5K reputation)SSCertifiable (6.5K reputation)

Group: General Forum Members
Points: 6473 Visits: 1029
Happy new year all Smile

Giving the original OP the benefit of the doubt, it looks like either the load strategy has been changed or different sources are employing different strategies. It is clear from the sample data that this is abstracted from the real problem. @ Jeff, if the SCD data is looking at continuous days attendance then a gap of one day would be perfectly acceptable and would definitely generate a new SCD record - I have used this pattern myself when tracking data quality issues to identify failure rate and mean-time-to-fix. I think the OP has the right approach to fixing the issue which is to explode out the records with the contiguous days, then find islands and gaps using a calendar table and collapse the records back down. There may be more elegant ways to solve the problem but if it is a one-off data realignment I would not be looking to spend days on code to optimise the solution. Do what is pragmatic and move on.
Chrissy321
Chrissy321
SSCrazy Eights
SSCrazy Eights (8.9K reputation)SSCrazy Eights (8.9K reputation)SSCrazy Eights (8.9K reputation)SSCrazy Eights (8.9K reputation)SSCrazy Eights (8.9K reputation)SSCrazy Eights (8.9K reputation)SSCrazy Eights (8.9K reputation)SSCrazy Eights (8.9K reputation)

Group: General Forum Members
Points: 8898 Visits: 5173
Thanks for your response. Yes there have been different methods of populating this table. As Jeff said one of the methods is broken and needs to be fixed. I will post up a different thread once I get a chance to take a crack at writing code to properly update a SCD table. If any one had any links/sample code on this that would be appreciated.

My original post was looking for a one-off fix and what I posted subsequently seems adequate so i will do what is pragmatic and move on. The existing broken process is not broken in the sense that the data is incorrect but rather it is always inserting records rather than updating which resulted in performance problems. My one-off fix should not destroy any data, all the information should be preserved just represented by a smaller set of records. Thanks all.
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum








































































































































































SQLServerCentral


Search