I wonder if there is a better way to solve this problem?

  • Apologies for the vague topic; I'm not sure how else to say it. I have data coming in from source A. It has to be "cleaned up" then sent off to destination B (and also stored here in an ODS).

    There is a specific issue with some of the data that comes in. Users enter their own information on a website over which I have no control. The interface permits them to do the following (apologies, I stink at formatting tables here):

    first_name last_name degree1 degree2 degree3 degree4

    ---------- ---------- ------ ------ ------ -------

    John Smith BA MBA

    They can enter up to four degrees/certifications or none at all. They are not required by the interface to start with cert1, so you may see someone with two qualifications fill in fields 1 and 3 or 4, or 2 and 3, or 1 and 4.

    The data must arrive at the destination looking like this:

    first_name last_name degree1 degree2 degree3 degree4

    ---------- ---------- ------ ------ ------ -------

    John Smith BA MBA

    So basically all of these values have to be squished and condensed to the left so there are no blank fields between certifications, if that makes sense.

    First, I did this (I know, I know...temp tables):

    SELECTTRANSACTION_ID,Degree,Value

    INTO #tmp_degreeunpivot

    FROM(SELECT TRANSACTION_ID,DEGREES1,DEGREES2,DEGREES3,DEGREES4 FROM DBO.mysourcetable

    WHERE(DEGREES1 <> '' OR DEGREES2 <> '' OR DEGREES3 <> '' OR DEGREES4 <> '')) AS P

    UNPIVOT (Degree FOR Value IN (DEGREES1,DEGREES2,DEGREES3,DEGREES4)) unpvt

    ORDER BY TRANSACTION_ID;

    Then I deleted the DEGREE1-DEGREE4 values that existed in the source heap for the TRANSACTION_ID values in #tmp_degreeunpivot.

    Then I did this:

    WITH DEGREE_CTE (TRANSACTION_ID,Degree,DEGREERANK)

    AS

    (

    SELECTTDP.TRANSACTION_ID,

    TDP.Degree,

    'DEGREES' + CAST(TDP.RNK AS char(1)) AS DEGREERANK

    FROM(SELECTTRANSACTION_ID,

    DEGREE,

    RANK() OVER (PARTITION BY TRANSACTION_ID ORDER BY VALUE ASC) AS RNK

    FROM#tmp_degreeunpivot) AS TDP

    )

    SELECTTRANSACTION_ID,

    DEGREES1,

    DEGREES2,

    DEGREES3,

    DEGREES4

    INTO#tmp_degreepivot

    FROM(SELECT TRANSACTION_ID,Degree,DEGREERANK FROM DEGREE_CTE) AS up

    PIVOT (MAX(DEGREE) for DEGREERANK in (DEGREES1,DEGREES2,DEGREES3,DEGREES4)) AS PVT;

    Then I updated the source with the values in the correct columns:

    UPDATEME

    SETME.DEGREES1 = COALESCE(PVT.DEGREES1,''),

    ME.DEGREES2 = COALESCE(PVT.DEGREES2,''),

    ME.DEGREES3 = COALESCE(PVT.DEGREES3,''),

    ME.DEGREES4 = COALESCE(PVT.DEGREES4,'')

    FROM#tmp_degreepivot AS PVT

    INNER JOINdbo.mysourcetable AS ME

    ONPVT.TRANSACTION_ID = ME.TRANSACTION_ID;

    Does it work? Yes. Does it perform well? Reasonably, and the files we'll be working with never have more than 100 records per day in them, so if it takes an extra few seconds to run it's not a big deal.

    Am I happy with it? Yes and no. I'm glad that I solved the problem, but I keep wondering if there was a better way to do it. A more elegant way.

  • I wish you could post table schema and sample data.

    SET NOCOUNT ON;

    USE tempdb;

    GO

    WITH C1 AS (

    SELECT

    T.first_name,

    T.last_name,

    R.degree,

    R.pos,

    ROW_NUMBER()OVER(PARTITION BY T.first_name, T.last_name ORDER BY R.pos) AS rn

    FROM

    (

    VALUES

    ('John', 'Smith', '', 'BA', '', 'MBA'),

    ('Some', 'Name', 'BA', 'MBA', '', 'PhD')

    ) AS T(first_name, last_name, degree1, degree2, degree3, degree4)

    CROSS APPLY

    (

    VALUES

    (degree1, 1),

    (degree2, 2),

    (degree3, 3),

    (degree4, 4)

    ) AS R(degree, pos)

    WHERE

    degree > ''

    )

    SELECT

    first_name,

    last_name,

    MAX(CASE WHEN rn = 1 THEN degree ELSE '' END) AS degree1,

    MAX(CASE WHEN rn = 2 THEN degree ELSE '' END) AS degree2,

    MAX(CASE WHEN rn = 3 THEN degree ELSE '' END) AS degree3,

    MAX(CASE WHEN rn = 4 THEN degree ELSE '' END) AS degree4

    FROM

    C1

    GROUP BY

    first_name,

    last_name;

    GO

    I am using VALUES as the row constructor to simulate your table, but the idea here is to use APPLY operator to unpivot the data, exclude the rows where there is no degree, enumerate the rest and do pivot again.

  • hunchback (5/15/2014)


    I wish you could post table schema and sample data.

    SET NOCOUNT ON;

    USE tempdb;

    GO

    WITH C1 AS (

    SELECT

    T.first_name,

    T.last_name,

    R.degree,

    R.pos,

    ROW_NUMBER()OVER(PARTITION BY T.first_name, T.last_name ORDER BY R.pos) AS rn

    FROM

    (

    VALUES

    ('John', 'Smith', '', 'BA', '', 'MBA'),

    ('Some', 'Name', 'BA', 'MBA', '', 'PhD')

    ) AS T(first_name, last_name, degree1, degree2, degree3, degree4)

    CROSS APPLY

    (

    VALUES

    (degree1, 1),

    (degree2, 2),

    (degree3, 3),

    (degree4, 4)

    ) AS R(degree, pos)

    WHERE

    degree > ''

    )

    SELECT

    first_name,

    last_name,

    MAX(CASE WHEN rn = 1 THEN degree ELSE '' END) AS degree1,

    MAX(CASE WHEN rn = 2 THEN degree ELSE '' END) AS degree2,

    MAX(CASE WHEN rn = 3 THEN degree ELSE '' END) AS degree3,

    MAX(CASE WHEN rn = 4 THEN degree ELSE '' END) AS degree4

    FROM

    C1

    GROUP BY

    first_name,

    last_name;

    GO

    I am using VALUES as the row constructor to simulate your table, but the idea here is to use APPLY operator to unpivot the data, exclude the rows where there is no degree, enumerate the rest and do pivot again.

    Awesome, this works! 🙂 Thank you!

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply