Reconcile employees with multiple IDs into Unique IDs

Question

Post reply

Reconcile employees with multiple IDs into Unique IDs

masterelaichi

SSC-Addicted

Points: 444

August 17, 2022 at 10:12 am

#4077328

Hi,

I have a scenario where my employee table has multiple IDs. The background to this scenario is that due to a change in the reporting systems employees who overlapped during the change ended up with more than one ID as in the table below

CREATE TABLE #input 
(
    Emp_codevarchar(300),
    Employee_Namevarchar(300),
    SSN varchar(300)
);

INSERT INTO #input (Emp_code,Employee_Name,SSN ) VALUES ('A00001', 'Employee Name 1', 'SSN1 ');
INSERT INTO #input (Emp_code,Employee_Name,SSN ) VALUES ('A00002', 'Employee Name 2', 'SSN2 ');
INSERT INTO #input (Emp_code,Employee_Name,SSN ) VALUES ('A00003', 'Employee Name 3', 'SSN3 ');
INSERT INTO #input (Emp_code,Employee_Name,SSN ) VALUES ('A00004', 'Employee Name 4', 'SSN4 ');
INSERT INTO #input (Emp_code,Employee_Name,SSN ) VALUES ('A00005', 'Employee Name 5', 'SSN5 ');
INSERT INTO #input (Emp_code,Employee_Name,SSN ) VALUES ('A00006', 'Employee Name 6', 'SSN6 ');
INSERT INTO #input (Emp_code,Employee_Name,SSN ) VALUES ('A00007', 'Employee Name 7', 'SSN7 ');
INSERT INTO #input (Emp_code,Employee_Name,SSN ) VALUES ('B90001', 'Employee Name 1', 'SSN1 ');
INSERT INTO #input (Emp_code,Employee_Name,SSN ) VALUES ('B90002', 'Employee Name 2', 'SSN2 ');
INSERT INTO #input (Emp_code,Employee_Name,SSN ) VALUES ('B90003', 'Employee Name 3', 'SSN3 ');
INSERT INTO #input (Emp_code,Employee_Name,SSN ) VALUES ('B90005', 'Employee Name 5', 'SSN5 ');
INSERT INTO #input (Emp_code,Employee_Name,SSN ) VALUES ('B90006', 'Employee Name 6', 'SSN6 ');
INSERT INTO #input (Emp_code,Employee_Name,SSN ) VALUES ('B90007', 'Employee Name 7', 'SSN7 ');

As can be seen from the table included, Employee Name 1 has two employee codes - A00001 and B90001. However, every employee has a unique SSN. In the list above, only Employee Name 4 has one Employee Code, which means this employee didn't exist in the previous system

I am trying to reconcile both sets of employees codes based on SSN such that there is a unique employee code which is a combination of the "New System"+"Old System" (where available)

The output I am trying to achieve should look something like this

CREATE TABLE #output 
(
    Emp_codevarchar(300),
    Employee_Namevarchar(300),
    SSN varchar(300)
);

INSERT INTO #output (Emp_code,Employee_Name,SSN ) VALUES ('A00001_B90001', 'Employee Name 1', 'SSN1 ');
INSERT INTO #output (Emp_code,Employee_Name,SSN ) VALUES ('A00002_B90002', 'Employee Name 2', 'SSN2 ');
INSERT INTO #output (Emp_code,Employee_Name,SSN ) VALUES ('A00003_B90003', 'Employee Name 3', 'SSN3 ');
INSERT INTO #output (Emp_code,Employee_Name,SSN ) VALUES ('A00004', 'Employee Name 4', 'SSN4 ');
INSERT INTO #output (Emp_code,Employee_Name,SSN ) VALUES ('A00005_B90005', 'Employee Name 5', 'SSN5 ');
INSERT INTO #output (Emp_code,Employee_Name,SSN ) VALUES ('A00006_B90006', 'Employee Name 6', 'SSN6 ');
INSERT INTO #output (Emp_code,Employee_Name,SSN ) VALUES ('A00007_B90007', 'Employee Name 7', 'SSN7 ');

I attempted this using Pivot table but getting stuck in splitting the SSN into multiple employee ids

Any help would be greatly appreciated

Thanks!

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply

arvind.patil 98284 Mr or Mrs. 500 Points: 510 More actions · Answer 1

Hi ,

you can try using below query

Select Employee_Name,SSN,STRING_AGG(Emp_code,'_') within group (order by Emp_code) from #input

group by Employee_Name,SSN

ScottPletcher SSC Guru Points: 100961 More actions · Answer 2

SELECT
    MIN(Emp_code) + CASE WHEN COUNT(*) > 1 THEN '_' + MAX(Emp_code) ELSE '' END AS Emp_code,
    MAX(Employee_Name) AS Employee_Name,
    SSN
FROM #input
GROUP BY SSN
ORDER BY SSN

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

masterelaichi SSC-Addicted Points: 444 More actions · Answer 3

Thanks for that. The sample I included was just a subset of my table. The information I am working with is payroll data. The table contains employees with multiple pay dates and pay types such tax, leave, bonus, and a few other specific to the company. This adds an extra level of granularity because of which there can be multiple instances of an Employee ID

I have tried to expand my dataset a bit more in the input table below if it helps

CREATE TABLE #input 
(
    Employee_ID varchar(300),
    SSN varchar(300),
    Pay_ID varchar(300),
    Date varchar(300),
    AmountPaid varchar(300),
    Col1 varchar(300),
    Col2 varchar(300),
    Col3 varchar(300),
    Col4 varchar(300),
    source varchar(300)
);

INSERT INTO #input (Employee_ID,SSN,Pay_ID,Date,AmountPaid,Col1,Col2,Col3,Col4,source )
VALUES
    ('A00001', 'SSN1', 'A00001_20170910', '10/09/2017', '412', 'RandomText1', 'RandomText2', 'RandomText3', 'RandomText4', 'SYS1 '),
    ('A00001', 'SSN1', 'A00001_20170910', '10/09/2017', '164', 'RandomText2', 'RandomText3', 'RandomText4', 'RandomText5', 'SYS1 '),
    ('A00001', 'SSN1', 'A00001_20180114', '14/01/2018', '295', 'RandomText3', 'RandomText4', 'RandomText5', 'RandomText6', 'SYS1 '),
    ('B00001', 'SSN1', 'B00001_20201210', '10/12/2020', '146', 'RandomText4', 'RandomText5', 'RandomText6', 'RandomText7', 'SYS1 '),
    ('B00001', 'SSN1', 'B00001_20201210', '10/12/2020', '183', 'RandomText5', 'RandomText6', 'RandomText7', 'RandomText8', 'SYS1 '),
    ('B00001', 'SSN1', 'B00001_20201210', '10/12/2020', '211', 'RandomText6', 'RandomText7', 'RandomText8', 'RandomText9', 'SYS1 '),
    ('A00001', 'SSN1', 'A00001_20170521', '21/05/2017', '271', 'RandomText7', 'RandomText7', 'RandomText7', 'RandomText7', 'SYS2 '),
    ('A00001', 'SSN1', 'A00001_20170409', '9/04/2017', '310', 'RandomText8', 'RandomText8', 'RandomText8', 'RandomText8', 'SYS2 '),
    ('A00001', 'SSN1', 'A00001_20170409', '9/04/2017', '449', 'RandomText9', 'RandomText9', 'RandomText9', 'RandomText9', 'SYS2 '),
    ('A00001', 'SSN1', 'A00001_20170312', '12/03/2017', '385', 'RandomText10', 'RandomText10', 'RandomText10', 'RandomText10', 'SYS2 '),
    ('A00002', 'SSN2', 'A00002_20201126', '26/11/2020', '353', 'RandomText11', 'RandomText11', 'RandomText11', 'RandomText11', 'SYS1 '),
    ('A00002', 'SSN2', 'A00002_20201126', '26/11/2020', '305', 'RandomText12', 'RandomText12', 'RandomText12', 'RandomText12', 'SYS1 '),
    ('A00002', 'SSN2', 'A00002_20201126', '26/11/2020', '481', 'RandomText13', 'RandomText13', 'RandomText13', 'RandomText13', 'SYS1 '),
    ('B00002', 'SSN2', 'B00002_20160612', '12/06/2016', '461', 'RandomText14', 'RandomText14', 'RandomText14', 'RandomText14', 'SYS2 '),
    ('B00002', 'SSN2', 'B00002_20160612', '12/06/2016', '363', 'RandomText15', 'RandomText15', 'RandomText15', 'RandomText15', 'SYS2 '),
    ('B00002', 'SSN2', 'B00002_20160612', '12/06/2016', '499', 'RandomText16', 'RandomText16', 'RandomText16', 'RandomText16', 'SYS2 '),
    ('B00002', 'SSN2', 'B00002_20160626', '26/06/2016', '362', 'RandomText17', 'RandomText17', 'RandomText17', 'RandomText17', 'SYS2 ');

When I tried the Sting_Agg code, the result I get is a column with as many occurrences of an ID separated by a delimiter, which is not what I need. The expected output should be

CREATE TABLE #output 
(
    Employee_IDvarchar(300),
    SSNvarchar(300),
    Pay_IDvarchar(300),
    Datevarchar(300),
    AmountPaidvarchar(300),
    Col1varchar(300),
    Col2varchar(300),
    Col3varchar(300),
    Col4varchar(300),
    source varchar(300)
);

INSERT INTO #output (Employee_ID,SSN,Pay_ID,Date,AmountPaid,Col1,Col2,Col3,Col4,source )
VALUES
    ('A00001_B00001', 'SSN1', 'A00001_20170910', '10/09/2017', '412', 'RandomText1', 'RandomText2', 'RandomText3', 'RandomText4', 'SYS1 '),
    ('A00001_B00001', 'SSN1', 'A00001_20170910', '10/09/2017', '164', 'RandomText2', 'RandomText3', 'RandomText4', 'RandomText5', 'SYS1 '),
    ('A00001_B00001', 'SSN1', 'A00001_20180114', '14/01/2018', '295', 'RandomText3', 'RandomText4', 'RandomText5', 'RandomText6', 'SYS1 '),
    ('A00001_B00001', 'SSN1', 'B00001_20201210', '10/12/2020', '146', 'RandomText4', 'RandomText5', 'RandomText6', 'RandomText7', 'SYS1 '),
    ('A00001_B00001', 'SSN1', 'B00001_20201210', '10/12/2020', '183', 'RandomText5', 'RandomText6', 'RandomText7', 'RandomText8', 'SYS1 '),
    ('A00001_B00001', 'SSN1', 'B00001_20201210', '10/12/2020', '211', 'RandomText6', 'RandomText7', 'RandomText8', 'RandomText9', 'SYS1 '),
    ('A00001_B00001', 'SSN1', 'A00001_20170521', '21/05/2017', '271', 'RandomText7', 'RandomText7', 'RandomText7', 'RandomText7', 'SYS2 '),
    ('A00001_B00001', 'SSN1', 'A00001_20170409', '9/04/2017', '310', 'RandomText8', 'RandomText8', 'RandomText8', 'RandomText8', 'SYS2 '),
    ('A00001_B00001', 'SSN1', 'A00001_20170409', '9/04/2017', '449', 'RandomText9', 'RandomText9', 'RandomText9', 'RandomText9', 'SYS2 '),
    ('A00001_B00001', 'SSN1', 'A00001_20170312', '12/03/2017', '385', 'RandomText10', 'RandomText10', 'RandomText10', 'RandomText10', 'SYS2 '),
    ('A00002_B00002', 'SSN2', 'A00002_20201126', '26/11/2020', '353', 'RandomText11', 'RandomText11', 'RandomText11', 'RandomText11', 'SYS1 '),
    ('A00002_B00002', 'SSN2', 'A00002_20201126', '26/11/2020', '305', 'RandomText12', 'RandomText12', 'RandomText12', 'RandomText12', 'SYS1 '),
    ('A00002_B00002', 'SSN2', 'A00002_20201126', '26/11/2020', '481', 'RandomText13', 'RandomText13', 'RandomText13', 'RandomText13', 'SYS1 '),
    ('A00002_B00002', 'SSN2', 'B00002_20160612', '12/06/2016', '461', 'RandomText14', 'RandomText14', 'RandomText14', 'RandomText14', 'SYS2 '),
    ('A00002_B00002', 'SSN2', 'B00002_20160612', '12/06/2016', '363', 'RandomText15', 'RandomText15', 'RandomText15', 'RandomText15', 'SYS2 '),
    ('A00002_B00002', 'SSN2', 'B00002_20160612', '12/06/2016', '499', 'RandomText16', 'RandomText16', 'RandomText16', 'RandomText16', 'SYS2 '),
    ('A00002_B00002', 'SSN2', 'B00002_20160626', '26/06/2016', '362', 'RandomText17', 'RandomText17', 'RandomText17', 'RandomText17', 'SYS2 ');

The number of rows remains the same in the input and the output. Because an SSN can have two ids associated with, I want to combine the two individual ids coming from the different systems and make it the new Employee ID

Hope that helped

Thanks

This reply was modified 1 year, 9 months ago by masterelaichi.

kaj SSCommitted Points: 1580 More actions · Answer 4

How about something like this?

WITH ssn_empID AS (
        SELECT DISTINCT 
            Employee_ID, SSN
        FROM #input
    ),
    ssn_empIDagg AS (
        SELECT 
            STRING_AGG(Employee_ID,'_') AS newEmployee_ID,
            SSN
        FROM ssn_empID
        GROUP BY SSN
    )
SELECT 
    se.newEmployee_ID,
    i.*
FROM #input i
JOIN ssn_empIDagg se ON se.ssn = i.ssn

masterelaichi SSC-Addicted Points: 444 More actions · Answer 5

hmm, I tried the above code. I end up getting a number ids concatenated together. For example, if ID A00001 has multiple rows on 3 paydates and Id B00001 has multiple rows for 2 paydates, I end up getting "A00001_A00001 _A00001_B00001 _B00001"

Jeff Moden SSC Guru Points: 1003933 More actions · Answer 6

masterelaichi wrote:

However, every employee has a unique SSN.

For the love of all that is holy in SQL Server, please tell me that your SSNs are NOT in plain, 9 digit text! If they are, you folks really need to fix them and I mean NOW! And, no... I'm not talking about hashing... I'm talking about real encryption with a salt and a master key, etc.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

masterelaichi SSC-Addicted Points: 444 More actions · Answer 7

The column in my dataset is not actually SSN. I just used it as example in this scenario to highlight a unique column

Jeff Moden SSC Guru Points: 1003933 More actions · Answer 8

Hopefully, that's true. The last person that told me that was at a company the I worked at temporarily. They told me that AFTER I had already seen that it was untrue.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

masterelaichi SSC-Addicted Points: 444 More actions · Answer 9

I think I managed to solve it using two CTEs and unioning them. I then used another CTE on the combined table to remove any duplicates

This reply was modified 1 year, 9 months ago by masterelaichi.

sgmunson SSC Guru Points: 110639 More actions · Answer 10

I'm pretty sure you don't need multiple CTE's to accomplish this task. Given the data supplied in the original post, the following should work (note that varchar(300) is rather wasteful for the data supplied, so I brought that down to just 20):

CREATE TABLE #input (
    Employee_ID varchar(20),
    SSN varchar(20),
    Pay_ID varchar(20),
    [Date] varchar(20),
    AmountPaid varchar(20),
    Col1 varchar(20),
    Col2 varchar(20),
    Col3 varchar(20),
    Col4 varchar(20),
    [source] varchar(20)
);

INSERT INTO #input (Employee_ID,SSN,Pay_ID,Date,AmountPaid,Col1,Col2,Col3,Col4,source )
VALUES
    ('A00001', 'SSN1', 'A00001_20170910', '10/09/2017', '412', 'RandomText1', 'RandomText2', 'RandomText3', 'RandomText4', 'SYS1 '),
    ('A00001', 'SSN1', 'A00001_20170910', '10/09/2017', '164', 'RandomText2', 'RandomText3', 'RandomText4', 'RandomText5', 'SYS1 '),
    ('A00001', 'SSN1', 'A00001_20180114', '14/01/2018', '295', 'RandomText3', 'RandomText4', 'RandomText5', 'RandomText6', 'SYS1 '),
    ('B00001', 'SSN1', 'B00001_20201210', '10/12/2020', '146', 'RandomText4', 'RandomText5', 'RandomText6', 'RandomText7', 'SYS1 '),
    ('B00001', 'SSN1', 'B00001_20201210', '10/12/2020', '183', 'RandomText5', 'RandomText6', 'RandomText7', 'RandomText8', 'SYS1 '),
    ('B00001', 'SSN1', 'B00001_20201210', '10/12/2020', '211', 'RandomText6', 'RandomText7', 'RandomText8', 'RandomText9', 'SYS1 '),
    ('A00001', 'SSN1', 'A00001_20170521', '21/05/2017', '271', 'RandomText7', 'RandomText7', 'RandomText7', 'RandomText7', 'SYS2 '),
    ('A00001', 'SSN1', 'A00001_20170409', '9/04/2017', '310', 'RandomText8', 'RandomText8', 'RandomText8', 'RandomText8', 'SYS2 '),
    ('A00001', 'SSN1', 'A00001_20170409', '9/04/2017', '449', 'RandomText9', 'RandomText9', 'RandomText9', 'RandomText9', 'SYS2 '),
    ('A00001', 'SSN1', 'A00001_20170312', '12/03/2017', '385', 'RandomText10', 'RandomText10', 'RandomText10', 'RandomText10', 'SYS2 '),
    ('A00002', 'SSN2', 'A00002_20201126', '26/11/2020', '353', 'RandomText11', 'RandomText11', 'RandomText11', 'RandomText11', 'SYS1 '),
    ('A00002', 'SSN2', 'A00002_20201126', '26/11/2020', '305', 'RandomText12', 'RandomText12', 'RandomText12', 'RandomText12', 'SYS1 '),
    ('A00002', 'SSN2', 'A00002_20201126', '26/11/2020', '481', 'RandomText13', 'RandomText13', 'RandomText13', 'RandomText13', 'SYS1 '),
    ('B00002', 'SSN2', 'B00002_20160612', '12/06/2016', '461', 'RandomText14', 'RandomText14', 'RandomText14', 'RandomText14', 'SYS2 '),
    ('B00002', 'SSN2', 'B00002_20160612', '12/06/2016', '363', 'RandomText15', 'RandomText15', 'RandomText15', 'RandomText15', 'SYS2 '),
    ('B00002', 'SSN2', 'B00002_20160612', '12/06/2016', '499', 'RandomText16', 'RandomText16', 'RandomText16', 'RandomText16', 'SYS2 '),
    ('B00002', 'SSN2', 'B00002_20160626', '26/06/2016', '362', 'RandomText17', 'RandomText17', 'RandomText17', 'RandomText17', 'SYS2 ');

WITH AGGREGATES AS (

SELECT DISTINCT I.SSN, I.Employee_ID
FROM #input AS I
)
SELECT
A.SSN,
STRING_AGG(A.Employee_ID, '_') WITHIN GROUP (ORDER BY A.Employee_ID) AS NEW_EMP_ID
FROM AGGREGATES AS A
GROUP BY A.SSN
ORDER BY A.SSN;

The results are:

SSNNEW_EMP_ID
SSN1A00001_B00001
SSN2A00002_B00002

This reply was modified 1 year, 9 months ago by sgmunson. Reason: o instead of 0 needed fixed
This reply was modified 1 year, 9 months ago by sgmunson. Reason: Forgot to include results

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

masterelaichi SSC-Addicted Points: 444 More actions · Answer 11

The reason I used two CTEs is because I have another dataset in which the ID are in two different columns leading to duplicates

With the sample above, I am not able to extract the complete payroll data with all the columns. Here is the code I tried

WITH AGGREGATES AS (

SELECT DISTINCT I.*
FROM #input AS I
)
SELECT
A.SSN,
STRING_AGG(A.Employee_ID, '_') WITHIN GROUP (ORDER BY A.Employee_ID) AS NEW_EMP_ID
FROM AGGREGATES AS A
GROUP BY A.SSN
ORDER BY A.SSN;


Result
SSNNEW_EMP_ID
SSN1A00001_A00001_A00001_A00001_A00001_A00001_A00001_B00001_B00001_B00001
SSN2A00002_A00002_A00002_B00002_B00002_B00002_B00002