SQLServerCentral Article

Boosting Data Accuracy: Resolving Common Data Quality Issues Using SQL

,

Project Context and Data Presentation Challenges

In our projects we have seen, data quality issues manifest in various forms, such as NULL values, incorrect formats, invalid characters, duplicate records, and non-ASCII characters to name a few. This article explores SQL functions for identifying and resolving some of the common data quality issues such as NULL values, inconsistent data formats, invalid characters, duplicates, and non-ASCII characters.

  1. Handling NULL Values
  2. Resolving Data Format Issues
  3. Identifying Invalid Characters
  4. Managing Duplicate Records
  5. Addressing Non-ASCII Characters

Use Case/Examples

Lets start by creating a table and inserting sample data.

---STEP 1: Create Table DDL Statement
CREATE TABLE Employees (
    ID INT ,
    Employee NVARCHAR(100),
    Email NVARCHAR(100),
    JoinDate NVARCHAR(10),
    Salary NVARCHAR(50),
    Department NVARCHAR(50)
);
---STEP 2:  Insert sample data
INSERT INTO Employees (ID, Employee, Email, JoinDate, Salary, Department) VALUES
(1, 'Johnson', 'johnson@email', '2023-01-07', '60000', 'IT'),
(2, 'ABC123', 'abc@domain.com', '21/10/2019', '50000', 'HR'),
(3, 'Robert', 'Robert@email.com', '2024-09-15', '70000', 'IT'),
(4, 'Robert', 'Rober@email.com', '2023-09-15', '70000a', 'IT'),
(5, 'Jack D@vis', 'jack.davis@email.com', '2021-11-11', NULL, 'HR'),
(6, NULL, 'john123@email.com', '2022-06-21', '48000', 'Marketing'),
(7, 'José Gómez', 'jose.gomez@email.com', '2021-02-10', '65000', 'Marketing'),
(8, '"这个产品非常好"', 'mark.sparks@email.com', '2021-12-05', '52000', 'Sales');

After inserting the sample data, the table would be as shown below:

IDEmployeeEmailJoinDateSalaryDepartment
1Johnsonjohnson@email2023-01-0760000IT
2ABC123abc@domain.com21/10/201950000HR
3RobertRobert@email.com2024-09-1570000IT
4RobertRober@email.com2023-09-1570000aIT
5Jack D@visjack.davis@email.com2021-11-11NULLHR
6NULLjohn123@email.com2022-06-2148000Marketing
7José Gómezjose.gomez@email.com2021-02-1065000Marketing
8这个产品非常好mark.sparks@email.com2021-12-0552000Sales

Now we can look at the different use cases of these SQL functions that help to rectify the data quality issues

USE CASE 1: Identifying and Handling NULL Values

To identify NULL Values, use "IS NULL" to locate records with missing data.

--Find records with NULL values as in the examples of "Employee" or "Salary" column
SELECT * FROM Employees
 WHERE Employee IS NULL OR Salary IS NULL;

The results would be:

IDEmployeeEmailJoinDateSalaryDepartment
5Jack D@visjack.davis@email.com2021-11-11NULLHR
6NULLjohn123@email.com2022-06-2148000Marketing

Use COALESCE()/ ISNULL() to replace NULL values with default values.  In this example we do two things:

  • Replace NULL values in the Employee column with the string 'Unknown' using ISNULL() function
  • Replace NULL value in the "Salary" column to '0' using COALESCE()function
-- Replace NULL values with default values using COALESCE() or ISNULL()
SELECT
  ID
, ISNULL (Employee, 'Unknown') AS Employee
, Email
, JoinDate
, COALESCE (Salary, '0') AS Salary
, Department
FROM Employees;

The output shows replacing NULL values in Employee with "Unknown" & Salary with "0"  would now be displayed as shown below:

IDEmployeeEmailJoinDateSalaryDepartment
5Jack D@visjack.davis@email.com2021-11-110HR
6Unknownjohn123@email.com2022-06-2148000Marketing

USE CASE 2: Resolving Data Format Issues (Dates, Emails)

These would be crucial as consistent data formats are vital for accurate analysis and reporting. In this example we will look at the Identifying two different formatting issues: date format and email formats using inbuilt functions.

-- 1a)Identifying DATE FORMAT ISSUE: rows where JoinDate is not a valid date
SELECT * FROM Employees
 WHERE ISDATE(JoinDate) = 0

The query correctly identifies rows where JoinDateis not a valid date, either because it's in an invalid format, contains non-date characters, or is NULL. The default data type for storing dates in SQL is “YYYY-MM-DD” format. In our table, 21/10/2019 is not proper valid date format.

The ISDATE() function is a SQL Server function that checks if the value in the JoinDate column can be interpreted as a valid date. ISDATE() returns:

      • 1 if the value can be converted to a valid date (i.e., the format is correct, and the value is a valid date).
      • 0 if the value cannot be interpreted as a valid date (i.e., it may be in an invalid format or contain non-date characters).

The WHERE ISDATE(JoinDate) = 0 clause filters out any rows where JoinDate is not a valid date (i.e., it either cannot be interpreted as a date or contains invalid characters)

The output of the query is shown:

IDEmployeeEmailJoinDateSalaryDepartment
2ABC123abc@domain.com21/10/201950000HR

Next, let's look at identifying email format issues.

-- 1b)Identifying EMAIL FORMAT ISSUE: Use PATINDEX to check the proper email address format
SELECT * FROM Employees
 WHERE PATINDEX(‘%@%.%’, Email) = 0;

In this query, the PATINDEX function is used to search for a pattern in a string and return the position of the first occurrence of that pattern. If the pattern is found, it returns the starting position of the pattern; otherwise, it returns 0. The parameters are:

  • %Pattern%: The pattern you want to search for, enclosed in percent signs (%).
  • Expression: The string expression (column or literal) in which to search for the pattern.
  • PATINDEX('%@%.%', Email) = 0: This condition filters the rows where the Email column does not contain an @ followed by a  .(dot).

The Output from this query is shown below:

IDEmployeeEmailJoinDateSalaryDepartment
1Johnsonjohnson@email2023-01-0760000IT

Handling the Formatting Issues

In the examples for dates and emails, you can choose to update any alternative values where applicable as per your need but use the same technique to do updates.

To fix the date issue, we can use TRY_CONVERT with this code:

-- 2a) Handling the Date Format Issue
UPDATE Employees
SET JoinDate = TRY_CONVERT(DATE, JoinDate, 103) WHERE ISDATE(JoinDate) = 0;

In this code, TRY_CONVERT(DATE, JoinDate, 103), attempts to convert the JoinDate column to a DATE data type. 103 is the style code for converting a date in the dd/mm/yyyy format (i.e., British/French style). TRY_CONVERT is a safe version of CONVERT. If the conversion fails (e.g., if the date format is invalid or not recognized), it will return NULL instead of raising an error.

The WHERE ISDATE(JoinDate) = 0 condition ensures that only rows where JoinDate is invalid (i.e., - where ISDATE returns 0) will be updated.

--2b) Handling the Email Format Issue
UPDATE Employees
 SET Email = ‘johnson@email.com’ 
 WHERE PATINDEX('%@%.%', Email) = 0;

In this code, the SET Email = johnson@email.com’ sets the email attribute to a default or placeholder value (you can choose to update another value if necessary). The WHERE PATINDEX('%@%.%', Email) = 0: This ensures the update only applies to rows where the email attribute does not contain a valid email format, as indicated by the PATINDEX function returning '0'.

The output from these two updates would now be displayed as below:

IDEmployeeEmailJoinDateSalaryDepartment
2ABC123abc@domain.com2019-10-2150000HR
1Johnsonjohnson@email.com2023-01-0760000IT

USE CASE 3: Identifying and Resolving Invalid Characters

In this example we will look at identifying invalid characters especially for names or numeric fields. We'll start with this code:

-- 1a) Identifying invalid characters in NAME field
SELECT *
 FROM Employees
 WHERE employee LIKE  '%[^A-Za-z ]%'

In this code, for identifying invalid characters, such as an incorrectly formatted name. Here is the explanation for this code: Pattern %[^A-Za-z ]%

  • The caret (^) is a not operatior for the next part.
  • The [^A-Za-z ] part of the pattern matches any character that is not a letter (A-Z, a-z) or a space. It could match digits, punctuation marks, special characters, or any other non-alphabetical character.
  • The % symbols before and after [^A-Za-z ] indicate that any sequence of characters (including none) can precede or follow the non-letter, non-space character

Next, let's look for invalid characters in a field that should only contain numbers. Here is the code

--1b) Detecting Invalid Characters in Numeral field (Check if Salary contains non-numeric chars)
SELECT *
 FROM Employees
 WHERE Salary LIKE '%[^0-9]%';

The pattern %[^0-9]%, which matches any character except the ones inside the brackets, will find non-numeric characters in salary fields. The pattern means any string that contains at least one character that is not a digit.

The output would be displayed as below

IDEmployeeEmailJoinDateSalaryDepartment
2ABC123abc@domain.com2019-10-2150000HR
5Jack D@visjack.davis@email.com2021-11-110HR
8这个产品非常好mark.sparks@email.com2021-12-0552000Sales
4RobertRober@email.com2023-09-1570000aIT

Handling Invalid Characters in Name and Numbers

You can choose to update any alternative values where applicable as per your need but use the same technique to do updates. Since we already know the specific Employee IDs  that require modification, we have applied a straightforward UPDATE statement to address the necessary changes.

-- Step 2a: Resolving Invalid Characters in Employee Name field
UPDATE Employees SET
    Employee = CASE
                 WHEN ID = 2 THEN 'ABC'
                 WHEN ID = 5 THEN 'Jack Davis'
                 WHEN ID = 8 THEN 'Unknown Employee'
                 ELSE Employee
               END
-- Step 2b: Resolving Invalid Characters in Salary field
UPDATE Employees SET
    salary = '7000' where ID=4 and Salary LIKE '%[^0-9]%'​

The output post handling the data format changes for email and date would now be displayed as below

IDEmployeeEmailJoinDateSalaryDepartment
2ABCabc@domain.com2019-10-2150000HR
5Jack Davisjack.davis@email.com2021-11-110HR
8Unknown Employeemark.sparks@email.com2021-12-0552000Sales
4RobertRober@email.com2023-09-1570000IT

USE CASE 4: Identifying and Resolving Duplicates

In this example we will look at identifying duplicate employees by employee name. We will use GROUP BY with HAVING to find duplicate entries.

-- Find duplicate records based on Employee using GROUP BY and HAVING
SELECT Employee, COUNT(*)
 FROM Employees
 GROUP BY Employee
 HAVING COUNT(*) > 1

In this code, the GROUP BY Employee groups the rows by the values in the Employee column. This means that instead of counting all rows in the table as a single group, the query groups the data by each unique employee. The HAVING COUNT(*) > 1 clause is used to filter the results after the GROUP BY operation based on aggregate functions like COUNT.Count(*)>1nterested in groups where the count of employees is greater than 1

The output is a single row.

EmployeeCount(*)
Robert2

Handle duplicates by Employee Name by using the ROW_NUMBER() function to find duplicate and keep only unique records. You can choose to update any alternative values where applicable as per your need but use the same technique to do updates.

-- Handling the Duplicate Records (Keep the latest created record)
WITH Duplicates AS
(
SELECT ID, Employee, Email, ROW_NUMBER() OVER (PARTITION BY Employee,Email ORDER BY JoinDate desc) AS RowNumber FROM Employees
)
DELETE FROM Employees
WHERE ID IN (SELECT ID FROM Duplicates WHERE RowNumber > 1);

The code explanation uses a CTE to identify and rank the duplicate records based on the Employee and Email columns. The ROW_NUMBER() function is used to rank records and determine which duplicates to delete, with the most recent record being retained. The PARTITION BY Employee, Email assigns unique sequential number to each record within the partitioned groups (based on Employee and Email). The ORDER BY JoinDate DESC ensures that the most recent record (based on the JoinDate descending order) within each duplicate group gets the RowNumber = 1. All subsequent duplicate rows will have RowNumbers greater than 1.

We use the delete to remove all rows from the Employees table that are considered duplicates, which are the older records, those with a RowNumber greater than 1. The most recent record (with RowNumber = 1) for each unique combination of Employee and Email is not deleted, ensuring that the latest entry is retained.

The output BEFORE DELETION of duplicates would be as below with the rank assigned to all the employee records.

SELECT *, ROW_NUMBER() OVER (PARTITION BY Employee,Email ORDER BY JoinDate desc) 
AS Rn FROM Employees
IDEmployeeEmailJoinDateSalaryDepartmentROWNumber
1Johnsonjohnson@email2023-07-0160000IT1
2ABC123abc@domain.com21/10/201950000HR1
3RobertRobert@email.com2024-09-1570000IT1
4RobertRober@email.com2023-09-1570000aIT2
5Jack D@visjack.davis@email.com2021-11-11NULLHR1
6NULLjohn123@email.com2022-06-2148000Marketing1
7José Gómezjose.gomez@email.com2021-02-1065000Marketing1
8这个产品非常好mark.sparks@email.com2021-12-0552000Sales1

The output after the deletion of duplicates would be as below

IDEmployeeEmailJoinDateSalaryDepartment
1Johnsonjohnson@email2023-07-0160000IT
2ABC123abc@domain.com21/10/201950000HR
3RobertRobert@email.com2024-09-1570000IT
5Jack D@visjack.davis@email.com2021-11-11NULLHR
6NULLjohn123@email.com2022-06-2148000Marketing
7José Gómezjose.gomez@email.com2021-02-1065000Marketing
8这个产品非常好mark.sparks@email.com2021-12-0552000Sales

USE CASE 5: Addressing Non-ASCII Characters

In this example we will look at identifying Non- ASCII characters. This would become important in some context where Non-ASCII characters may cause display issues or encoding errors, especially in text fields.

-- Step1 :Detecting Non-ASCII Characters: Use COLLATE Latin1_General_BIN LIKE '%[^ -~]%'to locate entries -- with characters outside the ASCII range.
-- Check if Employee Name contains any character outside the ASCII range (this includes accented          
-- characters and non-Latin scripts like Chinese) 
SELECT *
 FROM Employees
 WHERE Employee COLLATE Latin1_General_BIN LIKE '%[^ -~]%';

The explanation of this code is that the COLLATE Latin1_General_BIN clause treats characters according to their binary representation. The LIKE '%[^ -~]%': This pattern matches any character that is NOT in the ASCII range. The results of this code are shown here:

IDEmployeeEmailJoinDateSalaryDepartment
7José Gómezjose.gomez@email.com2021-02-1065000Marketing
8这个产品非常好mark.sparks@email.com2021-12-0552000Sales

In gandling Non ASCII characters, you can choose to update any alternative values where applicable as per your need. Since we already know the specific employee name that requires modification, we can apply a straightforward UPDATE statement to address the necessary changes.

UPDATE Employees SET Employee = 'Unknown Employee' WHERE Employee COLLATE Latin1_General_BIN LIKE '%[^ -~]%';

The output, post update, of Non- ASCII employee name to "Unknown Employee" would be displayed in the table as below

IDEmployeeEmailJoinDateSalaryDepartment
7Unknown Employeejose.gomez@email.com2021-02-1065000Marketing
8Unknown Employeemark.sparks@email.com2021-12-0552000Sales

Conclusion

In conclusion, addressing poor data quality is paramount for large enterprises seeking to unlock meaningful value from their data assets. The challenges posed by the variety & volume of data underscore the importance of implementing comprehensive data management strategies that prioritize accuracy and consistency. By leveraging SQL's powerful functions, organizations can proactively identify and resolve data quality issues, thereby enhancing the integrity of their data. High-quality, reliable data not only leads to better business insights and informed decision-making but also minimizes errors and reduces the need for costly manual interventions.

 

Rate

5 (2)

You rated this post out of 5. Change rating

Share

Share

Rate

5 (2)

You rated this post out of 5. Change rating