Project Context and Data Presentation Challenges
In our projects we have seen, data quality issues manifest in various forms, such as NULL values, incorrect formats, invalid characters, duplicate records, and non-ASCII characters to name a few. This article explores SQL functions for identifying and resolving some of the common data quality issues such as NULL values, inconsistent data formats, invalid characters, duplicates, and non-ASCII characters.
- Handling NULL Values
- Resolving Data Format Issues
- Identifying Invalid Characters
- Managing Duplicate Records
- Addressing Non-ASCII Characters
Use Case/Examples
Lets start by creating a table and inserting sample data.
---STEP 1: Create Table DDL Statement CREATE TABLE Employees ( ID INT , Employee NVARCHAR(100), Email NVARCHAR(100), JoinDate NVARCHAR(10), Salary NVARCHAR(50), Department NVARCHAR(50) );
---STEP 2: Insert sample data INSERT INTO Employees (ID, Employee, Email, JoinDate, Salary, Department) VALUES (1, 'Johnson', 'johnson@email', '2023-01-07', '60000', 'IT'), (2, 'ABC123', 'abc@domain.com', '21/10/2019', '50000', 'HR'), (3, 'Robert', 'Robert@email.com', '2024-09-15', '70000', 'IT'), (4, 'Robert', 'Rober@email.com', '2023-09-15', '70000a', 'IT'), (5, 'Jack D@vis', 'jack.davis@email.com', '2021-11-11', NULL, 'HR'), (6, NULL, 'john123@email.com', '2022-06-21', '48000', 'Marketing'), (7, 'José Gómez', 'jose.gomez@email.com', '2021-02-10', '65000', 'Marketing'), (8, '"这个产品非常好"', 'mark.sparks@email.com', '2021-12-05', '52000', 'Sales');
After inserting the sample data, the table would be as shown below:
ID | Employee | JoinDate | Salary | Department | |
1 | Johnson | johnson@email | 2023-01-07 | 60000 | IT |
2 | ABC123 | abc@domain.com | 21/10/2019 | 50000 | HR |
3 | Robert | Robert@email.com | 2024-09-15 | 70000 | IT |
4 | Robert | Rober@email.com | 2023-09-15 | 70000a | IT |
5 | Jack D@vis | jack.davis@email.com | 2021-11-11 | NULL | HR |
6 | NULL | john123@email.com | 2022-06-21 | 48000 | Marketing |
7 | José Gómez | jose.gomez@email.com | 2021-02-10 | 65000 | Marketing |
8 | 这个产品非常好 | mark.sparks@email.com | 2021-12-05 | 52000 | Sales |
Now we can look at the different use cases of these SQL functions that help to rectify the data quality issues
USE CASE 1: Identifying and Handling NULL Values
To identify NULL Values, use "IS NULL" to locate records with missing data.
--Find records with NULL values as in the examples of "Employee" or "Salary" column SELECT * FROM Employees WHERE Employee IS NULL OR Salary IS NULL;
The results would be:
ID | Employee | JoinDate | Salary | Department | |
5 | Jack D@vis | jack.davis@email.com | 2021-11-11 | NULL | HR |
6 | NULL | john123@email.com | 2022-06-21 | 48000 | Marketing |
Use COALESCE()/ ISNULL() to replace NULL values with default values. In this example we do two things:
- Replace NULL values in the Employee column with the string 'Unknown' using ISNULL() function
- Replace NULL value in the "Salary" column to '0' using COALESCE()function
-- Replace NULL values with default values using COALESCE() or ISNULL() SELECT ID , ISNULL (Employee, 'Unknown') AS Employee , Email , JoinDate , COALESCE (Salary, '0') AS Salary , Department FROM Employees;
The output shows replacing NULL values in Employee with "Unknown" & Salary with "0" would now be displayed as shown below:
ID | Employee | JoinDate | Salary | Department | |
5 | Jack D@vis | jack.davis@email.com | 2021-11-11 | 0 | HR |
6 | Unknown | john123@email.com | 2022-06-21 | 48000 | Marketing |
USE CASE 2: Resolving Data Format Issues (Dates, Emails)
These would be crucial as consistent data formats are vital for accurate analysis and reporting. In this example we will look at the Identifying two different formatting issues: date format and email formats using inbuilt functions.
-- 1a)Identifying DATE FORMAT ISSUE: rows where JoinDate is not a valid date SELECT * FROM Employees WHERE ISDATE(JoinDate) = 0
The query correctly identifies rows where JoinDateis not a valid date, either because it's in an invalid format, contains non-date characters, or is NULL. The default data type for storing dates in SQL is “YYYY-MM-DD” format. In our table, 21/10/2019 is not proper valid date format.
The ISDATE() function is a SQL Server function that checks if the value in the JoinDate column can be interpreted as a valid date. ISDATE() returns:
- 1 if the value can be converted to a valid date (i.e., the format is correct, and the value is a valid date).
- 0 if the value cannot be interpreted as a valid date (i.e., it may be in an invalid format or contain non-date characters).
The WHERE ISDATE(JoinDate) = 0 clause filters out any rows where JoinDate is not a valid date (i.e., it either cannot be interpreted as a date or contains invalid characters)
The output of the query is shown:
ID | Employee | JoinDate | Salary | Department | |
2 | ABC123 | abc@domain.com | 21/10/2019 | 50000 | HR |
Next, let's look at identifying email format issues.
-- 1b)Identifying EMAIL FORMAT ISSUE: Use PATINDEX to check the proper email address format SELECT * FROM Employees WHERE PATINDEX(‘%@%.%’, Email) = 0;
In this query, the PATINDEX function is used to search for a pattern in a string and return the position of the first occurrence of that pattern. If the pattern is found, it returns the starting position of the pattern; otherwise, it returns 0. The parameters are:
- %Pattern%: The pattern you want to search for, enclosed in percent signs (%).
- Expression: The string expression (column or literal) in which to search for the pattern.
- PATINDEX('%@%.%', Email) = 0: This condition filters the rows where the Email column does not contain an @ followed by a .(dot).
The Output from this query is shown below:
ID | Employee | JoinDate | Salary | Department | |
1 | Johnson | johnson@email | 2023-01-07 | 60000 | IT |
Handling the Formatting Issues
In the examples for dates and emails, you can choose to update any alternative values where applicable as per your need but use the same technique to do updates.
To fix the date issue, we can use TRY_CONVERT with this code:
-- 2a) Handling the Date Format Issue UPDATE Employees SET JoinDate = TRY_CONVERT(DATE, JoinDate, 103) WHERE ISDATE(JoinDate) = 0;
In this code, TRY_CONVERT(DATE, JoinDate, 103), attempts to convert the JoinDate column to a DATE data type. 103 is the style code for converting a date in the dd/mm/yyyy format (i.e., British/French style). TRY_CONVERT is a safe version of CONVERT. If the conversion fails (e.g., if the date format is invalid or not recognized), it will return NULL instead of raising an error.
The WHERE ISDATE(JoinDate) = 0 condition ensures that only rows where JoinDate is invalid (i.e., - where ISDATE returns 0) will be updated.
--2b) Handling the Email Format Issue UPDATE Employees SET Email = ‘johnson@email.com’ WHERE PATINDEX('%@%.%', Email) = 0;
In this code, the SET Email = johnson@email.com’ sets the email attribute to a default or placeholder value (you can choose to update another value if necessary). The WHERE PATINDEX('%@%.%', Email) = 0: This ensures the update only applies to rows where the email attribute does not contain a valid email format, as indicated by the PATINDEX function returning '0'.
The output from these two updates would now be displayed as below:
ID | Employee | JoinDate | Salary | Department | |
2 | ABC123 | abc@domain.com | 2019-10-21 | 50000 | HR |
1 | Johnson | johnson@email.com | 2023-01-07 | 60000 | IT |
USE CASE 3: Identifying and Resolving Invalid Characters
In this example we will look at identifying invalid characters especially for names or numeric fields. We'll start with this code:
-- 1a) Identifying invalid characters in NAME field SELECT * FROM Employees WHERE employee LIKE '%[^A-Za-z ]%'
In this code, for identifying invalid characters, such as an incorrectly formatted name. Here is the explanation for this code: Pattern %[^A-Za-z ]%
- The caret (^) is a not operatior for the next part.
- The [^A-Za-z ] part of the pattern matches any character that is not a letter (A-Z, a-z) or a space. It could match digits, punctuation marks, special characters, or any other non-alphabetical character.
- The % symbols before and after [^A-Za-z ] indicate that any sequence of characters (including none) can precede or follow the non-letter, non-space character
Next, let's look for invalid characters in a field that should only contain numbers. Here is the code
--1b) Detecting Invalid Characters in Numeral field (Check if Salary contains non-numeric chars) SELECT * FROM Employees WHERE Salary LIKE '%[^0-9]%';
The pattern %[^0-9]%, which matches any character except the ones inside the brackets, will find non-numeric characters in salary fields. The pattern means any string that contains at least one character that is not a digit.
The output would be displayed as below
ID | Employee | JoinDate | Salary | Department | |
2 | ABC123 | abc@domain.com | 2019-10-21 | 50000 | HR |
5 | Jack D@vis | jack.davis@email.com | 2021-11-11 | 0 | HR |
8 | 这个产品非常好 | mark.sparks@email.com | 2021-12-05 | 52000 | Sales |
4 | Robert | Rober@email.com | 2023-09-15 | 70000a | IT |
Handling Invalid Characters in Name and Numbers
You can choose to update any alternative values where applicable as per your need but use the same technique to do updates. Since we already know the specific Employee IDs that require modification, we have applied a straightforward UPDATE statement to address the necessary changes.
-- Step 2a: Resolving Invalid Characters in Employee Name field UPDATE Employees SET Employee = CASE WHEN ID = 2 THEN 'ABC' WHEN ID = 5 THEN 'Jack Davis' WHEN ID = 8 THEN 'Unknown Employee' ELSE Employee END
-- Step 2b: Resolving Invalid Characters in Salary field UPDATE Employees SET salary = '7000' where ID=4 and Salary LIKE '%[^0-9]%'
The output post handling the data format changes for email and date would now be displayed as below
ID | Employee | JoinDate | Salary | Department | |
2 | ABC | abc@domain.com | 2019-10-21 | 50000 | HR |
5 | Jack Davis | jack.davis@email.com | 2021-11-11 | 0 | HR |
8 | Unknown Employee | mark.sparks@email.com | 2021-12-05 | 52000 | Sales |
4 | Robert | Rober@email.com | 2023-09-15 | 70000 | IT |
USE CASE 4: Identifying and Resolving Duplicates
In this example we will look at identifying duplicate employees by employee name. We will use GROUP BY with HAVING to find duplicate entries.
-- Find duplicate records based on Employee using GROUP BY and HAVING SELECT Employee, COUNT(*) FROM Employees GROUP BY Employee HAVING COUNT(*) > 1
In this code, the GROUP BY Employee groups the rows by the values in the Employee column. This means that instead of counting all rows in the table as a single group, the query groups the data by each unique employee. The HAVING COUNT(*) > 1 clause is used to filter the results after the GROUP BY operation based on aggregate functions like COUNT.Count(*)>1nterested in groups where the count of employees is greater than 1
The output is a single row.
Employee | Count(*) |
Robert | 2 |
Handle duplicates by Employee Name by using the ROW_NUMBER() function to find duplicate and keep only unique records. You can choose to update any alternative values where applicable as per your need but use the same technique to do updates.
-- Handling the Duplicate Records (Keep the latest created record) WITH Duplicates AS ( SELECT ID, Employee, Email, ROW_NUMBER() OVER (PARTITION BY Employee,Email ORDER BY JoinDate desc) AS RowNumber FROM Employees ) DELETE FROM Employees WHERE ID IN (SELECT ID FROM Duplicates WHERE RowNumber > 1);
The code explanation uses a CTE to identify and rank the duplicate records based on the Employee and Email columns. The ROW_NUMBER() function is used to rank records and determine which duplicates to delete, with the most recent record being retained. The PARTITION BY Employee, Email assigns unique sequential number to each record within the partitioned groups (based on Employee and Email). The ORDER BY JoinDate DESC ensures that the most recent record (based on the JoinDate descending order) within each duplicate group gets the RowNumber = 1. All subsequent duplicate rows will have RowNumbers greater than 1.
We use the delete to remove all rows from the Employees table that are considered duplicates, which are the older records, those with a RowNumber greater than 1. The most recent record (with RowNumber = 1) for each unique combination of Employee and Email is not deleted, ensuring that the latest entry is retained.
The output BEFORE DELETION of duplicates would be as below with the rank assigned to all the employee records.
SELECT *, ROW_NUMBER() OVER (PARTITION BY Employee,Email ORDER BY JoinDate desc) AS Rn FROM Employees
ID | Employee | JoinDate | Salary | Department | ROWNumber | |
1 | Johnson | johnson@email | 2023-07-01 | 60000 | IT | 1 |
2 | ABC123 | abc@domain.com | 21/10/2019 | 50000 | HR | 1 |
3 | Robert | Robert@email.com | 2024-09-15 | 70000 | IT | 1 |
4 | Robert | Rober@email.com | 2023-09-15 | 70000a | IT | 2 |
5 | Jack D@vis | jack.davis@email.com | 2021-11-11 | NULL | HR | 1 |
6 | NULL | john123@email.com | 2022-06-21 | 48000 | Marketing | 1 |
7 | José Gómez | jose.gomez@email.com | 2021-02-10 | 65000 | Marketing | 1 |
8 | 这个产品非常好 | mark.sparks@email.com | 2021-12-05 | 52000 | Sales | 1 |
The output after the deletion of duplicates would be as below
ID | Employee | JoinDate | Salary | Department | |
1 | Johnson | johnson@email | 2023-07-01 | 60000 | IT |
2 | ABC123 | abc@domain.com | 21/10/2019 | 50000 | HR |
3 | Robert | Robert@email.com | 2024-09-15 | 70000 | IT |
5 | Jack D@vis | jack.davis@email.com | 2021-11-11 | NULL | HR |
6 | NULL | john123@email.com | 2022-06-21 | 48000 | Marketing |
7 | José Gómez | jose.gomez@email.com | 2021-02-10 | 65000 | Marketing |
8 | 这个产品非常好 | mark.sparks@email.com | 2021-12-05 | 52000 | Sales |
USE CASE 5: Addressing Non-ASCII Characters
In this example we will look at identifying Non- ASCII characters. This would become important in some context where Non-ASCII characters may cause display issues or encoding errors, especially in text fields.
-- Step1 :Detecting Non-ASCII Characters: Use COLLATE Latin1_General_BIN LIKE '%[^ -~]%'to locate entries -- with characters outside the ASCII range. -- Check if Employee Name contains any character outside the ASCII range (this includes accented -- characters and non-Latin scripts like Chinese) SELECT * FROM Employees WHERE Employee COLLATE Latin1_General_BIN LIKE '%[^ -~]%';
The explanation of this code is that the COLLATE Latin1_General_BIN clause treats characters according to their binary representation. The LIKE '%[^ -~]%': This pattern matches any character that is NOT in the ASCII range. The results of this code are shown here:
ID | Employee | JoinDate | Salary | Department | |
7 | José Gómez | jose.gomez@email.com | 2021-02-10 | 65000 | Marketing |
8 | 这个产品非常好 | mark.sparks@email.com | 2021-12-05 | 52000 | Sales |
In gandling Non ASCII characters, you can choose to update any alternative values where applicable as per your need. Since we already know the specific employee name that requires modification, we can apply a straightforward UPDATE statement to address the necessary changes.
UPDATE Employees SET Employee = 'Unknown Employee' WHERE Employee COLLATE Latin1_General_BIN LIKE '%[^ -~]%';
The output, post update, of Non- ASCII employee name to "Unknown Employee" would be displayed in the table as below
ID | Employee | JoinDate | Salary | Department | |
7 | Unknown Employee | jose.gomez@email.com | 2021-02-10 | 65000 | Marketing |
8 | Unknown Employee | mark.sparks@email.com | 2021-12-05 | 52000 | Sales |
Conclusion
In conclusion, addressing poor data quality is paramount for large enterprises seeking to unlock meaningful value from their data assets. The challenges posed by the variety & volume of data underscore the importance of implementing comprehensive data management strategies that prioritize accuracy and consistency. By leveraging SQL's powerful functions, organizations can proactively identify and resolve data quality issues, thereby enhancing the integrity of their data. High-quality, reliable data not only leads to better business insights and informed decision-making but also minimizes errors and reduces the need for costly manual interventions.