SQLServerCentral Article

Proper Storage and Handling of Personal Names

,

Typical Design

When designing a data warehouse, what I've encountered is that typically there is a table in a source database that stores name, birthdate, and other info relating to a person. Often, when I encounter personal names in a source table, I find that there are fields for storing and differentiating name parts, similar to this:

first_name varchar(20) not null,
middle_name varchar(20) null,
last_name varchar(25) not null

What we can see here is that certain assumptions have been made by the designers of the source system:

  1. each person entered will have a first and last name, but may or may not have a middle name
  2. each person entered will not have a first name longer than 20 characters, or a last name longer than 25 characters

Problems arise when a system like this must be used to store names that do not fit this pattern. For example, there are people with very long names. How would this system handle names longer than 25 characters? 

Some Example Personal Names

Let's look at some examples. Here are some customers that need to be input into our system:

  • Tyrion Lannister
  • Lee Jun-fan (AKA Bruce Lee)
  • Sinéad O'Connor
  • Cher
  • Nebuchadnezzar
  • Gordon Matthew Thomas Sumner (AKA Sting)
  • Shreya Ghoshal
  • Smith (the first name is missing or not known)

Let's create a new schema for the demo 

CREATE TABLE demo.Customers(
  cst_id int IDENTITY(1,1) NOT NULL,
  cst_last_name varchar(25) NOT NULL,
  cst_first_name varchar(20) NOT NULL, 
  cst_middle_name varchar(20) NULL,
  CONSTRAINT PK_Customers PRIMARY KEY CLUSTERED 
(
  cst_id ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

The Customers table above will handle most names, but try this one:

INSERT INTO demo.Customers(
  cst_last_name
  , cst_first_name
  , cst_middle_name
)
VALUES  ('Wolfeschlegelsteinhausenbergerdorff','Hubert','Blaine')

and you will receive the error

Msg 8152, Level 16, State 14, Line 1

String or binary data would be truncated.

The statement has been terminated.

Truncating the name ahead of the INSERT is a possibility, but does it make sense to do this? Generally, it's rare for a name to be longer than 25 characters, but would there be any negatives if we allow storage of 50 characters for a name part, or 100 characters for a full name? In this age of cheaper storage, perhaps we should consider planning for the storage of longer names.

Let's look at the other examples shown above and insert them into the table.

INSERT INTO demo.Customers(
    cst_last_name
  , cst_first_name
  , cst_middle_name
)
VALUES ('Lannister','Tyrion',null)
  , ('Jun-fan','Lee',null) -- OR is it  ('Lee','Jun-fan',null) ?
  , ('O''Connor','Sinéad',null)
  , ('Cher',null,null)
  , ('Nebuchadnezzar',null,null)
  , ('Sumner','Gordon','Matthew Thomas')
  , ('Ghoshal','Shreya',null) 
  , ('Smith',null,null)

If you run this, you'll get this error:

Msg 515, Level 16, State 2, Line 1

Cannot insert the value NULL into column 'cst_first_name', table 'Tung.demo.Customers'; column does not allow nulls. INSERT fails.

The statement has been terminated.

Now, there are thousands of mononymous people in the world, particularly in countries like India and Indonesia. These are folks that have only one name, and so their names do not fit in the "first_name last_name" or "first_name middle_name last_name" patterns. 

How can we represent those names if two names are required (not null) like in the example above? I have seen a number of workarounds, like "NFN" (No First Name), "Blank", "Unknown", and even "NOFIRSTNAME". It's not pretty!

So, we will need to allow NULLs in the cst_first_name field if we are going to allow for mononyms. Let's ALTER the column to allow NULLs so the INSERT will run without error:

ALTER TABLE demo.Customers 
  ALTER COLUMN cst_first_name varchar(20) NULL

Now, try that INSERT again. It works just fine. However, we may want to differentiate between unknown or missing names and proper mononyms. We might consider re-creating this table with an added cst_mononym_flag column:

DROP TABLE demo.Customers 
CREATE TABLE demo.Customers(
  cst_id int IDENTITY(1,1) NOT NULL,
  cst_last_name varchar(50) NOT NULL,
  cst_first_name varchar(50) NULL, 
  cst_middle_name varchar(50) NULL,
  cst_mononym_flag char(1) NOT NULL, -- or use BIT type
 CONSTRAINT PK_Customers PRIMARY KEY CLUSTERED 
(
  cst_id ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

Name Ordering

Now, we can differentiate between a mononym and a missing value in the next example, but first I want to mention another assumption that I have come across many times: that a full name should be represented by "first_name middle_name last_name", in that order. The problem here is that in many Asian countries, the order of names differs: "last_name first_name".

So this begs another question: should we be using the field or column names "last_name" and "first_name", when those names denote order, and yet this order is not always going to be what we expect? It may be better to use the term "surname" (or family name) rather than "last_name", and "given_name" rather than first_name". 

CREATE TABLE demo.Customers_3(
  cst_id int IDENTITY(1,1) NOT NULL,
  cst_surname varchar(50) NOT NULL,
  cst_given_name varchar(50) NULL, 
  cst_middle_name varchar(50) NULL,
  cst_mononym_flag char(1) NOT NULL, -- or use BIT type
 CONSTRAINT PK_Customers_3 PRIMARY KEY CLUSTERED 
(
  cst_id ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
INSERT INTO demo.Customers_3(
    cst_surname 
  , cst_given_name 
  , cst_middle_name
  , cst_mononym_flag
)
VALUES ('Lannister','Tyrion',null,'N')
  , ('Lee','Jun-fan',null,'N')
  , ('O''Connor','Sinéad',null,'N')
  , ('Cher',null,null,'Y')
  , ('Nebuchadnezzar',null,null,'Y')
  , ('Sumner','Gordon','Matthew Thomas','N')
  , ('Ghoshal','Shreya',null,'N') 
  , ('Smith',null,null,'N')
  

Accents and International Considerations

Can your name fields handle certain punctuation marks, like apostrophes and dashes? How about diacritics (accent marks)? All of these things must be taken into consideration when designing an application front- and back-end. In SQL Server, we need to consider our database collation settings to ensure it will properly deal with the kind of characters we will be dealing with in our data.

Full Names Instead Of Name Parts

We should also seriously consider if it's necessary to store and differentiate name parts. Why not have a "full_name" column, that can more easily handle most any name, and a familiar or short name variation for usage in correspondence? 

CREATE TABLE demo.Customers_4(
  cst_id int IDENTITY(1,1) NOT NULL,
  cst_full_name varchar(100) NOT NULL,
  cst_short_name varchar(50) NULL, 
 CONSTRAINT PK_Customers_4 PRIMARY KEY CLUSTERED 
(
  cst_id ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
INSERT INTO demo.Customers_4(
    cst_full_name
  , cst_short_name
)
VALUES ('Tyrion Lannister','Tyrion')
  ,('Lee Jun-fan','Bruce')
  ,('Sinéad O''Connor','Sinéad')
  ,('Cher','Cher')
  ,('Nebuchadnezzar','Nebuchadnezzar')
  ,('Gordon Matthew Thomas Sumner','Sting')
  ,('Shreya Ghoshal','Shreya') 
  ,('Smith','Smith')

Native Language Names

Global data will often require names to be stored in the native language of the user or customer, in addition to the Latinized form. We can include Unicode columns to store this:

CREATE TABLE demo.Customers_5(
  cst_id int IDENTITY(1,1) NOT NULL,
  cst_full_name varchar(100) NOT NULL,
  cst_full_name_native_language nvarchar(100) NULL,
  cst_short_name varchar(50) NULL, 
  cst_short_name_native_language nvarchar(50) NULL,
 CONSTRAINT PK_Customers_5 PRIMARY KEY CLUSTERED 
(
  cst_id ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
INSERT INTO demo.Customers_5(
    cst_full_name
  , cst_full_name_native_language
  , cst_short_name
  , cst_short_name_native_language
)
VALUES ('Tyrion Lannister',N'Tyrion Lannister','Tyrion',N'Tyrion')
  , ('Lee Jun-fan',N'???','Bruce',N'??')
  , ('Sinéad O''Connor',N'Sinéad O''Connor','Sinéad',N'Sinéad')
  , ('Cher',N'Cher','Cher',N'Cher')
  , ('Nebuchadnezzar',N'????????','Nebuchadnezzar',N'????????')
  , ('Gordon Matthew Thomas Sumner',N'Gordon Matthew Thomas Sumner','Sting',N'Sting')
  , ('Shreya Ghoshal',N'??????? ?????','Shreya',N'???????') 
  , ('Smith',N'Smith','Sir',N'Sir')

Summary

In summary, we really ought to give more thought to the entry, storage, and display of personal names, particularly in an age where customer service is of such high priority, and where we see much larger databases encompassing more global data.

Rate

3.62 (29)

You rated this post out of 5. Change rating

Share

Share

Rate

3.62 (29)

You rated this post out of 5. Change rating