String as primary key

Question

Post reply

String as primary key

N_Muller

SSCarpal Tunnel

Points: 4037
More actions
February 26, 2015 at 12:07 pm

#315688

I have tables in database where a VARCHAR(50) string is unique identifier. The database currently has an integer identity column as clustered primary key, and there's a non-clustered index on string column. The customer always queries based on a defined set of the identifier (string) column.
I wonder if someone sees an advantage of adding a persisted computed column to the table as the checksum of the string column, and then create a non-clustered index on the checksum and the string. When a customer requests data, we would compute the checksum of the customer provided identifier and add to the where clause or join, that the checksum and string must match.
Will SQL Server perform checksum check (integer) and only if it succeeds, perform the string check, in which case I see an advantage of added the checksum column? Or will SQL Server always check for both the checksum and string, in which case the additional column only adds unnecessary overhead? To note is the fact that the table(s) will have millions of rows, but the customer will request data for at at most, 100 or so identifiers.

Viewing 15 posts - 1 through 15 (of 30 total)

You must be logged in to reply to this topic. Login to reply

ZZartin SSC-Dedicated Points: 30894 More actions · Answer 1

Is there a problem just letting them query the column directly?

N_Muller SSCarpal Tunnel Points: 4037 More actions · Answer 2

There is no problem letting them query the string directly. I'm mostly interested on the performance, though.

ScottPletcher SSC Guru Points: 100946 More actions · Answer 3

Either cluster directly on the string itself. Or, if you don't want to do, encode the string into an int (or smallint if you have less than ~20,000 values) and cluster on that. Clustering on identity is often very damaging to overall performance.

For example:

CREATE TABLE dbo.strings (

string_id int IDENTITY(1, 1),

string varchar(50)

)

In the main table:

string_id int

You might also store the string, denormalized, in the main table as well.

For lookups:

SELECT ...

FROM main_table

WHERE

string_id IN (SELECT string_id FROM strings WHERE string IN ('<list of string values to lookup'>))

Edit: This code is intended only for use IF you decide to use an encoded number to represent the string value in the lead column of the clustering index. If you use the actual string itself, just code the WHERE condition normally:

WHERE string IN (...list_of_string_values_to_match...)

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Michael Valentine Jones SSC Guru Points: 64818 More actions · Answer 4

I have done this in order to avoid having to create an index on the string column.

Since this column is a unique identifier, you would probably still need a unique index or or unique constraint, so I doubt you would see much benefit.

sql-lover SSCoach Points: 18530 More actions · Answer 5

N_Muller (2/26/2015)
There is no problem letting them query the string directly. I'm mostly interested on the performance, though.

Well... what's the current performance. Can you post the execution plan? What's the T-SQL query. How many records do you have in that column, and how selective are they.

ScottPletcher SSC Guru Points: 100946 More actions · Answer 6

Also, let's look at SQL's missing index and index usage stats for the table:

USE [<your_db_name_here>] --change to desired db if not already there

SET DEADLOCK_PRIORITY LOW --probably irrelevant, but just in case

DECLARE @list_missing_indexes bit

DECLARE @table_name_pattern sysname

--NOTE: showing missing indexes can take some time; set to 0 if you don't want to wait.

SET @list_missing_indexes = 1

SET @table_name_pattern = '<your_table_name_goes_here>' --'%'=all tables.

--SET @table_name_pattern = '%'

PRINT 'Started @ ' + CONVERT(varchar(30), GETDATE(), 120)

IF @list_missing_indexes = 1

BEGIN

SELECT

GETDATE() AS capture_date,

DB_NAME(mid.database_id) AS Db_Name,

dps.row_count,

OBJECT_NAME(mid.object_id /*, mid.database_id*/) AS Table_Name,

mid.equality_columns, mid.inequality_columns, mid.included_columns,

user_seeks, user_scans, ca1.max_days_active, unique_compiles,

last_user_seek, last_user_scan, avg_total_user_cost, avg_user_impact,

system_seeks, system_scans, last_system_seek, last_system_scan, avg_total_system_cost, avg_system_impact,

mid.statement, mid.object_id, mid.index_handle

FROM sys.dm_db_missing_index_details mid WITH (NOLOCK)

CROSS APPLY (

SELECT DATEDIFF(DAY, create_date, GETDATE()) AS max_days_active FROM sys.databases WHERE name = 'tempdb'

) AS ca1

LEFT OUTER JOIN sys.dm_db_missing_index_groups mig WITH (NOLOCK) ON

mig.index_handle = mid.index_handle

LEFT OUTER JOIN sys.dm_db_missing_index_group_stats migs WITH (NOLOCK) ON

migs.group_handle = mig.index_group_handle

LEFT OUTER JOIN sys.dm_db_partition_stats dps WITH (NOLOCK) ON

dps.object_id = mid.object_id AND

dps.index_id IN (0, 1)

--order by

--DB_NAME, Table_Name, equality_columns

WHERE

1 = 1

AND mid.database_id = DB_ID() --only current db

AND OBJECT_NAME(mid.object_id) LIKE @table_name_pattern

--AND mid.object_id IN (OBJECT_ID('<table_name_1>'), OBJECT_ID('<table_name_2>'))

ORDER BY

--avg_total_user_cost * (user_seeks + user_scans) DESC,

Db_Name, Table_Name, equality_columns, inequality_columns

END --IF

PRINT 'Midpoint @ ' + CONVERT(varchar(30), GETDATE(), 120)

-- list index usage stats (seeks, scans, etc.)

SELECT

ius2.row_num, DB_NAME() AS db_name,

CASE WHEN i.name LIKE ca2.table_name + '%'

THEN '~' + SUBSTRING(i.name, LEN(ca2.table_name) + 1, 200)

ELSE i.name END AS index_name,

CASE WHEN i.is_unique = 0 THEN 'N' ELSE 'Y' END + '.' +

CASE WHEN i.is_primary_key = 0 AND i.is_unique_constraint = 0 THEN 'N' ELSE 'Y' END AS [uniq?],

ca2.table_name,

i.index_id, --ius.user_seeks + ius.user_scans AS total_reads,

dps.row_count,

SUBSTRING(key_cols, 3, 8000) AS key_cols, SUBSTRING(nonkey_cols, 3, 8000) AS nonkey_cols,

ius.user_seeks, ius.user_scans, ius.user_lookups, ius.user_updates,

ius.last_user_seek, ius.last_user_scan, ius.last_user_lookup, ius.last_user_update,

fk.Reference_Count AS fk_ref_count,

DATEDIFF(DAY, CASE WHEN o.create_date > ca1.sql_startup_date THEN o.create_date

ELSE ca1.sql_startup_date END, GETDATE()) AS max_days_active,

FILEGROUP_NAME(i.data_space_id) AS filegroup_name,

ius.system_seeks, ius.system_scans, ius.system_lookups, ius.system_updates,

ius.last_system_seek, ius.last_system_scan, ius.last_system_lookup, ius.last_system_update

FROM sys.indexes i WITH (NOLOCK)

INNER JOIN sys.objects o WITH (NOLOCK) ON

o.object_id = i.object_id

CROSS JOIN (

SELECT create_date AS sql_startup_date FROM sys.databases WHERE name = 'tempdb'

) AS ca1

CROSS APPLY (

SELECT OBJECT_NAME(i.object_id/*, DB_ID()*/) AS table_name

) AS ca2

OUTER APPLY (

SELECT

', ' + COL_NAME(object_id, ic.column_id)

FROM sys.index_columns ic

WHERE

ic.key_ordinal > 0 AND

ic.object_id = i.object_id AND

ic.index_id = i.index_id

ORDER BY

ic.key_ordinal

FOR XML PATH('')

) AS key_cols (key_cols)

OUTER APPLY (

SELECT

', ' + COL_NAME(object_id, ic.column_id)

FROM sys.index_columns ic

WHERE

ic.key_ordinal = 0 AND

ic.object_id = i.object_id AND

ic.index_id = i.index_id

ORDER BY

COL_NAME(object_id, ic.column_id)

FOR XML PATH('')

) AS nonkey_cols (nonkey_cols)

LEFT OUTER JOIN sys.dm_db_partition_stats dps WITH (NOLOCK) ON

dps.object_id = i.object_id AND

dps.index_id = i.index_id

LEFT OUTER JOIN sys.dm_db_index_usage_stats ius WITH (NOLOCK) ON

ius.database_id = DB_ID() AND

ius.object_id = i.object_id AND

ius.index_id = i.index_id

LEFT OUTER JOIN (

SELECT

database_id, object_id, MAX(user_scans) AS user_scans,

ROW_NUMBER() OVER (ORDER BY MAX(user_scans) DESC) AS row_num --user_scans|user_seeks+user_scans

FROM sys.dm_db_index_usage_stats WITH (NOLOCK)

WHERE

database_id = DB_ID()

--AND index_id > 0

GROUP BY

database_id, object_id

) AS ius2 ON

ius2.database_id = DB_ID() AND

ius2.object_id = i.object_id

LEFT OUTER JOIN (

SELECT

referenced_object_id, COUNT(*) AS Reference_Count

FROM sys.foreign_keys

WHERE

is_disabled = 0

GROUP BY

referenced_object_id

) AS fk ON

fk.referenced_object_id = i.object_id

WHERE

i.object_id > 100 AND

i.is_hypothetical = 0 AND

i.type IN (0, 1, 2) AND

o.type NOT IN ( 'IF', 'IT', 'TF', 'TT' ) AND

(

o.name LIKE @table_name_pattern AND

o.name NOT LIKE 'dtprop%' AND

o.name NOT LIKE 'filestream[_]' AND

o.name NOT LIKE 'MSpeer%' AND

o.name NOT LIKE 'MSpub%' AND

--o.name NOT LIKE 'queue[_]%' AND

o.name NOT LIKE 'sys%'

)

--AND OBJECT_NAME(i.object_id /*, DB_ID()*/) IN ('tbl1', 'tbl2', 'tbl3')

ORDER BY

--row_count DESC,

--ius.user_scans DESC,

--ius2.row_num, --user_scans&|user_seeks

db_name, table_name,

-- list clustered index first, if any, then other index(es)

CASE WHEN i.index_id IN (0, 1) THEN 1 ELSE 2 END,

key_cols

PRINT 'Ended @ ' + CONVERT(varchar(30), GETDATE(), 120)

SET DEADLOCK_PRIORITY NORMAL

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 7

ScottPletcher (2/26/2015)
Either cluster directly on the string itself. Or, if you don't want to do, encode the string into an int (or smallint if you have less than ~20,000 values) and cluster on that. Clustering on identity is often very damaging to overall performance.
For example:
CREATE TABLE dbo.strings (
string_id int IDENTITY(1, 1),
string varchar(50)
)
In the main table:
string_id int
You might also store the string, denormalized, in the main table as well.
For lookups:
SELECT ...
FROM main_table
WHERE
string_id IN (SELECT string_id FROM strings WHERE string IN ('<list of string values to lookup'>))
Edit: This code is intended only for use IF you decide to use an encoded number to represent the string value in the lead column of the clustering index. If you use the actual string itself, just code the WHERE condition normally:
WHERE string IN (...list_of_string_values_to_match...)

Curious, but where do you get this:

Clustering on identity is often very damaging to overall performance.

I know this was true with SQL Server 6.x and maybe 7.x. When SQL Server introduce row level locking the issue of a hot spot during inserts was significantly reduced.

N_Muller SSCarpal Tunnel Points: 4037 More actions · Answer 8

Relatively static data can be clustered on any column that fits your querying criteria. However if you deal with a high transaction level OLTP where new records associated with the primary read-filtering column are added at high rates, you want to make sure you don't degrade write performance. As of today, I've been doing this by making the clustered index an ever-increasing column, be that an int or bigint identity or GUID column. I'm open to suggestions that don't degrade write performance and improves read performance.

N_Muller SSCarpal Tunnel Points: 4037 More actions · Answer 9

Michael Valentine Jones (2/26/2015)
I have done this in order to avoid having to create an index on the string column.
Since this column is a unique identifier, you would probably still need a unique index or or unique constraint, so I doubt you would see much benefit.

This is what I have done so far. Checksum doesn't guarantee uniqueness, but is quite fast. I added an index on the computed checksum column (thus the need for a persistent computed column). Depending on the scenario, I would add the string itself in the "include" portion of the index.

Grant Fritchey SSC Guru Points: 398679 More actions · Answer 10

My concern would be how the function is applied to arrive at the checksum. But I'm with everyone else, why not just let them hit the string. Also, I'd consider looking at that as the clustered index since it's unique. You're saying it's the most commonly used access path to the data. It might be the best choice for the cluster. Is the identity column used for searches, or just as a unique value?

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

Jeff Moden SSC Guru Points: 1003861 More actions · Answer 11

N_Muller (2/26/2015)
I have tables in database where a VARCHAR(50) string is unique identifier. The database currently has an integer identity column as clustered primary key, and there's a non-clustered index on string column. The customer always queries based on a defined set of the identifier (string) column.
I wonder if someone sees an advantage of adding a persisted computed column to the table as the checksum of the string column, and then create a non-clustered index on the checksum and the string. When a customer requests data, we would compute the checksum of the customer provided identifier and add to the where clause or join, that the checksum and string must match.
Will SQL Server perform checksum check (integer) and only if it succeeds, perform the string check, in which case I see an advantage of added the checksum column? Or will SQL Server always check for both the checksum and string, in which case the additional column only adds unnecessary overhead? To note is the fact that the table(s) will have millions of rows, but the customer will request data for at at most, 100 or so identifiers.

Let's back the performance cart up a bit here. You say the data in the string column is a "unique identifier". What precisely does that mean? Is it a GUID? If so, then it should absolutely NOT be the clustered index. Is the string ever increasing in value? If not, this it should probably not be the clustered index here, either.

What kind of data does this string have in it?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ScottPletcher SSC Guru Points: 100946 More actions · Answer 12

Lynn Pettis (2/26/2015)
ScottPletcher (2/26/2015)
Either cluster directly on the string itself. Or, if you don't want to do, encode the string into an int (or smallint if you have less than ~20,000 values) and cluster on that. Clustering on identity is often very damaging to overall performance.
For example:
CREATE TABLE dbo.strings (
string_id int IDENTITY(1, 1),
string varchar(50)
)
In the main table:
string_id int
You might also store the string, denormalized, in the main table as well.
For lookups:
SELECT ...
FROM main_table
WHERE
string_id IN (SELECT string_id FROM strings WHERE string IN ('<list of string values to lookup'>))
Edit: This code is intended only for use IF you decide to use an encoded number to represent the string value in the lead column of the clustering index. If you use the actual string itself, just code the WHERE condition normally:
WHERE string IN (...list_of_string_values_to_match...)
Curious, but where do you get this:
Clustering on identity is often very damaging to overall performance.
I know this was true with SQL Server 6.x and maybe 7.x. When SQL Server introduce row level locking the issue of a hot spot during inserts was significantly reduced.

Not for the INSERTs, for the subsequent reads. Identity generally works very well for INSERTs, but may force large numbers of "covering" nonclustered indexes and/or many key lookup queries where a clustering key that better matched the data queries would result in much more efficient lookups. And the ratio of SELECTs to INSERTs is often 1000x or more.

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 13

ScottPletcher (2/27/2015)
Lynn Pettis (2/26/2015)
ScottPletcher (2/26/2015)
Either cluster directly on the string itself. Or, if you don't want to do, encode the string into an int (or smallint if you have less than ~20,000 values) and cluster on that. Clustering on identity is often very damaging to overall performance.
For example:
CREATE TABLE dbo.strings (
string_id int IDENTITY(1, 1),
string varchar(50)
)
In the main table:
string_id int
You might also store the string, denormalized, in the main table as well.
For lookups:
SELECT ...
FROM main_table
WHERE
string_id IN (SELECT string_id FROM strings WHERE string IN ('<list of string values to lookup'>))
Edit: This code is intended only for use IF you decide to use an encoded number to represent the string value in the lead column of the clustering index. If you use the actual string itself, just code the WHERE condition normally:
WHERE string IN (...list_of_string_values_to_match...)
Curious, but where do you get this:
Clustering on identity is often very damaging to overall performance.
I know this was true with SQL Server 6.x and maybe 7.x. When SQL Server introduce row level locking the issue of a hot spot during inserts was significantly reduced.
Not for the INSERTs, for the subsequent reads. Identity generally works very well for INSERTs, but may force large numbers of "covering" nonclustered indexes and/or many key lookup queries where a clustering key that better matched the data queries would result in much more efficient lookups. And the ratio of SELECTs to INSERTs is often 1000x or more.

I am going to say "it depends." You also have to look at the environment you are running as well. One thing to look at is if you are also mirroring to an HR site over a slow WAN connection. If you have to rebuild a clustered index in that situation you could flood the wan connection.

ScottPletcher SSC Guru Points: 100946 More actions · Answer 14

Lynn Pettis (2/27/2015)
ScottPletcher (2/27/2015)
Lynn Pettis (2/26/2015)
ScottPletcher (2/26/2015)
Either cluster directly on the string itself. Or, if you don't want to do, encode the string into an int (or smallint if you have less than ~20,000 values) and cluster on that. Clustering on identity is often very damaging to overall performance.
For example:
CREATE TABLE dbo.strings (
string_id int IDENTITY(1, 1),
string varchar(50)
)
In the main table:
string_id int
You might also store the string, denormalized, in the main table as well.
For lookups:
SELECT ...
FROM main_table
WHERE
string_id IN (SELECT string_id FROM strings WHERE string IN ('<list of string values to lookup'>))
Edit: This code is intended only for use IF you decide to use an encoded number to represent the string value in the lead column of the clustering index. If you use the actual string itself, just code the WHERE condition normally:
WHERE string IN (...list_of_string_values_to_match...)
Curious, but where do you get this:
Clustering on identity is often very damaging to overall performance.
I know this was true with SQL Server 6.x and maybe 7.x. When SQL Server introduce row level locking the issue of a hot spot during inserts was significantly reduced.
Not for the INSERTs, for the subsequent reads. Identity generally works very well for INSERTs, but may force large numbers of "covering" nonclustered indexes and/or many key lookup queries where a clustering key that better matched the data queries would result in much more efficient lookups. And the ratio of SELECTs to INSERTs is often 1000x or more.
I am going to say "it depends." You also have to look at the environment you are running as well. One thing to look at is if you are also mirroring to an HR site over a slow WAN connection. If you have to rebuild a clustered index in that situation you could flood the wan connection.

But this is necessary only because it was improperly clustered to begin with. All the more reason to never allow a "default" clustering key of identity, and instead always base the clustering key on the that specific table's actual usage.

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.