String as primary key

Question

String as primary key

Viewing 15 posts - 16 through 30 (of 30 total)

You must be logged in to reply to this topic. Login to reply

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 1

ScottPletcher (2/27/2015)
Lynn Pettis (2/27/2015)
ScottPletcher (2/27/2015)
Lynn Pettis (2/26/2015)
ScottPletcher (2/26/2015)
Either cluster directly on the string itself. Or, if you don't want to do, encode the string into an int (or smallint if you have less than ~20,000 values) and cluster on that. Clustering on identity is often very damaging to overall performance.
For example:
CREATE TABLE dbo.strings (
string_id int IDENTITY(1, 1),
string varchar(50)
)
In the main table:
string_id int
You might also store the string, denormalized, in the main table as well.
For lookups:
SELECT ...
FROM main_table
WHERE
string_id IN (SELECT string_id FROM strings WHERE string IN ('<list of string values to lookup'>))
Edit: This code is intended only for use IF you decide to use an encoded number to represent the string value in the lead column of the clustering index. If you use the actual string itself, just code the WHERE condition normally:
WHERE string IN (...list_of_string_values_to_match...)
Curious, but where do you get this:
Clustering on identity is often very damaging to overall performance.
I know this was true with SQL Server 6.x and maybe 7.x. When SQL Server introduce row level locking the issue of a hot spot during inserts was significantly reduced.
Not for the INSERTs, for the subsequent reads. Identity generally works very well for INSERTs, but may force large numbers of "covering" nonclustered indexes and/or many key lookup queries where a clustering key that better matched the data queries would result in much more efficient lookups. And the ratio of SELECTs to INSERTs is often 1000x or more.
I am going to say "it depends." You also have to look at the environment you are running as well. One thing to look at is if you are also mirroring to an HR site over a slow WAN connection. If you have to rebuild a clustered index in that situation you could flood the wan connection.
But this is necessary only because it was improperly clustered to begin with. All the more reason to never allow a "default" clustering key of identity, and instead always base the clustering key on the that specific table's actual usage.

I'm just going to agree to disagree.

Eric M Russell SSC Guru Points: 125523 More actions · Answer 2

As a general rule, the clustered key should be sequentially incremented (to avoid page splits and accumulating fragmentation) and it should be narrow, because it will be used as the rowid in non-clustered indexes. This varchar(50) doesn't fit the bill.

However, you said the column is unique, and it's used as a search argument, so you can place a unique non-clustered index on it. The idea about creating a checksum or hash on the column for use as indexed search will only work if the user supplies the entire column value for use as an argument. How likely is that if the values are up to 50 characters? Typically for text columns of this length the users supply only first few characters for a truncated search.

I think just keep the identity as the clustered key and place a unique non-clustered index on varchar(50) column.

As for which column to designate as the primary key, in a situation like this where there are multiple unique keys, my rule is that whatever column would be leveraged for use as foreign in another table, then that should be designated as the primary key. So in this case the surrogate identity key should be the primary key rather than the natural varchar(50) key.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

ScottPletcher SSC Guru Points: 100949 More actions · Answer 3

Eric M Russell (2/27/2015)
As a general rule, the clustered key should be sequentially incremented (to avoid page splits and accumulating fragmentation) and it should be narrow, because it will be used as the rowid in non-clustered indexes. This varchar(50) doesn't fit the bill.
However, you said the column is unique, and it's used as a search argument, so you can place a unique non-clustered index on it. The idea about creating a checksum or hash on the column for use as indexed search will only work if the user supplies the entire column value for use as an argument. How likely is that if the values are up to 50 characters? Typically for text columns of this length the users supply only first few characters for a truncated search.
I think just keep the identity as the clustered key and place a unique non-clustered index on varchar(50) column.
As for which column to designate as the primary key, in a situation like this where there are multiple unique keys, my rule is that whatever column would be leveraged for use as foreign in another table, then that should be designated as the primary key. So in this case the surrogate identity key should be the primary key rather than the natural varchar(50) key.

Yes, as a general rule. But that's why experts are experts, they know when to override a general rule for a specific case.

In this case, you should get far better overall performance making the varchar(50) a clustering key or an encoded int representing the varchar(50) rather than an identity column. Sometimes you have to actually think instead of relying on rote rules.

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Eric M Russell SSC Guru Points: 125523 More actions · Answer 4

From what you've described, I'm envisioning the table as looking like something similar to this, bothin form and function, but it would help if we had the actual table definition.

create table Product

(

ProductID int not null identity(1,1) primary key clustered,

ProductDescription varchar(50) not null unique nonclustered

);

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Eric M Russell SSC Guru Points: 125523 More actions · Answer 5

ScottPletcher (2/27/2015)
Eric M Russell (2/27/2015)
As a general rule, the clustered key should be sequentially incremented (to avoid page splits and accumulating fragmentation) and it should be narrow, because it will be used as the rowid in non-clustered indexes. This varchar(50) doesn't fit the bill.
However, you said the column is unique, and it's used as a search argument, so you can place a unique non-clustered index on it. The idea about creating a checksum or hash on the column for use as indexed search will only work if the user supplies the entire column value for use as an argument. How likely is that if the values are up to 50 characters? Typically for text columns of this length the users supply only first few characters for a truncated search.
I think just keep the identity as the clustered key and place a unique non-clustered index on varchar(50) column.
As for which column to designate as the primary key, in a situation like this where there are multiple unique keys, my rule is that whatever column would be leveraged for use as foreign in another table, then that should be designated as the primary key. So in this case the surrogate identity key should be the primary key rather than the natural varchar(50) key.
Yes, as a general rule. But that's why experts are experts, they know when to override a general rule for a specific case.
In this case, you should get far better overall performance making the varchar(50) a clustering key or an encoded int representing the varchar(50) rather than an identity column. Sometimes you have to actually think instead of relying on rote rules.

Experts always speak in general terms, unless they have the actual subject of discussion in front of them, which is why I asked for the table definition. Based on what has been discussed, it's unclear whether clustering on that varchar(50) column would provide any performance benefit. I doubt it seriously. It's generally, almost always, better to put a fat non-sequential key in a non-clustered index, and depending on the case usage of the queries, perhaps also adding a couple of more included columns to make it a covering index.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Jeff Moden SSC Guru Points: 1003873 More actions · Answer 6

ScottPletcher (2/27/2015)
Eric M Russell (2/27/2015)
As a general rule, the clustered key should be sequentially incremented (to avoid page splits and accumulating fragmentation) and it should be narrow, because it will be used as the rowid in non-clustered indexes. This varchar(50) doesn't fit the bill.
However, you said the column is unique, and it's used as a search argument, so you can place a unique non-clustered index on it. The idea about creating a checksum or hash on the column for use as indexed search will only work if the user supplies the entire column value for use as an argument. How likely is that if the values are up to 50 characters? Typically for text columns of this length the users supply only first few characters for a truncated search.
I think just keep the identity as the clustered key and place a unique non-clustered index on varchar(50) column.
As for which column to designate as the primary key, in a situation like this where there are multiple unique keys, my rule is that whatever column would be leveraged for use as foreign in another table, then that should be designated as the primary key. So in this case the surrogate identity key should be the primary key rather than the natural varchar(50) key.
Yes, as a general rule. But that's why experts are experts, they know when to override a general rule for a specific case.
In this case, you should get far better overall performance making the varchar(50) a clustering key or an encoded int representing the varchar(50) rather than an identity column. Sometimes you have to actually think instead of relying on rote rules.

You're final statement is absolutely true, although a bit obnoxious.

One of the things the "experts" will also remember to consider is what page splits can do during inserts (can actually cause totally unnecessary timeouts during inserts) and what not having an increasing value on a large actively inserted table will do to the reserved size of the table, which will affect anything from simple index maintenance to backup duration and restore size/duration, not to mention what it could do to range scan queries on nightly runs. As already mentioned, you also have to consider the possible extreme impact on all non-clustered indexes on the table.

Although I certainly agree that there are exceptions from the general rules that "experts" have, all recommendations against the "general rule" should be accompanied by a full explanation of such caveats and I'm sure that I've not covered them all.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ScottPletcher SSC Guru Points: 100949 More actions · Answer 7

Jeff Moden (2/27/2015)
ScottPletcher (2/27/2015)
Eric M Russell (2/27/2015)
As a general rule, the clustered key should be sequentially incremented (to avoid page splits and accumulating fragmentation) and it should be narrow, because it will be used as the rowid in non-clustered indexes. This varchar(50) doesn't fit the bill.
However, you said the column is unique, and it's used as a search argument, so you can place a unique non-clustered index on it. The idea about creating a checksum or hash on the column for use as indexed search will only work if the user supplies the entire column value for use as an argument. How likely is that if the values are up to 50 characters? Typically for text columns of this length the users supply only first few characters for a truncated search.
I think just keep the identity as the clustered key and place a unique non-clustered index on varchar(50) column.
As for which column to designate as the primary key, in a situation like this where there are multiple unique keys, my rule is that whatever column would be leveraged for use as foreign in another table, then that should be designated as the primary key. So in this case the surrogate identity key should be the primary key rather than the natural varchar(50) key.
Yes, as a general rule. But that's why experts are experts, they know when to override a general rule for a specific case.
In this case, you should get far better overall performance making the varchar(50) a clustering key or an encoded int representing the varchar(50) rather than an identity column. Sometimes you have to actually think instead of relying on rote rules.
You're final statement is absolutely true, although a bit obnoxious.
One of the things the "experts" will also remember to consider is what page splits can do during inserts (can actually cause totally unnecessary timeouts during inserts) and what not having an increasing value on a large actively inserted table will do to the reserved size of the table, which will affect anything from simple index maintenance to backup duration and restore size/duration, not to mention what it could do to range scan queries on nightly runs. As already mentioned, you also have to consider the possible extreme impact on all non-clustered indexes on the table.
Although I certainly agree that there are exceptions from the general rules that "experts" have, all recommendations against the "general rule" should be accompanied by a full explanation of such caveats and I'm sure that I've not covered them all.

A lower FILLFACTOR might be required. Since users always provide the up-to-50-byte key, the idea is to avoid having to create nonclustered indexes. This creates more I/O, since you have to do a key lookup to the main table after reading the nonclus index. At a certain "tipping point", SQL simply scans the entire table instead.

By creating a nonclus covering index(es), you've typically doubled the storage size needed. Moreover, if the string key would have fragmented the clustered index, it will also fragment the nonclustered index, which you're now doing all of your lookups on. Then you have a fragmented lookup and an extra I/O back to the main table.

The average length of the key needs to be checked too. If, for example, it averages 8 bytes, I'd be very inclined to just make the string the clustering key. If it averages 40 bytes, I'd be very inclined to use the surrogate key. Under no circumstances with the specific conditions would I ever use an identity, as it's useless for lookup here.

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Bill Talada SSChampion Points: 11954 More actions · Answer 8

The devil is always in the details. Without knowing specifics we can only give ideas here. For example, what is wrong with clustering on something non-sequential if you have a weekend maintenance plan to reindex with a fillfactor? My preference is heaps because the last pages all contain highly accessed recent data. Clustering is most useful if it matches reporting order and most data is reported. What is his expected ratio of reads to writes? Make sure to optimize for what is actually needed by the application.

If we're talking billions of rows, then size is the overriding factor and I would create a hash index with no included columns, fetch the few rows that hash match and do the secondary string match test on the page rows returned instead of an index included columns check. Minimizing data cache turnover would be paramount. And minimizing data size improves everything from percent in cache to backup times and disk cost.

As for the original poster's question, he almost has to try different implementations to score the results for his environment.

Jeff Moden SSC Guru Points: 1003873 More actions · Answer 9

Yes... at the "tipping point", a table scan will occur instead. More correctly on a clustered table, a clustered index scan will occur. On very large tables, having a FILL FACTOR of, say, 80% will automatically mean that that all such table scans will take 20% longer because of the partially filled pages right after an index defrag and the part of restores represented by that table will also take 20% longer. Keep in mind that the FILL FACTOR affects all parts of the table/index including the totally static areas and not just the active areas.

And agreed on the covering NCIs. Although they can be very helpful for performance, it's a balancing act because all NCIs represent a duplication of data and require quite a bit of maintenance because they are almost always not key according to the order of insertion.

Again, I agree that there may be an "expert" situation that requires deviation from the general recommendations but they should more frequently resemble chickens teeth than fish scales.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Eric M Russell SSC Guru Points: 125523 More actions · Answer 10

ScottPletcher (2/27/2015)

A lower FILLFACTOR might be required. Since users always provide the up-to-50-byte key, the idea is to avoid having to create nonclustered indexes. This creates more I/O, since you have to do a key lookup to the main table after reading the nonclus index. At a certain "tipping point", SQL simply scans the entire table instead.
By creating a nonclus covering index(es), you've typically doubled the storage size needed. Moreover, if the string key would have fragmented the clustered index, it will also fragment the nonclustered index, which you're now doing all of your lookups on. Then you have a fragmented lookup and an extra I/O back to the main table.
The average length of the key needs to be checked too. If, for example, it averages 8 bytes, I'd be very inclined to just make the string the clustering key. If it averages 40 bytes, I'd be very inclined to use the surrogate key. Under no circumstances with the specific conditions would I ever use an identity, as it's useless for lookup here.

Scott,

Even if the identity column is ultimately not deemed useful for clustering, the column is most probably still needed, because practically all tables need a compact surrogate key when the natural key is something like VARCHAR(50). You have to look beyond the narrow case use scenario presented where users query the table by description. The fact that each description column is unique means that this is probably a reference table, so there will probably be a need for a foreign key relationship with other tables. You don't want to use a VARCHAR(50) column as a foreign key. I'll go out on a limb and say that practically every table (by default) needs a simple short surrogate key, if the only natural key is wide and complex. Even if the purpose isn't clear at first for reporting, it will come into play later on for somebody on the application side or ETL side.

You've stated a concern that a non-clustered index on the VARCHAR(50) column (I wish we had a name for this thing) would effectively double the storage requirements of the table and would still require a bookmark lookup into the table. However, we just don't know that. For all we know, this table could have 30 more columns in addition to the varchar key. Also, like I said earlier, the addition of included columns on the non-clustered index could potentially make it a covering index. But of course we don't know the column definition of the table or user query, so we just don't know given the information we have.

Without the DDL for the table and some sample queries, we just don't have enough information to perform a cost benefit analysis of what column should be the clustered key or to answer the other related questions.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

sql-lover SSCoach Points: 18530 More actions · Answer 11

While still I see not enough info from the OP so we can provide a better answer, I do agree with Eric on this:

Eric M Russell (2/27/2015)

As for which column to designate as the primary key, in a situation like this where there are multiple unique keys, my rule is that whatever column would be leveraged for use as foreign in another table, then that should be designated as the primary key. So in this case the surrogate identity key should be the primary key rather than the natural varchar(50) key.

In most cases, this statement is true.

I was myself facing this situation at work. I drastically improved the performance of a table with 1 billion records changing a very "fat" Cluster Index from a varchar column to a non clustered Index and surrogate ID that becomes the new PK and CI.

On my case, the fragmentation due inserts was killing me.

But again, the OP should expose the table's schema and current execution plan, and maybe amount of rows, in order to decide what could be a better Index for his particular and specific situation.

Eric M Russell SSC Guru Points: 125523 More actions · Answer 12

N_Muller (2/26/2015)
...
...
To note is the fact that the table(s) will have millions of rows, but the customer will request data for at at most, 100 or so identifiers.

Regarding concerns about whether queries against a non-clustered covering index would reach a tipping point and fall back to a table scan, the comment above seems to indicate this will not be likely.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

ScottPletcher SSC Guru Points: 100949 More actions · Answer 13

Eric M Russell (2/27/2015)

ScottPletcher (2/27/2015)
A lower FILLFACTOR might be required. Since users always provide the up-to-50-byte key, the idea is to avoid having to create nonclustered indexes. This creates more I/O, since you have to do a key lookup to the main table after reading the nonclus index. At a certain "tipping point", SQL simply scans the entire table instead.
By creating a nonclus covering index(es), you've typically doubled the storage size needed. Moreover, if the string key would have fragmented the clustered index, it will also fragment the nonclustered index, which you're now doing all of your lookups on. Then you have a fragmented lookup and an extra I/O back to the main table.
The average length of the key needs to be checked too. If, for example, it averages 8 bytes, I'd be very inclined to just make the string the clustering key. If it averages 40 bytes, I'd be very inclined to use the surrogate key. Under no circumstances with the specific conditions would I ever use an identity, as it's useless for lookup here.
Scott,
Even if the identity column is ultimately not deemed useful for clustering, the column is most probably still needed, because practically all tables need a compact surrogate key when the natural key is something like VARCHAR(50). You have to look beyond the narrow case use scenario presented where users query the table by description. The fact that each description column is unique means that this is probably a reference table, so there will probably be a need for a foreign key relationship with other tables. You don't want to use a VARCHAR(50) column as a foreign key. I'll go out on a limb and say that practically every table (by default) needs a simple short surrogate key, if the only natural key is wide and complex. Even if the purpose isn't clear at first for reporting, it will come into play later on for somebody on the application side or ETL side.
You've stated a concern that a non-clustered index on the VARCHAR(50) column (I wish we had a name for this thing) would effectively double the storage requirements of the table and would still require a bookmark lookup into the table. However, we just don't know that. For all we know, this table could have 30 more columns in addition to the varchar key. Also, like I said earlier, the addition of included columns on the non-clustered index could potentially make it a covering index. But of course we don't know the column definition of the table or user query, so we just don't know given the information we have.
Without the DDL for the table and some sample queries, we just don't have enough information to perform a cost benefit analysis of what column should be the clustered key or to answer the other related questions.

I'm dealing only with the proper way to cluster the table. I have no problem with an identity as a non-clustered PK, if you really feel you need it. Although I wouldn't a priori always assume FK(s) would be needed, there would be need to be a reasonable business basis for deciding that.

As to size, of course just adding the nonclus index adds size, And the more columns you include, to make the non-clus index(es) covering, the more you're increasing the overall size. That's likely as much, or more, than a somewhat lower fillfactor would cost.

By using a nonclustered index for the always-looked-up key, you're always doubling I/O for every key-value lookup. That's a huge price to pay. Likely much more than an occasional page split happening on INSERTs.

You're adding large amounts of disk and large amounts of I/O. That's a lot of overhead. A LOT. Be careful of doing that by rote.

A FILLFACTOR of 80% doesn't necessary increase a table scan by 20% as INSERTs are being done, and they can fill up that 20%. But table scans should be rare anyway. And, as on offset, we've saved space and I/O by not needing a nonclus index on the actual key column. If, however, you need different fillfactors for different areas of the table, due to INSERT "bunching", you do have to partition the table.

As to rarity, I've found that the majority, 60-70%, of identity clusters had a better-performing key column(s) available. I've gotten huge gains in performance just by getting the best clustered index on tables in place of the identity.

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

ScottPletcher SSC Guru Points: 100949 More actions · Answer 14

Eric M Russell (2/27/2015)
N_Muller (2/26/2015)
...
...
To note is the fact that the table(s) will have millions of rows, but the customer will request data for at at most, 100 or so identifiers.
Regarding concerns about whether queries against a non-clustered covering index would reach a tipping point and fall back to a table scan, the comment above seems to indicate this will not be likely.

Not so fast. SQL looks at it by page counts, not row counts. And SQL will assume each row from a nonclus index will require a lookup read of a separate page in the main SQL table. If the rows aren't that long, you might be surprised at how few total rows can tip SQL into a table scan.

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Eric M Russell SSC Guru Points: 125523 More actions · Answer 15

A non-clustered index won't necessarily result in double I/O on reads. Given a well designed non-clustered index with all required columns included, it can entirely cover the user query without requiring lookups into the table. The page reads are significantly less. The index could even be a small fraction the size of the table, depending on how many columns other than the varchar key are in the table. A NCI on the wide natural key is very commonly put to good use in this way.

Also, if this is more of an OLAP table than an OLTP table, then I'm totally not concerned about doubling up on total disk storage or doubling up on insert I/O. I've had OLAP tables where I've trippled up on storage by adding indexes. Yes, I've confirmed they're all put to good use. This is fine for medium sized tables that infrequently written (like nightly batch loads) and very frequently queried during normal business hours. At least for me, a table with a few million rows is a medium sized table, if not on the small end of the scale.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho