Primary keys for an OLTP database

  • Jeff Moden - Friday, December 28, 2018 8:59 AM

    So, I wondered what the results would be on the 48 core, 384GB RAM, SSD fire breather at work... here are those results from the first run.  Subsequent runs did pan out roughly the same way...

    

    My i5-4310U 4 Logical Cores @ 2.00GHz system, 512GB SSD, Microsoft SQL Server 2016 (SP1-GDR) (KB4458842) - 13.0.4224.16 (X64) is optimized to the extreem for running the SQL Server Service, quite amusing to see it outperform the 48 core, 384GB RAM, SSD fire breather at your work, maybe I should drop around for some tuning!
    😎


    T_TXT              DURATION
    ------------------ -----------
    IDENTITY 3             1900222
    IDENTITY 1             1900764
    IDENTITY 2             1921514
    NEWSEQUENTIALID 2      2401202
    NEWSEQUENTIALID 1      2404004
    NEWSEQUENTIALID 3      2472017
    NEWID 2                3437846
    NEWID 3                3561873
    NEWID 1                3589680



  • xsevensinzx - Saturday, December 29, 2018 7:11 AM

    Jeff Moden - Friday, December 28, 2018 9:24 AM

    xsevensinzx - Friday, December 28, 2018 8:08 AM

    Jeff Moden - Friday, December 28, 2018 7:54 AM

    xsevensinzx - Friday, December 28, 2018 7:47 AM

    I've said it in other threads, did not like the performance of a clustered GUID (non-unique) on billions of records in a single fact table within a SMP system. I had to cleanse the keys for sequentials. But, in the MPP columnstore world, I absolutely love GUID because the randomness is great for clustering the data pretty evenly across N databases without having any single database with more or less records than the others.

    Now there's an interesting aspect.  Thanks.

    Just for further clarification.

    You have to hash data using a key with each table. This is like adding a clustered index to define how the data is stored on disk. Being there is 60 databases with N disks per database, you want to ensure the key you select has a even distribution of that data across those databases. Thus, if you have 500 million records with random keys and 1 billion records with 0 as their key, then 1 of those 60 databases will have 1 billion records stuck in it where the other 59 databases would have evenly distributed the remaining random keys the best they could, which is likely about 8 million per database.

    If you write a query to read some of those 0 keys, you would have just 1 computer working for you versus if you wrote one for the other keys, you would have 59 databases and their computers working for you. Thus, using the GUID in place of these keys, including the 0 keys, can help evenly distribute the data evenly across all computers/databases/etc. Hopefully about 25 million per database in this example now that you have used the GUID key to help evenly distribute the data.

    Using the sequential here may be bad because it ticks the ranges forward in a linear fashion causing the distribution to always shift forward with the data as it comes into the system. The same is true for date/time.

    Good stuff right there and it's confirmation of what I told folks when they wanted to invest in an MPP appliance at one of the companies that I work do work for.  The hype (I know you already know this and have confirmed that knowledge with what you wrote above) was that MPP would make things run 30X faster.  Everyone but me was ready to loosen up their purse strings even though I didn't have much knowledge of MPP appliances but knew enough to know better.  And so I challenged the salesman on the spot... "Tell them what the necessary modifications are to the underlying data structure and code is to achieve that 30X improvement.  Then explain why the expenditure of that amount and cost of development and testing, not to mention the cost of the appliance itself, is better than spending the time to tweak the code to make it run 60 to 1000 times faster" and cited several major examples where we had done so.

    Understand, that they don't have the billions of rows in tables that you do.  They only recently (in the last year) had a database go over the 2TB mark and I just got rid of half of that data for them (with their concurrence, of course) because it was never accessed and was duplicated in other places.

    Thanks again for the awesome information.  I'll probably never have the opportunity to work on such a system as what you do.  To be honest, I'm not sure I'd want to but it would be interesting.

    Yes, you will need to alter the model to make it work in the MPP world. The prime example is the fact your hashing key can only accept a single key in most MPP systems. You cannot for example hash on more than one key. This causes you to replicate data on different hashing keys for different computational processes. Dimension tables that are below 60 unique keys or under 2 GB in size, will need to be treated differently than larger dimensional tables in how they are distributed across those 60 databases. For example, instead of hashing on a key, you have to choose replicate, which will not distribute the data across 60 databases, but replicate the entire dataset per database. When you modify that table, you will need to constantly rebuild it with SELECT TOP 1 * FROM DimTable. Dimensional tables are commonly modified when new dimensional values come into the system. Thus you have to fact that in.

    I won't even go into the design decisions needed for concurrency slots as most MPP does not allow every process to run at the same time. Queuing systems with concurrency values are needed in meaning, not only do you have to schedule your ETL, but also put them into priority order and how much currency they can use when running in batches.

    Anyways, I moved to MPP not just for the big data stuff, but also for the fact a good portion of my data is alphanumeric like GUID's. This is surely a good example of where GUID's perform the best and can be used with ease.

    Understood but what a fascinating sidebar you've provided!  Thanks for taking the time to explain a bit about MPP, especially how it all relates to the use of GUIDs.  I've added this discussion to my briefcase.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Eirikur Eiriksson - Saturday, December 29, 2018 8:08 AM

    My i5-4310U 4 Logical Cores @ 2.00GHz system, 512GB SSD, Microsoft SQL Server 2016 (SP1-GDR) (KB4458842) - 13.0.4224.16 (X64) is optimized to the extreem for running the SQL Server Service, quite amusing to see it outperform the 48 core, 384GB RAM, SSD fire breather at your work, maybe I should drop around for some tuning!

    Ya know, it's funny that you've mentioned that.  I've noticed for quite a while that my laptop (an older Vaio of a similar config as yours but with no SSDs and only 4GB dedicated to SQL Server) frequently outstrips the fire-breather at work in certain areas.  In other cases, the one at work blows my laptop away.  I've been trying to figure out why but, so far, haven't found out why.  If we weren't using SSDs at work, I'd be temped to blame it on the ol' sector offset problem except that some memory-only stuff is also slower.

    I'd love to have someone else take a look sometime.  I can't do such a thing remotely, though.  Totally "against the rules" I have to follow.  I might be able to convince folks to make an exception, though, especially if it's a "watched" session.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Sunday, December 30, 2018 9:45 AM

    Eirikur Eiriksson - Saturday, December 29, 2018 8:08 AM

    My i5-4310U 4 Logical Cores @ 2.00GHz system, 512GB SSD, Microsoft SQL Server 2016 (SP1-GDR) (KB4458842) - 13.0.4224.16 (X64) is optimized to the extreem for running the SQL Server Service, quite amusing to see it outperform the 48 core, 384GB RAM, SSD fire breather at your work, maybe I should drop around for some tuning!

    Ya know, it's funny that you've mentioned that.  I've noticed for quite a while that my laptop (an older Vaio of a similar config as yours but with no SSDs and only 4GB dedicated to SQL Server) frequently outstrips the fire-breather at work in certain areas.  In other cases, the one at work blows my laptop away.  I've been trying to figure out why but, so far, haven't found out why.  If we weren't using SSDs at work, I'd be temped to blame it on the ol' sector offset problem except that some memory-only stuff is also slower.

    I'd love to have someone else take a look sometime.  I can't do such a thing remotely, though.  Totally "against the rules" I have to follow.  I might be able to convince folks to make an exception, though, especially if it's a "watched" session.

    Ping me a PM or an email with the whole specs of the fire breather, very interested in knowing what causes the difference.
    😎 

    I'm prepping my own fire breather, 64 cores, 512 Gb memory, eight storage controllers,  8 Tb of SSDs, latest generation which exceed 100000 iops/sec each, four per channel, slightly hesitant as it cost me more to run it than to heat the house.

  • Eirikur Eiriksson - Sunday, December 30, 2018 10:53 AM

    Jeff Moden - Sunday, December 30, 2018 9:45 AM

    Eirikur Eiriksson - Saturday, December 29, 2018 8:08 AM

    My i5-4310U 4 Logical Cores @ 2.00GHz system, 512GB SSD, Microsoft SQL Server 2016 (SP1-GDR) (KB4458842) - 13.0.4224.16 (X64) is optimized to the extreem for running the SQL Server Service, quite amusing to see it outperform the 48 core, 384GB RAM, SSD fire breather at your work, maybe I should drop around for some tuning!

    Ya know, it's funny that you've mentioned that.  I've noticed for quite a while that my laptop (an older Vaio of a similar config as yours but with no SSDs and only 4GB dedicated to SQL Server) frequently outstrips the fire-breather at work in certain areas.  In other cases, the one at work blows my laptop away.  I've been trying to figure out why but, so far, haven't found out why.  If we weren't using SSDs at work, I'd be temped to blame it on the ol' sector offset problem except that some memory-only stuff is also slower.

    I'd love to have someone else take a look sometime.  I can't do such a thing remotely, though.  Totally "against the rules" I have to follow.  I might be able to convince folks to make an exception, though, especially if it's a "watched" session.

    Ping me a PM or an email with the whole specs of the fire breather, very interested in knowing what causes the difference.
    😎 

    I'm prepping my own fire breather, 64 cores, 512 Gb memory, eight storage controllers,  8 Tb of SSDs, latest generation which exceed 100000 iops/sec each, four per channel, slightly hesitant as it cost me more to run it than to heat the house.

    Heh... so stop heating the house. 😀  The system will do that for you.  By a 20" (~51 cm) box fan to push some air around.

    Will do on the specs.  Anything in particular that you'd like me to include so that I don't miss it?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Although this topic came quite late to my attention it is more actual than ever. I see lots of people favorite sequential INT/BIGINT as PK but - as Jeff has mentioned - it does not scale. I would say it sucks 🙂

    The topic of the article is about OLTP! OLTP in general do not work with big data sets like DWH do. So the problem of index fragmentation is not a big issue here!

    The demo from Erikur (where Identity wins!) has nothing to do with a heavy load scenario from real world OLTP systems. So if just one process is hitting a table you won't run into "last page contention".

    My tests which I show whenever I do a session about DML optimization shows that a random value (whether GUID or others) will win always.

    A big problem can be a big table with billions of rows because of the traversal access to the pages where the new record has to be stored.

    In this case I favorite HEAPS over C.I. with a NONCLUSTERED PK. Most of my tests and workload optimizations at customer sites did win the race with heaps and an intelligent underlying file layout (e.g. I use to implement databases with high DML operational workload on multiple files in a filegroup!)

    My personal advice here is:

    • You can use an ascending key in business environment with <= 1000 transactions / min
    • If your system should scale you should think about a redesign of the PK data type and value AND usage of C.I vs. HEAP

    Microsoft Certified Master: SQL Server 2008
    MVP - Data Platform (2013 - ...)
    my blog: http://www.sqlmaster.de (german only!)

  • What I haven't seen in this conversation is any control for uniqueness. Certainly one can argue that a monotonically increasing record (not row) counter is unique. It does, however, allow otherwise entirely duplicate rows to be entered but they are, of course, unique because they have a new incremental value. How will you control uniqueness assuming that you find it incorrect to have a line item billed to a customer 100 times when they only ordered 1...but the database has not controlled for truly unique content?

    I can already hear the shouts that such business rules belong in the application. Then you better make sure your application covers all the things support personnel and/or DBAs might need to accomplish because SSMS will not be constrained by the beautiful micro service business rule implementation behind your mobile application. And believe me, someone WILL need to go into your database to fix things.

    ------------
    Buy the ticket, take the ride. -- Hunter S. Thompson

  • Bryant McClellan wrote:

    What I haven't seen in this conversation is any control for uniqueness.

    ... How will you control uniqueness assuming that you find it incorrect to have a line item billed to a customer 100 times when they only ordered 1...but the database has not controlled for truly unique content?

    ...

    Generally speaking, the primary key is the key intended for reference in foreign key relationships with other tables. But there can still be another unique index on the natural key. For example, we wouldn't expect the same point of sale (POS) terminal to process two transactions at the exact same moment. In the real world, it could potentially happen (maybe) if two POS terminals were setup wrong with the same PosID. But here we have a sequentially incremented primary key and then an index enforcing uniqueness on three columns that represent logical place + time.

    PRIMARY KEY:   PosTransactionId (INT)

    UNIQUE INDEX:   Store (VARCHAR(5)), PosID (SMALLINT), TransactionTime (DATETIME)

     

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Another point to consider; if the primary key is also a foreign key in other tables, there could be a significant impact in using one that takes up 16 bytes instead of 4 (or 8).

    On the subject of natural vs artificial keys; I tend to use an INT or BIGINT surrogate in addition to any available natural key. The surrogate would be the one repeated in other tables as a foreign key, but the customer would quote the more memorable natural key. Which of them gets used as the primary key in the main table would depend on the types of query anticipated.

  • And fairly straightforward to accomplish. All I was after is that it is a common failing that usually manifests itself at the worst possible time. I was also reminded of this from some of Jeff Moden's earlier comments regarding the definition of natural keys.

    ------------
    Buy the ticket, take the ride. -- Hunter S. Thompson

Viewing 10 posts - 46 through 54 (of 54 total)

You must be logged in to reply to this topic. Login to reply