Is hyper-threading still relevant for SQL Server?

  • Comments posted to this topic are about the item Is hyper-threading still relevant for SQL Server?

  • The main problem with HT is that the physical core shares registers and processor cache with the "hyper" (virtual) core. This often gives a full context shift with flush when the physical core and the virtual core are working.

    SQL Server is one of the most efficient applications on the Windows operating system, and is hurt bad by full context shifts on the core. Like when HT is enabled.

    This problem gets worse when the SQL Server is increasing its CPU usage.

    /Niels Grove-Rasmussen

  • My experience broadly echoes Glenn's recommendation; off for OLAP, on for OLTP. But, as you mention in the article no two workloads are ever the same.

    These sorts of scenarios are perfect candidates for comparative testing using RML tools. Any critical system upgrade (hardware, firmware, service pack etc) should be subjected to a test run using a baseline workload, captured from a live system and replayed using OStress. While you’re at it, do a test run with HT enabled and disabled and see what effect it has on your particular system.

  • Small correction:

    AMD doesn't implement Hyper-Threading, it implements Two Strong Threads.

    Intel combats [the expense of one thread per core] with Hyper-Threading, which allows each physical core to work on two threads. Over-provisioning is assumed, meaning you rely on under-utilization to extract additional performance from each core. This is a relatively inexpensive technology. But it’s also quite limited in the benefits it offers. Some workloads don’t see any speed-up from Hyper-Threading. Others barely crack double-digit performance gains.

    [Click on the image for a larger view.]

    AMD is trying to define a third approach to threading it calls Two Strong Threads. Whereas Hyper-Threading only duplicates architectural states, the Bulldozer design shares the front-end (fetch/decode) and back-end of the core (through a shared L2 cache), but duplicates integer schedulers and execution pipelines, offering dedicated hardware to each of two threads.

  • I work with a business system and it's mainly OLTP but alot of long running adhoc queries goes on. So it's more like a combination with OLAP. Unfortunatley I never get the technician to turn off HT, but I measure SQL waits and in a rather large production environment I tried with MAXDOP 0 and MAXDOP set to the actual physical cores. And the last option gives the least amount waits. Even tried setting MAXDOP values between 0 and actual physical cores, but physical cores, still the best. Well, that's my experience.

  • "and so two threads can be executed simultaneously"

    And that's the mistake most people make when it comes to hyperthreading. The two threads on the logical cores don't run simultaneously, but alternately. Each logical core is basically only a set of registers.

    The plus in hyperthreading is twofold. First the OS need to do less context switching because more CPU threads are available and context switching is expensive. Second when on thread has to wait for cache, I/O or other things the other thread can utilize the physical core. Maximum gain i've seen is 120%-130% CPU usage on the two cores.

    Hyperthreading is OK in a low utilization environment with many threads competing for CPU. In a high CPU environment (such as OLAP) it can even start to work against you as funny scheduling artefacts start to occur. I've seen HT machines decrease performance compared to the same machine with HT turned off with heavy workloads. Even on Intels newest CPU's.

    Another thing to watch is that the CPU high water mark is not 100%, but 60%-65% with HT, and you don't know exactly where it is. With CPU target normally 70% or below, this becomes 45% or below with HT turned on.

    HT is good stuff on certain workloads and bad news on others. Be sure to benchmark HT on and off for your particular workload.

  • Many thanks for the feedback so far, and especially to h.berg for correcting my misunderstanding with regard to thread execution.

    It's interesting that you mention "...less context switching because more CPU threads are available..." because the increase in the number of threads actually seems quite small. E.g. a quad-core processor with and without HT will have 288 and 256 worker threads respectively (on 32-bit) -- so actually significantly fewer threads per scheduler. Jonathan Kehayias [/url]has written quite extensively on this topic, and it's resulting "thread contention" that, as you observe, can cause you trouble if you have HT enabled and a lot of long-running queries in your workload.

    Cheers,

    Tony.

  • What I mentioned were CPU threads, not OS threads. Scheduling a different OS thread on a CPU thread requires a context switch for the given logical core aka CPU thread. a context switch is an expensive operation, the old OS thread state has to be saved and the new OS thread state has to be restored.

    The more CPU threads are available, the less often the OS has to switch a different OS thread on a CPU thread and thus the fewer context switches there are.

    On a system with many OS threads that have a low CPU demand per OS thread, like a web server or an OLTP database with many short queries, this is a large part of the reason why an HT CPU outperforms a non-HT CPU with the same number of physical cores.

  • I apologize for coming in late, but I've got a nice new server sitting around with modern Nehalem CPU's and scores of GB of RAM, and a limited time window to benchmark Hyperthreading against non-hyperthreading, and I'd like to ask if anyone has a highly (up to 64+ thread) parallel test script available.

    If there are some good samples available which can exploit dozens of cores/threads, I'll be happy to run them both with and (hopefully) without hyperthreading, to get some actual numbers on the difference.

  • I don't have a script for you, but I can give you a really, really big recommendation to enable HT.

  • h.berg-884044 (1/24/2011)


    and so two threads can be executed simultaneously

    And that's the mistake most people make when it comes to hyperthreading. The two threads on the logical cores don't run simultaneously, but alternately.

    The Nehalem micro-architecture front end can decode up to four instructions per clock cycle. This decoding is the only thing that alternates between the two hardware threads.

    The resulting stream of micro-ops (uops) passes to the execution engine, which can have as many as 128 uops in flight at any one time. The execution engine contains a number of features including out-of-order execution, speculative execution, branch prediction and so on...all of which all combine to try to keep as many of the issue ports and available execution units as busy as possible. Nehalem has six issue ports and can process a maximum of 6 uops per clock. The pipelined nature of the execution units means that an operation like an integer multiply can have a latency of 3 clocks (so each operation takes 3 clocks start to finish), but a throughput of 1 clock (so a new integer multiply can be completed every clock).

    The normal situation is that the execution units (and pipelines) in Nehalem are under-utilized for a variety of reasons, including instruction mix, cache latency and branch mispredictions. Adding a second hardware thread enhances execution unit and pipeline utilization by providing a greater number and mix of uops.

    The execution units are completely unaware of HT - they just operate on the stream of upos. Execution is therefore parallel, with uops from both hardware threads progressing through the pipelines at each clock.

    Each logical core is basically only a set of registers.

    HT hardware threads are not just a set of registers. Core resources may be replicated (one per hardware thread e.g. register state, return stack buffer, large page TLB), partitioned (statically allocated between threads e.g. small page TLB, reorder buffer), competitively shared (shared between threads according to demand e.g. caches, reservation station), or unaware (HT has no impact e.g. execution units).

    Maximum gain i've seen is 120%-130% CPU usage on the two cores.

    Intel quote 1.25x (25%) as a general guide, but SQL Server can often do much better because it is written to scale particularly well with additional processing resources. I regularly see 1.6x or more on large parallel queries - both on my 4-core i7 laptop and our 64-core (128 hardware thread) production servers running an ETL/OLAP workload.

    Paul

  • AndrewJacksonZA (5/11/2011)


    I don't have a script for you, but I can give you a really, really big recommendation to enable HT.

    +1. Absolutely enable HT on Nehalem architecture or later.

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply