• h.berg-884044 (1/24/2011)


    and so two threads can be executed simultaneously

    And that's the mistake most people make when it comes to hyperthreading. The two threads on the logical cores don't run simultaneously, but alternately.

    The Nehalem micro-architecture front end can decode up to four instructions per clock cycle. This decoding is the only thing that alternates between the two hardware threads.

    The resulting stream of micro-ops (uops) passes to the execution engine, which can have as many as 128 uops in flight at any one time. The execution engine contains a number of features including out-of-order execution, speculative execution, branch prediction and so on...all of which all combine to try to keep as many of the issue ports and available execution units as busy as possible. Nehalem has six issue ports and can process a maximum of 6 uops per clock. The pipelined nature of the execution units means that an operation like an integer multiply can have a latency of 3 clocks (so each operation takes 3 clocks start to finish), but a throughput of 1 clock (so a new integer multiply can be completed every clock).

    The normal situation is that the execution units (and pipelines) in Nehalem are under-utilized for a variety of reasons, including instruction mix, cache latency and branch mispredictions. Adding a second hardware thread enhances execution unit and pipeline utilization by providing a greater number and mix of uops.

    The execution units are completely unaware of HT - they just operate on the stream of upos. Execution is therefore parallel, with uops from both hardware threads progressing through the pipelines at each clock.

    Each logical core is basically only a set of registers.

    HT hardware threads are not just a set of registers. Core resources may be replicated (one per hardware thread e.g. register state, return stack buffer, large page TLB), partitioned (statically allocated between threads e.g. small page TLB, reorder buffer), competitively shared (shared between threads according to demand e.g. caches, reservation station), or unaware (HT has no impact e.g. execution units).

    Maximum gain i've seen is 120%-130% CPU usage on the two cores.

    Intel quote 1.25x (25%) as a general guide, but SQL Server can often do much better because it is written to scale particularly well with additional processing resources. I regularly see 1.6x or more on large parallel queries - both on my 4-core i7 laptop and our 64-core (128 hardware thread) production servers running an ETL/OLAP workload.

    Paul