Instruction-level Multithreading to improve processor utilization
No one wants the hardware in their computer sitting idle – we all want to get as much useful work out of our hardware as possible. Mechanisms such as cache and branch prediction have been incorporated into processors to minimize the amount of processor idle time caused by memory accesses and changes in program flow; however, these mechanism are not perfect.
There are still times that the processor could be idle waiting for data or computational results to become available – these delays are relatively short, generally less than a few hundred clock cycles, typically around ten. The operating system software context switch to another runnable task takes on the order of hundreds of cycles. Thus, the overheads are too large for the operating system to switch to another runnable tasks to hide these short times of idleness.
One approach to get better utilization is to have the physical processor support multiple logical processors. If one logical processor has to wait for some result, the physical processor can switch to processing instructions from other logical processors to keep the hardware busy doing useful work and get better utilization of the hardware.
The hardware cost of supporting multiple logical processors is relatively modest. The Pentium 4 support of two logical processors on one physical core was estimated to be 5% of the die space providing up to a 30% improvement in performance.
The idea of multiplexing the processor hardware between multiple logical processors has been around for quite some time. The 1960’s CDC 6600 peripheral processor implemented ten logical processors sharing the same processor hardware to hide the latency of memory accesses. Thus, the CDC 6600 peripheral processor effectively got ten times the processing performance with a little added hardware to store the state information for each of the logical processors.
The main advantage of multithreading is to fill the previously idle delays due to operation latency with useful work. A program that just has a single thread of execution is not going to get any performance improvement from hardware that supports multithreading, and the individual thread are not any faster due to the multithreading. However, the aggregate throughput of the multiple threads is higher because the hardware is better utilized.
Improved performance is not a given with instruction level multithreading, since resources are shared between the logical processors. If one logical processor uses 100% of a resource, running additional threads on other logical processors will not improve utilization past 100%. One example of such a shared resource is the memory interface to RAM that is shared between the logical processors. If one thread can use all the memory interface bandwidth, adding more threads running on the other logical processors will simply split that bandwidth between them, not providing any improvement in aggregate throughput.
In some cases splitting the resources between multiple logical processors can hurt performance. Having additional logical processors may cause excessive cache misses because the working sets of the multiple logical processors exceed the size of the cache. Thus, rather than getting a performance improvement with logical processors the physical processor might spend a greater portion of its time waiting for cache misses to be satisfied.
Whether multithreading helps is very application and hardware implementation specific. You should benchmark performance both with and without multithreading enabled to determine whether multithreading helps. To improve the probability that your application benefits from multithreading:
- Expose the concurrency in the code so there are multiple threads that can be assigned to separate logical processors.
- Ensure that the data working sets fit into the reduced cache space due to the cache being shared between multiple threads. For example if the processor has four physical cores, each handling two logical processors, and a last-level cache of 8MB, there could be eight threads each with approximately 1MB of cache space. If the code assumed that each processor had 2MB of cache space, this could lead to excessive cache evictions and memory accesses.
- Be careful with spin-lock operations on multithreaded processors. Naive implementations of spin-lock operation can cause the spin-wait loop to slow down other threads sharing the physical processor because the spin-wait is a very tight loop.
Investigating multithreading further
The Intel optimization manual has a entire chapter about optimizing code for multicore and multithreading for their processors which includes suggestions on how to address some of the issues: