Part IV·Microarchitecture·Chapter 30 of 62

Part IVMicroarchitecture

Thread-Level Parallelism

May 16, 2026·21 min read·advanced

When ILP gains plateaued in the mid-2000s and the only ways to keep using more transistors per chip were either to widen SIMD or to replicate cores, the industry chose to do both — and primarily the latter. Thread-level parallelism (TLP) is what multi-core hardware exploits: distinct threads of execution, each with its own program counter and stack, running concurrently on different cores or sharing a single core via fine-grained interleaving.

TLP is a different beast from ILP and DLP. It does not come from a single program's internal structure; it comes from the program being explicitly written or transformed to use multiple threads. The hardware does not automatically extract TLP the way OoO extracts ILP: the program must be parallel by design, and the operating system and runtime must distribute its threads across cores.

This chapter is about the hardware that runs threads: multi-core processors, simultaneous multithreading (SMT), the interconnects that link cores together, and the system-level structures (NUMA, system controllers) that scale to many cores. The next chapter (Chapter 31) will examine the correctness problem: how cache coherence and memory consistency keep multi-threaded programs's view of memory consistent.

01.Multi-Core Basics

A multi-core processor is a chip with several CPU cores, each capable of running an independent thread. Each core has its own front end (fetch, decode, rename), back end (issue queue, execution units, ROB), private L1 caches, and usually a private L2. The cores share a last-level cache (L3 or higher) and the memory controllers.

The simplest multi-core picture:

Figure: Four-core multi-core chip: each core holds private L1 and L2 caches, all four share an L3, and a memory controller below the LLC drives DRAM

LaTeX

\begin{tikzpicture}[font=\footnotesize, >=Stealth, line cap=round,
  blk/.style={draw, thick, fill=white, minimum height=0.9cm, align=center}]
  % Cores at top
  \node[blk, minimum width=1.7cm] (c0) at (1, -1) {Core 0\\L1\\L2};
  \node[blk, minimum width=1.7cm] (c1) at (3.2, -1) {Core 1\\L1\\L2};
  \node[blk, minimum width=1.7cm] (c2) at (5.4, -1) {Core 2\\L1\\L2};
  \node[blk, minimum width=1.7cm] (c3) at (7.6, -1) {Core 3\\L1\\L2};
  % Shared L3
  \node[blk, minimum width=8cm] (l3) at (4.3, -3) {Shared L3 / LLC};
  % Memory controller
  \node[blk, minimum width=8cm] (mc) at (4.3, -4.5) {Memory Controller};
  % DRAM
  \node[blk, minimum width=2cm] (dram) at (4.3, -6) {DRAM};
  \draw[->] (c0.south) -- (1, -2.7);
  \draw[->] (c1.south) -- (3.2, -2.7);
  \draw[->] (c2.south) -- (5.4, -2.7);
  \draw[->] (c3.south) -- (7.6, -2.7);
  \draw[->] (l3) -- (mc);
  \draw[->] (mc) -- (dram);
\end{tikzpicture}

Each core executes its own thread, fetched from its own program counter, with its own register state. The cores share memory through the cache hierarchy: a load on core 0 can read data that core 1 wrote, eventually, with the cache hierarchy delivering it.

The cores are connected by an on-chip interconnect: a network of wires and switches that carries cache-line transfers, snoop requests, and other coherence traffic between cores. We will look at interconnects more in a few sections.

The number of cores has grown rapidly. Early multi-core chips (mid-2000s) had 2 cores; mid-2010s servers had 16-32; current high-end servers have 128 or more. Mobile and laptop chips often have a heterogeneous mix of high-performance and high-efficiency cores.

02.Why TLP

The case for TLP is fundamentally about transistor budgets and power.

A single core with $N$ transistors achieves some performance $P_1(N)$ . By Pollack's rule, $P_1(N) \approx \sqrt{N}$ — quadrupling transistors gives you about double the performance. The marginal performance per transistor decreases.

Two cores with $N/2$ transistors each achieve aggregate $P_2 \approx 2 \cdot P_1(N/2) \approx 2 \sqrt{N/2} = \sqrt{2N}$ . For $N \gg 1$ , this is more total performance than the single fat core, if the workload can use both cores in parallel.

Power follows a similar argument. Doubling a single core's frequency requires more than doubling its power (because of voltage scaling); running two cores at the original frequency uses about double the power. Two cores at original frequency are more power-efficient than one core at double frequency.

The combination — more aggregate performance, better power efficiency — is what drove the multi-core shift around 2005. Since then, single-thread performance has grown at perhaps 5-15% per year, while core counts have grown 10-30% per year on the high end.

The catch: TLP is software-visible. A serial program does not benefit from more cores. A program that uses one thread runs on one core, leaving the others idle. Realizing TLP's benefit requires writing parallel software, which is harder than writing serial software.

03.Programming Models for TLP

Several models for writing parallel programs.

Threads with shared memory. The classic POSIX threads (pthread_create, mutexes, condition variables), the equivalent on Windows, and modern higher-level abstractions (C++ std::thread, Java threads, .NET tasks). The threads share a single address space; the program coordinates through synchronization primitives.

Message passing. Threads or processes communicate by sending messages rather than sharing memory. MPI (Message Passing Interface) is the standard for scientific computing across nodes. Erlang and Go popularized lightweight actor- and channel-based message passing within a process.

Task-based parallelism. Express the work as a graph of tasks; a runtime schedules tasks onto threads. Examples: Cilk, Intel TBB, OpenMP tasks, .NET TPL, Java's ForkJoinPool.

Data-parallel models. Express the parallelism as operations on collections; the runtime handles distribution. Examples: OpenMP parallel for, .NET PLINQ, Rayon for Rust, Java parallel streams.

SIMT / GPU. The model used for GPUs: launch thousands of threads, each on one data element, with the hardware grouping them into lockstep warps. CUDA, OpenCL, Metal, ROCm.

The hardware does not care which model the software uses; from the hardware's perspective, threads are just threads. But different models impose different patterns of synchronization and memory access, and different patterns interact differently with the cache hierarchy and coherence protocols.

04.Amdahl's Law

The fundamental limit on parallel speedup. If a program has a serial fraction $s$ that cannot be parallelized and a parallel fraction $1-s$ , then on $N$ processors the maximum speedup is:

$\text{Speedup}(N) = \frac{1}{s + (1-s)/N}.$

As $N \to \infty$ :

$\text{Speedup}(\infty) = \frac{1}{s}.$

Even with infinite processors, the speedup is bounded by the inverse of the serial fraction. A program with $s = 0.05$ (5% serial work) caps at 20× speedup, no matter how many cores. A program with $s = 0.20$ caps at 5×.

The implications:

Even small serial fractions limit parallel scaling severely.
Reducing $s$ — making more of the program parallel — has bigger payoff than adding more cores past a point.
Programs with truly minimal serial fractions can scale to thousands of cores; most programs cannot.

A more pessimistic refinement, Gustafson's law, says the relevant question is not "how fast can I make this fixed problem with more cores" but "how big a problem can I solve in fixed time with more cores." If the parallel work scales with the problem size and the serial work stays constant, then arbitrarily large speedups are achievable on arbitrarily large problems. This is closer to the experience of HPC and large-scale data processing.

In practice, both perspectives matter. Strong scaling (fixed problem, more cores) is bounded by Amdahl. Weak scaling (problem grows with cores) is more permissive. Real applications are evaluated in both terms.

05.Heterogeneous Cores: Big and Little

Modern mobile and laptop chips often have heterogeneous core configurations: a small number of high-performance ("big") cores and a larger number of high-efficiency ("little") cores on the same chip.

ARM popularized this with big.LITTLE in 2011: a Cortex-A15 (out-of-order, high IPC, hot) paired with a Cortex-A7 (in-order, lower IPC, very efficient). The OS can run intensive threads on the big cores and background threads on the little cores, optimizing for performance and battery life.

Intel introduced its hybrid architecture (Performance cores + Efficient cores) starting with Alder Lake (2021). Apple's M-series uses similar hybrid configurations from the M1 onward, with up to 12 performance cores and 4 efficiency cores in the M-Ultra chips.

The cores share an ISA but have very different micro-architectures: the big cores are wide OoO superscalar; the little cores are narrower in-order or modest OoO. Programs are unaware (mostly) — the OS scheduler decides which thread runs where, often guided by hints from the hardware (thread directors) about each thread's behavior.

The architectural challenge: a thread migrated from a big core to a little core must continue working correctly. Both cores implement the same ISA; the program does not change. Performance counters and timing differ between core types, which can confuse software that profiles and tunes per-core, but correctness is preserved.

06.Simultaneous Multithreading (SMT)

A single core's execution resources are often underutilized. The OoO machinery can issue, say, 6 µops per cycle, but the average IPC of a thread is 2-3 — roughly half of peak. The unused issue slots are wasted.

Simultaneous multithreading (SMT) lets a single core run multiple threads concurrently, sharing the same execution resources. Each thread has its own architectural state (PC, registers, etc.) but shares the core's execution units, caches, and so on. The hardware fetches and issues from both threads each cycle, blending their µops into the issue queue and letting them all compete for resources.

The benefit: when one thread stalls (cache miss, branch misprediction, long-latency op), the other thread keeps running on the same core, using the resources the first thread cannot. The aggregate throughput of two SMT-sharing threads on one core is typically 1.2-1.4× a single thread on the same core — not 2×, but a substantial improvement at low hardware cost.

SMT in real implementations:

Intel Hyper-Threading. 2-thread SMT, in Intel cores from Pentium 4 onward (with a brief absence in early Atom and some E-cores). Each core appears to the OS as 2 logical processors.
IBM POWER. 4-thread or 8-thread SMT in some POWER cores. The execution resources are sized to feed many threads simultaneously.
AMD. 2-thread SMT in Zen architectures.
ARM. No SMT in standard cores. Some server cores have explored it.

The hardware cost of SMT is modest: per-thread architectural state (registers, PC, status), a small extension of the rename machinery, slight extension of the front end. The execution resources are the same. SMT is one of the highest performance-per-transistor wins in modern processors.

The drawbacks:

Per-thread performance can drop. If a single thread was already fully utilizing the core, adding a second thread takes resources away from it. SMT improves aggregate throughput but may slow individual threads.
Cache contention. Two threads share the L1 caches; their working sets compete. A thread that fits in cache alone may evict half its data when sharing.
Security risks. Threads on the same core share micro-architectural state (branch predictors, caches, ports), creating side channels. A malicious thread can sometimes infer secrets from a sibling thread's behavior. Cloud providers often disable SMT on multi-tenant hardware for this reason.

SMT is useful for throughput-oriented workloads (servers handling many requests) and for workloads with frequent stalls (memory-bound code). It is less useful for latency-oriented workloads where each thread wants the whole core. The OS scheduler can be aware of SMT topology and schedule accordingly.

07.Synchronization Primitives in Hardware

For threads to coordinate, the hardware provides primitive operations that operate atomically across all the cores' caches. We have seen these in Chapter 12 in the context of single-thread atomics; for TLP they take on broader importance.

Atomic read-modify-write. A single instruction reads a memory location, modifies it, and writes the result back, all atomically with respect to other cores. x86's LOCK ADD, LOCK XCHG, LOCK CMPXCHG and friends. On RISC architectures, the equivalent is the LR/SC pair (RISC-V) or LDXR/STXR (AArch64).

Compare-and-swap (CAS). A specific atomic that reads a location, compares it to an expected value, and writes a new value only if they match. CAS is sufficient to implement most concurrent algorithms.

Memory barriers. Instructions that prevent the hardware from reordering memory operations across them. The exact semantics depend on the memory model (see Chapter 31).

The cache-coherence machinery (Chapter 31) makes these primitives work efficiently: an atomic operation typically requires the cache line to be exclusively owned by the executing core, and the protocol arranges that. Heavily contested atomic operations (a million threads CAS-ing the same location) cause the line to ping-pong between caches, causing severe slowdowns. This is a fundamental scaling limit; high-contention shared state defeats parallelism.

08.Lock-Free and Wait-Free Programming

Building on atomic primitives, software can implement lock-free data structures: structures where multiple threads can operate simultaneously without using mutual-exclusion locks. The classic example is a lock-free queue using CAS to update head and tail pointers.

Lock-free means progress is guaranteed for at least one thread at any time, even if others are interrupted or stalled. Wait-free is stronger: every thread makes progress in a bounded number of its own operations.

Lock-free structures are appealing in highly concurrent code: they avoid the priority-inversion problems of locks and can scale better. But they are notoriously hard to design correctly. Subtle bugs (the ABA problem, memory reclamation issues, ordering races) are common. Most real systems use locks for the bulk of synchronization, with lock-free tricks reserved for specific hot paths.

The hardware enables both approaches; the software's choice depends on requirements. Either way, the underlying atomic primitives and memory-model guarantees are what the hardware provides.

09.Hardware Transactional Memory

For a brief period in the 2010s, several major CPU vendors added transactional memory to mainstream ISAs, hoping to give software a higher-level concurrency primitive than the atomic-and-barrier toolkit. The idea is that a thread can mark a region of code as a transaction: the hardware speculatively executes it, holds the writes in a private buffer, monitors the cache lines it has read, and at the end of the transaction either commits the writes atomically (if no conflict was detected) or aborts and rolls back to the start (if another thread touched a conflicting line). The programmer writes ordinary serial-looking code; the hardware provides the atomicity.

Intel TSX (Transactional Synchronization Extensions), introduced with Haswell in 2013, exposed two interfaces: Hardware Lock Elision (HLE), in which a lock-prefixed instruction is interpreted as the start of a transaction so existing locked code transparently benefits, and Restricted Transactional Memory (RTM), with explicit XBEGIN, XEND, XABORT instructions for code written with TSX in mind. IBM POWER8 added a similar facility, and z/Architecture has had constrained transactional memory since z12.

The story since has been complicated. Intel disabled TSX on most Haswell chips by microcode update because of a functional bug, re-enabled it on later generations, then largely removed it through microcode in 2021 because TSX was shown to be exploitable as a side channel (the TAA — TSX Asynchronous Abort — vulnerability). At the time of writing, transactional memory is not available on most current x86 cores. POWER and z still ship it but it is not widely used outside specialist niches.

The architectural ideas remain interesting and influential. The notion of a hardware-monitored read-set and write-set turns out to be useful for speculative lock elision (where the hardware optimistically runs through a lock without acquiring it and falls back if there is a conflict) and for checkpointing in the OoO machinery of Chapter 25. Software transactional memory (STM) libraries continue to use the same model in pure software, with much higher overhead but no hardware dependency. The lesson, as with VLIW, is that an attractive architectural idea is not always commercially viable; the right balance of hardware support, programming model, and security has been hard to strike for HTM.

10.On-Chip Interconnect

Connecting many cores requires a fast on-chip network. The cores must be able to exchange cache lines, send snoop requests, and route memory traffic to the right memory controllers — all with low latency and high bandwidth.

Several topologies have been used.

Crossbar. Direct connection from every source to every destination. Latency is uniform and low. Cost grows as $N^2$ in connections, so it doesn't scale beyond ~16 cores.

Ring. Cores connected in a ring; each message travels around the ring until it reaches its destination. Used in older Intel server chips (Sandy Bridge through Haswell). Latency is proportional to distance around the ring.

Mesh. Cores arranged in a 2D grid; each core has a router that connects to its four neighbors. Used in Intel's Skylake-SP and later server chips (24+ cores). Latency depends on the path; bandwidth scales well.

Switched fabric. A general routed network with arbitrary topology. Used in some many-core designs and in chiplet-based processors (AMD's Infinity Fabric).

NUMA effect. As cores get farther apart (in physical placement, in number of hops), the latency between them grows. A core accessing data in a distant L3 slice or a distant memory controller pays more. This non-uniform memory access (NUMA) effect is small within a chip but significant across chip boundaries (multi-socket systems).

The interconnect is one of the most active research areas in chip design. Its bandwidth and latency directly limit how many cores a chip can usefully contain.

11.Chiplets and Disaggregated Multi-Core Designs

The upper bound on the number of cores in a single die is set by the reticle limit (the largest area a lithography step can pattern in one shot, around 800 mm² on current processes), the yield penalty of large dies (a single defect ruins the whole chip, and defects are uniformly distributed by area), and the cost of producing the most advanced node. The response in recent designs has been chiplets: a CPU is built from several smaller dies packaged together, each die manufactured in the most cost-effective process for its function.

AMD's Zen family has used chiplets since 2017. A modern AMD server CPU contains several Core Chiplet Dies (CCDs), each a small die on the most advanced process holding 8 cores plus an L3 slice, and a separate I/O Die (IOD) on an older, cheaper process holding the memory controllers, PCIe, USB, and inter-CCD interconnect. The Infinity Fabric carries cache-coherent traffic between the CCDs and the IOD. The architectural cost is a higher latency to memory and to other CCDs (the request has to leave the CCD, cross to the IOD, and possibly cross again to another CCD); the architectural benefit is that very large core counts (96 in Zen 4 EPYC, more in newer generations) become economically feasible.

Intel Sapphire Rapids and successors use a tile architecture: four large dies are arranged in a 2-by-2 pattern with a fast die-to-die interconnect (EMIB, embedded multi-die interconnect bridges) so that the four tiles function as a single mesh-connected core complex. Each tile holds cores, L3, memory controllers, and PCIe; the package as a whole presents a single unified address space.

Apple's M1 Ultra and M2 Ultra combine two M-series dies through an UltraFusion interconnect with very high bandwidth. The combined chip presents itself as a single unified system to software — with caches, memory, and GPU all behaving as one large machine — even though it is physically two dies.

The architectural consequences for software are mostly invisible: the cache coherence still works, the memory model is still the same, the OS scheduler sees a homogeneous (or heterogeneous) set of cores. What changes is the latency topology: cores within a chiplet are close, cores across chiplets are farther apart, and memory has additional NUMA-like layers. Performance-tuned software is increasingly aware of which cores share which last-level cache, which memory controllers are local, and which paths cross chiplet boundaries. The OS exposes this through CPU topology (Linux's lscpu shows it; Windows has analogous APIs).

The broader trend is that the boundary of the chip is becoming a stack of layers rather than a single line. Multi-die packages, advanced packaging (TSMC's CoWoS, Intel's Foveros), and silicon photonics for inter-package links are pushing the system architecture into territory that used to be reserved for the motherboard. We will return to packaging in Chapter 55.

12.NUMA and Multi-Socket Systems

Beyond a single chip, large servers connect multiple chips in multi-socket configurations. A typical 2-socket server has two CPU chips, each with its own DRAM, connected by a high-speed inter-socket link (Intel UPI, AMD Infinity Fabric).

Each chip's cores can access memory attached to the other chip, but the access goes through the inter-socket link and pays a latency penalty (typically 1.5-2× the local access latency) and a bandwidth penalty. This is the macro-scale NUMA (Non-Uniform Memory Access) effect.

NUMA-aware software:

Allocates memory on the same node where the using thread runs (numactl, mbind, set_mempolicy on Linux).
Pins threads to specific cores or nodes to keep them near their data.
Uses different data structures for cross-node communication.

Operating systems are NUMA-aware: they try to schedule threads on nodes where their memory was allocated and to allocate new pages on the local node by default. But the program must cooperate, especially in latency-sensitive applications.

For very large systems (hundreds of cores across many sockets), the NUMA effect dominates performance. Distributed-memory parallelism (MPI, message passing) becomes more efficient than shared-memory parallelism past some scale, even though the hardware still provides shared memory in principle.

13.Power and Heterogeneity

A modern multi-core chip dissipates substantial power, and the heat generated is the primary constraint on how fast and how many cores can run simultaneously. Several techniques manage this.

Per-core frequency and voltage scaling. Each core can run at a different frequency and voltage. Idle or lightly-loaded cores run slowly to save power; heavily-loaded cores run fast. The chip's power management unit monitors thermal and current conditions and adjusts.

Turbo boost. When only a few cores are active, those cores can boost above their nominal frequency, using the thermal headroom of the idle cores. When more cores become active, frequencies drop back. This is essentially dynamic voltage/frequency scaling (DVFS) tuned for many-core workloads.

Per-core power gating. Idle cores can be completely powered down. Wake-up takes a few microseconds but saves substantial idle power.

Heterogeneous cores. Big.LITTLE-style designs explicitly use small cores for low-power background work and big cores only for performance-critical work.

The OS, power-management firmware, and hardware all participate in these decisions. The total power budget for a chip is a few hundred watts in a server, a few tens of watts in a laptop, a few watts in a phone. Within that budget, the hardware tries to give each thread the best performance it can, given the current thermal state.

14.Scalability Limits

Multi-core scaling has its own walls.

The serial fraction. Amdahl's law caps speedup. Real applications have non-trivial serial sections that limit how far parallel speedup goes.

Synchronization overhead. Contended locks and atomic operations cause coherence traffic; heavily-contested shared state can cause performance to degrade with more cores rather than improve.

Cache contention. Threads on different cores may compete for cache space if they share data, or for cache capacity if they don't. The L3 is shared but finite.

Memory bandwidth. All cores share the memory controllers. With enough cores, memory bandwidth becomes the bottleneck, and adding more cores does not help.

Power. A chip's thermal envelope is fixed. Adding cores either drops the per-core frequency or the chip overheats.

These walls are why core counts have grown but per-thread performance has not skyrocketed. A 64-core chip is not 64× faster than a 1-core chip on most workloads; it might be 30-40× on well-parallelized work, much less on poorly-parallelized work.

15.Summary

Thread-level parallelism is the dominant axis of performance growth in modern computing. Multi-core processors put many cores on a chip; SMT shares each core among multiple threads; heterogeneous designs mix big and little cores; on-chip interconnects link them all together; multi-socket systems extend the model to multiple chips and NUMA effects.

TLP requires explicit parallelism in the software: a program written serially does not benefit from more cores. Programming models — threads, message passing, task graphs, data-parallel collections — give software different ways to express the parallelism. Synchronization primitives provided by the hardware (atomic operations, memory barriers) let threads coordinate.

The performance ceiling is set by Amdahl's law (serial fractions), synchronization overhead, cache contention, and memory bandwidth. Real applications scale to a few cores easily, to dozens with care, and to hundreds only with substantial effort and design. Beyond that, distributed-memory programming and accelerators take over.

The correctness of multi-threaded code rests on the cache-coherence and memory-consistency models that the hardware provides. The next chapter develops them in detail.

Book mode