Part IV·Microarchitecture·Chapter 25 of 62

Part IVMicroarchitecture

Out-of-Order Execution

May 16, 2026·23 min read·intermediate

An in-order superscalar processor stalls whenever the next instruction in the program is not ready. Even if dozens of independent instructions wait behind that one stalled instruction, the hardware cannot reach them. On modern code with frequent cache misses and long-latency operations, this is intolerable: the processor's wide pipeline runs at a fraction of its peak throughput.

The solution is to decouple the order in which instructions execute from the order in which they appear in the program. Instructions still enter the pipeline in program order (the front end is sequential) and they still retire in program order (so that the architectural state evolves predictably and exceptions can be handled), but in between, they may execute in any order their data dependencies allow. An older instruction stuck waiting for a cache miss does not block a younger independent instruction from running; the younger one issues, executes, and parks its result, ready to retire when its turn comes.

This idea, called out-of-order (OoO) execution, is the most consequential micro-architectural technique of the past forty years. It first appeared in research designs in the 1960s (Tomasulo's algorithm in the IBM 360/91, 1967), entered mainstream desktop processors in the 1990s (Intel Pentium Pro, 1995), and is now essentially universal in high-performance CPUs. This chapter develops it from the ground up.

01.What Out-of-Order Means

The idealization to keep in mind: every instruction sits in a pool, waiting for its data inputs. As soon as an instruction's inputs are all available and an execution unit is free, the instruction runs. Instructions complete in whatever order their data dependencies and resource availability allow. When an instruction completes, its result is broadcast back to the waiting pool, where dependent instructions notice and become ready themselves.

To make this work while preserving the architectural illusion of sequential execution, several pieces of machinery are needed.

Register renaming removes false dependencies (WAR and WAW) by assigning each instruction's destination to a fresh physical register. The architectural register names are mapped to physical registers through a rename map; only the true RAW dependencies remain.

A reorder buffer (ROB) holds in-flight instructions in program order. Each ROB entry tracks an instruction's status — issued, executing, completed — and its result. When the oldest instruction in the ROB has completed, it retires, making its result architecturally visible.

An issue queue (also called a reservation station or scheduler) holds instructions that have been decoded and renamed but have not yet executed. Instructions wait here until their inputs are ready and an appropriate execution unit is free.

A common data bus or broadcast network carries completed results back to the issue queue, where they wake up dependent instructions.

The pipeline of an OoO processor, conceptually, is:

Figure: Out-of-order pipeline: fetch, decode, rename and allocate, issue queue, execute, writeback to ROB, retire, with a wakeup broadcast feeding back to the issue queue

LaTeX

\begin{tikzpicture}[font=\footnotesize, >=Stealth, line cap=round,
  blk/.style={draw, thick, fill=white, minimum height=0.9cm, align=center}]
  \node[blk, minimum width=1.6cm] (f)   at (1, -0.5)    {Fetch};
  \node[blk, minimum width=1.8cm] (d)   at (3.2, -0.5)  {Decode};
  \node[blk, minimum width=2.0cm] (r)   at (5.6, -0.5)  {Rename\\+ Allocate};
  \node[blk, minimum width=2.2cm] (iq)  at (8.2, -0.5)  {Issue Queue};
  \node[blk, minimum width=1.8cm] (ex)  at (10.7, -0.5) {Execute};
  \node[blk, minimum width=2.0cm] (wb)  at (13.1, -0.5) {Writeback\\to ROB};
  \node[blk, minimum width=1.6cm] (ret) at (15.5, -0.5) {Retire};
  \draw[->] (f) -- (d);
  \draw[->] (d) -- (r);
  \draw[->] (r) -- (iq);
  \draw[->] (iq) -- (ex);
  \draw[->] (ex) -- (wb);
  \draw[->] (wb) -- (ret);
  \draw[->, dashed] (wb.south) to[bend left=25] node[below] {wakeup broadcast} (iq.south);
\end{tikzpicture}

The issue queue is the heart of the OoO machine: instructions sit in it until they can run, the order of issuing is determined by dependency-readiness rather than program order, and the broadcast wakes up dependents as they finish.

02.Register Renaming

The first piece of OoO machinery, and the conceptual foundation, is register renaming. Consider:

Assembly

add  x1, x2, x3       # writes x1
sub  x4, x1, x5       # reads x1 (RAW: true dependency on 1)
mul  x1, x6, x7       # writes x1 (WAW: output dependency on 1)
add  x8, x1, x9       # reads x1 (RAW on 3)

Instructions 1 and 3 both write x1. In an in-order pipeline, this is fine: 1 writes, 2 reads it, 3 overwrites, 4 reads the new value. But if we want 3 and 4 to execute concurrently with 1 and 2 — say, because 1 has a dependence on a slow load — we have a problem. The architectural register x1 cannot simultaneously hold the result of 1 (which 2 needs) and the result of 3 (which 4 needs).

The fix is to give each instruction's destination a fresh register. Internally, the processor has a large pool of physical registers, much larger than the architectural register file. When instruction 1 says "write x1", the renamer allocates physical register p47 and remaps x1 → p47. When instruction 3 says "write x1", the renamer allocates p52 and remaps x1 → p52. Now:

Plain Text

add  p47, p33, p21    # x1 → p47
sub  p55, p47, p18    # reads p47 (the right one, from 1)
mul  p52, p41, p9     # x1 → p52  (different physical register!)
add  p60, p52, p25    # reads p52 (from 3)

The WAW between 1 and 3 has vanished: they write different physical registers. The WAR that would have existed between 2 and 3 (2 reading x1 just before 3 overwrites it) has also vanished. Only the true RAW dependencies remain: 2 depends on 1 because it reads p47 that 1 writes; 4 depends on 3 because it reads p52 that 3 writes.

The architectural register file is now a projection of the physical register file. The mapping is dynamic: at any moment, the renamer's table says which physical register currently corresponds to each architectural register. As instructions retire in order, the architectural state evolves predictably even though the physical state is being shuffled around far ahead of retirement.

A typical rename pipeline:

Read the source architectural registers from the rename map: each architectural register name maps to the physical register holding its current value.
Allocate a fresh physical register for the destination.
Update the rename map: dest_arch → new_phys.
Forward the renamed instruction to the issue queue, with all references replaced by physical register numbers.

The cost: the rename stage adds 1-2 pipeline stages (and increases misprediction penalty correspondingly), and the rename map and free list are non-trivial pieces of hardware. The benefit: false dependencies are eliminated, freeing the scheduler to run instructions in any order their true data dependencies allow.

The size of the physical register file matters. It bounds how many in-flight instructions can have outstanding writes simultaneously. Modern processors have 200+ physical integer registers and similar numbers for FP/vector. Once the free list is exhausted, the rename stage stalls until older instructions retire and free their old physical registers.

03.The Reorder Buffer

Instructions complete out of order, but the architectural state must evolve in order. The reorder buffer (ROB) is the structure that makes this work.

The ROB is a circular queue of in-flight instructions, in program order. Each entry tracks:

The instruction's PC.
Its destination architectural register.
The physical register holding its result (the freshly allocated one from rename).
The previous physical register that mapped to this architectural register (needed for misprediction recovery, see below).
Status: issued, executing, completed.
Any exception or branch outcome.

Instructions enter the ROB in program order during the rename stage and leave the ROB in program order during the retire stage. In between, their execution may complete in any order; the ROB just records the completion.

Retirement happens when the oldest ROB entry has completed and there are no exceptions. At retirement:

The instruction's result physical register becomes the architectural mapping for its destination register.
The previously-architectural physical register is returned to the free list.
The ROB entry is freed.
The architectural state has now advanced by one instruction.

If a retirement-stage instruction has raised an exception, retirement is delayed: the exception handler runs in place, after the architectural state of all older instructions is in place. We will revisit precise exception handling in a later section.

The ROB size is one of the headline numbers for any modern processor. A larger ROB allows more in-flight instructions, which means more opportunity for OoO scheduling to find independent work to do during long stalls. Modern processors have ROBs of 256-600+ entries. The Apple M1's ROB is reportedly around 630 entries; recent Intel cores around 512.

A larger ROB is not free. It costs area and power, and several aspects of the processor scale with ROB size: the rename map's depth (each in-flight instruction's prior rename must be tracked for recovery), the misprediction recovery cost (more in-flight instructions to flush), and the issue queue's effective scope. The ROB must also keep up with the back end's retirement rate; a too-small ROB throttles a fast back end.

04.The Issue Queue and Scheduling

Renamed instructions wait in the issue queue for their inputs to become available and an execution unit to be free. The issue queue is one of the most performance-critical structures in an OoO processor.

Each issue-queue entry contains:

The instruction's opcode and any immediate operands.
The physical register numbers of the source operands.
Ready bits indicating whether each source has been produced.
The destination physical register.
The execution port(s) the instruction can issue to.

The scheduling logic, every cycle:

Identifies which instructions have all their sources ready.
Among the ready instructions, picks the oldest (or some other criterion) for each available execution port.
Dispatches the picked instructions to the execution units.
Removes them from the issue queue.

When an executing instruction produces a result, it broadcasts the destination physical register number on a common data bus (CDB) or equivalent network. Every issue-queue entry compares the broadcast register number against its source registers; matches set the corresponding ready bit. The next cycle, those instructions are eligible to issue.

The issue queue's complexity grows with its size. Each cycle, the wakeup logic compares the broadcast tags against every source in every entry. If the issue queue has 100 entries with 2 sources each, that's 200 comparators every cycle. Combined with the select logic that picks among the ready entries, the wakeup-and-select loop is on a tight timing budget — typically a single cycle.

Several techniques manage this complexity.

Distributed reservation stations. Tomasulo's original design had separate reservation stations at each functional unit, with each station a small queue. This distributes the wakeup logic but complicates the broadcast network.

Unified scheduler. Most modern designs use a single unified scheduler with all instructions in one queue. The wakeup logic is centralized; the select logic chooses one instruction per port per cycle.

Speculative wakeup. For very fast back-to-back issuing of dependent instructions, the wakeup is speculative: a dependent instruction is woken up before its source's result is actually ready, on the assumption that it will be ready in time. If the speculation is wrong (e.g., the source instruction takes longer than expected, as with a cache miss), the dependent has to be replayed.

Non-data-capturing schedulers. Older designs stored source values in the issue queue (the reservation stations would capture the data). Modern designs store only the source register numbers; the actual values are read from the physical register file at issue time. This shrinks each issue-queue entry significantly.

Issue-queue size, like ROB size, is a major performance lever. Larger issue queues find more parallelism but cost area, power, and timing. Modern issue queues range from about 60 entries (small cores) to 200+ entries (largest server cores).

05.The Wakeup-Select Loop

The single tightest timing loop in a modern out-of-order processor is the wakeup-select path of the issue queue. To issue dependent instructions in back-to-back cycles — essential, since chains of dependent ops are common — the processor must (1) execute the producer in cycle $N$ , (2) broadcast its destination tag at the end of cycle $N$ so that dependents wake up in cycle $N+1$ , (3) select among the now-ready instructions, and (4) issue the chosen ones in cycle $N+1$ . The whole sequence has to fit in a single clock cycle, end to end.

This is harder than it looks. The wakeup compares the broadcast tag against every source-register field in every issue-queue entry; the select logic then finds the oldest ready instruction for each port from among potentially dozens of candidates. Both operations have logic depth that grows with the issue-queue size, and the wires have to span the queue. As issue queues grow, the loop tends to violate timing.

One response is to split the loop across cycles. The wakeup happens in cycle $N$ ; the select in cycle $N+1$ ; back-to-back issue is no longer possible, and dependent instructions execute with a one-cycle gap. This sacrifices IPC on dependency chains and is unattractive on the high-frequency targets where the loop's timing is hardest.

A more common response is speculative wakeup. The producer broadcasts its tag before its result is actually computed, on the assumption that the producer's latency is what the schedule expected. For one-cycle ALU operations the assumption is straightforward; for loads, the schedule assumes a cache hit. The dependent issues optimistically and reads the producer's value from the bypass network when it arrives.

When the assumption is wrong — the load misses in the L1, the producer takes longer than expected — the speculatively-woken-up dependents and their dependents (the entire cone of speculation) must be replayed. Mechanisms vary. Some designs re-issue the affected instructions from the issue queue, holding their slot until the producer's value is genuinely ready. Others squash a window of instructions and refetch them. Replay machinery is one of the major hidden complexities of modern OoO designs and a frequent source of pathological performance cliffs: a program that hits the replay path frequently can run substantially slower than its peak IPC suggests.

The practical reading is that the issue queue is not the orderly waiting room of the textbook but a noisy place where instructions wake up speculatively, sometimes execute, sometimes get knocked back into the queue, and sometimes drag a cluster of dependents through the same fate. The IPC numbers reported on workloads reflect a steady-state balance between successful speculation and replays, with the balance set by the cache miss rate, the latency variance of variable-latency units, and the depth of dependency chains in the code.

06.Move Elimination and Zero Idioms

The rename machinery makes some optimizations possible that have no analog on an in-order machine. Two of them are common enough to be worth naming.

Move elimination. A register-to-register move — mov rax, rbx on x86, mov x0, x1 on AArch64 — is a common instruction in compiler-generated code, accounting for several percent of dynamic instructions. On the architectural level it copies bx's value into ax; on the renamed level it does not need to. Instead, the rename stage simply maps the architectural destination to the same physical register that the source was already mapped to. Both names now refer to the same physical register; no execution-port bandwidth is consumed. The instruction completes "in the rename stage" and is marked done in the ROB.

The accounting is delicate: the physical register's reference count must be tracked, since multiple architectural names refer to it, and the register cannot be freed until all of them are overwritten. But the win is real: a few percent of all dynamic instructions vanish from the back end.

Zero idioms. An instruction like xor eax, eax on x86 is a common way to zero a register; the same is true of sub r0, r0, r0 and many other patterns. The rename stage recognizes these idioms and maps the destination to a special zero physical register (a register file entry hard-wired to zero) without consuming an execution slot. Modern x86 cores recognize a substantial collection of zero idioms; AArch64 has the architectural xzr/wzr zero register, so the issue does not arise.

A related optimization is immediate generation: instructions that produce a small constant from a fixed operation may be eliminated similarly. The architectural effect is invisible; the micro-architectural effect is that effective IPC on real code is somewhat higher than a literal count of dynamic instructions would suggest.

We will discuss these decode-and-rename-stage optimizations more thoroughly in Chapter 27, where they sit alongside macro-op fusion and other front-end techniques.

07.Misprediction Recovery

OoO machines speculate aggressively: they fetch, rename, issue, and execute past unresolved branches. When a branch is mispredicted, the work done past it has to be unwound.

The challenge is that some of that work has already updated processor state — physical registers have been written, rename mappings have changed, the ROB has filled with instructions younger than the branch. Recovery has to undo all of it without touching the genuinely-architectural state of older instructions.

Several techniques exist.

Checkpointing. At every branch (or every prediction), the processor takes a checkpoint of the rename map. If the branch is mispredicted, the rename map is restored from the checkpoint, and physical registers freed by the squashed instructions return to the free list. The ROB entries past the branch are dropped. Fast but storage-expensive: each checkpoint costs the size of the rename map.

Walking the ROB. Instead of checkpointing, the processor walks the ROB backward from the youngest entry to the mispredicted branch, undoing each rename mapping by restoring the prior physical register that the entry recorded. Slower (multiple cycles) but no checkpoint storage needed.

Hybrid. Many modern designs checkpoint at every branch but only the most recent few branches; older mispredictions trigger an ROB walk.

Either way, recovery in a modern processor takes a small number of cycles plus the time to refill the pipeline. The total misprediction penalty is dominated by the pipeline depth (10-15 cycles to refill) plus the recovery cost (1-3 cycles). This penalty is high but bounded; the role of the branch predictor (Chapter 23) is to make sure mispredictions are rare enough that the cost is manageable.

08.Memory Operations: A Preview

Memory operations — loads and stores — are the hardest part of OoO execution. They have an extra source of "dependency" beyond register dataflow: a load may depend on a store through memory, and the hardware has to detect this and serialize them appropriately.

Consider:

Assembly

1: st   x1, 0(x10)        # store x1 to address x10
2: ld   x2, 0(x11)        # load from x11 — does it alias x10?

If x10 == x11, the load reads what the store wrote, and the load must wait for the store. If they differ, the load is independent and can execute earlier.

The processor cannot tell whether they alias until both addresses are known. It must therefore handle memory dependencies dynamically. This is the job of the load/store queue and memory disambiguation logic, which we will explore in Chapter 26.

For now, suffice it to say that stores typically wait in a queue until they retire (so that they only modify memory in program order), and loads check against older stores' addresses for matches before reading from memory or forwarding from a store. Speculation is also used: a load may speculatively assume it does not alias older stores and execute early; if it turns out to alias, the load (and its dependents) are squashed and replayed.

09.Precise Exceptions

A precise exception is one in which all instructions before the faulting instruction have completed and none after it have. The architectural state is exactly as if the program had executed up to the faulting instruction and stopped. This is what software relies on for sensible exception handling: the OS can read the registers, see exactly where the fault happened, and either fix it (page fault) or kill the process (segfault).

OoO execution makes precise exceptions tricky. Younger instructions may have already executed — written their results to physical registers, modified internal state — by the time an older instruction faults. If the OS sees those younger results, it sees an inconsistent state.

The ROB makes precise exceptions possible. An instruction's result becomes architecturally visible only at retirement; until then, it is only in a physical register that the architectural rename map does not yet refer to. If an older instruction faults, the processor:

Stops retiring.
Squashes all younger instructions in the ROB.
Frees their physical registers.
Restores the rename map to the state just before the faulting instruction.
Invokes the exception handler.

The architectural state at the point of the exception is exactly the state right before the faulting instruction. The handler sees a clean, in-order view, as if the program had executed serially.

This is one of the reasons retirement happens in order. If the architectural state were updated as instructions completed (out of order), precise exceptions would be impossible. The ROB's in-order retirement is the linchpin that makes OoO execution architecturally invisible.

10.A Worked Example

To make the OoO machinery concrete, walk through a small program on a hypothetical OoO machine with: 4-wide front end, 4 ALU ports, 100-entry ROB, 60-entry issue queue, full register renaming, 1-cycle ALU latency, 4-cycle load latency, 300-cycle main-memory latency.

The program:

Assembly

ld   x1, 0(x10)        # cache miss (300 cycles)
add  x2, x1, x3        # depends on x1
mul  x4, x5, x6        # independent
sub  x7, x8, x9        # independent
xor  x11, x12, x13     # independent
add  x14, x4, x15      # depends on x3 (mul)
ld   x16, 0(x17)       # cache hit
add  x18, x16, x19     # depends on x16

Cycle 1-2: Front end fetches and decodes 1-4.

Cycle 3: Rename allocates physical registers for 1-4. ROB entries 1-4 created. Issue queue receives 1-4.

Cycle 4: Issue queue dispatches:

Instruction 1 (load) to a load port. Cache misses, off to memory.
Instruction 3 (mul) to a multiplier port (3 cycles).
Instruction 4 (sub) to an ALU port (1 cycle).
Instruction 2 (add) waits — its source x1 is not ready.

Cycle 5: Sub completes. Front end fetches 5-8. Rename and issue them.

Instruction 5 (xor) issues to ALU.
Instruction 7 (load) issues to load port. Cache hit, will return in 4 cycles.

Cycle 6: Xor completes. Mul still in flight.

Cycle 7: Mul completes. Instruction 6 (add, depends on x4 from mul) becomes ready. Issue.

Cycle 8: Instruction 6 (add) completes. Load 7 about to return.

Cycle 9: Load 7 returns. Instruction 8 (add, depends on x16) becomes ready. Issue.

Cycle 10: Instruction 8 completes.

So far: instructions 3, 4, 5, 6, 7, 8 have completed. Instruction 1 is still off in memory. Instruction 2 is still waiting in the issue queue. The processor has done useful work for 6 instructions during the 300-cycle stall on instruction 1.

In cycle ~304, the load from main memory returns. Instruction 1 completes; instruction 2 wakes up; in cycle 305, instruction 2 issues; in cycle 306, it completes. Now retirement can proceed. Instructions 1, 2, 3, 4, 5, 6, 7, 8 retire in order in cycles 307-308 (assuming 4-wide retire).

The total elapsed time is about 308 cycles for 8 instructions, so apparent IPC is low. But during those 308 cycles, the processor was filling in other instructions from the program: the front end was fetching ahead, the OoO back end was finding independent work everywhere it could. Over the long run, the processor's IPC reflects the amount of independent parallelism in the program, not just the in-order dependency chain.

This is the central virtue of OoO execution: long stalls do not bring the machine to a halt. As long as there is independent work in the ROB (or that can be fetched into it), the back end stays busy.

11.Memory-Level Parallelism

A crucial benefit of OoO that is sometimes overlooked: memory-level parallelism (MLP), the ability to have multiple cache misses in flight simultaneously.

Consider a loop walking a linked list:

Assembly

loop:
  ld   x1, 0(x2)         # load next pointer
  ...
  mv   x2, x1
  bne  x2, zero, loop

Each iteration depends on the previous (the pointer chain), so OoO cannot help with the dependency. But with prefetching or speculative execution past the branch, multiple cache misses can be in flight at once, and the total latency is amortized.

For data structures without dependency chains — array iteration, hash-table lookups across many keys, scatter-gather operations — OoO with a large enough ROB can keep dozens of cache misses in flight simultaneously. The effective memory bandwidth is multiplied by the parallelism. This is one of the largest practical benefits of OoO execution on memory-bound code.

12.When Out-of-Order Doesn't Help

OoO execution depends on having independent instructions in the ROB. When it doesn't, OoO cannot help.

Long dependency chains. A program structured as a linear chain of dependent operations — each instruction reading the result of the previous — has no parallelism for OoO to find. The machine reduces to in-order behavior plus the overhead of the OoO machinery. Pointer chasing and many cryptographic algorithms are like this.

Frequent mispredictions. A tight loop with hard-to-predict branches spends most of its time in misprediction recovery. The big OoO window is wasted because each misprediction flushes it. Highly data-dependent control flow defeats OoO benefit.

Tiny ROB-relevant programs. Some algorithms have parallelism but at distances that exceed the ROB. Sorting a large array, for example, has parallelism, but the parallel work may be thousands of instructions apart — beyond the reach of even a 600-entry ROB.

For these workloads, simpler in-order designs may be more efficient (per watt, per area). This is why mobile and embedded chips often have a mix of small in-order and large OoO cores: the right tool for the right work.

13.Summary

Out-of-order execution decouples the order of execution from the order of the program. Instructions enter and leave the back end in program order, but in between, they execute as soon as their data dependencies and resource availability allow. The mechanism rests on register renaming (eliminating false dependencies), the reorder buffer (preserving in-order retirement and precise exceptions), the issue queue (selecting ready instructions to execute), and a broadcast network that wakes up dependents when their producers complete.

OoO execution is the dominant technique behind modern high-performance CPUs. It hides long-latency operations like cache misses, exposes memory-level parallelism, and turns a wide pipeline's potential into actual IPC. The cost is a complex back end: large physical register files, large ROBs, fast issue queues, and intricate misprediction-recovery machinery. The benefit, on typical software, is a large multiple in throughput compared to in-order designs of similar width.

Memory operations are the hardest part of OoO and have their own dedicated machinery — the load/store queue, memory disambiguation, store-to-load forwarding. Chapter 26 covers them in detail.

Book mode

	1: add x1, x2, x3 # writes x1
	2: sub x4, x1, x5 # reads x1 (RAW: true dependency on 1)
	3: mul x1, x6, x7 # writes x1 (WAW: output dependency on 1)
	4: add x8, x1, x9 # reads x1 (RAW on 3)

	1: add p47, p33, p21 # x1 → p47
	2: sub p55, p47, p18 # reads p47 (the right one, from 1)
	3: mul p52, p41, p9 # x1 → p52 (different physical register!)
	4: add p60, p52, p25 # reads p52 (from 3)

	1: ld x1, 0(x10) # cache miss (300 cycles)
	2: add x2, x1, x3 # depends on x1
	3: mul x4, x5, x6 # independent
	4: sub x7, x8, x9 # independent
	5: xor x11, x12, x13 # independent
	6: add x14, x4, x15 # depends on x3 (mul)
	7: ld x16, 0(x17) # cache hit
	8: add x18, x16, x19 # depends on x16

	1: st x1, 0(x10) # store x1 to address x10
	2: ld x2, 0(x11) # load from x11 — does it alias x10?

	loop:
	ld x1, 0(x2) # load next pointer
	...
	mv x2, x1
	bne x2, zero, loop