Part VISA Case Studies

x86-64 Micro-Architecture

May 16, 2026·20 min read·advanced

The previous four chapters described x86-64 as an architecture: the historical evolution, the programming model, the system architecture, and the floating-point/SIMD facilities. This chapter looks at…

The previous four chapters described x86-64 as an architecture: the historical evolution, the programming model, the system architecture, and the floating-point/SIMD facilities. This chapter looks at the micro-architecture — how Intel and AMD have actually built fast x86-64 chips. The principles of out-of-order execution, branch prediction, caching, and so on were developed in Part V; here we apply them to specific real-world cores and discuss where their designers have made interesting choices.

The chapter has three parts. First, a tour of a representative modern Intel core (Golden Cove / Redwood Cove, the P-cores in Alder Lake / Raptor Lake / Meteor Lake). Second, a parallel tour of a modern AMD core (Zen 4, the basis of Ryzen 7000 and EPYC 9004). Third, comparative observations: where Intel and AMD differ, what each does well, and how x86-64 cores compare to ARM and RISC-V.

The numbers and structures here change every CPU generation. Treat the specifics as illustrative; the patterns are what matters and what tend to persist across generations.

01. Modern Intel: Golden Cove and Successors

Intel's "P-core" lineage (high-performance core) traces back to the Pentium Pro (1995). Skylake (2015) was a long-lived variant. Sunny Cove (Ice Lake, 2019) was a major redesign. Golden Cove (2021) and Raptor Cove (2022) followed; Redwood Cove and Lion Cove are the most recent generations as of 2026.

We use Golden Cove as the reference, with notes on later changes.

Front End

Branch prediction. Two-level TAGE-style predictor with very large history tables; specialized handlers for indirect branches (ITTAGE), returns (RAS, 32 entries with prediction beyond the depth via the BTB), and loops. Excellent on most workloads; mispredict rate often under 1% for typical applications.

Instruction fetch. 32-byte fetch from the L1 instruction cache (48 KiB, 12-way associative). Up to 6 instructions per cycle delivered to the decoders.

Decode. 6-wide decoder: one complex decoder + five simple decoders. Each decoder produces 1-4 µops per x86 instruction. The "complex decoder" handles instructions that produce multiple µops (e.g., string ops); simple decoders handle instructions that produce a single µop.

µop cache. A separate cache of pre-decoded µop sequences, capable of delivering up to 8 µops per cycle. On hits, the regular decoders are bypassed. The µop cache (around 4K µops) is filled from the decoders' output and is one of Intel's secret weapons: most steady-state hot loops fit and get the high delivery bandwidth.

Loop Stream Detector. When a loop fits in the µop queue, the front end stops fetching/decoding entirely and replays the µops directly from the queue. Saves power and pipeline activity.

The front end can sustain ~6 instructions / 8 µops per cycle when the µop cache hits. When the µop cache misses, the decode bandwidth limits delivery to 6 instructions per cycle.

Back End

Rename. ~6 µops per cycle renamed into the OoO back end. The PRF has hundreds of integer registers (around 280) and similar FP/vector entries. Rename also handles move elimination, zeroing-idiom elimination, and dependency-breaking on idioms.

ROB. ~512 entries, deep enough to track many in-flight branches and memory operations. Window size is one of the bigger jumps in recent generations.

Schedulers. Unified for some functional units, distributed for others. Total scheduler entries: ~150-200. Up to 12 µops can issue per cycle to functional units.

Execution ports. Twelve in Golden Cove:

  • 5 ALU ports (each handling integer ALU; some also handle multiply, branch, etc.).
  • 3 FP/SIMD ports (each capable of FMA on 256-bit or 512-bit data, depending on AVX-512 enablement).
  • 3 AGUs (for address generation): 2 load AGUs + 1 store AGU (some ports shared).
  • 2 store-data ports (distinct from the AGU; one for the address, one for the data).

The actual port assignment is asymmetric — particular instructions can only execute on particular ports. The scheduler figures out the best port for each µop based on availability and instruction type.

Memory. L1D 48 KiB, 12-way, 3 cycles load-use latency. Three loads + two stores per cycle in the best case. L2 is 1.25 MiB to 2 MiB per core (depending on segment), about 14-cycle latency. L3 is shared, ~30-50 cycles latency, sized at 1.875 MiB per slice in Sapphire Rapids servers.

TLBs. L1 dTLB ~96 entries for 4 KiB pages, L1 iTLB ~256 entries. L2 unified ~2048 entries. Separate small TLBs for 2 MiB and 1 GiB pages.

AVX-512

P-cores include AVX-512 hardware. On consumer chips with hybrid topology (Alder Lake/Raptor Lake), AVX-512 is fused off because the E-cores don't support it. On servers (Sapphire Rapids, Emerald Rapids) and on some Sapphire Rapids workstations, AVX-512 is enabled.

When enabled, the FP/SIMD ports execute AVX-512 instructions at full rate (one 512-bit FMA per cycle per port, three FMA-capable ports). The result: peak DP throughput of 6 FMAs × 8 lanes = 48 DP FLOPs/cycle/core. Power-managed via license bus.

Lion Cove and Beyond

Lion Cove (2024) brought significant changes:

  • 8-wide allocation (up from 6).
  • Wider issue (~18 ports).
  • Larger ROB (~640 entries).
  • Removed hyperthreading from P-cores (a controversial choice; trades single-thread perf for total throughput).
  • New L0/L1 cache hierarchy (L0 smaller and faster, L1 larger as backup).

Each generation pushes structures wider and deeper. The micro-architecture is in continuous evolution; what was top of the line two years ago is now mid-tier.

02. Intel E-Cores: Gracemont and Successors

Intel's hybrid topology pairs P-cores with E-cores (efficient cores). The E-core lineage (Tremont → Gracemont → Crestmont → Skymont) descends from Atom; it is a smaller, more efficient core targeting throughput per watt rather than peak per-thread performance.

Gracemont (Alder Lake's E-core) characteristics:

  • 6-wide front end (unusual for an "efficient" core — it's not small).
  • 256 ROB entries.
  • 17 execution ports (yes, more than the P-core).
  • Smaller L1 (32 KiB I + 32 KiB D), 5-cycle load-use.
  • 2 MiB shared L2 across 4 E-cores.
  • AVX2 (no AVX-512).
  • No hyperthreading.

E-cores are not weak. They are reasonably wide and deep, but they prioritize area and power efficiency over peak frequency. Per-clock IPC is around 80-90% of the P-core for typical integer code; running at lower frequency (~3-4 GHz vs P-core's 5+ GHz), they end up at 50-60% of the per-core performance, but at perhaps 1/4 the power.

Skymont (Lunar Lake / Arrow Lake's E-core) brings even more improvements: wider front-end, larger structures, aiming to close the gap with P-cores further. The hybrid arrangement is becoming the norm.

03. AMD Zen Lineage

AMD's modern x86-64 cores are the Zen family, designed from scratch starting in 2012-2014 and shipping in 2017. Generations: Zen, Zen 2, Zen 3, Zen 4, Zen 5 (2024). Each is a refinement of the previous; Zen 3 was a particularly large jump (unifying the L3 cache for the whole core complex).

We use Zen 4 (Ryzen 7000, EPYC 9004) as the reference.

Front End

Branch prediction. TAGE-style predictor with deep history. Indirect predictor via ITTAGE-like structure. Return stack: 32 entries.

Instruction fetch. 32-byte fetch from L1 I-cache (32 KiB, 8-way associative).

Decode. 4-wide decoder with µop cache. Smaller decode width than Intel, but the µop cache and good front-end design largely mitigate the difference.

µop cache. ~6.75K µops, larger than Intel's. Delivers up to ~8 µops per cycle when hit.

The front end can sustain ~6 µops per cycle from the µop cache; ~4 from regular decode.

Back End

Rename. 6 µops per cycle in Zen 4. PRF: ~224 integer + ~192 FP/SIMD.

ROB. ~320 entries. Smaller than Intel's, traditionally — though Zen 5 grew this further.

Schedulers. Distributed: separate scheduler per execution group. Total scheduler entries: ~144.

Execution ports.

  • 4 ALU pipes (each can do most integer ops).
  • 3 AGU pipes (for memory addresses).
  • 4 FP/SIMD pipes (each capable of 256-bit FMA; AVX-512 instructions take 2 cycles in Zen 4's "double-pumped" scheme).

AMD's port count is similar to Intel's but the grouping differs. The compiler doesn't directly target specific ports, but the scheduler's port assignment affects achievable throughput on specific code patterns.

Memory. L1D 32 KiB, 8-way, 4-cycle load-use. Two loads + one store per cycle (Zen 4; Zen 5 widened this). L2 1 MiB private per core, ~14-cycle latency. L3 shared per core complex (CCX), 16-32 MiB depending on chip variant, ~50-cycle latency.

TLBs. Similar capacities to Intel: ~64 L1 dTLB, ~64 iTLB, ~3072 L2 unified.

AVX-512 in Zen 4

Zen 4 was the first AMD core to support AVX-512, but with a "double-pumped" implementation: AVX-512 instructions are split into two 256-bit micro-ops internally and execute on the existing 256-bit FP units. This means AVX-512 instructions execute correctly but at half the throughput of a "true" 512-bit implementation.

The advantage: AVX-512 support without the area, power, and frequency cost of full 512-bit datapaths. Many AVX-512 codes still benefit (mask registers, more registers, embedded broadcasts, FMA forms) without needing peak throughput. For workloads dominated by pure FP throughput, the 512-bit Intel implementation is faster; for mixed workloads, Zen 4's approach is competitive at lower power.

Zen 5 (2024) brought a true 512-bit FP datapath, joining Intel in native 512-bit execution.

Chiplet Design

A defining AMD feature is chiplets: instead of one monolithic die, an EPYC or Ryzen processor is composed of:

  • One or more CCDs (Core Complex Dies) holding 8 cores each.
  • One IOD (I/O Die) holding the memory controller, PCIe controllers, USB, etc.
  • Connected via the Infinity Fabric interconnect.

Each CCD has its own L3 cache (32 MiB in Zen 4, shared among 8 cores in the CCX). Cores in different CCDs accessing each other's L3 must go through Infinity Fabric, with the additional latency that implies.

Chiplets let AMD scale up core counts (32-128 cores in EPYC) without needing huge dies. Each CCD is a single, manufacturable size. The IOD can be on a different process node (e.g., 6 nm IOD with 5 nm CCDs in Zen 4). Chiplets are AMD's secret weapon in the server market.

The trade-off: cross-CCD communication is slower than within-CCD. Cache-line transfers between cores in different CCDs go through the IOD, taking ~80 ns vs ~15 ns within a CCD. NUMA-aware software pinning helps mitigate this.

Intel's Sapphire Rapids and later use a similar "tile" approach, with multiple dies on a single package connected by EMIB. The chiplet approach is becoming industry standard.

04. Comparative Observations

A few patterns emerge from comparing modern Intel and AMD.

Front-End Width

Intel: 6 (Golden Cove) → 8 (Lion Cove). AMD: 4 (Zen 4) → 5+ (Zen 5).

Intel has historically had wider decode. AMD has pursued similar effective bandwidth through bigger µop caches and good predictors. Both are currently converging in IPC.

µop Cache Size

AMD typically has a larger µop cache than Intel (around 6.75K vs Intel's 4K range). This compensates for narrower decode by delivering high bandwidth from the cache.

Cache Hierarchy

Intel typically has a non-inclusive L3, larger L2 (Sapphire Rapids: 2 MiB per core L2). AMD has a smaller L2 (1 MiB per core) but huge L3 (32 MiB shared per CCX, 96 MiB with 3D V-Cache).

The AMD pattern (smaller L2, huge L3) tends to be better for large working sets that fit in the giant L3. The Intel pattern (larger L2, smaller per-core L3 share) tends to be better for working sets that fit in the L2.

SMT (Hyper-Threading)

AMD has SMT in all Zen cores (2-way). Intel has SMT in P-cores up through Raptor Cove (2-way), but removed it from Lion Cove. The SMT decision is contentious: it boosts total throughput by 20-30% but reduces per-thread latency on memory-bound code, and creates side-channel risks (Spectre-like attacks across hyperthreads).

AVX-512

Intel servers: full 512-bit support. Intel consumer (Alder Lake-): fused off. AMD Zen 4: double-pumped 256-bit. AMD Zen 5: full 512-bit.

The AVX-512 ecosystem support is therefore strongest on Intel servers and (currently) AMD Zen 5. Cross-platform binaries usually default to AVX2.

Topology

Intel: hybrid (P + E cores) since Alder Lake (2021). AMD: homogeneous (all cores identical) traditionally; recently dual-CCX configurations (like Ryzen 9 7950X3D pairing one CCD with V-cache and one without).

The hybrid model helps power efficiency on consumer workloads (browsing, light productivity) where most threads can use small E-cores. On homogeneous workloads (compilation, rendering), the difference matters less.

05. Comparison with ARM and RISC-V

x86-64 cores are large and complex compared to ARM or RISC-V cores at similar performance levels. A few reasons:

  • Variable-length, prefix-laden encoding requires complex multi-cycle decoders.
  • The two-operand integer ISA forces extra mov instructions, which are eliminated at rename — but the rename hardware must be very capable.
  • TSO memory ordering requires ordering enforcement in the LSQ that ARM/RISC-V do not need (or implement more flexibly).
  • The legacy ISA features (segmentation, x87, real mode, all the modes) require area and validation effort.

Despite this, modern x86-64 cores compete very effectively with ARM cores on per-thread performance. Apple's M-series cores (Firestorm, Avalanche, Everest) and ARM's Neoverse cores have shown that ARM can match x86-64, but x86-64 has not been outclassed. The CISC-vs-RISC-instruction-set question, on its own, has limited impact at the level of modern out-of-order cores: by the time the code is executed, both have decoded into similar µop streams. What matters more is the surrounding micro-architecture — predictor, cache, OoO width, fab process — and there the competition is direct.

06. Why is x86-64 Still Competitive?

A reasonable question: why hasn't ARM displaced x86-64 entirely on PCs and servers, given ARM's success on mobile?

Several reasons:

  • Software ecosystem. Decades of x86-64 binaries: applications, drivers, kernels, libraries. Migration is expensive. ARM has been making inroads (Apple Silicon, Ampere servers, Microsoft's Snapdragon X), but the bulk of installed base remains x86-64.

  • Performance is competitive. Apple's M-series notebooks compete strongly with x86-64 laptops, but on raw compute throughput, top Intel and AMD chips remain at the front. The gap closes both ways across generations.

  • Dual-vendor competition. Intel and AMD compete fiercely. This drives improvement on both sides faster than a single-vendor environment would.

  • Investment depth. x86-64 cores have benefited from sustained, large investment. The micro-architectural know-how accumulated by Intel and AMD is substantial.

  • Specialized accelerators. Modern x86-64 chips include AMX, GPU integration (Intel iGPUs, AMD APUs), AI engines (Intel NPU, AMD Ryzen AI). The x86-64 SoC has caught up to ARM SoCs on integration.

The competition is healthy and ongoing. Predictions that ARM will displace x86-64 in PCs/servers within a year or two have been made for at least 15 years; the displacement hasn't happened, though ARM's share has grown. Predictions that x86-64 will collapse have similarly proven premature.

07. Side Channels and Speculation Security

Modern x86-64 cores have also been the canonical examples of the speculative-execution side channels (Chapter 23, Chapter 26): Spectre, Meltdown, MDS, L1TF, ZombieLoad, RIDL, RSB-Underflow, Branch History Injection, and many others. The rich speculative back end that delivers high single-thread performance also creates many opportunities for information leakage through micro-architectural state.

Each new attack has required micro-architectural and software mitigations:

  • Page-table isolation (KPTI / PTI) for Meltdown.
  • IBRS, IBPB, STIBP, eIBRS (indirect branch restriction).
  • L1D flush on VM entry (L1TF).
  • VERW for buffer flushing (MDS).
  • Retbleed, RetVoid mitigations.
  • Compiler-inserted retpolines (now mostly replaced with eIBRS).

Many of these mitigations cost performance, sometimes significantly (5-30% in worst cases for syscall-heavy or virtualization-heavy workloads). Modern x86-64 cores have integrated some of these mitigations into hardware to reduce the cost.

This will be revisited in detail in Chapter 51 (Advanced Branch and Speculation), which covers Spectre/Meltdown family in depth.

08. Hyperthreading and Simultaneous Multithreading

Intel's Hyper-Threading Technology (HT) and AMD's Simultaneous Multithreading (SMT) are two implementations of the same underlying idea: sharing one physical core's execution resources between two architectural threads. Each thread has its own register file, program counter, and architectural state; the front-end fetch, the rename and scheduler, the execution units, the load/store buffers, and the caches are shared.

The motivation is utilization. A single thread rarely keeps a wide back end busy: cache misses, branch mispredictions, and serial dependence chains leave execution ports idle for substantial fractions of every cycle. A second thread's instructions, drawn from an independent dependence graph, can fill the gaps. Typical throughput gains in well-mixed workloads are 15–30%, occasionally as high as 40%, but with single-thread latency degraded by the resource sharing.

Resources within a hyperthreaded core fall into three categories:

  • Replicated: register files, architectural state, return-address stack. Each thread has its own.
  • Statically partitioned: certain queues (the ROB, the load and store buffers, the micro-op queue) split equally between threads when both are active, so that one thread cannot starve the other. When only one thread runs, that thread gets the full resource.
  • Dynamically shared: the schedulers, execution units, register file, and caches. Whoever has ready instructions gets the cycles.

The pause instruction is a hint to the core that the executing thread is in a spin-wait loop and the partner thread should be given more resources; the OS scheduler can also use the MWAIT states to put one thread to sleep while the other runs at full strength. AMD's SMT and Intel's HT are essentially identical at this functional level, with detailed differences in the partitioning rules.

Security has complicated the picture. Several of the speculation side channels (L1TF, MDS, the various sibling-thread leaks) cross thread boundaries within a hyperthreaded core, so trust-domain crossings (kernel↔user, hypervisor↔guest, container↔container) sometimes require disabling SMT or scheduling sibling threads from the same trust domain. Linux's core scheduling feature implements the latter; many cloud providers offer SMT-disabled instance types for the former. The performance cost of disabling SMT is workload-dependent but often material.

Some experimental designs have explored four-way SMT (the IBM POWER family does this in production). x86 has not pursued SMT4: the additional sharing erodes single-thread performance further, and the gains in throughput on integer workloads are limited. Intel's recent direction — the hybrid P-core/E-core design — effectively adds parallel slots for throughput in a different way, by putting many small E-cores on the die rather than splitting big cores into more threads.

09. Frequency, Power, and the AVX License

A modern x86-64 core's frequency is not a fixed property; it varies dynamically with workload, temperature, and power budget. Two systems are at play. Turbo Boost (Intel) and Precision Boost (AMD) raise frequency above the nominal base when thermal and power headroom allows; Thermal Throttling drops frequency when limits are hit. The OS does not directly control frequency on modern x86; it sets performance hints (P-states, EPP) and the hardware's power-management unit chooses the actual frequency several times per millisecond.

The interaction with SIMD has been notable. On Skylake-X (the first server-class AVX-512 implementation, 2017) and several successors, executing AVX2 or AVX-512 instructions caused the core to drop to a lower license level, reducing maximum turbo frequency by several hundred megahertz. The mechanism reflected the higher dynamic power and current draw of wide-SIMD execution; the core stayed at the reduced frequency for some hundreds of microseconds after the last wide-SIMD instruction, on the assumption more would follow. The effect was the source of many surprising performance regressions: a function that occasionally used AVX-512 could slow the entire program by reducing the frequency of the surrounding scalar code.

This behaviour was significantly relaxed on Ice Lake (Sunny Cove, 2019–2020) and subsequent generations, where the per-core power delivery and the dynamic frequency control became fine-grained enough that AVX-512 could often run at full turbo, with the throttle triggering only on sustained heavy SIMD load. Tiger Lake, Sapphire Rapids, and the recent Lion Cove generations continue the trend; on the latest cores the AVX license penalty is small or zero for typical workloads. AMD's Zen 4 and Zen 5 implementations of AVX-512 (Zen 4 double-pumps the 512-bit operations through 256-bit datapaths; Zen 5 has a true 512-bit datapath) avoid the historical Intel frequency penalty by virtue of the wider, lower-clocked baseline they target.

The broader lesson — that the cost of SIMD is not just the instruction itself but its effect on surrounding code through frequency, power, and thermal coupling — is one of the durable subtleties of x86-64 performance engineering. We will return to power, thermal, and physical design in Chapter 52.

10. Recent Generations: 2024 and Beyond

The two-year cadence of major core updates has continued through the writing of this book. A brief tour, with the caveat that any year-by-year description ages quickly:

Intel Lion Cove (2024) is the P-core in Lunar Lake (mobile, 2024) and Arrow Lake (desktop, late 2024). Lion Cove widens the rename to 8-wide, increases the ROB to about 600 entries, and (notably) drops Hyper-Threading from the P-core entirely; throughput parallelism is provided instead by the larger number of E-cores in hybrid configurations. The integer scheduler is split into multiple smaller schedulers in a Zen-like distributed style. Branch prediction grows again to a TAGE-SC-L design with a much larger BTB.

Intel Skymont (2024) is the E-core paired with Lion Cove. It is dramatically wider than Gracemont — a 9-wide front end and 26 execution ports of various kinds — to the point that on some integer workloads it approaches Raptor Cove (the previous P-core) IPC at significantly lower power. This blurs the P/E distinction and is the strongest argument yet for hybrid topology.

AMD Zen 5 (2024) widens the front end to 8-wide decode (in two parallel 4-wide clusters fed by a redesigned op cache), grows the ROB to about 450 entries, doubles the L1-D bandwidth, and implements true 512-bit AVX-512 datapaths. Zen 5c (the cloud-density variant) shares the design with smaller caches and lower clocks. On floating-point and AVX-512 workloads, Zen 5 leads or matches Lion Cove; on lightly-threaded workloads it is competitive but not dominant.

Foveros, EMIB, and chiplet packaging have become essential to both vendors' high-end designs. Intel's Meteor Lake (2023) and Lunar Lake (2024) use Foveros 3D packaging to stack tiles for compute, graphics, I/O, and SoC; Arrow Lake follows on desktop. AMD continues with the CCD+IOD organic-substrate chiplet approach in Zen 5 desktop and server parts and uses 3D V-Cache (a stacked L3 die) in selected SKUs to dramatically grow effective last-level cache. Packaging is now a first-class architectural variable; Chapter 55 will treat it as such.

These generations are simultaneously the cutting edge of high-performance x86-64 implementation and the most aggressive pushers of the boundary between architecture, micro-architecture, and physical design — each new chip blurs the line further.

11. Summary

Modern x86-64 cores from Intel (Golden Cove, Lion Cove and successors) and AMD (Zen 4, Zen 5) are deep, wide, sophisticated implementations of the x86-64 ISA. They feature 6-8 wide front ends, µop caches, large branch predictors (TAGE-class), 300-600 entry ROBs, 12-18 execution ports, multi-level TLBs and caches, and SIMD throughput up to ~32-48 DP FLOPs per cycle per core. The legacy ISA's complexity is largely paid for in the front-end decoder and in legacy-feature implementation; the back-end is essentially a RISC-style OoO engine and competes effectively with ARM and RISC-V at similar performance levels.

Intel's hybrid topology (P + E cores) and AMD's chiplet design (CCD + IOD) represent two different architectural answers to scaling. Both are now mainstream. AVX-512, after a rocky deployment, is on its way to being universally available again. Side-channel security has driven significant micro-architectural change across the past several years.

This concludes Part VII. We have walked through x86-64 from history through programming model, system architecture, SIMD, and now micro-architecture. Part VIII does the same for ARM, the second great ISA family of modern computing — first looking at its history and overview, then its programming model, system architecture, SIMD/vector capabilities, and the micro-architecture of leading ARM implementations.

Book mode
computer-architecturex86-64isa-case-study
Was this helpful?