Part V·ISA Case Studies·Chapter 41 of 62

Part VISA Case Studies

AArch64 Micro-Architecture

May 16, 2026·19 min read·advanced

This chapter looks at how modern AArch64 chips are actually built. Where Chapter 36 toured the Intel and AMD x86-64 cores, this chapter does the same for the leading AArch64 implementations: ARM…

This chapter looks at how modern AArch64 chips are actually built. Where Chapter 36 toured the Intel and AMD x86-64 cores, this chapter does the same for the leading AArch64 implementations: ARM Ltd.'s Cortex-A and Cortex-X cores, ARM Ltd.'s Neoverse server cores, Apple's M-series cores, and Qualcomm's Oryon. We close with comparisons across vendors and against x86-64.

The diversity of AArch64 implementations is much wider than that of x86-64. Where x86-64 has essentially two design teams (Intel and AMD), AArch64 has at least a dozen serious teams: ARM Ltd. itself (with multiple core families), Apple, Qualcomm, Samsung (historical), Ampere, NVIDIA, AWS (designs Graviton in-house using ARM IP), and others. Each makes different trade-offs. This chapter samples the most influential current designs.

The numbers given are illustrative and change every generation; the structural patterns matter more than the specific values.

01.ARM Cortex-A Series

The Cortex-A series is ARM Ltd.'s mainline application-class cores, used in nearly every Android phone (often customized by Qualcomm or Samsung) and many tablets, set-top boxes, and embedded systems.

Big-LITTLE History

ARM uses three tiers of cores in modern mobile chips:

Cortex-A "small" cores (efficiency tier): Cortex-A55, A510, A520. In-order or limited OoO. Run at moderate frequency for background tasks, low power.
Cortex-A "big" cores (performance tier): Cortex-A75, A76, A77, A78, A715, A720. Out-of-order, moderate width.
Cortex-X "ultra" cores (peak tier): Cortex-X1, X2, X3, X4, X925. Deepest and widest cores in the Cortex line.

A typical 2024 Android flagship SoC has 1 X-core + 3-5 big A-cores + 2-4 small A-cores in a DynamIQ cluster.

Cortex-A720 (representative big core, 2023)

Front end: 5-wide decode. Mid-sized branch predictor (TAGE-class). 64 KiB I-cache, 4-way.
Back end: 6-wide rename. ~270-entry ROB. ~10 execution ports: 4 ALU, 2 branch, 2 FP/SIMD/NEON, 2 load/store.
Caches: 64 KiB L1D 4-way, 4-cycle. 256-512 KiB private L2, ~12-cycle. Shared L3 in DynamIQ cluster (configurable 0-32 MiB), ~30-cycle.
SIMD: 2× 128-bit NEON pipes. SVE2 in 128-bit form.
Frequency: 3.0-3.3 GHz typical in mobile silicon.

A720 is a balanced, relatively efficient design — ARM's mid-tier high-performance core. It ships in tens of millions of phones.

Cortex-X4 (peak performance, 2023)

Front end: 10-wide decode (ARM's widest yet). Large TAGE predictor with deep history. 64 KiB I-cache.
Back end: 10-wide rename. ~384-entry ROB. ~14 execution ports: 6 ALU, 2 branch, 4 FP/SIMD, 4 load/store (2L+2S).
Caches: 64 KiB L1D 4-way, 4-cycle. Up to 2 MiB private L2.
SIMD: 4× 128-bit FP/NEON pipes; SVE2 in 128-bit form.
Frequency: ~3.3-3.5 GHz in early Cortex-X4 SoCs (Snapdragon 8 Gen 3, Dimensity 9300).

X4 is the peak-IPC ARM Ltd. core, comparable to recent Apple cores in some metrics (though typically not matching them on integer IPC).

Cortex-X925 (2024)

The successor to X4, with another wide bump:

Front end: ~12-wide.
Back end: ~12-wide rename, ~432-entry ROB.
More aggressive predictor, larger structures throughout.
Frequency target ~3.5-3.7 GHz.

The trend across the X-series is clear: each generation pushes structures wider and deeper, mirroring the analogous trend on Intel and AMD cores. ARM Ltd. has been catching up to Apple's IPC over the last 3-4 generations.

Cortex-A520 (representative small core, 2023)

In-order pipeline: ~3-wide.
No OoO: simpler structures, no rename, no large reorder buffer.
Branch prediction: small but capable.
Caches: 32-64 KiB L1D, shared L2 with sibling cores.
SIMD: 1× 128-bit NEON pipe; SVE2 in 128-bit.
Frequency: 2.0-2.3 GHz.
Power: a small fraction (≤25%) of an X4 at similar workload.

Small cores are tiny in area and run cool; for background tasks (notifications, location updates, music playback) they are perfect. The SoC scheduler keeps them busy and lets the big and X cores sleep.

02.ARM Neoverse: Server Cores

The Neoverse line is ARM Ltd.'s server- and infrastructure-targeted cores. They share design DNA with the Cortex-A line but with server-relevant features (more cache, RAS extensions, multi-socket coherence support, larger TLBs, no dynamic frequency scaling).

Neoverse N-series (efficiency-focused servers)

Neoverse N1 (2018): 64 cores in AWS Graviton 2. Dual-issue OoO core, ~140-entry ROB, 64 KiB L1, 1 MiB L2.

Neoverse N2 (2021): used in Ampere One, Microsoft Cobalt 100, AWS Graviton 3 (some). 5-wide decode, 6-wide rename, 160-entry ROB.

Neoverse N3 (2024): refinements; deployed in newer cloud chips.

Neoverse V-series (performance-focused servers)

Neoverse V1 (2020): used in AWS Graviton 3. 8-wide decode, 256-bit SVE, 192-entry ROB.

Neoverse V2 (2022): used in NVIDIA Grace, AWS Graviton 4, Microsoft Cobalt 200. 8-wide decode, 256-bit-equivalent SVE2, 320-entry ROB. Cache: 64 KiB L1D, 2 MiB L2 private.

Neoverse V3 (2024-25): wider, deeper, with newer SVE2 features.

V-series cores prioritize per-thread performance for HPC, databases, and some cloud workloads where tail latency matters. They have bigger structures than N-series but lower core counts per chip.

Neoverse Cluster Architectures

A Neoverse-based server has dozens to hundreds of cores connected via ARM's coherent interconnect (CMN-650 or CMN-700). The fabric handles cache coherence (MOESI-like), inter-core IPIs (via GICv3+), and interfaces to memory controllers (typically DDR5) and PCIe controllers.

A typical 2024 Neoverse server chip:

128-192 V2 or N2 cores in a single die (or chiplets).
L2 per core: 1-2 MiB.
L3 / SLC: 96-256 MiB shared.
Memory: 8-12 channels DDR5.
PCIe: Gen5, 64-128 lanes.
Coherent interconnect: mesh fabric with cross-bar elements.

These chips compete directly with x86-64 servers from Intel (Sapphire Rapids, Granite Rapids, Sierra Forest) and AMD (EPYC Bergamo, EPYC Turin). They are typically more energy-efficient per core, often slower per-thread on some workloads, and very strong on cloud-native (containerized, parallel) workloads.

03.Apple's M-Series Cores

Apple's chip-design team is widely regarded as the leading single-thread CPU designer in the industry. Apple's cores have appeared in iPhones since A4 (2010), iPads, and Macs (since 2020). The M-series Macs have demonstrated that AArch64 can compete at the high end of laptop and workstation performance.

The Apple core lineage:

Cyclone (A7, 2013): the first 64-bit consumer ARM core.
Typhoon (A8), Twister (A9), Hurricane (A10), Monsoon (A11), Vortex (A12), Lightning (A13), Firestorm (A14, M1).
Avalanche (A15, M2), Everest (A16, M3), and Apple has continued the cycle.

(The exact code names for A17 onwards aren't all publicly documented, but the cadence continues.)

Apple Firestorm (M1, 2020)

The famous core that started Apple's Mac transition.

Front end: 8-wide decode. Massive branch predictor (multiple TAGE structures).
Back end: 8-wide rename. ~630-entry ROB (much larger than any x86 contemporary). 354-entry register files (integer and FP).
Schedulers: 6 ALU + 2 branch + 4 FP/SIMD + 4 load/store ports.
Caches: 192 KiB L1I + 128 KiB L1D (notably large; 4-cycle access despite size). 12 MiB L2 shared per cluster (4 P-cores).
NEON: 4× 128-bit FP pipes.
Frequency: 3.2 GHz in M1.
No SVE: Apple has its own AMX-style matrix coprocessor.

The structural numbers — particularly the ROB and the L1 sizes — are well beyond what Intel and AMD ship. Combined with Apple's predictor and execution engine, Firestorm achieved roughly 30-50% higher integer IPC than contemporary Intel cores at significantly lower power. This was the surprise of the M1: not a small ARM core competing in laptops, but an aggressive desktop-class core in a laptop power envelope.

Avalanche (M2) and Everest (M3)

Successors with progressive improvements:

Larger ROB (~700+ entries).
More execution ports.
Improved cache hierarchy.
Higher frequency (3.5 GHz in M2, 4.0+ GHz in M3, 4.5 GHz in M4).
Larger SLC (system-level cache, shared with GPU and other accelerators).

Each generation extends what was already a remarkably wide core. M4 (2024) is at 4.5 GHz, with reported per-thread performance among the highest in any laptop.

How Apple Achieves It

Apple's advantages compound:

Wide core with huge ROB: more in-flight instructions, more memory parallelism, better hiding of cache miss latency.
Large L1 caches: more L1 hits, fewer L2 trips, better effective bandwidth.
Excellent branch predictor: very high prediction accuracy, deep speculation works without too many mispredicts.
Dedicated AMX matrix coprocessor: offloads matrix math from main pipes.
Tight chip integration: unified memory architecture means GPU shares CPU's memory; no explicit copies needed.
Process technology: Apple uses leading-edge TSMC nodes (5nm, 3nm) that competitors have mostly used too, but Apple's tight design constraints let them pack more transistors.
Per-product tuning: the chip is designed to match exactly the OS and product requirements; less generality means less area "wasted" on unused features.

The result is a CPU that feels faster than the GHz numbers suggest. M-series Macs in real-world tasks frequently outperform x86 laptops at the same MSRP, especially in single-threaded responsiveness.

E-Cores (Icestorm, Blizzard, Sawtooth)

Apple's efficiency cores are also notable:

Icestorm (M1 E-core): 4-wide decode, ~200-entry ROB. Smaller than the P-core but not a tiny in-order core.
Blizzard (M2): wider, more aggressive.
Sawtooth and successors: continued evolution.

Apple's E-cores hit roughly 30% of the P-core performance at 10% of the power. The core sleeps most of the time on a typical Mac workload; macOS aggressively migrates light tasks to E-cores. The result is excellent battery life on Mac laptops.

04.Qualcomm Oryon

Qualcomm acquired Nuvia in 2021, which gave them access to a custom AArch64 design originally targeted at servers. The result is Oryon, used in Snapdragon X Elite (2024) and Snapdragon X2 Elite (2025) for Windows on ARM laptops.

Front end: 8-wide decode, large branch predictor.
Back end: ~8-wide rename, ~450-entry ROB.
Caches: large L1 and L2; 192 KiB L1I, 96 KiB L1D, 12 MiB L2 per cluster.
NEON: 4× 128-bit pipes.
Frequency: 3.4-4.0 GHz depending on variant.

Oryon's IPC is competitive with Apple Firestorm/Avalanche on integer workloads. Qualcomm's first-generation Snapdragon X Elite chip launched in 2024 with 12 cores, all Oryon (homogeneous, no big.LITTLE). The Snapdragon X2 (2025) added a hybrid topology with smaller Oryon-derived efficiency cores.

The arrival of Oryon in Windows laptops was the first time x86-class single-thread performance was available on Windows on ARM, making the platform genuinely competitive.

05.Samsung Mongoose (historical)

Samsung designed custom AArch64 cores from 2016 to 2019 (Exynos M1 through M5). The designs were ambitious but never matched ARM Ltd.'s and Apple's cores. Samsung discontinued Mongoose in 2019 and reverted to using ARM Cortex cores in Exynos chips. A reminder that custom AArch64 design is hard, even for a large semiconductor company.

06.NVIDIA Grace and Project Denver

NVIDIA has done several AArch64 designs:

Project Denver (early 2010s): a code-translating ARM core for Tegra. Software translated guest ARM instructions to a wider internal VLIW. Discontinued; performance was disappointing.
Carmel (Xavier SoC, 2018): 8-wide custom core; modest deployment.
Grace (2023): based on ARM Neoverse V2; not a custom core. NVIDIA's AArch64 server CPU paired with Hopper GPUs for AI/HPC.

NVIDIA's strength is in GPUs. They have largely accepted that ARM Ltd.'s Neoverse cores are good enough for their CPU role, and they put their design effort into GPUs and fabric.

07.Comparing AArch64 Cores

A rough qualitative ranking by per-thread performance, as of 2026:

Apple M4 P-core (~4.5 GHz, ~700-entry ROB, vast caches). Top of the AArch64 single-thread ladder.
Qualcomm Oryon (X2 generation), similar IPC to Apple, slightly lower frequency.
ARM Cortex-X925 / Cortex-X4: aggressive, but typically a step behind Apple.
ARM Neoverse V3 / V2: designed for servers; per-thread is good but not chasing Apple's peak.
AMD Ryzen / EPYC (Zen 5) and Intel Lion Cove / Redwood Cove: roughly comparable to current-gen ARM Ltd. Cortex-X cores in IPC, possibly higher frequency.
ARM Cortex-A720 / A78: solid mainstream.
Cortex-A520 / A510: efficiency cores; not aiming for IPC.

The first three can plausibly trade blows with the best x86-64 cores. The top of the AArch64 stack (Apple, Oryon) currently has a slight edge in single-thread per-watt; the top of the x86-64 stack (Lion Cove, Zen 5) is competitive in absolute single-thread but at higher power.

For multi-thread server workloads, the comparison shifts: AMD EPYC Turin (192 cores in the Bergamo descendant) and Intel Sierra Forest (288 E-cores) are extremely high core count; Ampere One M (192 cores) and AWS Graviton 4 (96 cores) are AArch64's high-core-count entries. Workload behavior determines which wins.

08.big.LITTLE Versus Hybrid: Same Idea

Apple's hybrid (P + E cores), Intel's hybrid (P + E since Alder Lake), and ARM's big.LITTLE / DynamIQ (X + A "big" + A "small") are all variants of the same idea: have multiple core types optimized for different points in the perf/watt curve.

The OS scheduler is critical. macOS's QoS-aware scheduler routes tasks to E-cores by default and only escalates to P-cores when needed. Linux's EAS (Energy-Aware Scheduler) does similar work for ARM big.LITTLE. Windows 11's Thread Director (with help from Intel's hardware hint mechanism) does it for Intel hybrid. All three converged on the model after years of refinement.

09.Cache and Memory

ARM cores tend to have larger L1 caches than x86-64 cores, typically because ARM's lower clock speeds permit larger arrays at single-cycle access. Apple's cores have especially large L1s (96-128 KiB for D, 192 KiB for I in P-cores).

L2 caches are typically private per-core (1-2 MiB) on both sides. L3/SLC sizes vary widely: AMD has the largest (96 MiB with 3D V-Cache), Apple has medium-sized (12-24 MiB SLC shared with GPU). ARM Ltd.'s designs let SoC designers configure L3 size up to many MiB.

Memory subsystems differ. Apple uses unified memory (CPU and GPU share LPDDR5x in the M-series package). ARM-based cloud servers use DDR5 in standard DIMMs, often with high channel counts. x86-64 servers use DDR5 (and increasingly CXL-attached memory).

10.Branch Prediction

All modern AArch64 cores use TAGE-class predictors with substantial history. ARM Ltd.'s cores tend to use more standard ARM predictor designs; Apple has its own custom predictor that has been notably accurate. Branch mispredict rates below 1% on typical workloads are achieved across the board.

For indirect branches (function pointers, virtual calls), all use ITTAGE-style predictors. Apple's predictor is particularly good on indirect branches; this matters disproportionately for languages like Swift and Objective-C, where dynamic dispatch is heavy.

11.Memory Ordering

AArch64's weak memory model is implemented by all cores. The performance impact of explicit barriers (DMB ISH, DSB) is generally small (a few cycles per barrier), but contended atomics with full sequential consistency can be costly. LSE atomics (single-instruction CAS, LDADD, etc.) are a meaningful improvement over the ldxr/stxr LR/SC loop in contended scenarios.

A specific concern: porting code from x86-64 (TSO) to AArch64 (weak) sometimes reveals ordering bugs that were latent on x86-64. Modern test suites and tools (like thread-sanitizer) catch these, but the ordering difference is real and occasionally bites real software.

12.Power and Thermals

ARM cores are widely regarded as more energy-efficient than x86-64 cores at similar performance points. The reasons compound:

Simpler decode: fixed-width instructions, no prefix walking, no µop translation overhead at the same scale.
Simpler memory ordering: weak model needs less LSQ checking.
Smaller per-instruction overhead: in general, AArch64's instruction encoding is more uniform.
Different design priorities: ARM cores are typically designed for fanless or moderate cooling; x86 cores often for high TDP. Both can be designed at low or high power, but the historical defaults differ.

Apple's M-series achieves famous laptop runtimes (15-25 hours of normal use) partly because of efficient cores and partly because the entire SoC (GPU, neural engine, memory) is designed for low average power.

For server workloads, the perf/watt advantage of AArch64 over x86-64 is generally 20-40% for cloud-native workloads (web servers, microservices, containers). For HPC and number-crunching workloads, the gap is smaller and depends on whether AVX-512 (or SVE) is fully utilized.

13.Side Channels and Speculation Security

AArch64 cores are also vulnerable to many of the same speculation-related side-channel attacks as x86-64: Spectre v1, Spectre v2, MDS variants, Branch History Injection. Mitigations include:

ARMv8.5+ hardware mitigation features: SSBS (Speculative Store Bypass Safe), CSDB (Consumption Speculation Data Barrier).
Apple's PAC: indirectly mitigates some classes of speculative attacks by validating pointers.
Software-emitted CSDB or DSB barriers in critical sequences.
Indirect branch tracking via BTI (ARMv8.5).

We will revisit speculative side channels in Chapter 51 (Advanced Branch and Speculation).

14.Why Apple's Cores Are So Wide: Decode and the Fixed-Width Advantage

A recurring observation across this chapter has been that AArch64 cores — Apple's especially — are noticeably wider than their x86-64 contemporaries. Apple's Avalanche/Everest P-cores have an 8–9 wide decode, the M4 generation reaches farther, and ARM's Cortex-X4 sits at 10-wide. Intel's Lion Cove and AMD's Zen 5, by contrast, top out around 8-wide effective decode (often achieved by combining several smaller decoders with a micro-op cache). The asymmetry has a clean structural explanation that connects back to the ISA differences described in Part VII and at the start of this part.

A fixed-width 32-bit AArch64 instruction stream can be sliced into N parallel decoders by the obvious mechanism: take the next N×4 bytes of the fetched cache line and hand each 4-byte slot to its own decoder. There is no dependency between adjacent decoders' work because there is no ambiguity about where each instruction starts and ends. The decode logic itself is straightforward: a relatively small case-split on the top bits of the instruction word selects from a few dozen encoding groups, and each group has a regular layout for register fields and immediates. Building a 10-wide AArch64 decoder is essentially a matter of replicating one decoder ten times.

A variable-length x86-64 instruction stream is fundamentally serial in the byte dimension. The first instruction starts at the cache-line offset; the second starts wherever the first ends, which depends on the prefix-and-opcode walk through the first instruction's bytes. Decoder N cannot start work until decoder N-1 has produced a length, so naive replication does not scale. Real x86 decoders use pre-decode hardware (a coarse step that marks instruction boundaries before the main decoders see the bytes), parallel speculative starts at multiple positions, or fall back to the micro-op cache that bypasses decode entirely once a hot path's micro-op stream has been recorded. All of these work, but each adds complexity, area, and power, and none scales cleanly past about 8 decoders. Beyond that point, the front end has to widen via the micro-op cache rather than via more native decoders.

The consequence is visible in three places. First, AArch64 cores can spend their decode area more aggressively, devoting silicon to wider downstream structures (rename, schedulers, register files, execution ports) rather than to clever boundary-finding. Second, the micro-op cache, which is a load-bearing component of every modern x86 design, is largely unnecessary on AArch64; ARM cores have some form of decoded-op cache but it is smaller and contributes proportionately less to throughput. Third, code-density advantages flip: x86's variable-length encoding produces smaller binaries (roughly 30% smaller for typical compiled code), trading instruction-cache footprint for decode complexity. Whether the trade is favourable depends on workload; on cache-bound integer code x86 sometimes pulls ahead, while on decode-bound or speculation-heavy code AArch64 tends to win.

The history is instructive too. The original argument for x86's variable-length encoding in 1978 — that program memory was scarce and instruction density mattered — has been steadily eroded by cheap memory and large caches. The argument for AArch64's fixed-width encoding in 2011 was the inverse: that decode parallelism was becoming the dominant front-end concern. Fifteen years on, the latter argument has held up well, and Apple's wide cores are part of the evidence.

None of this implies AArch64 is destined to win on performance — the back-end execution and memory-system engineering matters at least as much, and AMD and Intel continue to ship excellent silicon — but it does explain why the front-end widths have diverged the way they have.

15.ARM in 2026 and Beyond

The ARM core landscape is more dynamic than x86-64's two-vendor competition.

Apple has sustained world-class designs and shows no signs of slowing.
Qualcomm has reentered the high-end with Oryon and is iterating fast.
ARM Ltd. has closed much of the gap to Apple with X-series cores; Cortex-X925 and successors are very competitive.
AWS, Microsoft, Google are all designing or co-designing custom ARM server CPUs.
Ampere continues to push core counts on its custom AmpereOne.
NVIDIA uses Neoverse but is rumored to be working on custom cores.

The diversity is healthy: more design effort, more implementations, more competition. For the user, it means continued steady improvement in per-thread performance, perf/watt, and core counts across the AArch64 ecosystem.

16.Summary

AArch64 is implemented by a more diverse vendor ecosystem than x86-64. ARM Ltd. provides the Cortex-A and Neoverse cores used in most stock implementations. Apple, Qualcomm (Oryon), and several others design custom cores. The current performance leader is Apple's M-series, which has demonstrated genuinely class-leading single-thread performance with very wide cores (8-10 wide rename, 600-700 entry ROBs, 96-192 KiB L1 caches). Qualcomm's Oryon is competitive. ARM Ltd.'s Cortex-X line has narrowed the gap.

The micro-architectural patterns are familiar from x86-64: deep OoO with large structures, sophisticated branch prediction, multi-level caches with TLBs and prefetchers, hybrid topology mixing performance and efficiency cores, and security features layered over the base design. The overall character of AArch64 cores tends toward efficiency and cleanliness; x86-64 cores tend toward higher peak frequencies and absolute throughput. Both are competitive, and the technology is converging in many respects.

This concludes Part VIII. Parts I-VIII have built the conceptual framework, walked through digital design and computer organization, taught the major ISAs (x86-64 and AArch64) in depth, and surveyed how modern implementations are built. Part IX turns to RISC-V — the open-source ISA that has gained substantial momentum in embedded, research, and increasingly mainstream applications.

Book mode