Part IV·Microarchitecture·Chapter 29 of 62

Part IVMicroarchitecture

Data-Level Parallelism

May 16, 2026·23 min read·advanced

A surprising amount of useful computing follows the same shape: take a long array of data, do the same operation to each element, write the results back. Image processing applies a filter to each…

A surprising amount of useful computing follows the same shape: take a long array of data, do the same operation to each element, write the results back. Image processing applies a filter to each pixel. Scientific simulation updates each grid point. Machine learning multiplies long vectors of weights and activations. Graphics shading transforms each vertex or fragment. The work is embarrassingly parallel: the operations on each element are independent of every other element.

A scalar processor wastes most of its capability on this kind of work. To add two arrays of a million floats, it issues a million add instructions, fetched and decoded one by one, even though the operations are identical and independent. The fetch, decode, and rename machinery handles each one as if it were unique.

Data-level parallelism (DLP) addresses this directly. A single instruction operates on multiple data elements at once. The hardware that performs the parallel operation is wide; the instruction stream that drives it stays narrow. The architecture extends the ISA with vector or SIMD instructions; the implementation builds wide functional units that compute on many elements per cycle.

This chapter covers DLP from two angles: the architecture (what kinds of instructions and registers ISAs add), and the implementation (how the wide hardware is organized). We start with the basic SIMD model that x86 SSE/AVX and ARM NEON use, then look at the more flexible vector model from RISC-V V and ARM SVE, and end with a brief look at how GPUs take DLP to its logical extreme.

01.SIMD: Single Instruction, Multiple Data

The classic data-parallel model is SIMD: a single instruction operates on multiple data elements held in a wide register. For example, a 128-bit register holds four 32-bit floats; a single SIMD add adds two such registers element-wise, producing four sums in parallel.

Figure: SIMD element-wise add: two 128-bit registers each holding four 32-bit lanes are added, producing four sums in a third register in one instruction

LaTeX

\begin{tikzpicture}[font=\small, line cap=round]
  % register a (top row)
  \node[anchor=east] at (-0.1, -0.35) {register a:};
  \draw[thick] (0, -0.7) rectangle (1.5, 0); \node at (0.75, -0.35) {a0};
  \draw[thick] (1.5, -0.7) rectangle (3, 0); \node at (2.25, -0.35) {a1};
  \draw[thick] (3, -0.7) rectangle (4.5, 0); \node at (3.75, -0.35) {a2};
  \draw[thick] (4.5, -0.7) rectangle (6, 0); \node at (5.25, -0.35) {a3};
  % register b
  \node[anchor=east] at (-0.1, -1.35) {register b:};
  \draw[thick] (0, -1.7) rectangle (1.5, -1); \node at (0.75, -1.35) {b0};
  \draw[thick] (1.5, -1.7) rectangle (3, -1); \node at (2.25, -1.35) {b1};
  \draw[thick] (3, -1.7) rectangle (4.5, -1); \node at (3.75, -1.35) {b2};
  \draw[thick] (4.5, -1.7) rectangle (6, -1); \node at (5.25, -1.35) {b3};
  % ADD line
  \draw[thick] (-0.1, -2) -- (6.1, -2); \node[anchor=west] at (6.2, -2) {ADD};
  % register c
  \node[anchor=east] at (-0.1, -2.6) {register c:};
  \draw[thick] (0, -2.85) rectangle (1.5, -2.35); \node[font=\footnotesize] at (0.75, -2.6) {a0+b0};
  \draw[thick] (1.5, -2.85) rectangle (3, -2.35); \node[font=\footnotesize] at (2.25, -2.6) {a1+b1};
  \draw[thick] (3, -2.85) rectangle (4.5, -2.35); \node[font=\footnotesize] at (3.75, -2.6) {a2+b2};
  \draw[thick] (4.5, -2.85) rectangle (6, -2.35); \node[font=\footnotesize] at (5.25, -2.6) {a3+b3};
\end{tikzpicture}

The register width is the vector width in bits; the number of elements is the width divided by the element size. A 256-bit register holds 8 floats, 4 doubles, 16 16-bit ints, or 32 8-bit ints. The same register, viewed as different element types, can be operated on differently.

x86's SIMD has gone through several generations:

MMX (1996). 64-bit integer SIMD. Reused FP registers, which made it incompatible with floating-point code.
SSE (1999). 128-bit registers (xmm0–xmm15), separate from FP. Originally single-precision FP only; later extended to int and double precision (SSE2, SSE3, SSSE3, SSE4).
AVX (2011). 256-bit registers (ymm0–ymm15). Introduced three-operand form (separate destination from sources).
AVX2 (2013). Integer operations widened to 256 bits.
AVX-512 (2016+). 512-bit registers (zmm0–zmm31), masked operations, embedded broadcasts.

ARM's path:

VFP (1998). Scalar-only floating-point.
NEON (2005). 128-bit SIMD, integer and FP. Standard on AArch64.

ARM's AArch64 base architecture requires NEON, so 128-bit SIMD is universally available on AArch64 cores.

The hardware that implements SIMD is straightforward in principle. A 128-bit ALU is essentially four 32-bit ALUs running in lockstep. They share decode, issue, and operand-read logic; only the actual computation is replicated. The cost in transistors is much less than four separate scalar ALUs would be, because the front-end overhead is paid once for the whole vector.

The instruction set scales with the register width. Each new SIMD generation typically adds: wider versions of existing operations (add, multiply, multiply-add, compare), specialized operations (horizontal sums, dot products, packing/unpacking, shuffles), conversion operations (int to float, narrow to wide), and operation predication (mask which elements to update).

02.A Simple SIMD Loop

A loop that adds two float arrays, scalar form:

for (int i = 0; i < n; i++) {
    c[i] = a[i] + b[i];
}

In assembly, on AArch64:

Assembly

loop:
    ldr   s0, [x0, x3, lsl #2]       ; load a[i]
    ldr   s1, [x1, x3, lsl #2]       ; load b[i]
    fadd  s2, s0, s1                 ; add
    str   s2, [x2, x3, lsl #2]       ; store c[i]
    add   x3, x3, #1                 ; i++
    cmp   x3, x4                     ; i < n
    blt   loop

Seven instructions per element, four of which are computational (load, load, add, store). The other three (add, cmp, blt) are loop overhead.

The same loop with NEON, processing 4 floats per iteration:

Assembly

loop:
    ld1   {v0.4s}, [x0], #16         ; load 4 floats from a
    ld1   {v1.4s}, [x1], #16         ; load 4 floats from b
    fadd  v2.4s, v0.4s, v1.4s        ; add 4 floats
    st1   {v2.4s}, [x2], #16         ; store 4 floats to c
    sub   x4, x4, #4                 ; n -= 4
    cbnz  x4, loop

Six instructions per vector of 4 elements, processing 4 elements per loop iteration. The instructions are similar to scalar — load, load, add, store, decrement, branch — but each does 4 elements of work. The throughput is roughly 4× scalar.

The actual speedup depends on hardware: how many SIMD operations the core can issue per cycle, the latency of the SIMD adds, the cache behavior. A typical modern core issues 2 SIMD ops per cycle and dispatches them with the same latency as scalar ops, so 4× throughput is realistic on this kind of code.

03.The SIMD Programming Model

SIMD is a low-level abstraction. Several ways to access it.

Hand-written assembly. Maximum control, minimum portability.

Compiler intrinsics. Each SIMD instruction is exposed as a C/C++ function. The compiler emits the corresponding instruction:

#include <immintrin.h>

void add_arrays(float *a, float *b, float *c, int n) {
    for (int i = 0; i < n; i += 8) {
        __m256 va = _mm256_load_ps(a + i);   // 8 floats
        __m256 vb = _mm256_load_ps(b + i);
        __m256 vc = _mm256_add_ps(va, vb);
        _mm256_store_ps(c + i, vc);
    }
}

Intrinsics give the programmer direct access to SIMD instructions while letting the compiler handle register allocation and scheduling. They are the dominant way SIMD code is written today.

Compiler auto-vectorization. The compiler detects loops it can vectorize and emits SIMD code automatically. Modern compilers (GCC, LLVM, Intel ICC) do this routinely, but only when the dependency analysis is clean. Pointer aliasing, complex control flow, and irregular memory access patterns defeat auto-vectorization.

Higher-level libraries. NumPy, Eigen, BLAS implementations, and similar libraries call SIMD-tuned kernels under the hood. Most numerical Python code, for example, ends up running highly-tuned SIMD inside C libraries.

Domain-specific languages. OpenCL, CUDA, Metal, and similar models target SIMD/vector hardware (especially GPUs) with explicit data-parallel constructs.

The choice of abstraction depends on how much performance is needed and how much code is willing to be tied to specific hardware.

04.Limitations of Fixed-Width SIMD

The classic SIMD model has several pain points.

Width-specific code. SSE code (128-bit), AVX code (256-bit), and AVX-512 code (512-bit) all do the same thing but with different instructions and different element counts. A program optimized for AVX has to be rewritten for AVX-512 to get the wider speed-up. The intrinsics differ; the registers differ; the instruction encodings differ. This is a serious source of code-base bloat in performance-critical software.

Tail handling. A SIMD loop processes 4 (or 8 or 16) elements per iteration. If the array length is not a multiple of the vector width, the leftover elements have to be handled separately, typically with a scalar tail loop. This is code-size overhead and complicates the source.

Misalignment. Most SIMD instructions require their memory operands to be aligned to the vector width (16 bytes for 128-bit, 32 bytes for 256-bit). Misaligned data needs separate (slower) instructions or extra handling. Older SIMD ISAs had separate aligned and unaligned forms; newer ones have unified them but still pay a small cost for misalignment.

Predication. What if you want to do the SIMD operation on only some elements (those satisfying a condition)? Older SIMD models had to compute the full vector and then mask the results; the lanes that should not have been computed do work anyway. AVX-512 introduced masked operations that disable specific lanes, addressing this. NEON requires explicit blend instructions.

Reductions. Combining the elements of a vector into a single value (sum, max, min) is a sequential operation: pairs of elements combine, then pairs of pairs, until one value remains. SIMD does not naturally support reductions; explicit horizontal operations (or shuffle-and-add sequences) are needed. The cost is a few extra instructions at the end of a reduction loop.

These limitations are not fundamental to data-level parallelism; they are artifacts of the fixed-width SIMD design. Vector ISAs (next section) address them.

05.Vector ISAs: RISC-V V and ARM SVE

The classic vector architectures of the 1970s — Cray-1 and its descendants — were length-agnostic: a single instruction operated on a vector whose length was set in a control register. The same code ran on hardware with vector-register sizes anywhere from a few elements to thousands. The Cray-1 had 64-element vector registers; the Convex C-2 had 128; later CDC machines had 256.

Modern vector ISAs revive this idea, with refinements.

RISC-V V (Vector Extension)

RISC-V's V extension is a flexible vector ISA. Each implementation has its own VLEN (vector register length, in bits), but the same instructions work at any VLEN. The program asks the hardware "what's your VLEN?" and the answer determines how many elements per instruction. Code written for a small core with VLEN=128 runs on a large core with VLEN=512 without modification, getting a free speedup.

Programs use the V extension by setting a vector type — element width, mask register, etc. — and then issuing instructions that operate on the elements according to that type and the current vector length.

Assembly

loop:
    vsetvli  t0, a0, e32, m1     # set vl = min(a0, VLENMAX), 32-bit elements
    vle32.v  v0, (a1)            # load t0 floats from a
    vle32.v  v1, (a2)            # load t0 floats from b
    vfadd.vv v2, v0, v1          # add element-wise
    vse32.v  v2, (a3)            # store t0 floats to c
    sub      a0, a0, t0          # remaining count
    add      a1, a1, t0, lsl 2   # advance pointers
    add      a2, a2, t0, lsl 2
    add      a3, a3, t0, lsl 2
    bnez     a0, loop

The vsetvli instruction asks the hardware for the largest vector length it can handle, given the remaining elements a0 and the type e32 (32-bit elements). The result t0 is the actual vector length for this iteration. Subsequent instructions process exactly t0 elements. On the last iteration, when a0 is small, t0 is smaller, and the loop processes the tail without a separate scalar loop.

This length-agnostic model addresses the SIMD issues directly:

Width-specific code is gone. The same code runs on any VLEN.
Tail handling is automatic. The last iteration just has a smaller t0.
Predication. Mask registers are part of the architecture; instructions take a mask operand to disable lanes.
Reductions. Dedicated reduction instructions (sum, max, etc.) are part of the ISA.

RISC-V V also supports a richer set of operations than typical SIMD: gather and scatter (indirect memory access), strided memory access, segment loads/stores, and more.

ARM SVE (Scalable Vector Extension)

ARM's SVE is conceptually similar to RISC-V V, with its own design choices. SVE2 is the version targeting general-purpose computing; the original SVE was aimed at HPC.

SVE register width can be 128 to 2048 bits (in 128-bit increments), set by the implementation. SVE code written for a 256-bit machine runs unchanged on a 1024-bit machine, with 4× the throughput. Apple's recent chips have implemented SVE2 with various widths.

SVE's most distinctive feature is predicate registers that mask vector operations. A typical SVE loop:

Assembly

loop:
    whilelo  p0.s, x3, x4           # predicate: lanes < n
    ld1w     z0.s, p0/z, [x0, x3, lsl #2]
    ld1w     z1.s, p0/z, [x1, x3, lsl #2]
    fadd     z2.s, z0.s, z1.s
    st1w     z2.s, p0, [x2, x3, lsl #2]
    incw     x3                       ; advance i by hardware vector size
    b.first  loop

The whilelo instruction sets predicate bits to 1 for lanes whose index is less than n, 0 for lanes beyond. Subsequent loads and stores use this predicate to disable out-of-range lanes. The loop's iteration count is determined by the hardware; tail handling is automatic.

SVE has gained significant traction in HPC and is appearing in smartphones (Apple's M-series implements SVE under the hood with a 128-bit physical width but exposes some of the model). The flexibility comes at a small cost — slightly more complex hardware than fixed-width SIMD — but the software portability benefit is large.

06.Implementing Vector Hardware

The hardware that implements SIMD or vector instructions has several distinctive features.

Wide ALUs. A 256-bit add is essentially eight 32-bit adds running in parallel. The carry chains within each 32-bit lane are isolated; the 32-bit boundaries do not propagate. A wide ALU is therefore not much harder to design than a narrow one — the lanes are independent.

Wide register files. A 32-element register file with 512-bit elements is a 16 KB array, with several read and write ports. Vector register files are large and a significant fraction of the chip's area in vector-heavy designs.

Wide load/store paths. A vector load delivers many bytes per cycle. The L1 D-cache's read port has to be wide enough — typically the full vector width or half of it. Older SIMD implementations used a narrow port and took multiple cycles for a wide load; modern implementations have wide paths, treating SIMD loads as fast as scalar loads in throughput.

Specialized execution units. Vector and SIMD units include extras beyond plain arithmetic: shuffle units that rearrange elements within a vector, gather/scatter units that handle indirect memory access, masked-execution logic that selectively enables lanes.

The vector hardware sits alongside scalar hardware on the same chip. They share the front end and decoder, but the back end has separate vector and scalar register files, separate execution units, and separate result paths. A modern core might have:

2-4 scalar integer ALUs.
2-3 SIMD/vector pipes, each with its own ALU and register file.
1-2 load and 1 store unit, each capable of vector-width transfers.

The vector and scalar pipes feed into a common ROB and commit in program order, just like scalar ops, but execute in their own datapaths.

07.Vector Cost Model

A simple cost model: a vector operation of width $W$ on $N$ elements takes $\lceil N / W \rceil$ vector instructions, each costing perhaps 1-3 cycles plus any latency. The throughput is bounded by:

The vector unit's issue width (typically 1-2 vector ops per cycle).
The memory subsystem's bandwidth (vector loads consume L1 D-cache bandwidth quickly).
Reduction or tail-handling overhead.

The peak speedup over scalar code is roughly $W$ (the vector width), but real speedups are often 0.5-0.8× of peak because of the limitations above. For well-suited code (long arrays, regular access, no branches inside the loop), speedups close to peak are achievable.

For ill-suited code (irregular access, lots of branches, short loops, complex dependencies), vectorization may achieve little or even hurt performance because of the setup overhead.

08.When to Use SIMD vs. Scalar

The decision is workload-dependent.

Use SIMD/vector when:

The work consists of many independent elements doing the same operation.
Memory access is regular (sequential or strided).
The element count is large enough to amortize setup overhead.
The control flow within the loop is simple (or maskable).

Stay scalar when:

Operations are inherently sequential (pointer chasing, parsing).
Branches inside the loop are unpredictable and lane-divergent.
Element count is small (a vector setup followed by tail handling is overhead).
Memory access is irregular and gather/scatter is too slow.

Modern compilers try to vectorize automatically, and with #pragma omp simd or explicit intrinsics they often succeed. Performance-critical code in image processing, video encoding, scientific computing, machine learning kernels, cryptography, and physics simulation is essentially all SIMD code today.

09.Masking, Gather/Scatter, and Reductions

Three mechanisms recur often enough across SIMD and vector ISAs that they deserve their own treatment: predication via masks, irregular memory access via gather and scatter, and the reduction of a vector to a scalar.

Mask registers and predication. Real loops contain conditional logic, and a vector implementation has to do something with the lanes whose condition is false. The two common approaches are blending (compute both branches, then merge with a mask) and predication (carry a per-lane mask through subsequent instructions, with masked-off lanes neither writing results nor faulting). AVX-512 introduced eight architectural mask registers (k0–k7) and a zeroing or merging writemask on every instruction; AArch64 SVE and SME use predicate registers (p0–p15) for the same purpose; RISC-V V uses register v0 as the mask source. In all three, the governing predicate is part of the instruction encoding, so a single masked vector add can compute c[i] = a[i] + b[i] only where mask[i] is true and leave the other lanes alone.

Mask predication is what makes irregular control flow vectorizable. A loop with an inner if becomes a vector comparison that produces a mask, followed by masked vector arithmetic. The cost is that lanes whose predicate is false still consume execution time — the work is done in the ALU and discarded — so heavily branchy code with low average activity wastes hardware. SVE and SME include active-lane optimization hooks that can let the implementation skip lanes when entire mask registers are zero, but the architectural model is still that all lanes execute.

Gather and scatter. Most SIMD loads expect a single contiguous slab of memory; many real workloads do not have one. Gather takes a vector of indices and produces a vector of loaded values, one per lane: v[i] = mem[base + idx[i]]. Scatter is the reverse: mem[base + idx[i]] = v[i]. AVX2 added gather; AVX-512 added scatter; SVE has LD1H, LD1W and similar with vector index forms; RISC-V V has vluxei and vsuxei.

Gather and scatter let irregular access patterns vectorize, but the hardware cost is severe: each lane potentially touches a different cache line, requiring multiple cache-port accesses, multiple TLB lookups, and serialization in the worst case. A gather across $N$ lanes that hits a single line costs roughly the same as a scalar load; a gather that hits $N$ lines costs $N$ scalar loads plus the vector overhead. The practical lesson is that gather/scatter is a tool of last resort — if the access pattern can be reorganized to be contiguous, a contiguous load is dramatically faster.

Reductions. Many algorithms end with a reduction: a vector of partial results combined into a scalar (sum, dot product, max, min). Naively, a reduction is a tree of pairwise combinations that takes $\log_2 N$ steps for $N$ lanes, with each step using lane-shuffling instructions to bring partner elements together. AArch64 has dedicated addv, smaxv, sminv instructions that perform the reduction in hardware; AVX provides hadd and shuffles that build the tree; SVE has explicit addv, faddv, andv instructions; RISC-V V has the vredsum, vredmax, vredmin family. In all cases, the reduction is a serial point in an otherwise parallel loop and often dominates the loop's tail latency, so well-vectorized code amortizes reductions across many iterations of the parallel work.

The interaction of these three mechanisms — masking, irregular memory, reduction — is what determines whether a real algorithm vectorizes well. SIMD primitives have grown progressively richer along all three axes; AVX-512, SVE, RVV, and SME all include sophisticated mask, gather/scatter, and reduction support that earlier SIMD generations (SSE, NEON ARMv7) lacked.

10.GPUs as Extreme DLP

GPUs take data-level parallelism to its logical extreme. A modern GPU has thousands of small execution units, each running the same instruction on different data elements. The whole machine is one giant vector unit.

The GPU's architecture (covered in detail in Chapter 56) differs from a CPU's vector unit:

Massive width. Hundreds to thousands of "lanes" instead of dozens.
SIMT (Single Instruction, Multiple Threads). Each lane has its own register state and program counter — sort of. The hardware groups lanes into warps or wavefronts (32 or 64 lanes) that execute together. Within a warp, all lanes run the same instruction; if they diverge (some take a branch one way, some the other), the hardware masks the inactive lanes.
Many warps in flight. Each compute unit holds many warps simultaneously and switches between them every cycle to hide memory latency. Where a CPU uses caches and OoO to hide latency, a GPU uses warp switching.
Specialized memory hierarchy. GPUs have small per-thread fast memory and large global memory with high bandwidth but high latency. The hierarchy and access patterns are different from CPU caches.

The GPU's programming model — CUDA, OpenCL, Metal, Vulkan compute, ROCm — is built around launching thousands of threads, each operating on one or a few data elements. The hardware groups them into warps and runs them in lockstep within each warp.

For embarrassingly parallel workloads — graphics rendering, ML training and inference, physics simulation, image processing at scale — GPUs deliver throughput well beyond what a CPU's SIMD can offer. The cost is programming model: the GPU model is restrictive and software has to be written specifically for it.

11.SIMD and Vectorization in Modern Software

A summary of where DLP shows up in modern computing:

CPU SIMD. Used in nearly every modern application via libraries: BLAS, FFTW, OpenSSL, video codecs, image processing, JSON parsing, hash functions, regex engines. The application code is scalar; the heavy lifting is in tuned SIMD libraries.
CPU vector ISAs. RISC-V V and ARM SVE are spreading; HPC code is increasingly portable across implementations because of length-agnostic vector ISAs.
GPU compute. ML, graphics, and HPC use GPUs for the bulk of computation. CUDA and similar frameworks dominate.
DSPs. Specialized chips (audio DSPs, baseband processors, ML accelerators) use VLIW or SIMD heavily.
Tensor processors and matrix accelerators. New ISA extensions (AMX on x86, SME on ARM) and dedicated accelerators (TPUs, NPUs) extend DLP to matrix operations, matching the patterns of ML.

DLP, far from being a niche, is one of the dominant axes of modern computing. Single-thread ILP gains have plateaued; DLP gains are still growing rapidly as ISAs widen and accelerators specialize.

12.Matrix Extensions: AMX, SME, and Dedicated Accelerators

The single largest workload driving CPU SIMD evolution in the 2020s is machine learning, and ML at the kernel level is dominated by matrix multiplication. The dot-product-and-accumulate pattern at the heart of GEMM (general matrix-matrix multiply) is regular enough that even classical SIMD vectorizes it well, but it has so much intrinsic parallelism that even the widest vector ISAs leave performance on the table. The architectural response is matrix instructions that operate on 2-D tiles of data in a single operation.

Intel AMX (Advanced Matrix Extensions), introduced in Sapphire Rapids (2023), adds eight 1024-byte tile registers (architecturally configured as 16 rows of 64 bytes by default) and a small set of matrix instructions: TDPBSSD and friends for 8-bit integer dot-product-accumulate, TDPBF16PS for bfloat16, and tile load/store instructions. A single TDPBSSD instruction performs a $16 \times 64 \times 16$ matrix multiply-accumulate — thousands of fused multiply-adds in one instruction — and runs on a dedicated AMX execution unit rather than the regular SIMD pipes. The throughput is roughly an order of magnitude higher than AVX-512 on the same workload, and the energy per operation is correspondingly lower.

Arm SME (Scalable Matrix Extension), an extension to SVE2 that became standard from ARMv9.2-A, takes a similar approach with vector-length-agnostic semantics. SME introduces the ZA register, a 2-D tile whose dimensions scale with the implementation's vector length, and a streaming mode in which the rest of the SVE register file is repurposed for matrix operands. SME instructions perform outer products and accumulate them into the ZA tile. Apple's M4 reportedly implements an SME-compatible matrix engine; ARM's expectation is that future server and mobile cores will implement SME natively.

Dedicated accelerators — Google's TPU, NVIDIA's Tensor Cores (which are integrated into each streaming multiprocessor of recent GPUs), AMD's matrix cores, Apple's Neural Engine — push matrix specialization further still. Each is a fixed-function unit that performs matrix multiply-accumulate on specific data types (FP16, BF16, INT8, FP8, FP4 in newer designs) at throughput well beyond what general-purpose SIMD can deliver. The CPU or GPU programming model exposes them through libraries (cuBLAS, oneDNN, ML Compute) rather than through individual intrinsics.

The architectural pattern is consistent: as a workload becomes economically important enough, ISAs add specialized instructions for it; as those instructions become standard, dedicated execution units take over from general-purpose ones. Matrix multiplication is the current example; in the past, the same trajectory took FP from co-processors into the main pipeline (Chapter 4) and SIMD from extensions into mainstream ISA. We will return to accelerators more broadly in Chapter 56.

13.Summary

Data-level parallelism exploits the regular, independent operations on arrays of data that are common in scientific, graphics, and ML workloads. SIMD architectures (SSE, AVX, NEON) widen the registers and ALUs to process several elements per instruction; vector ISAs (RISC-V V, ARM SVE) generalize to length-agnostic instructions that adapt to the implementation's hardware width. GPUs take the model further with thousands of parallel lanes and warp-based scheduling.

The compiler and the programmer expose DLP through auto-vectorization, intrinsics, vector pragmas, and specialized libraries. Real speedups depend on the regularity of the workload: regular array operations vectorize well, irregular data structures less so. For suitable workloads, DLP delivers throughput far beyond what ILP can extract.

DLP and ILP are complementary, not alternatives. A modern processor exploits both: a wide OoO core with SIMD execution units handles ILP within and around vector instructions while DLP within each vector instruction handles the regular parallelism. The next chapter takes the parallelism one level higher: across threads, on multi-core processors.

Book mode

	loop:
	ldr s0, [x0, x3, lsl #2] ; load a[i]
	ldr s1, [x1, x3, lsl #2] ; load b[i]
	fadd s2, s0, s1 ; add
	str s2, [x2, x3, lsl #2] ; store c[i]
	add x3, x3, #1 ; i++
	cmp x3, x4 ; i < n
	blt loop

	loop:
	ld1 {v0.4s}, [x0], #16 ; load 4 floats from a
	ld1 {v1.4s}, [x1], #16 ; load 4 floats from b
	fadd v2.4s, v0.4s, v1.4s ; add 4 floats
	st1 {v2.4s}, [x2], #16 ; store 4 floats to c
	sub x4, x4, #4 ; n -= 4
	cbnz x4, loop

	loop:
	vsetvli t0, a0, e32, m1 # set vl = min(a0, VLENMAX), 32-bit elements
	vle32.v v0, (a1) # load t0 floats from a
	vle32.v v1, (a2) # load t0 floats from b
	vfadd.vv v2, v0, v1 # add element-wise
	vse32.v v2, (a3) # store t0 floats to c
	sub a0, a0, t0 # remaining count
	add a1, a1, t0, lsl 2 # advance pointers
	add a2, a2, t0, lsl 2
	add a3, a3, t0, lsl 2
	bnez a0, loop

	loop:
	whilelo p0.s, x3, x4 # predicate: lanes < n
	ld1w z0.s, p0/z, [x0, x3, lsl #2]
	ld1w z1.s, p0/z, [x1, x3, lsl #2]
	fadd z2.s, z0.s, z1.s
	st1w z2.s, p0, [x2, x3, lsl #2]
	incw x3 ; advance i by hardware vector size
	b.first loop