Part V·ISA Case Studies·Chapter 35 of 62

Part VISA Case Studies

x86-64 Floating-Point and SIMD

May 16, 2026·18 min read·advanced

Floating-point and SIMD instructions account for a disproportionate share of x86-64's performance-critical work. Numerical libraries, image and video processing, machine-learning kernels, audio…

Floating-point and SIMD instructions account for a disproportionate share of x86-64's performance-critical work. Numerical libraries, image and video processing, machine-learning kernels, audio codecs, scientific simulations, cryptography — almost every compute-intensive workload depends on floating-point or SIMD throughput. The x86-64 ISA has accumulated an enormous and intricate vector-processing capability, evolving from the 1980s x87 FPU through SSE, AVX, AVX-512, and most recently AMX. This chapter walks through that evolution, the IEEE 754 standard the FPU implements, and how SIMD code is actually written.

We have seen the data-level parallelism story conceptually in Chapter 29. Here we look at it through the x86-64 lens: the specific registers, instructions, encodings, and idioms.

01.IEEE 754 Floating-Point

Modern x86-64 floating-point is governed by IEEE 754 (latest revision: IEEE 754-2019). The standard defines the bit-level format of floating-point numbers, the rules for arithmetic operations, the handling of special values (NaN, infinity, denormals), and rounding modes.

Formats

x86-64 supports three binary formats:

Format	Total bits	Sign	Exponent	Mantissa	Bias	Precision
Single (`float`)	32	1	8	23 (+1 implicit)	127	~7 decimal digits
Double (`double`)	64	1	11	52 (+1 implicit)	1023	~16 decimal digits
Extended	80	1	15	64 (explicit)	16383	~19 decimal digits

The x87 FPU uses the 80-bit extended format internally. SSE/AVX use single and double precision.

A normal value is encoded as:

$x = (-1)^s \times 1.m \times 2^{e - \text{bias}}$

where $s$ is the sign bit, $m$ is the mantissa, and $e$ is the biased exponent.

Special encodings:

Zero: exponent = 0, mantissa = 0. Both +0 and −0 exist.
Denormals (subnormals): exponent = 0, mantissa ≠ 0. Represent very small values close to zero, with reduced precision. The implicit leading 1 is replaced by 0.
Infinities: exponent = all 1s, mantissa = 0. Both +∞ and −∞.
NaN (Not a Number): exponent = all 1s, mantissa ≠ 0. Quiet NaNs (qNaN) propagate silently; signaling NaNs (sNaN) raise an exception.

Half-precision (16-bit, float16 or __fp16) and BFloat16 are recent additions for ML workloads, supported by some AVX-512 sub-extensions and AMX.

Rounding Modes

IEEE 754 defines four rounding modes:

Round to nearest, ties to even (the default).
Round toward zero (truncate).
Round toward +∞ (ceiling).
Round toward −∞ (floor).

Mode is set in the FPU control word (x87) or the MXCSR register (SSE/AVX).

Exceptions

IEEE 754 defines five exception conditions:

Invalid operation (e.g., 0/0, sqrt of negative).
Division by zero.
Overflow (result too large to represent, returns ±∞).
Underflow (result too small, returns denormal or zero).
Inexact (result was rounded).

Each can either set a flag (default) or raise a trap (#XF in x86-64). Most code runs with all traps masked and reads flags only when it cares.

Why It Matters

The IEEE 754 details are not pedantic trivia. Numerical results depend on:

Whether your code uses single or double precision.
The rounding mode in effect.
Whether intermediate computations are kept in 80-bit extended precision (x87) or rounded to 64 bits at each step (SSE/AVX).
Whether the compiler reorders expressions (-ffast-math allows reorderings that violate IEEE rules).
Whether denormals are flushed to zero (FTZ/DAZ flags in MXCSR — common in performance-critical code, since denormals slow down most FPUs).

Compiler flags that disable strict IEEE compliance (-ffast-math) can give 2-5× speedups but make results subtly different. For numerical code where reproducibility matters, IEEE compliance must be preserved.

02.The x87 FPU

x87 was the original Intel floating-point coprocessor (1980), brought on-die in the 80486 (1989). It defines:

Eight 80-bit FP registers, st(0) through st(7), organized as a stack.
Operations push, pop, and combine stack entries.
A control word (rounding mode, exception masks, precision control).
A status word (condition codes, exception flags, top-of-stack pointer).

A typical x87 sequence:

Assembly

fld   qword ptr [a]     ; push a onto FP stack: st(0) = a
fld   qword ptr [b]     ; push b: st(0) = b, st(1) = a
fmul                    ; st(0) = a * b, pop one
fld   qword ptr [c]     ; push c: st(0) = c, st(1) = a*b
fadd                    ; st(0) = c + a*b, pop one
fstp  qword ptr [r]     ; pop and store to r

Stack-based encoding is compact (each operation implicitly references st(0) and possibly st(1)) but awkward for compilers. Register allocation becomes a stack-juggling exercise: the compiler must shuffle values up and down to keep the right operand on top.

x87 also has a unique feature: 80-bit internal precision. Loads and stores convert between memory format (32 or 64 bit) and the internal 80-bit format. Intermediate computations keep extra precision, reducing rounding error in long calculations. SSE and AVX always work in the destination's precision (32 or 64 bit), so equivalent calculations may give slightly different results.

x87 also has trigonometric and transcendental instructions: FSIN, FCOS, FPTAN, FPATAN, F2XM1, FYL2X, FSQRT, etc. These are microcoded and slow (50-200 cycles each). Modern code mostly avoids them, using software libraries that compute via polynomial approximations on SSE/AVX.

In practice, modern x86-64 compilers do not emit x87 code by default. The 80-bit extended type (long double in C on Linux, but double on Windows) is the main remaining use. For all standard float and double arithmetic, the compiler emits SSE/AVX scalar instructions.

03.SSE: The First Modern SIMD

The Streaming SIMD Extensions (SSE), introduced with the Pentium III in 1999, defined the first modern SIMD ISA on x86. Key features:

Eight new 128-bit registers, xmm0 through xmm7 (later extended to xmm0-xmm15 in x86-64).
A new control register, MXCSR, holding rounding mode, exception flags, FTZ/DAZ.
Initial scope: 4 single-precision floats per register.
A new exception vector: #XF (SIMD floating-point exception).

Subsequent revisions broadened it:

SSE2 (Pentium 4, 2000): added 64-bit double precision (2 doubles per xmm), and 128-bit integer SIMD (16 bytes, 8 shorts, 4 ints, or 2 longs per xmm).
SSE3 (Prescott, 2004): horizontal arithmetic, complex-number support.
SSSE3 (Core 2, 2006): general byte-shuffle (PSHUFB), saturated math.
SSE4.1, SSE4.2: dot product, blend, packed integer compare, string operations, CRC32, POPCNT.

After SSE2, virtually every x86-64 chip supports the full SSE family. By x86-64 baseline, SSE2 is mandatory; the compiler is free to use it without a runtime check.

A few representative SSE instructions:

Assembly

movaps xmm0, [rdi]              ; load 4 floats (aligned)
addps  xmm0, xmm1               ; 4 parallel float adds
mulps  xmm0, xmm2               ; 4 parallel float multiplies
movaps [rsi], xmm0              ; store
addsd  xmm0, xmm1               ; one double add (scalar)
addpd  xmm0, xmm1               ; two double adds (packed)
paddd  xmm0, xmm1               ; 4 packed integer adds (32-bit)

The naming pattern: op + suffix:

ps = packed single (4 floats).
pd = packed double (2 doubles).
ss = scalar single (1 float, low element).
sd = scalar double (1 double, low element).
b/w/d/q = packed bytes / words / dwords / qwords.

The same pattern carries over to AVX with the v prefix and wider widths.

04.AVX: Three-Operand SIMD at 256 Bits

AVX (Sandy Bridge, 2011) was a major redesign:

Doubles register width to 256 bits. The xmm registers become the lower halves of ymm0 through ymm15.
Adds three-operand form: destination, source 1, source 2. The destination no longer overwrites a source.
Adds the VEX prefix for cleaner encoding.
Adds new instructions including FMA (in AVX2).

Three-operand form is a big productivity win:

Assembly

; Old SSE (two-operand): destination overwritten
movaps xmm0, xmm1   ; copy
addps  xmm0, xmm2   ; xmm0 = xmm1 + xmm2
; AVX (three-operand): no copy needed
vaddps ymm0, ymm1, ymm2   ; ymm0 = ymm1 + ymm2

VEX-encoded instructions also implicitly zero the upper 128 bits of the destination ymm register (a "zero-upper" effect), avoiding the SSE-AVX transition penalty when mixing.

AVX2 (Haswell, 2013) extended integer SIMD to 256 bits and added several useful operations:

256-bit integer arithmetic, shifts, compares.
Gather instructions (VGATHER): load from non-contiguous addresses given a base and a vector of indices.
VPSLLVD, VPSRLVD: per-lane variable shifts.
FMA (Fused Multiply-Add): vfmadd231ps zmm0, zmm1, zmm2 computes zmm0 = zmm1 * zmm2 + zmm0 with a single rounding step. Doubles the FP throughput on workloads that can use it.

AVX/AVX2 is the SIMD baseline assumed by most modern numerical and ML libraries on x86-64. Compilers auto-vectorize loops to AVX2 when targeting Haswell or newer.

05.AVX-512: 512-Bit and Mask Registers

AVX-512 (Knights Landing 2016, then Skylake-X 2017) is a much more substantial extension:

Doubles register width again, to 512 bits. The ymm registers are the lower halves of zmm0 through zmm31 (32 vector registers, double the AVX count).
Adds mask registers k0-k7, used for predicated execution: each lane in a SIMD operation can be masked off via a per-lane bit.
Adds EVEX prefix for encoding.
Adds many new operations: per-lane permutes, embedded broadcasts, embedded rounding, conflict detection, etc.

Mask registers solve a long-standing SIMD problem. Consider:

for (int i = 0; i < n; i++) {
    if (a[i] > 0)
        b[i] = c[i];
}

With AVX2, vectorizing requires either a select pattern (compute both branches, blend) or scalar fallback at boundaries. With AVX-512:

Assembly

vmovups   zmm0, [rdi + rcx]            ; load 16 floats from a
vcmpgtps  k1, zmm0, zmm_zero            ; k1 = mask of lanes where a[i] > 0
vmovups   zmm1, [rdx + rcx]             ; load c (could mask to skip)
vmovups   [rsi + rcx]{k1}, zmm1         ; store only where mask says yes

The mask bit for each lane controls whether that lane participates. The store uses the mask to write only the active lanes. No scalar fallback needed.

AVX-512 is split into many sub-extensions:

AVX-512F (Foundation): 512-bit registers, mask registers, basic FP and integer ops.
AVX-512CD (Conflict Detection): for vectorizing histogram-like loops.
AVX-512BW: byte/word integer ops.
AVX-512DQ: dword/qword integer and double-precision ops.
AVX-512VL: 128-bit and 256-bit versions of AVX-512 instructions (so you can use mask registers and EVEX encoding without going to 512 bits).
AVX-512VNNI: vector neural-net instructions (INT8 dot products for ML).
AVX-512BF16: BFloat16 support.
AVX-512FP16: half-precision support.
AVX-512_VPCLMULQDQ, AVX-512_GFNI: cryptography helpers.

A given chip implements some subset. Server chips (Xeon Scalable, EPYC) tend to implement many; consumer chips fewer; Atom/E-cores often skip AVX-512 entirely.

The hybrid topology (P-cores with AVX-512, E-cores without) caused Intel to disable AVX-512 on consumer Alder Lake and later (P-cores have it physically but it's fused off). AMD's Zen 4 brought AVX-512 to consumer chips, with a "double-pumped" 256-bit implementation that runs AVX-512 instructions correctly but at lower throughput.

06.Frequency and Throughput Considerations

Wide SIMD imposes power and thermal constraints. Running AVX-512 at full utilization can:

Draw significantly more power than scalar code.
Cause the CPU to drop its operating frequency (a "license" mechanism in Intel server chips).
Affect adjacent cores' thermal headroom.

The result: a workload that uses AVX-512 may run at a lower frequency, partially offsetting the SIMD speedup. For code that is only sometimes vectorized (e.g., a long mostly-scalar program with one vectorized hot spot), the frequency drop can outweigh the speedup. Intel's later silicon (Ice Lake and Sapphire Rapids) reduced this penalty significantly; AMD's Zen 4 has it but smaller.

This is a subtle point that experienced SIMD programmers have to be aware of: just because you can use AVX-512 doesn't mean you should. Profile carefully.

07.FMA and the Throughput Story

For dense numerical kernels, the headline number is FLOPs per cycle. For double precision on a modern x86-64 chip:

1 cycle / instruction.
2 fused FP ops per FMA instruction (one mul + one add).
8 doubles per AVX-512 register (or 4 per AVX-256, or 2 per SSE).
Often 2 FMA-capable execution ports.

So AVX-512: $2 \text{ ports} \times 1 \text{ FMA/cycle} \times 8 \text{ lanes} \times 2 \text{ FLOPs/lane} = 32$ DP FLOPs/cycle/core.

At 3 GHz, that's 96 GFLOPS per core. A 32-core CPU can theoretically reach 3 TFLOPS double-precision. (In practice, memory bandwidth limits actual achieved performance for most code.)

This is the throughput that BLAS libraries (Intel MKL, OpenBLAS) target with carefully tuned kernels. Achieving 80-90% of peak is realistic for matrix multiplication; lower for less compute-dense kernels.

08.AMX: Tile-Based Matrix Operations

AMX (Advanced Matrix Extensions, Sapphire Rapids 2023) adds matrix-tile operations to x86-64. It defines:

Eight tile registers tmm0 through tmm7, each a 2D register up to 16 rows × 64 bytes (= 1 KiB per tile).
Tile configuration via LDTILECFG (set up dimensions and palettes).
Matrix multiply instructions: TDPBSSD (signed×signed→int32 dot product), TDPBF16PS (bfloat16×bfloat16→float32 multiply-accumulate), and others.

A single TDPBF16PS instruction performs a 16×16 outer-product accumulation: 256 FMAs in one instruction, accumulating into a tile register.

AMX is targeted at machine-learning inference and training. It's a meaningful capability bump for AI workloads but requires explicit programming (or a library like oneDNN). The ecosystem support is still maturing.

09.Programming SIMD

Three main approaches.

Compiler auto-vectorization. The compiler analyzes loops, decides whether they can be vectorized, and emits SIMD code. Effective for simple loops with regular access patterns and no dependencies. Less effective for complex loops with conditionals, irregular access, or potential aliasing.

for (int i = 0; i < n; i++)
    c[i] = a[i] + b[i];

GCC with -O3 will auto-vectorize this to AVX2 or AVX-512 (depending on -march).

Intrinsics. C/C++ functions that map (mostly) one-to-one to SIMD instructions. Headers: <xmmintrin.h> (SSE), <immintrin.h> (AVX, AVX2, AVX-512).

#include <immintrin.h>
void vadd(float* a, float* b, float* c, int n) {
    int i;
    for (i = 0; i + 8 <= n; i += 8) {
        __m256 va = _mm256_loadu_ps(a + i);
        __m256 vb = _mm256_loadu_ps(b + i);
        __m256 vc = _mm256_add_ps(va, vb);
        _mm256_storeu_ps(c + i, vc);
    }
    for (; i < n; i++)
        c[i] = a[i] + b[i];
}

Intrinsics give precise control over what instructions are emitted. They are how performance-critical libraries (FFmpeg, OpenSSL crypto, math libraries, FFT implementations) are written.

Inline assembly. Direct assembly inside C/C++ source. Most flexible but most painful. Reserved for the very last optimization steps, or for instructions not exposed as intrinsics.

There are also tools like ISPC (Intel SPMD Program Compiler) that take a SIMT-style source language and emit SIMD code, hiding the lane structure. Useful for code that maps naturally to a SIMT model.

10.SSE/AVX/AVX-512 Compatibility

When code is built with AVX-512 instructions, it cannot run on a chip without AVX-512 — the new instructions are illegal opcodes. This creates a deployment dilemma: build for the lowest common denominator (SSE2) and miss out on newer instructions, or build for newer ISAs and lose compatibility with older systems.

Two solutions:

Multi-versioned binaries. Compile multiple versions (SSE2, AVX2, AVX-512) into one binary and dispatch at runtime via CPUID. GCC has __attribute__((target("avx2"))) and ifunc for this. glibc's math functions, for example, dispatch automatically.

Build for a specific target audience. Server software builds for AVX2 or AVX-512. Game distributors might build for SSE4.2. Languages like Rust have target_feature attributes to enable per-function ISA selection.

11.A Worked Example: Matrix Multiplication Kernel

A small example of how a tuned SIMD kernel looks. Computing C = A × B with single-precision floats, blocked for AVX-512:

// 16x16 micro-kernel: accumulates into 16x16 tile of C
void matmul_microkernel(float* A, float* B, float* C, int K, int lda, int ldb, int ldc) {
    __m512 c[16];
    for (int i = 0; i < 16; i++)
        c[i] = _mm512_loadu_ps(C + i*ldc);

    for (int k = 0; k < K; k++) {
        __m512 b = _mm512_loadu_ps(B + k*ldb);
        for (int i = 0; i < 16; i++) {
            __m512 a = _mm512_set1_ps(A[i*lda + k]);
            c[i] = _mm512_fmadd_ps(a, b, c[i]);
        }
    }

    for (int i = 0; i < 16; i++)
        _mm512_storeu_ps(C + i*ldc, c[i]);
}

The inner loop performs 16 FMAs per iteration of k, each an FMA on 16-wide single-precision vectors: 512 FLOPs per iteration. The C tile stays in registers throughout, accumulating partial sums. This pattern is the heart of every BLAS Level-3 kernel.

A real BLAS implementation wraps this in cache-blocking layers (L1, L2, L3), packs A and B into contiguous buffers, and tunes parameters per chip. The result is the basis of most numerical computing — every dense linear algebra operation eventually goes through code like this.

12.Cryptography and Bit Manipulation

x86-64 has dedicated instructions for cryptography:

AES-NI (AESENC, AESDEC, AESKEYGENASSIST): hardware AES round operations.
SHA: SHA-1 and SHA-256 message-schedule and round operations.
PCLMULQDQ: carryless multiplication, the heart of GHASH (used in AES-GCM) and CRC.
VAES, VPCLMULQDQ: vectorized versions across SIMD lanes.

A modern x86-64 chip with AES-NI can encrypt at 5-10 GB/s/core, vastly faster than software AES. This is one reason why disk and network encryption have minimal performance cost on modern hardware.

RDRAND and RDSEED provide hardware random number generation, drawing from an on-die entropy source. Used by /dev/random, OpenSSL, and similar.

13.Denormals, Exceptions, and the Practical Cost of Strict IEEE

The IEEE 754 sections at the start of this chapter describe the standard's correctness properties; an honest treatment of x86 SIMD must also describe their performance implications, which are large enough that real numerical software pays attention.

Denormal numbers — the very small values whose magnitude falls below the smallest representable normalized number — are part of IEEE 754 and represent values gradually approaching zero with progressively reduced precision. On older Intel cores (and on essentially all FP hardware predating around 2010), arithmetic on denormals took a microcoded slow path that could be 10–100× slower than normal-number arithmetic. A single denormal in a hot loop could destroy performance with no obvious symptom.

The MXCSR register provides two control bits to opt out of strict denormal handling at execution time:

FTZ (Flush To Zero): denormal results are rounded to zero rather than produced.
DAZ (Denormals Are Zero): denormal operands are treated as zero on the way in.

With both bits set, the FP unit never has to handle denormals, and the performance cliff disappears at the cost of a small numerical-accuracy compromise. Audio processing libraries, game engines, and many DSP applications set FTZ and DAZ unconditionally; numerical libraries that need full precision (LAPACK, scientific simulation codes) leave them clear and accept the cost. C compilers expose this through -ffast-math (which sets FTZ/DAZ along with several other relaxations) and through pragmas; Linux's glibc sets FTZ/DAZ by default in some configurations and not in others, which has produced mysterious cross-system performance differences over the years.

Modern Intel and AMD cores have made denormal handling much faster on most operations — within a factor of 2 or so of normal arithmetic, rather than 100× — but the cliff has not entirely vanished. Performance-sensitive code still sets FTZ/DAZ, and benchmark portability still depends on knowing the setting.

FP exceptions in IEEE 754 are the conditions Invalid, Divide-by-Zero, Overflow, Underflow, and Inexact, each with a corresponding sticky flag in MXCSR and an enable mask. By default, all five exceptions are masked: the operation produces a defined default result (NaN for invalid, infinity for overflow, denormal/zero for underflow, the rounded value for inexact) and the corresponding sticky flag is set. With an exception unmasked, the operation traps to the OS, which delivers a SIGFPE (Unix) or structured exception (Windows). Almost all production code runs with all exceptions masked; unmasked exceptions are useful only for debugging.

Reproducibility is the deepest practical concern. The same FP operations on the same data can produce different results on different hardware, with different compiler flags, with different vector widths (because reduction order changes), and with different SIMD generations (FMA versus separate multiply-add changes the rounding once). Code that needs bit-exact reproducibility — financial calculations, regulatory simulations, scientific replication — must specify rounding mode, denormal handling, and FMA usage explicitly, and often disables auto-vectorization to keep the operation order predictable. Code that does not need bit-exactness gains substantial performance from letting the compiler and hardware choose freely. The tension between reproducibility and speed is a permanent feature of FP programming on x86 and indeed on every modern architecture.

14.Summary

x86-64's floating-point and SIMD evolved from the stack-based 80-bit x87 FPU through SSE (128-bit packed FP and integer), AVX (256-bit, three-operand, FMA), AVX-512 (512-bit, mask registers, 32 vector regs), and AMX (matrix tiles). Each generation targeted higher arithmetic throughput per cycle, and modern x86-64 chips can perform tens of FLOPs per cycle per core.

IEEE 754 governs the semantics, with single, double, and (for x87) extended precision, plus newer half-precision and BFloat16 formats. The MXCSR register controls rounding, exception masking, and FTZ/DAZ. Programming SIMD relies on compiler auto-vectorization, intrinsic functions, or hand-written assembly, with multi-versioned binaries handling deployment to varied target hardware.

The wide SIMD capabilities are the throughput backbone of x86-64 in numerical, AI, multimedia, and cryptographic workloads. The next chapter brings together the full x86-64 picture by looking at how Intel and AMD have actually built modern implementations of the ISA: the front end, back end, cache hierarchy, and core trade-offs that distinguish today's high-performance x86-64 chips.

Book mode

	fld qword ptr [a] ; push a onto FP stack: st(0) = a
	fld qword ptr [b] ; push b: st(0) = b, st(1) = a
	fmul ; st(0) = a * b, pop one
	fld qword ptr [c] ; push c: st(0) = c, st(1) = a*b
	fadd ; st(0) = c + a*b, pop one
	fstp qword ptr [r] ; pop and store to r

	movaps xmm0, [rdi] ; load 4 floats (aligned)
	addps xmm0, xmm1 ; 4 parallel float adds
	mulps xmm0, xmm2 ; 4 parallel float multiplies
	movaps [rsi], xmm0 ; store

	addsd xmm0, xmm1 ; one double add (scalar)
	addpd xmm0, xmm1 ; two double adds (packed)

	paddd xmm0, xmm1 ; 4 packed integer adds (32-bit)

	; Old SSE (two-operand): destination overwritten
	movaps xmm0, xmm1 ; copy
	addps xmm0, xmm2 ; xmm0 = xmm1 + xmm2

	; AVX (three-operand): no copy needed
	vaddps ymm0, ymm1, ymm2 ; ymm0 = ymm1 + ymm2

	vmovups zmm0, [rdi + rcx] ; load 16 floats from a
	vcmpgtps k1, zmm0, zmm_zero ; k1 = mask of lanes where a[i] > 0
	vmovups zmm1, [rdx + rcx] ; load c (could mask to skip)
	vmovups [rsi + rcx]{k1}, zmm1 ; store only where mask says yes