Part V·ISA Case Studies·Chapter 33 of 62

Part VISA Case Studies

x86-64 Programming Model

May 16, 2026·18 min read·advanced

This chapter is the programmer's-eye view of x86-64. The previous chapter sketched the ISA's history and structural shape; this one develops what the programmer actually sees: how instructions are…

This chapter is the programmer's-eye view of x86-64. The previous chapter sketched the ISA's history and structural shape; this one develops what the programmer actually sees: how instructions are categorized, how operands are specified, what idioms compilers emit, and how a typical program's machine code is organized. The treatment is concrete and example-driven, on the assumption that x86-64 is best learned by reading and writing it.

We will use AT&T syntax for some examples and Intel syntax for others, since both appear in real-world tools (GCC and the Linux kernel use AT&T; Windows tools and the Intel manuals use Intel). The key visual difference: AT&T puts the destination on the right (movq %rax, %rbx means rbx ← rax), while Intel puts the destination on the left (mov rbx, rax means rbx ← rax). Operands in AT&T have % register prefixes and $ immediate prefixes; Intel has none. Both are common; we will note which is which.

01.Instruction Categories

x86-64's many hundreds of instructions cluster into a handful of categories.

Data Movement

MOV. The basic load/store/copy. Many forms:

Assembly

mov  rax, rbx           ; register to register
mov  rax, [rbx]         ; load from memory
mov  [rbx], rax         ; store to memory
mov  rax, 0x1234        ; immediate to register
mov  rax, [rip + label] ; PC-relative load

MOV is single-cycle on the back end and is often eliminated entirely at rename (Chapter 27).

PUSH / POP. Push or pop a 64-bit value on the stack:

Assembly

push rax        ; rsp -= 8; [rsp] = rax
pop  rax        ; rax = [rsp]; rsp += 8

LEA. Load Effective Address. Computes an address without dereferencing:

Assembly

lea rax, [rbx + rcx*4 + 16] ; rax = rbx + rcx*4 + 16

MOVZX / MOVSX. Move with zero or sign extension:

Assembly

movzx rax, byte ptr [rbx]   ; load byte, zero-extend to 64 bits
movsx rax, word ptr [rbx]   ; load 16-bit, sign-extend to 64 bits

CMOVcc. Conditional move; copies only if the condition is true. Avoids a branch:

Assembly

cmp  rax, rbx
cmovl rax, rcx        ; if rax < rbx (signed), rax = rcx

CMOVcc is widely used by compilers in branchless code. Modern OoO cores execute it as a normal data dependency without a branch.

XCHG. Atomic exchange. With a memory operand, it has implicit lock semantics — the bus is locked for the duration. Common in older lock implementations, mostly superseded by CMPXCHG.

Arithmetic and Logic

ADD, SUB, INC, DEC, NEG. Integer arithmetic. All set flags.

Assembly

add  rax, rbx           ; rax = rax + rbx
sub  rax, 1             ; rax = rax - 1 (or use 'dec rax')
inc  rax                ; rax = rax + 1; sets ZF, SF, OF, but not CF
neg  rax                ; rax = -rax

MUL, IMUL. Multiplication. MUL is unsigned; IMUL is signed. The two-operand and three-operand forms of IMUL are common:

Assembly

imul rax, rbx           ; rax = rax * rbx (signed, low 64 bits)
imul rax, rbx, 7        ; rax = rbx * 7 (signed, low 64 bits)
mul  rbx                ; rdx:rax = rax * rbx (unsigned, 128-bit result)

The single-operand form (MUL rbx) writes a double-width result into rdx:rax, useful for big-integer arithmetic.

DIV, IDIV. Division. The single-operand form takes rdx:rax as the 128-bit dividend and produces a 64-bit quotient (rax) and remainder (rdx).

Assembly

xor  rdx, rdx           ; clear high half of dividend
div  rbx                ; rax = rdx:rax / rbx; rdx = remainder

Division is the slowest common arithmetic operation: 20-100 cycles depending on operand size and value, and not pipelined or only partially pipelined.

AND, OR, XOR, NOT. Bitwise logic.

Assembly

xor  rax, rax           ; rax = 0 (a common zeroing idiom)
and  rax, 0xff          ; rax = low 8 bits

The xor reg, reg zeroing idiom is recognized by every modern x86 decoder as a constant-zero generator with no input dependencies — a special case that avoids serializing on the prior value of the register.

SHL, SHR, SAR, ROL, ROR, RCL, RCR. Shifts and rotates.

Assembly

shl  rax, 3             ; rax <<= 3 (multiply by 8)
sar  rax, 2             ; rax >>= 2 (signed right shift, divide by 4)

SHR is logical (zero-fill); SAR is arithmetic (sign-fill).

BT, BTS, BTR, BTC. Bit test, set, reset, complement. Used for individual bit manipulation in flags or bitmaps.

POPCNT, LZCNT, TZCNT. Population count, count leading zeros, count trailing zeros. Useful for bit-twiddling.

BMI1/BMI2. Various advanced bit-manipulation: BLSR (clear lowest bit), PDEP/PEXT (parallel deposit/extract), SHLX/SHRX (shift without modifying flags), MULX (multiply without flags), and others.

Comparisons and Tests

CMP. Subtract without storing the result; sets flags based on the difference.

Assembly

cmp rax, rbx ; compute rax - rbx; set ZF, SF, OF, CF; discard result

Followed by a conditional branch or conditional move that reads the flags.

TEST. AND without storing; sets flags based on the AND.

Assembly

test rax, rax           ; sets ZF if rax is zero, SF based on high bit
jz   skip               ; jump if rax was zero

The test reg, reg pattern for "is this register zero?" is universal. It is one byte shorter than cmp reg, 0 and avoids the immediate.

Control Flow

JMP. Unconditional jump. Direct (immediate target) or indirect (register/memory target).

Assembly

jmp  label              ; direct
jmp  rax                ; indirect through register
jmp  [rax]              ; indirect through memory

Jcc. Conditional jump. The "cc" is a condition code:

Mnemonic	Meaning	Flags
JE / JZ	equal / zero	ZF=1
JNE / JNZ	not equal	ZF=0
JL / JNGE	less (signed)	SF≠OF
JLE / JNG	less or equal (signed)	ZF=1 or SF≠OF
JG / JNLE	greater (signed)	ZF=0 and SF=OF
JGE / JNL	greater or equal (signed)	SF=OF
JB / JC / JNAE	below (unsigned)	CF=1
JA / JNBE	above (unsigned)	CF=0 and ZF=0
JS	sign (negative)	SF=1
JO	overflow	OF=1

Sixteen conditions in total. Compilers pick the right one based on the comparison's signedness and operator.

CALL / RET. Function call and return.

Assembly

call function    ; push next-instr address, jump to function
ret              ; pop return address, jump to it

LOOP. Decrement rcx and jump if not zero. Mostly historical; modern compilers prefer dec rcx; jnz.

SETcc. Set a byte to 1 or 0 based on a condition; useful for materializing booleans:

Assembly

cmp  rax, rbx
setl al               ; al = 1 if rax < rbx else 0

String Operations

x86 has built-in instructions for memcpy/memset/strcmp-like loops:

MOVS (with rep prefix): copy memory.
STOS (with rep prefix): fill memory with a value.
CMPS (with repe/repne prefix): compare memory.
SCAS (with repe/repne prefix): scan memory.

Assembly

mov   rcx, length
mov   rdi, dest
mov   rsi, source
rep movsb          ; copy rcx bytes from rsi to rdi

These instructions were classic on early x86. Modern implementations have enhanced REP MOVSB / STOSB, special microcoded paths that can be faster than scalar copy loops on large buffers. The compiler's memcpy may use them. For small or fixed-size copies, vectorized SSE/AVX loads/stores are usually faster.

Atomic and Synchronization

LOCK (prefix). Makes the next instruction atomic with respect to other cores. Adds bus-locking semantics for read-modify-write instructions.

Assembly

lock add [rax], 1       ; atomically increment [rax]
lock cmpxchg [rax], rbx ; compare-and-swap

The CAS form (lock cmpxchg) is the cornerstone of lock-free programming on x86: it atomically compares the destination with rax, and if equal, replaces it with the source operand.

PAUSE. A hint instruction used in spin-wait loops: tells the processor that this is a busy-wait, allowing it to back off and reduce wasted execution slots.

MFENCE / LFENCE / SFENCE. Memory fences. MFENCE is a full barrier; SFENCE orders stores; LFENCE orders loads (and acts as a speculation barrier in some contexts).

XACQUIRE / XRELEASE and the TSX family — historical hardware transactional memory; mostly disabled in current chips due to errata.

System Instructions

A small number of privileged or semi-privileged instructions: SYSCALL, SYSRET, RDTSC (read timestamp counter), RDPMC (read performance counter), CPUID (query feature flags), RDMSR, WRMSR, and various others. We will see these in Chapter 34.

02.Calling Conventions

A calling convention specifies how function arguments and return values are passed, and which registers are caller- vs. callee-saved.

System V AMD64 ABI (Linux, macOS, BSD)

The dominant convention on Unix-like systems.

Arguments. First six integer/pointer arguments in registers, in order:

Plain Text

rdi, rsi, rdx, rcx, r8, r9

First eight floating-point arguments in xmm0-xmm7. Additional arguments on the stack.

Return value. Integer/pointer returns in rax (with rdx for second 64 bits if 128-bit return). FP returns in xmm0.

Caller-saved (volatile). rax, rcx, rdx, rsi, rdi, r8-r11, and all xmm registers. The caller must save these before a call if it needs them after.

Callee-saved (non-volatile). rbx, rbp, r12-r15. The callee must preserve these or save and restore them.

Stack. Grows downward. rsp must be aligned to 16 bytes at the call instruction (so 8 mod 16 inside the callee, before pushing rbp). The "red zone" — 128 bytes below rsp — can be used by leaf functions without adjusting rsp.

A typical function prologue/epilogue:

Assembly

function:
    push rbp
    mov  rbp, rsp
    sub  rsp, 32          ; allocate locals
    ; ... body ...
    leave                 ; mov rsp, rbp; pop rbp
    ret

Modern compilers often skip the rbp save and use [rsp + offset] directly, freeing rbp as a general-purpose register. The push rbp / mov rbp, rsp pair is preserved when frame pointers are needed for debugging or unwinding.

Microsoft x64 (Windows)

Windows uses a different convention.

Arguments. First four integer/pointer arguments in: rcx, rdx, r8, r9. First four FP in xmm0-xmm3. Additional arguments on stack. The caller reserves "shadow space" (32 bytes) on the stack for the callee to spill those four register arguments to.

Return value. Integer in rax. FP in xmm0.

Caller-saved. rax, rcx, rdx, r8-r11, xmm0-xmm5.

Callee-saved. rbx, rbp, rdi, rsi, r12-r15, xmm6-xmm15.

The two conventions are incompatible. Code that crosses the boundary (e.g., calling a Windows DLL from Unix-style code, or vice versa) must adapt. Cross-platform libraries often have function-pointer wrappers that translate.

03.Common Idioms and Patterns

A few idioms appear in nearly every compiled binary.

Zeroing a Register

Assembly

xor eax, eax ; rax = 0 (zeroing idiom)

The xor reg, reg form is recognized by the decoder as a zeroing idiom: it produces zero with no input dependency. Using mov rax, 0 would work but is longer (7 bytes vs. 2) and creates a dependency on the immediate.

Note: xor eax, eax zeros all 64 bits because writes to the 32-bit name zero-extend the upper 32 bits.

Negation

Assembly

neg  rax            ; rax = -rax
not  rax            ; rax = ~rax (bitwise NOT)

Absolute Value (branchless)

Assembly

mov  rdx, rax
sar  rdx, 63        ; rdx = sign-extended (-1 if negative, 0 if non-negative)
xor  rax, rdx       ; flip bits if negative
sub  rax, rdx       ; add 1 if negative

Compilers emit this trick for abs(x) on signed ints.

Branchless Min/Max

Assembly

cmp  rax, rbx
cmovg rax, rbx       ; rax = max(rax, rbx)

A conditional move avoids a branch and its potential mispredict cost.

Multiply by Constant

For multiplication by small constants, compilers often use LEA and shifts:

Assembly

; rax * 5
lea rax, [rax + rax*4]  ; rax = rax + 4*rax = 5*rax

For non-trivial divisions by constants, compilers use the Magic Number or Reciprocal Multiplication technique: replace x / c with a multiplication by a precomputed constant followed by a shift. Vastly faster than DIV.

Stack Probing

Functions with large local variables sometimes "probe" the stack — touch each page — to avoid skipping the guard page. Windows and Linux differ in conventions here.

Tail Calls

A tail call (last operation in a function is a call to another function) can be optimized to a jump:

Assembly

; Instead of: call other_func; ret
jmp  other_func

The other function returns directly to the original caller. Saves a stack frame and a return.

04.Compiler Output Walk-Through

Consider this small C function:

int sum_array(const int* a, size_t n) {
    int s = 0;
    for (size_t i = 0; i < n; i++)
        s += a[i];
    return s;
}

A modern compiler with -O2 for x86-64 SysV ABI might produce:

Assembly

sum_array:
    test    rsi, rsi              ; n == 0?
    je      .return_zero
    xor     eax, eax              ; s = 0; clear rcx for loop counter
    xor     ecx, ecx              ; i = 0
.loop:
    add     eax, [rdi + rcx*4]    ; s += a[i]
    inc     rcx
    cmp     rcx, rsi
    jb      .loop
    ret
.return_zero:
    xor     eax, eax              ; s = 0
    ret

Notice:

Argument a is in rdi, n is in rsi (SysV).
Return value is in rax (eax).
The accumulator s is in eax for narrowness; the upper 32 bits of rax are zeroed by writes to eax.
The loop is tight: 4 instructions (add, inc, cmp, jb).
-O3 or -O2 -ftree-vectorize would vectorize this with SSE/AVX, processing 4 or 8 ints per iteration.

Reading compiler output (e.g., via gcc -O2 -S -masm=intel) is one of the best ways to learn x86-64. The compiler's idioms reveal both the ISA's strengths and the optimizer's tricks.

05.Position-Independent Code

Modern executables and shared libraries are position-independent (PIC/PIE): they can be loaded at any virtual address. x86-64 makes this easy via RIP-relative addressing.

Function call (within the same module):

Assembly

call function ; encoded as 32-bit RIP-relative offset

Global variable access (within the same module):

Assembly

mov rax, [rip + global_var]

Function call (across modules, through PLT):

Assembly

call function@plt ; jumps to PLT stub, which goes via GOT to the real address

Global variable access (across modules, through GOT):

Assembly

mov rax, [rip + var@GOTPCREL]   ; load var's address from GOT
mov rbx, [rax]                  ; deref

The GOT (Global Offset Table) and PLT (Procedure Linkage Table) are runtime structures the dynamic linker fills in when loading the binary.

Pre-x86-64 (i.e., 32-bit x86) PIC was much more painful: each function had to compute its own address and use it as a base for relative references. RIP-relative addressing in x86-64 made shared libraries about 30% faster on i386 → x86-64 ports.

06.Thread-Local Storage

Per-thread variables (declared __thread in C, thread_local in C++) are allocated in a TLS region. x86-64 uses the fs or gs segment register's hidden base to address this region:

Assembly

mov rax, fs:[var@TPOFF] ; access thread-local 'var'

The OS (or thread library) sets the fs base for each thread to point at that thread's TLS area. Linux uses fs for user-mode TLS; Windows uses gs. The kernel uses gs for per-CPU data.

This is one of the few places where x86's segmentation has survived in long mode. The fs/gs base addresses are stored in MSRs (FS_BASE, GS_BASE) and can be read or written via the RDFSBASE/WRFSBASE/RDGSBASE/WRGSBASE instructions (added late in x86-64's history; previously you had to use a syscall).

07.Vector Code (Quick Glimpse)

Floating-point and SIMD have their own chapter (35), but here's a taste of how they appear in compiled code.

Scalar double-precision FP:

Assembly

movsd xmm0, qword ptr [rdi]   ; load double from [rdi] into xmm0
addsd xmm0, qword ptr [rsi]   ; add double from [rsi]
movsd qword ptr [rdx], xmm0   ; store double to [rdx]

Vectorized addition of 4 doubles:

Assembly

vmovapd ymm0, [rdi]           ; load 4 doubles (256 bits) from [rdi]
vaddpd  ymm0, ymm0, [rsi]     ; add 4 doubles from [rsi]
vmovapd [rdx], ymm0           ; store 4 doubles

Compilers auto-vectorize loops when alignment, dependence, and size allow. Programmers can use intrinsics (_mm256_add_pd and similar) for explicit control.

08.String and REP-Prefixed Instructions

A distinctive corner of the integer ISA is the string instructions and their REP prefixes. These predate the modern era — they go back to the 8086 — but they remain in active use because modern hardware optimizes them aggressively.

The basic string instructions operate on memory addressed by rsi (source) and rdi (destination), advancing or retreating the index registers by the operand size:

MOVS{B,W,D,Q} — copy [rsi] to [rdi], advance both.
STOS{B,W,D,Q} — store al/ax/eax/rax to [rdi], advance.
LODS{B,W,D,Q} — load [rsi] into al/ax/eax/rax, advance.
CMPS{B,W,D,Q} — compare [rsi] to [rdi], set flags, advance both.
SCAS{B,W,D,Q} — compare al/ax/eax/rax to [rdi], set flags, advance.

The direction (advance vs. retreat) is governed by the direction flag DF: 0 advances forward, 1 retreats. CLD clears it; STD sets it. Most code keeps DF clear; the SysV ABI requires it on function entry.

The REP/REPE/REPNE prefixes turn each into a loop:

Assembly

    mov rcx, rdx          ; count
    rep movsb             ; copy rdx bytes from rsi to rdi

This single line copies a buffer. On a modern Intel or AMD core, fast string operations (FSO) and enhanced REP MOVSB (ERMS, plus newer FSRM — Fast Short REP MOV — and FZRM) make rep movsb competitive with hand-tuned SIMD memcpy for medium-to-large copies; the microcode dispatches to specialized hardware that streams data through wide load/store paths, automatically handling alignment and prefetching. Both glibc and the Microsoft CRT use rep movsb as the inner loop of memcpy on processors that report ERMS in CPUID.

The equivalent rep stosb is the standard memset inner loop. repne scasb searches for a byte in memory (the heart of strlen, although modern implementations use SSE/AVX comparisons that are faster on long strings). repe cmpsb is the heart of memcmp.

A related family is the string-comparison SIMD instructions of SSE 4.2: PCMPESTRI, PCMPISTRI, PCMPESTRM, PCMPISTRM. Each takes two 16-byte vectors and compares them according to a small immediate-encoded operation (find any matching byte, find any byte in a range, find a substring), returning the position of the first match in rcx or a mask in xmm0. These instructions are used in some strchr/strstr/strspn implementations and in JSON and HTML parsers; the win over byte-by-byte loops on long strings is substantial.

The broader pattern is that x86's most ancient instructions have been kept alive and made fast because the use cases (string copies, comparisons, fills) are so common that microcoding them into highly-tuned hardware paths pays off. A simple rep movsb today executes at 32 or 64 bytes per cycle on modern cores, far better than a software loop could achieve.

09.Privileged vs. Unprivileged

User-mode code (ring 3) can run nearly all of the integer, FP, and SIMD instructions described above. A few instructions are privileged:

Anything that loads or modifies control registers (cr0-cr8).
I/O instructions (IN, OUT) when the I/O privilege level is restrictive.
HLT, INVD, WBINVD, INVLPG, LGDT, LIDT, LTR, etc.
RDMSR, WRMSR (some MSRs are accessible from user mode but most are not).
CLI, STI (interrupt flag — sometimes user-accessible if I/O privilege allows).
SYSCALL, SYSRET are designed for transitions; user mode uses SYSCALL but never SYSRET.

User mode encountering a privileged instruction triggers a #GP (general protection) exception, which the OS converts to a SIGSEGV or equivalent.

10.Practical Tools

A few tools every x86-64 programmer should know:

objdump -d -M intel binary — disassemble a binary in Intel syntax.

gcc -S -O2 -masm=intel src.c — produce assembly from C.

Godbolt Compiler Explorer (godbolt.org) — interactive web tool to see compiler output for many languages and compilers. Indispensable for learning.

perf annotate (Linux) — annotate binaries with sample counts from a perf record profile, showing hot instructions.

Intel SDM Volume 2 — the canonical instruction reference. Heavy reading but the definitive source.

Felix Cloutier's x86 reference (felixcloutier.com/x86) — a more navigable web version of the SDM.

11.Summary

x86-64's programming model is a 16-register, two-operand integer ISA with rich addressing modes, RIP-relative addressing for position independence, integrated SIMD via SSE/AVX, and TSO memory ordering. Calling conventions (SysV on Unix, Microsoft on Windows) define how arguments, return values, and saved registers are passed. Common idioms — xor reg, reg for zeroing, branchless CMOVcc, LEA for arithmetic, RIP-relative for PIC — recur throughout compiled code.

Reading compiler output is the best way to internalize the ISA. Each idiom reveals a piece of how x86-64 fits together: the flag-driven branches, the implicit dependencies, the RISC-like internal structure beneath a CISC-like surface.

The next chapter steps up to the system level: paging and virtual memory, system calls and interrupts, control registers and MSRs, the boot sequence, and how the OS kernel wields all this machinery.

Book mode

	mov rax, rbx ; register to register
	mov rax, [rbx] ; load from memory
	mov [rbx], rax ; store to memory
	mov rax, 0x1234 ; immediate to register
	mov rax, [rip + label] ; PC-relative load

	push rax ; rsp -= 8; [rsp] = rax
	pop rax ; rax = [rsp]; rsp += 8

	movzx rax, byte ptr [rbx] ; load byte, zero-extend to 64 bits
	movsx rax, word ptr [rbx] ; load 16-bit, sign-extend to 64 bits

	cmp rax, rbx
	cmovl rax, rcx ; if rax < rbx (signed), rax = rcx

	add rax, rbx ; rax = rax + rbx
	sub rax, 1 ; rax = rax - 1 (or use 'dec rax')
	inc rax ; rax = rax + 1; sets ZF, SF, OF, but not CF
	neg rax ; rax = -rax

	imul rax, rbx ; rax = rax * rbx (signed, low 64 bits)
	imul rax, rbx, 7 ; rax = rbx * 7 (signed, low 64 bits)
	mul rbx ; rdx:rax = rax * rbx (unsigned, 128-bit result)

	xor rdx, rdx ; clear high half of dividend
	div rbx ; rax = rdx:rax / rbx; rdx = remainder

	xor rax, rax ; rax = 0 (a common zeroing idiom)
	and rax, 0xff ; rax = low 8 bits

	shl rax, 3 ; rax <<= 3 (multiply by 8)
	sar rax, 2 ; rax >>= 2 (signed right shift, divide by 4)

	test rax, rax ; sets ZF if rax is zero, SF based on high bit
	jz skip ; jump if rax was zero

	jmp label ; direct
	jmp rax ; indirect through register
	jmp [rax] ; indirect through memory

	call function ; push next-instr address, jump to function
	ret ; pop return address, jump to it

	mov rcx, length
	mov rdi, dest
	mov rsi, source
	rep movsb ; copy rcx bytes from rsi to rdi

	lock add [rax], 1 ; atomically increment [rax]
	lock cmpxchg [rax], rbx ; compare-and-swap

	function:
	push rbp
	mov rbp, rsp
	sub rsp, 32 ; allocate locals
	; ... body ...
	leave ; mov rsp, rbp; pop rbp
	ret

	mov rdx, rax
	sar rdx, 63 ; rdx = sign-extended (-1 if negative, 0 if non-negative)
	xor rax, rdx ; flip bits if negative
	sub rax, rdx ; add 1 if negative

	int sum_array(const int* a, size_t n) {
	int s = 0;
	for (size_t i = 0; i < n; i++)
	s += a[i];
	return s;
	}

	mov rax, [rip + var@GOTPCREL] ; load var's address from GOT
	mov rbx, [rax] ; deref

	movsd xmm0, qword ptr [rdi] ; load double from [rdi] into xmm0
	addsd xmm0, qword ptr [rsi] ; add double from [rsi]
	movsd qword ptr [rdx], xmm0 ; store double to [rdx]

	vmovapd ymm0, [rdi] ; load 4 doubles (256 bits) from [rdi]
	vaddpd ymm0, ymm0, [rsi] ; add 4 doubles from [rsi]
	vmovapd [rdx], ymm0 ; store 4 doubles

	mov rcx, rdx ; count
	rep movsb ; copy rdx bytes from rsi to rdi