Part II·Instruction Set Architecture·Chapter 14 of 62

Part IIInstruction Set Architecture

Calling Conventions and ABIs

May 16, 2026·31 min read·intermediate

A function does not exist in isolation. It is called by other functions, calls others itself, and exchanges arguments and return values with them. For this to work — especially across compilation units, languages, and library boundaries — every party has to agree on the rules: where the arguments live, where the return value goes, who is responsible for saving which registers, how the stack is laid out, and how the called function's prologue and epilogue interact with the caller. These rules are collectively called the calling convention, and the broader set of binary-level rules that govern interoperation is the Application Binary Interface, or ABI.

The ABI is the contract that turns a pile of separately-compiled object files into a working program. Calling conventions are at its heart, but the ABI also covers data layout (struct alignment, padding, bit-field ordering, enum size), exception handling, name mangling, thread-local storage, dynamic linking conventions, and many other details. This chapter focuses on the parts most directly visible at the instruction level: how function calls work, register-saving discipline, stack frames, alignment, and the prologue/epilogue conventions. The deeper data-layout details that the ABI specifies appear in the final section.

01.Function Arguments

The first question any calling convention must answer is where the arguments live. Three broad strategies have been used historically, and modern ABIs combine them.

On the stack. The caller pushes arguments onto the stack before the call; the callee reads them from there. Simple and unbounded — any number of arguments fit — but slow, because every argument requires a memory write on the way in and a memory read on the way out.

In registers. The caller places arguments in agreed-on registers; the callee reads them from there. Fast (no memory traffic for register-passed arguments) but limited by the number of registers reserved for the purpose.

Hybrid. The first few arguments go in registers; any beyond that overflow to the stack. Modern ABIs all use this style, with the cutoff chosen to balance register pressure against the cost of memory traffic.

The System V AMD64 ABI

The dominant ABI on Linux, BSD, and macOS for x86-64 is the System V AMD64 ABI. It passes the first six integer or pointer arguments in registers rdi, rsi, rdx, rcx, r8, r9 (in that order), and the first eight floating-point arguments in xmm0 through xmm7. Additional arguments overflow to the stack, pushed right-to-left so that the leftmost extra argument is at the lowest address.

int foo(int a, int b, int c, int d, int e, int f, int g, int h);
//      a → edi, b → esi, c → edx, d → ecx, e → r8d, f → r9d
//      g and h on the stack

The Microsoft Windows x64 ABI uses a different convention: only four integer registers (rcx, rdx, r8, r9) and four FP registers (xmm0–xmm3), with mandatory shadow space of 32 bytes on the stack reserved by the caller for the callee to spill those four register arguments if it wishes. The two conventions are incompatible; programs cross-compiling to Windows from Linux have to be rebuilt against the Windows ABI.

AArch64 (AAPCS64)

ARM's 64-bit calling convention, the AAPCS64, passes the first eight integer or pointer arguments in x0–x7, and the first eight floating-point arguments in v0–v7. Additional arguments go on the stack.

The AArch64 design with eight register-passed arguments fits its 31 general-purpose registers nicely and matches the typical function fan-in.

RISC-V

RISC-V's standard calling convention passes the first eight integer arguments in a0–a7 (registers x10–x17), and the first eight floating-point arguments in fa0–fa7. Additional arguments overflow to the stack.

The pattern across modern RISC ABIs is similar: a generous register window for the common case (a small number of arguments), with stack overflow as a safety valve.

Argument Sizes

What if an argument is larger than a register? Several possibilities exist.

Small structs and pairs. A struct that fits in two registers may be passed in two registers; the System V AMD64 ABI has elaborate rules for classifying structs into integer, floating-point, or memory categories and assigning their fields to registers accordingly. AArch64 has similar rules.

Large structs. A struct too big to fit in registers is passed by reference: the caller stores the struct on its own stack and passes a pointer to it. Or the struct is copied onto the stack as part of the argument list.

Variadic arguments. Functions like printf take a variable number of arguments. The ABI specifies how the callee can walk through them using the va_list machinery; on most ABIs, variadic arguments are passed using the same register-then-stack scheme but with all FP arguments duplicated into integer registers (because the callee's variable-argument handler does not know the type).

These details vary enough between ABIs that producing a correct calling sequence for an arbitrary signature is a non-trivial compiler task. Programmers writing inline assembly or hand-tuned routines must follow the ABI's rules to the letter.

02.Return Values

The mirror image of argument passing: where does the return value live?

The conventions are simpler.

Single integer or pointer returns in a designated register: rax on x86-64, x0 on AArch64, a0 on RISC-V.
Pair of integers or struct of two registers returns in two registers: rax/rdx on System V AMD64, x0/x1 on AArch64, a0/a1 on RISC-V.
Floating-point returns in xmm0 (System V AMD64), v0 (AArch64), or fa0 (RISC-V).
Floating-point pair returns in xmm0/xmm1 or the equivalent.
Larger struct or aggregate returns by hidden pointer: the caller allocates space for the return value, passes its address as a hidden first argument, and the callee writes the return value through that pointer. The function appears to take an extra argument that the C source never declared.

The hidden-pointer convention is responsible for many compiler-output puzzles. A function declared as struct big_thing foo(void) in source may compile to one that takes one explicit argument (the hidden return pointer) and returns nothing in the conventional sense.

03.Stack Frames

When a function is running, its local variables, its saved register values, and its administrative bookkeeping live in a region of memory called the function's stack frame. The stack itself grows downward (toward lower addresses) on every modern ABI, so a new frame is allocated by subtracting the frame size from the stack pointer.

A typical stack frame contains:

The return address (placed there by the call instruction on x86 or saved from the link register on RISC).
The saved frame pointer of the caller (optional, but commonly used).
Saved values of any callee-saved registers the function intends to use.
Space for the function's local variables.
A small spill area for register spills the compiler may need.
Any outgoing arguments that overflow to the stack for calls this function makes.

The arrangement varies by ABI but the rough picture is universal:

Figure: Stack frame layout: caller frame, overflow arguments, return address, saved frame pointer, callee-saved registers, locals, and outgoing arguments, with rbp and rsp marked

LaTeX

\begin{tikzpicture}[font=\small, line cap=round]
  % Origin (0,0) at top-left. High addresses at top, low addresses at bottom.
  \node[anchor=east] at (-0.3, -0.3) {high addresses};
  \node[anchor=east] at (-0.3, -6.6) {low addresses};
  \draw[thick] (0, -0.6) rectangle (6, 0);   \node at (3, -0.3) {caller's frame};
  \draw[thick] (0, -1.2) rectangle (6, -0.6); \node at (3, -0.9) {...};
  \draw[thick] (0, -1.8) rectangle (6, -1.2); \node at (3, -1.5) {argument N (overflow from caller)};
  \draw[thick] (0, -2.4) rectangle (6, -1.8); \node at (3, -2.1) {argument 7};
  \draw[thick] (0, -3.0) rectangle (6, -2.4); \node at (3, -2.7) {return address (pushed by call on x86)};
  \draw[thick] (0, -3.6) rectangle (6, -3.0); \node at (3, -3.3) {saved rbp of caller};
  \draw[thick] (0, -4.2) rectangle (6, -3.6); \node at (3, -3.9) {callee-saved registers};
  \draw[thick] (0, -4.8) rectangle (6, -4.2); \node at (3, -4.5) {...};
  \draw[thick] (0, -5.4) rectangle (6, -4.8); \node at (3, -5.1) {local variables};
  \draw[thick] (0, -6.0) rectangle (6, -5.4); \node at (3, -5.7) {spill area};
  \draw[thick] (0, -6.6) rectangle (6, -6.0); \node at (3, -6.3) {outgoing arguments to nested calls};
  \node[anchor=west] at (6.2, -3.3) {$\leftarrow$ frame pointer (rbp)};
  \node[anchor=west] at (6.2, -6.3) {$\leftarrow$ stack pointer (rsp)};
\end{tikzpicture}

On x86-64 with a frame pointer, rbp is set up to point at the saved rbp slot, and locals are addressed at negative offsets from rbp ([rbp-8], [rbp-16], etc.). On AArch64, the frame pointer is x29, and locals are accessed similarly (though AArch64 ABIs commonly skip the frame pointer for leaf functions). On RISC-V, the frame pointer is s0 (also called fp).

Frame Pointer or No Frame Pointer

A function can be compiled either with or without a frame pointer.

With a frame pointer, every function maintains rbp (or its equivalent) at a fixed offset from its locals. Stack-walking tools (debuggers, profilers) can traverse the call chain just by following the chain of saved frame pointers. The cost is one extra register reserved across the function's lifetime and a few extra instructions in the prologue and epilogue.

Without a frame pointer (the -fomit-frame-pointer mode), the function uses rsp directly and adjusts offsets for any local stack changes. The frame-pointer register is freed up for general use. Stack walking now has to consult debugging information (DWARF unwind tables) to compute frame sizes; this is slower but works for production code with no perceptible cost.

Modern compilers default to omitting the frame pointer for optimized builds. Recent OS-level work — for example, Meta and Microsoft's pushes for system-wide profiling tools — has motivated re-enabling the frame pointer in some distributions to make stack sampling cheap, but this is a tradeoff between profiling speed and runtime efficiency.

Red Zone

The System V AMD64 ABI defines a 128-byte red zone below the current stack pointer. Leaf functions (those that do not call other functions and do not handle signals) may use this region for local variables without bothering to decrement rsp. Signal handlers and the kernel are required to leave this region untouched.

The red zone saves a stack-pointer adjustment in tiny leaf functions. It is absent from ABIs that need the kernel and signal handlers to be free to clobber the area below rsp (e.g., the Windows x64 ABI has no red zone).

04.Caller-Saved and Callee-Saved Registers

A function call sits at a point in time when many registers may hold values that the caller cares about. After the call returns, the caller wants those values still there. But the callee will need registers of its own to do its work. Who is responsible for preserving what?

Every ABI partitions the registers into two classes.

Caller-saved (or "scratch") registers. The caller is responsible for saving these if it cares about their value across the call. The callee may freely clobber them. From the callee's point of view: free to use, no obligation to preserve.

Callee-saved (or "preserved") registers. The callee is responsible for preserving these. If the callee uses one, it must save the original value somewhere (typically on its stack frame) at the start and restore it at the end. From the caller's point of view: guaranteed to survive the call.

The partition is a tradeoff. If too many registers are caller-saved, callers do a lot of saving and restoring around every call; if too many are callee-saved, callees do a lot of saving and restoring even when the caller did not have anything in those registers. In practice, ABIs split roughly in half.

System V AMD64 ABI Partition

Class	Registers
Caller-saved	`rax`, `rcx`, `rdx`, `rdi`, `rsi`, `r8`, `r9`, `r10`, `r11`, all `xmm0`–`xmm15`
Callee-saved	`rbx`, `rbp`, `r12`, `r13`, `r14`, `r15`, `rsp`

The argument registers (rdi, rsi, rdx, rcx, r8, r9) are caller-saved, which makes sense: they are by definition holding values the caller computed for the call, not values it expects to have after.

AArch64 (AAPCS64) Partition

Class	Registers
Caller-saved	`x0`–`x18`, `v0`–`v7`, `v16`–`v31`
Callee-saved	`x19`–`x28`, `x29` (FP), `x30` (LR), `v8`–`v15` (low 64 bits)
Special	`x29` (frame pointer), `x30` (link register), `sp`

Note the partial preservation of v8–v15: only the low 64 bits are preserved across calls, even though the registers are 128 bits wide. The high half is caller-saved.

RISC-V Partition

Class	Registers
Caller-saved (temporaries)	`t0`–`t6`, `a0`–`a7`, `ft0`–`ft11`, `fa0`–`fa7`
Callee-saved (saved)	`s0`–`s11`, `fs0`–`fs11`
Special	`ra` (return address, caller-saved by convention), `sp`, `gp`, `tp`

RISC-V's register naming is unusually pedagogical: registers literally called s are saved (callee-saved) and registers called t are temporaries (caller-saved).

The compiler chooses which registers a function uses based on this partition. If a function only does a small amount of work that fits in a few registers, it prefers caller-saved (free to use, no save/restore overhead). If a function holds a value across nested calls, it has to use either a callee-saved register (and pay the prologue/epilogue cost once) or repeatedly spill and reload a caller-saved one (cheaper per use but more total memory traffic). Compilers make this tradeoff as part of register allocation.

05.Stack Alignment

Modern ABIs require the stack pointer to be aligned at function entry. The required alignment varies but is typically larger than the natural word size, to accommodate vector loads and stores.

System V AMD64: rsp must be 16-byte aligned just before a call instruction. Since call itself pushes 8 bytes (the return address), the callee on entry sees rsp aligned to 16k+8. The prologue typically subtracts a multiple of 16 minus 8 to re-align.
AArch64: sp must be 16-byte aligned at all times when accessed.
RISC-V: sp must be 16-byte aligned (with the exception of certain compressed-instruction modes).

Misaligned stack at function entry is a real source of bugs, especially in hand-written assembly that calls C functions. A common symptom: a function that uses SSE or NEON loads with explicit alignment requirements faults the first time it accesses the stack. The fix is to ensure the prologue establishes proper alignment.

The compiler handles alignment automatically when it generates prologues. But programmers writing inline assembly or naked functions must do it themselves.

06.Procedure Prologues and Epilogues

The prologue of a function is the small block of instructions at its entry that sets up its stack frame. The epilogue is the block at its exit that tears the frame down. Together, they enforce the calling convention's rules and make the function safe to call.

A typical full prologue on x86-64 (System V):

Assembly

foo:
    push    rbp                  ; save caller's frame pointer
    mov     rbp, rsp             ; establish our frame pointer
    push    rbx                  ; save the callee-saved registers we'll use
    push    r12
    sub     rsp, 32              ; allocate space for locals
    ; ... function body ...

And the matching epilogue:

Assembly

    add     rsp, 32              ; deallocate locals
    pop     r12                  ; restore callee-saved registers (in reverse)
    pop     rbx
    pop     rbp                  ; restore caller's frame pointer
    ret                          ; pop return address into rip

A simpler prologue, when no callee-saved registers are needed and no locals are needed:

Assembly

foo:
    ; (nothing)
    ; ... function body ...
    ret

Such a function is a leaf with no stack frame at all. Its only stack-related action is the implicit push from the caller's call, undone by the matching ret.

On AArch64, the prologue tends to be more elaborate because the link register has to be saved explicitly when the function makes any calls of its own:

Assembly

foo:
    stp     x29, x30, [sp, #-32]!    ; save x29 (FP) and x30 (LR), pre-decrement sp by 32
    mov     x29, sp                  ; establish frame pointer
    stp     x19, x20, [sp, #16]      ; save callee-saved registers
    ; ... function body ...
    ldp     x19, x20, [sp, #16]      ; restore callee-saved registers
    ldp     x29, x30, [sp], #32      ; restore x29, x30 and post-increment sp
    ret                              ; branch to x30

The STP/LDP instructions save and restore pairs of registers — an AArch64 idiom that makes prologues compact.

A leaf AArch64 function avoids saving the link register, since it makes no calls and x30 is unmodified:

Assembly

foo:
    ; ... function body using only caller-saved regs ...
    ret

On RISC-V, a function prologue looks similar to AArch64's but with explicit addi sp, sp, -N and individual sd instructions for each saved register:

Assembly

foo:
    addi    sp, sp, -32
    sd      ra, 24(sp)               # save return address
    sd      s0, 16(sp)               # save s0 (frame pointer)
    addi    s0, sp, 32               # establish frame pointer
    sd      s1, 8(sp)                # save callee-saved register
    ; ... function body ...
    ld      s1, 8(sp)
    ld      s0, 16(sp)
    ld      ra, 24(sp)
    addi    sp, sp, 32
    ret

The structure is the same across ISAs: allocate stack, save what needs saving, do the work, restore, deallocate, return. The exact instruction sequences differ but the pattern is universal.

Tail Calls

A tail call is a call that is the last action of the calling function. Because the caller has nothing left to do after the call returns, the callee can reuse the caller's stack frame entirely. The compiler emits a jmp (rather than a call) to the target, after first cleaning up the caller's frame and arranging for arguments to be in the right places.

Tail-call optimization is essential to functional languages that express loops as recursion, and it is a common micro-optimization in performance-sensitive C and Rust code as well. Most calling conventions tolerate tail calls; the compiler ensures that the callee's prologue and the caller's already-completed setup are compatible.

07.Variadic Functions

A few functions — printf, scanf, execl, open (in some signatures) — take a variable number of arguments of unknown types. The ABI has to specify how the callee can walk through whatever the caller provided, and the answer is one of the more complicated parts of any modern calling convention.

The core problem is that on a register-passing ABI, the callee cannot, in general, distinguish a register that holds an argument from a register that holds garbage. The C standard's va_list/va_arg machinery solves the problem with cooperation from the compiler: the prologue of a variadic function spills all the argument-passing registers to a known location (the register save area), and va_arg returns successive values either from that area or from the stack overflow region as the type indicates.

The System V AMD64 ABI is representative of the complexity. A variadic function's va_list is a small struct containing four fields:

typedef struct {
    unsigned int gp_offset;        // next general-purpose argument byte
    unsigned int fp_offset;        // next floating-point argument byte
    void        *overflow_arg_area; // pointer into the stack for overflow args
    void        *reg_save_area;     // pointer to the spill area in the caller's frame
} va_list[1];

The prologue of a variadic function spills rdi–r9 (six 8-byte slots) and xmm0–xmm7 (eight 16-byte slots) into a 176-byte register save area on its own stack, sets up the va_list to point at it, and proceeds. va_arg of an integer or pointer type reads from gp_offset and bumps the offset; once gp_offset reaches 48, the next reads come from overflow_arg_area. Floating-point arguments use the parallel fp_offset machinery. The whole arrangement is intricate enough that programmers rarely look at the generated code; the compiler emits it correctly and the standard library's vprintf and friends rely on it.

AArch64's AAPCS64 uses a similar but simpler structure with two pointers and two counters. RISC-V's variadic ABI is the simplest of the three: variadic arguments are passed exactly like ordinary ones in registers a0–a7 with overflow on the stack, and the callee spills the arguments to a contiguous region whether it ends up using them or not. Microsoft's x64 variadic ABI requires all arguments after the second to be passed in integer registers (even floating-point ones, which are duplicated into both the FP and GP register), so that the callee does not need to know the type when reading variadic arguments — a major simplification at the cost of some register pressure.

A practical trap is that variadic functions are not always interchangeable with non-variadic ones at the call site. On Microsoft's ABI in particular, a function declared without a prototype but called with floating-point arguments may put the arguments in the wrong registers; this is one of the reasons C++ rejects unprototyped function declarations entirely.

08.Position-Independent Code, GOT, and PLT

The ABI also dictates how a function reaches non-local data and code. Calling another function in the same translation unit is a direct PC-relative branch and presents no problem; calling a function in another shared library, or accessing a global variable that may live in another library, requires runtime indirection.

The Global Offset Table (GOT) is a per-shared-object table of pointers to global data and functions resolved by the dynamic linker at load time (or lazily on first use). When a PIC function wants to access an external global, it loads the address from the appropriate GOT slot and dereferences. The GOT slot itself is referenced PC-relative, so the access works regardless of where the library is loaded.

The Procedure Linkage Table (PLT) plays the analogous role for function calls. A call to an external function is in fact a PC-relative call to the function's PLT stub, which is a small piece of code that loads the resolved address from the GOT and jumps to it. The first time the stub runs, the GOT slot points back into the dynamic linker; the dynamic linker resolves the symbol, patches the GOT slot with the real address, and jumps. Subsequent calls go directly through the patched slot.

A simplified x86-64 PLT entry looks like:

Assembly

func@plt:
    jmp     [rip + func@GOTPCREL]   ; first call: jumps into the resolver stub
                                    ; later calls: jumps to the resolved function
    push    n                       ; relocation index
    jmp     plt0                    ; common resolver entry

Lazy binding has historically been the default, because resolving every symbol at startup made shared-library load slow. Modern hardened builds prefer eager binding with the RELRO protection (-z relro -z now), which resolves all symbols at startup and then marks the GOT read-only, removing it as an attack surface.

The PIC machinery is invisible to source code but very visible in disassembly. A typical optimized C function call to a libc routine compiles to a single PC-relative call func@PLT; the PLT entry, the GOT slot, the relocations that wire them up, and the dynamic linker's resolver all sit in the background.

09.Thread-Local Storage

A C declaration __thread int counter; (or C11's thread_local) creates a variable that has a separate instance per thread. The ABI must specify how the program reaches the thread-local instance from any thread that runs the code, without any explicit thread argument.

The answer is a thread pointer, a register or special architectural mechanism that points at the current thread's TLS area. On x86-64 it is the fs segment register (fs:0 is the start of the TLS area); on AArch64 it is the system register tpidr_el0; on RISC-V it is the GPR tp (x4). The kernel sets up the thread pointer when it creates a new thread, and every thread sees its own.

Four standard TLS access models exist, in increasing order of generality and decreasing efficiency.

Local-Exec. The variable is in the main executable, accessed only from the main executable, and the executable is not built as PIE. The compiler emits a fixed offset from the thread pointer; one instruction.

Initial-Exec. The variable is in the main executable or a library loaded at startup, but the access is from PIC code. The compiler emits a load of the offset from a GOT-like table, then an addition to the thread pointer; two instructions.

Local-Dynamic. The variable is in a library that may be loaded with dlopen after startup, but the access is from within the same library that defined it. A runtime helper resolves the library's TLS module, then an offset is added.

General-Dynamic. The fully general case: the variable is in some library, accessed from somewhere that may not know which one. A runtime helper takes a per-thread per-module index and returns the address. This is the most expensive but most flexible model.

The compiler chooses the most efficient model that the program's structure and link options permit, and the dynamic linker provides the helpers (__tls_get_addr on Linux). The end result is that counter++ from C compiles to anywhere from one instruction to a function call, depending on what the link-time and run-time environments allow.

The TLS area itself is a per-thread block allocated at thread creation, containing a copy of every TLS variable defined in every loaded shared object, plus thread-control fields used by the runtime (thread ID, errno location, stack guard, cancellation state). Its layout is also part of the ABI; getting it wrong means that a program built against one libc cannot run with another.

10.Stack Unwinding and Exception Handling

Calls and returns are simple when execution flows linearly through them. They become much more complicated when control jumps out of a function unexpectedly — a C++ exception thrown several frames deep, a Rust panic!, a pthread_cancel, a longjmp. Each of these has to walk back up the stack, run any required cleanup at each frame, and resume execution somewhere known.

The machinery that does this is stack unwinding, and the ABI specifies it carefully because it has to work across compilers and across libraries.

The data structure at the heart of unwinding is the call-frame information (CFI): a description, for every range of code in the program, of how to recover the previous frame's state — how to find the caller's PC, how to find the saved registers, how big this frame is. The CFI is encoded in DWARF format and stored in the .eh_frame section of every Linux ELF binary; even stripped binaries retain it because the C++ runtime cannot work without it. Microsoft uses a different format (.pdata/.xdata) but the essential information is the same.

A C++ throw works as follows. The throw runtime allocates the exception object, looks at the current PC, consults the unwind tables to determine the function it is in and any cleanup actions the function has registered (destructors of local objects, for example), runs them, computes the previous frame's state from the unwind data, and repeats. At each frame, it also checks whether the function has a try block whose catch matches the thrown type; if so, the runtime runs the destructors back to that point and resumes execution at the catch.

The unwind tables also encode the language-specific data area (LSDA), a small data structure produced per function that lists its try ranges, its catch types, and the addresses of its cleanup landing pads. The runtime is largely language-agnostic; the LSDA tells it what to do at the language level.

A function compiled with -fno-exceptions produces no LSDA but still produces CFI, because CFI is also used for stack tracing in debuggers and profilers. The combination has the practical effect that even pure-C programs need their toolchain to emit unwind information; modern build systems do so by default.

The ABI's promise is that an exception thrown in one library can propagate cleanly through frames belonging to another, even if they were compiled by different versions of the same compiler. The cost is the entire DWARF infrastructure described above, embedded in every binary on a modern Linux system.

11.Stack Protection and Hardening

A running stack is a juicy target for memory-corruption attacks: a buffer overflow in a local array can overwrite the saved return address, redirecting the function's ret to attacker-controlled code. Modern ABIs include several mechanisms specifically to defend against this class of attack, and each leaves its fingerprint in the prologue and epilogue.

Stack canaries, introduced by StackGuard and now ubiquitous, place a small random value (the canary) in each frame's prologue, immediately above the saved return address. The epilogue checks that the canary is intact before returning; if a buffer overflow has overwritten it, the function calls __stack_chk_fail and the program aborts. The compiler flag -fstack-protector-strong selects the heuristic for which functions get a canary (typically those with stack arrays or address-taken locals); -fstack-protector-all applies it everywhere, at modest cost. The canary itself is loaded from a TLS slot that the kernel and runtime set up at process startup.

Non-executable stack (NX, XD, or XN bit on the page table) marks the stack pages as non-executable, so that a buffer overflow that injects shellcode onto the stack cannot then jump to it. ROP attacks (return-oriented programming) work around this by reusing existing code gadgets, but the bar is raised significantly. The ELF marker .note.GNU-stack records the executable's preference.

Shadow stacks, recently introduced as Intel's CET (Control-flow Enforcement Technology) and ARM's GCS (Guarded Control Stack), maintain a parallel stack that holds only return addresses, kept in memory the program cannot ordinarily write to. Every call pushes the return address to both the regular stack and the shadow stack; every ret checks that they match. Buffer-overflow attacks that overwrite the regular stack's saved PC are caught at the next return. The hardware support is present on recent x86-64 (Tiger Lake and later) and AArch64 (FEAT_GCS) chips, and major operating systems are gradually enabling it for system binaries.

Pointer authentication (AArch64 PAC) cryptographically signs return addresses and other pointers using a per-process secret key, embedded in the unused high bits of 64-bit pointers. The paciasp instruction in the function prologue signs the return address before saving it; autiasp in the epilogue authenticates and strips the signature before the return. Forging a pointer is computationally infeasible without the key. The mechanism imposes negligible overhead and has been adopted enthusiastically by Apple silicon, where most user-space binaries ship with PAC enabled.

Branch Target Identification (AArch64 BTI) and Intel's IBT (Indirect Branch Tracking) require indirect branches to land on instructions specifically marked as valid targets; landing anywhere else raises a fault. Combined with shadow stacks and PAC, the result is a strong reduction in the available attack surface for control-flow hijacks.

None of these mechanisms is invisible to the ABI. A function with PAC has a different prologue from one without; a binary compiled for shadow stacks uses different call/ret semantics; a stack canary requires a TLS slot the loader has to set up. Different platforms and distributions enable different combinations as defaults, and it is part of the ABI's job to document which.

12.ABI-Level Data Layout

The ABI also dictates how data is laid out in memory. This is the part most visible to programmers writing C structs and trying to interoperate across languages or compilers.

Integer Sizes

The C standard does not fix the size of int, long, long long, or pointers; it leaves them to the implementation. The ABI nails them down. The two most common conventions on 64-bit systems are:

LP64 (used by Linux, BSD, macOS, AIX, and most Unix): long is 64 bits, pointers are 64 bits, int is 32 bits, long long is 64 bits.
LLP64 (used by Windows): long long is 64 bits, pointers are 64 bits, int and long are both 32 bits.

This divergence — Windows keeping long as 32 bits — is a frequent source of portability bugs in C code that assumed long would scale with the pointer size. Modern code prefers fixed-width types like int64_t from <stdint.h> to avoid the issue.

Struct Layout

A C struct is laid out in declared order, with each field at the lowest address that satisfies its alignment requirement. Padding bytes are inserted before fields that need higher alignment than the current offset provides, and trailing padding is inserted to round the struct's size up to the alignment of its strictest member.

struct Example {
    char  a;       // 1 byte at offset 0
    // 3 bytes of padding at offsets 1-3
    int   b;       // 4 bytes at offset 4
    char  c;       // 1 byte at offset 8
    // 7 bytes of padding at offsets 9-15
    long  d;       // 8 bytes at offset 16
};
// total size: 24 bytes; alignment: 8

The ABI specifies the natural alignment of each primitive type (typically equal to its size) and the rules above. C compilers follow the rules; cross-language interoperation (Rust's repr(C), FFI bindings, network serialization) all key off the same layout.

A struct's alignment is the strictest alignment of any member. The compiler ensures every instance of the struct, whether on the stack, in static storage, or on the heap, is allocated at an address satisfying that alignment.

#pragma pack(1) and similar directives let the programmer override the rules, packing structs without padding. This is needed for binary file formats and network protocols but produces slow access for misaligned fields and (on architectures that disallow misaligned access) outright faults.

Bit Fields

C bit fields — fields like unsigned int x : 3; — have implementation-defined order and packing. The ABI specifies the order (typically least-significant bit first on little-endian systems, most-significant first on big-endian) and how the compiler combines bit fields into storage units. This is one of the least-portable corners of C; programs that need defined bit layout typically use explicit bit operations rather than bit fields.

Enum Sizes

The size of an enum is also ABI-specified. Some ABIs make all enums the size of int; others (like the modern C++ ABI for scoped enums) allow the size to depend on the values used. Mismatches between ABIs can cause subtle bugs at language boundaries.

Name Mangling

C++, with its overloading and namespaces, encodes type and scope information into symbol names. The encoding — name mangling — is part of the C++ ABI. The Itanium C++ ABI is used on Linux, BSD, and macOS; Microsoft Visual C++ uses its own. The two are mutually unintelligible, so C++ libraries cannot generally be linked across them.

For example, the function int foo(int, double) in namespace bar mangles to _ZN3bar3fooEid under Itanium and to ?foo@bar@@YAHHN@Z under MSVC. Tools like c++filt translate mangled names back for human consumption.

C, by contrast, does not mangle: the symbol for a C function foo is just foo. This simplicity is why C has remained the lingua franca of inter-language interfaces, and why C++ programmers writing code that has to be called from other languages wrap it in extern "C" blocks to suppress mangling.

Other ABI Specifications

A complete ABI also specifies:

Exception unwinding — how to walk the stack during a thrown exception, what frames register, and what cleanup actions to run. Encoded in DWARF unwind tables in .eh_frame sections.
Thread-local storage — how the program accesses thread-local variables, including the layout of the TLS area and the use of segment registers (fs on x86-64, tpidr_el0 on AArch64).
Position-independent code — how a shared library's code accesses its data without absolute addresses, using PC-relative addressing or the GOT.
Atomic operation conventions — which atomic instructions are used, and what memory-ordering primitives are available.
Compiler intrinsics and built-in symbols — names like __stack_chk_fail (stack-canary failure), __cxa_atexit (C++ exit handler), or memcpy that the ABI specifies as available from the runtime.

The ABI is, in short, a long list of conventions that no single document fully captures but that every compiler, linker, and runtime must implement consistently for the resulting program to work.

13.Summary

Calling conventions are the agreement that lets functions interoperate. They specify where arguments live (in registers when small enough; on the stack when not), where return values go (in designated registers, or by hidden pointer for large aggregates), and which registers the caller versus the callee is responsible for preserving. Stack frames are the per-function regions of memory that hold locals, saved registers, and outgoing arguments; prologues and epilogues set them up and tear them down according to the convention. Stack alignment requirements (typically 16 bytes) ensure that vector instructions and other alignment-sensitive operations work. Variadic functions extend the scheme with per-ABI machinery for spilling and walking arguments of unknown count and type, and tail calls let a callee reuse the caller's frame entirely when the caller has nothing left to do.

The ABI extends these conventions into a full contract: position-independent code with the GOT and PLT for cross-library access; thread-local storage models that resolve thread-local variables through a thread pointer; stack unwinding tables in DWARF or PDB form that let exceptions propagate across libraries; data layout (integer sizes, struct padding, bit fields, enums); name mangling; dynamic-linking idioms. Modern ABIs additionally specify stack-hardening features — canaries, non-executable stack, shadow stacks, AArch64 PAC and BTI, x86 CET — each with its own marks on the prologue and epilogue. The ABI is the binary-level glue that makes separately-compiled code interoperate. Different platforms have different ABIs (System V vs Microsoft on x86-64, AAPCS64 on AArch64, the standard RISC-V ABI), and code compiled for one will not work with another without recompilation.

With the ISA, instruction categories, machine code, and now ABIs all in place, we have a complete picture of the contract between hardware and software at the instruction level. Chapter 15 turns to the cases where ordinary execution does not flow as expected: exceptions, interrupts, and traps, and the system-level mechanisms that handle them.

Book mode

	foo:
	push rbp ; save caller's frame pointer
	mov rbp, rsp ; establish our frame pointer
	push rbx ; save the callee-saved registers we'll use
	push r12
	sub rsp, 32 ; allocate space for locals
	; ... function body ...

	add rsp, 32 ; deallocate locals
	pop r12 ; restore callee-saved registers (in reverse)
	pop rbx
	pop rbp ; restore caller's frame pointer
	ret ; pop return address into rip

	foo:
	stp x29, x30, [sp, #-32]! ; save x29 (FP) and x30 (LR), pre-decrement sp by 32
	mov x29, sp ; establish frame pointer
	stp x19, x20, [sp, #16] ; save callee-saved registers
	; ... function body ...
	ldp x19, x20, [sp, #16] ; restore callee-saved registers
	ldp x29, x30, [sp], #32 ; restore x29, x30 and post-increment sp
	ret ; branch to x30

	typedef struct {
	unsigned int gp_offset; // next general-purpose argument byte
	unsigned int fp_offset; // next floating-point argument byte
	void *overflow_arg_area; // pointer into the stack for overflow args
	void *reg_save_area; // pointer to the spill area in the caller's frame
	} va_list[1];

	func@plt:
	jmp [rip + func@GOTPCREL] ; first call: jumps into the resolver stub
	; later calls: jumps to the resolved function
	push n ; relocation index
	jmp plt0 ; common resolver entry

	struct Example {
	char a; // 1 byte at offset 0
	// 3 bytes of padding at offsets 1-3
	int b; // 4 bytes at offset 4
	char c; // 1 byte at offset 8
	// 7 bytes of padding at offsets 9-15
	long d; // 8 bytes at offset 16
	};
	// total size: 24 bytes; alignment: 8