Part II·Instruction Set Architecture·Chapter 13 of 62

Part IIInstruction Set Architecture

Machine Code and Assembly

May 16, 2026·30 min read·intermediate

A processor does not, in any meaningful sense, execute "instructions" — it executes bit patterns. The previous chapters described what those bit patterns mean, but a working programmer rarely sees them in raw binary form. Between the bits the processor consumes and the C or Rust source the developer writes lies a short tower of representations: assembly language, object files, executables, and the tools that translate among them. This chapter walks through that tower from the bottom up.

The chapter has two related aims. The first is to demystify the compilation and linking pipeline, so that when something breaks (a missing symbol, a relocation overflow, an unexpected disassembly) the programmer knows what is going on. The second is to give a vocabulary for talking about machine code — mnemonics, operands, opcodes, encodings, relocations, sections — that the rest of the book and the rest of the systems-software world use heavily.

01.Mnemonics and Operands

Machine code is a sequence of bytes that the processor decodes and executes. Assembly language is a textual representation of those bytes, in which each instruction is written as a mnemonic (a short name) followed by zero or more operands. The assembler — a program — translates assembly to machine code; the disassembler reverses the process.

A simple assembly statement on AArch64:

Assembly

add x0, x1, x2

add is the mnemonic. x0, x1, x2 are operands. The assembler turns this into a 32-bit machine-code word whose bit fields encode "add register x1 to register x2 and put the result in x0."

Different ISAs have different conventions for ordering operands. RISC-V and AArch64 use destination-first order: add dest, src1, src2. x86 has two main syntaxes: AT&T syntax, used by Unix tools, with destination last (add %rbx, %rax means rax += rbx); and Intel syntax, used in Microsoft's documentation and many other places, with destination first (add rax, rbx means the same). The two syntaxes differ in punctuation as well: AT&T prefixes registers with % and immediates with $, while Intel uses bare names.

The same instruction in three syntaxes:

Assembly

; Intel syntax
mov    rax, [rbx + rcx*8 + 16]

# AT&T syntax
movq   16(%rbx, %rcx, 8), %rax

# AArch64
ldr    x0, [x1, x2, lsl #3]   ; with x1=base, x2=index*8 (no immediate +16 in one form)

The assembly statement is a human representation. It is not what the CPU executes. Two distinct assembly statements may produce the same machine code (different syntaxes for the same instruction), and a single mnemonic may map to many different encodings depending on operand types and sizes.

Mnemonic Suffixes

Many ISAs use suffixes on the mnemonic to indicate operand size or type. AArch64 uses the operand register's name to convey size: add x0, x1, x2 is a 64-bit add, add w0, w1, w2 is a 32-bit add (the lower halves of the same registers). x86 in AT&T syntax uses suffixes: addb, addw, addl, addq for byte, word (16), long (32), and quadword (64) operations. Intel syntax usually relies on operand sizes inferred from register names. RISC-V uses separate mnemonics: add (XLEN), addw (32-bit, sign-extending the result to 64 in RV64).

Floating-point and SIMD instructions add more suffixes: fadd.d (double-precision add) on RISC-V, fadd s0, s1, s2 (single-precision) on AArch64, addss/addsd (scalar single, scalar double) on x86.

Operand Types

A typical instruction's operands fall into a few categories:

Register operands: named registers (x0, r5, rax, xmm0).
Immediate operands: constant values, distinguished syntactically (e.g., #42 in AArch64, $42 in AT&T, bare 42 in Intel and RISC-V).
Memory operands: an effective-address expression, usually written in some form of brackets or parentheses ([x1, x2, lsl #3], (rbx, rcx, 8), 0(a0)).
Labels: symbolic references to other locations in the program. The assembler resolves them to PC-relative offsets or absolute addresses as appropriate.
Modifiers: condition codes, shifts, sign-extension specifiers, and so on (b.eq, lsl #3, sxtw).

Memory operand syntax is one of the genuinely hard things about reading assembly: it varies so much across ISAs and syntaxes that a programmer who learns one needs explicit re-orientation when reading another.

Pseudo-Instructions

Most assemblers support pseudo-instructions: shortcut mnemonics that expand into one or more real instructions. They are conveniences for the human, not new operations.

Assembly

# RISC-V pseudo-instructions and what they expand to
li     a0, 42             # → addi a0, zero, 42        (small)
li     a0, 0x12345678     # → lui a0, 0x12345; addi a0, a0, 0x678  (medium)
mv     a0, a1             # → addi a0, a1, 0
ret                       # → jalr zero, ra, 0
nop                       # → addi zero, zero, 0
neg    a0, a1             # → sub a0, zero, a1
not    a0, a1             # → xori a0, a1, -1

Pseudo-instructions hide inconveniences that arise from the underlying ISA. RISC-V has no dedicated mov, nop, or ret; the assembler synthesizes them from addi and jalr. The programmer writes the convenient form; the disassembler may show the underlying form, leading to occasional confusion when reading raw output.

x86 and AArch64 have far fewer pseudo-instructions, because their ISAs already include the conveniences directly. RISC-V's deliberate minimalism is what makes pseudo-instructions so common in its assembly.

02.Labels and Directives

Real assembly source contains more than just instructions. Labels name locations in the program; directives instruct the assembler to lay out data, switch sections, define symbols, and so on.

A simple complete assembly file (RISC-V):

Assembly

    .section .rodata
hello_msg:
    .ascii "Hello, world\n"
    .byte 0
    .section .text
    .globl  _start
_start:
    li      a7, 64           # Linux syscall number for write
    li      a0, 1            # fd = stdout
    la      a1, hello_msg    # buffer
    li      a2, 13           # length
    ecall                    # syscall
    li      a7, 93           # syscall number for exit
    li      a0, 0            # status

Several elements deserve explanation.

Labels are identifiers followed by a colon. They name the address at which the next data or instruction is placed. hello_msg, _start here. Other parts of the program — instructions, data initializers, the linker — refer to these labels by name; the assembler and linker resolve the references to actual addresses.

Directives start with a period and are interpreted by the assembler rather than encoded into machine code. .section switches the section into which subsequent bytes are placed. .ascii and .byte emit literal data. .globl marks a symbol as visible outside the current file.

Comments typically start with #, ;, or // depending on the assembler.

Sections are named regions of the output object file. The most common are .text for code, .data for initialized writable data, .rodata for read-only data, and .bss for zero-initialized data. The linker collects sections of the same name from many object files and places them in the executable's memory layout.

A few more directives appear regularly.

.align N — pad with zeros (or NOPs) until the next byte at an N-byte boundary.
.word, .dword, .long — emit literal multi-byte integers.
.zero N — emit N zero bytes.
.equ NAME, value — define a symbol with a constant value.
.macro / .endm — define an assembly-level macro.

Different assembler dialects use different directive names (GNU as, Microsoft MASM, Intel's NASM, RISC-V GNU as, etc.) but the concepts are universal.

03.Encoding and Decoding

The assembler's central job is encoding: turning each instruction's mnemonic and operands into the right bit pattern, then emitting those bits into the output. The disassembler's job is the reverse.

Consider the AArch64 instruction add x0, x1, x2. The encoding format for the ADD (shifted register) instruction is:

Plain Text

| sf | 0 | 0 | 0 | 1 | 0 | 1 | 1 | shift |  Rm  | imm6 |  Rn  |  Rd  |
  1   1   1   1   1   1   1   1     2     5      6      5      5    = 32 bits

To encode add x0, x1, x2:

sf = 1 (64-bit operation, since x0/x1/x2 are 64-bit registers).
The fixed bits 0001011 are part of the opcode.
shift = 00 (no shift on the second operand).
Rm = 00010 (register x2, encoded as 2).
imm6 = 000000 (no shift amount).
Rn = 00001 (register x1).
Rd = 00000 (register x0).

Putting it together: 1 00 01011 00 00010 000000 00001 00000, which is 0x8B020020 in hex.

A disassembler reading the bytes 20 00 02 8B (little-endian) reverses the process: split into fields, recognize the opcode, identify the addressing form, and emit add x0, x1, x2.

The same logic, with different formats, applies to every instruction. A modern assembler has a table of encoding patterns for each mnemonic; encoding an instruction is a matter of selecting the right pattern based on operand types and filling in the bit fields.

For x86, with its variable-length instructions, the process is more elaborate. The assembler has to:

Pick an encoding form based on operand types (register-register, register-memory, with or without immediate, etc.).
Choose prefix bytes (operand-size override, REX, VEX, etc.) as needed.
Build the opcode bytes.
Build the ModR/M byte for register and addressing-mode fields.
Build the SIB byte if the addressing mode requires it.
Append the displacement and immediate, if any.

For most instructions, several encodings are possible — for example, mov rax, 1 can use a short form with an 8-bit immediate or a long form with a 64-bit immediate, with corresponding differences in size. The assembler typically picks the shortest form that works.

Endianness and Byte Order

A 32-bit instruction word is stored in memory as four bytes. The order in which those bytes appear is determined by the architecture's endianness.

In little-endian byte order, the least-significant byte appears at the lowest address. In big-endian order, the most-significant byte appears at the lowest address. The 32-bit instruction 0x8B020020 in little-endian memory is laid out as 20 00 02 8B; in big-endian it is laid out as 8B 02 00 20.

x86 is little-endian. RISC-V and ARM are configurable but default to little-endian. Older mainframe and network architectures used big-endian, and a vestige remains in the network protocols (so-called network byte order is big-endian). Knowing the endianness matters when reading hex dumps of binaries; otherwise the byte sequence looks scrambled.

04.Object Files

The output of the assembler is not directly executable. It is an object file: a structured binary that contains the assembled instructions, the data, and metadata describing how the object will combine with others to form a final executable.

The dominant object-file format on Linux and most Unix-like systems is ELF (Executable and Linkable Format). On Windows it is PE/COFF (Portable Executable / Common Object File Format). On macOS it is Mach-O. They differ in details but share the same conceptual structure.

An ELF object file consists of:

A file header identifying the format, target architecture, and locations of other tables.
A section header table listing each section by name, type, size, and offset within the file.
The sections themselves: .text containing instructions, .data containing initialized data, .rodata containing read-only data, .bss (logically present but with no actual content in the file — it is zero-filled at load time), and metadata sections.
A symbol table (.symtab) listing each named symbol (function, variable, label) defined or referenced by the file, along with its location, size, type, and visibility.
A string table (.strtab) holding the names of symbols, sections, and other strings.
A relocation table (.rela.text, .rela.data, etc.) listing places in the sections where addresses are not yet known and have to be patched up later.

You can inspect an ELF object file with readelf or objdump:

Plain Text

$ readelf -S hello.o
There are 14 section headers, starting at offset 0x540:

Section Headers:
  [Nr] Name              Type             Address           Offset
      Size              EntSize          Flags  Link  Info  Align
  [ 0]                   NULL             0000000000000000  00000000
      0000000000000000  0000000000000000           0     0     0
  [ 1] .text             PROGBITS         0000000000000000  00000040
      0000000000000034  0000000000000000  AX       0     0     4
  [ 2] .rela.text        RELA             0000000000000000  000003f8
      0000000000000048  0000000000000018   I      11     1     8
  [ 3] .data             PROGBITS         0000000000000000  00000074
      0000000000000000  0000000000000000  WA       0     0     1
  [ 4] .bss              NOBITS           0000000000000000  00000074
      0000000000000000  0000000000000000  WA       0     0     1
  [ 5] .rodata           PROGBITS         0000000000000000  00000074
      000000000000000e  0000000000000000   A       0     0     1
  ...

The flags A (allocate), X (execute), W (write), I (info link) describe how the section is used at runtime.

The structure is regular: the file is a self-describing collection of named, typed regions, each with offsets pointing into the file's bytes. Tools like ld (the linker), gdb (the debugger), objdump (the disassembler/dumper), and the dynamic loader all read this same structure.

05.Relocation

When the assembler emits a call or a jmp to a label that is in another file (an external function, say), it does not know the final address of the target. It cannot encode the branch's offset, because that depends on where the linker will place the target. The assembler's solution is to emit an instruction with a placeholder offset (zero, typically) and record a relocation entry that says: "At byte offset N in this section, an address of type T pointing to symbol S needs to be patched in later."

A simple example. Suppose main.c has:

extern int helper(int);
int main(void) { return helper(7); }

After assembly, main.o contains something like:

Plain Text

  ...                               # function prologue
  mov    edi, 7                     # load argument
  call   0x1a <main+0x1a>           # call to helper, but address is unknown!
  1a:   ...                               # function epilogue

The call instruction at offset 0x15 has its 4-byte displacement set to zero (or to a placeholder). The relocation table contains an entry like:

Code

Offset: 0x16
Type:   R_X86_64_PC32          (PC-relative 32-bit)
Symbol: helper
Addend: -4

This entry says: at offset 0x16 (just past the call opcode), patch in a 32-bit PC-relative displacement to the symbol helper, with an addend of $-4$ (because the displacement is computed relative to the end of the instruction, four bytes past the displacement field itself).

The linker, when it eventually resolves helper's address, computes:

$\text{displacement} = \text{address}(\text{helper}) - \text{address}(\text{call instruction}) - \text{instruction size}$

and writes it into the four bytes at 0x16. The instruction now correctly calls helper.

Each architecture has its own set of relocation types, corresponding to the different forms in which addresses appear in instructions: PC-relative 8-bit, PC-relative 32-bit, absolute 64-bit, the 20-bit immediate of a RISC-V auipc, the 19- or 26-bit branch offsets of AArch64, and so on. The relocation type tells the linker exactly which bits of the instruction to overwrite and how to compute the value.

Relocations also handle data references. A global variable's address embedded in the program's data, or a function pointer in a vtable, are relocations of an absolute or PC-relative kind. The same machinery resolves them.

A common error message — "relocation truncated to fit" — appears when the linker tries to write a value that does not fit in the target field. For example, a PC-relative branch with a 19-bit offset cannot reach a target more than ±256 KB away. If the program is large enough that a referenced function is farther than that, the linker has to insert a trampoline (a small stub that uses a longer instruction sequence) or emit an error.

06.Linking

The linker is the program that combines one or more object files into a final executable or shared library. Its job has several distinct phases.

Symbol resolution. The linker reads each input object's symbol table and builds a single global symbol table. Each undefined symbol (a reference without a definition) must be matched to a definition somewhere; each defined symbol may not be defined in two places (or, if it is, one must be marked weak or the linker reports a conflict).

Section merging. The linker concatenates sections of the same name from all input objects, producing one .text, one .rodata, one .data, etc. The relative offsets of symbols within their sections are preserved.

Layout. The linker chooses where in the output address space each section will live. For an executable, this means assigning a virtual address to each section's start, respecting alignment requirements and the operating system's expectations.

Relocation processing. With every symbol's address now known, the linker walks every relocation entry and patches the corresponding bytes in the merged sections.

Output writing. The linker writes the final executable file, with its own header structure, program-segment table, and the merged-and-relocated sections.

A typical command line that produces an executable from object files:

Plain Text

$ gcc -o myprog main.o helper.o -lm

Behind the scenes, gcc runs the linker ld with arguments that include the object files, the C runtime startup (crt1.o and friends), and the C library and math library. The result is myprog, a fully linked ELF executable with all internal references resolved.

Static and Dynamic Linking

Two styles of linking exist.

Static linking copies the needed code from libraries into the executable. After linking, the executable contains everything it needs to run; it does not depend on libraries at runtime. The cost is size (the same library code may be duplicated in many executables) and updatability (a library bug fix requires re-linking every executable).

Dynamic linking records references to libraries in the executable, deferring the actual binding to runtime. The library code lives in a separate file (a shared library: .so on Linux, .dll on Windows, .dylib on macOS). When the program runs, the dynamic linker (often called the loader, named ld.so or ld-linux.so on Linux) loads the libraries, resolves the references, and patches in the addresses.

Dynamic linking saves memory (one copy of libc serves all programs), simplifies updates (replace the shared library and every program benefits), and allows plugin architectures. It also introduces complexity (versioning, symbol-resolution conflicts, runtime overhead of resolution) and security concerns (a compromised shared library affects every program using it).

Most modern systems use a hybrid: dynamic linking by default, with static linking available when self-contained executables are desirable (e.g., for distribution).

The mechanism that makes dynamic linking efficient is lazy resolution through a Procedure Linkage Table (PLT) and Global Offset Table (GOT). The PLT contains stubs that, on first call, invoke the dynamic linker to resolve the symbol; subsequent calls go directly to the resolved address. The details are operating-system-specific but the essential idea — defer until needed — is universal.

07.Symbol Visibility, Versioning, and Weak Symbols

The simple picture of "the linker resolves a reference to a definition" hides a surprisingly elaborate set of attributes that real symbol tables carry. Each one exists to solve a specific historical problem, and each one occasionally surprises programmers who run into its consequences.

Visibility controls whether a defined symbol can be referenced from outside the shared library that contains it. The four standard visibilities, defined by the ELF specification and supported by most other formats, are default, protected, hidden, and internal. A default symbol is exported from the library and can be interposed by any other library that defines a symbol of the same name; this is the powerful but expensive default that lets a program override malloc by linking in libtcmalloc. Protected symbols are exported but cannot be interposed; references from inside the defining library always bind to the local definition. Hidden symbols are not exported at all, and internal symbols additionally promise that no pointer to them ever escapes the library. Each step from default to internal lets the compiler and linker produce smaller, faster code, because the calls become direct rather than going through the GOT and PLT.

Most projects today annotate their public API with a visibility macro and compile with -fvisibility=hidden, exposing only what is meant to be exposed. The result is faster startup, smaller GOTs, and fewer accidental ABI promises.

Linkage distinguishes strong and weak symbols. A strong definition is the normal kind: defining the same symbol strongly in two object files is an error. A weak definition is one that the linker chooses only if no strong definition is available. The classical use is to provide a default implementation that an application can override:

__attribute__((weak)) void debug_hook(void) { /* default: do nothing */ }

If the application defines its own debug_hook, the linker uses it; if not, the weak default applies. The same mechanism implements many of the C library's optional features and is essential to the way C++ inline functions and template instantiations are deduplicated across translation units.

Symbol versioning, an extension introduced by Sun and adopted by GNU/Linux, lets a single shared library export multiple versions of the same symbol simultaneously. A new version of glibc can introduce an incompatible new behaviour for realpath while keeping the old one available, by exporting both realpath@GLIBC_2.0 and realpath@@GLIBC_2.3 (the @@ denotes the default). Old binaries link against the old version; newly built ones pick up the new. The result is that GNU/Linux can ship a single libc.so.6 that runs binaries spanning more than two decades of releases. Other ABIs (notably musl, macOS, and Windows) take simpler paths and pay the cost in occasional incompatibility.

These mechanisms together are why objdump -T and readelf --dyn-syms produce so much more information per symbol than readelf --syms: the dynamic-symbol view encodes visibility, binding, version, and a hash chain index, all consulted on every dynamic symbol resolution.

08.Position-Independent Code, ASLR, and PIE

The linker we have described so far chose final addresses for sections at link time. The resulting executable runs only at those addresses, which is fine for a system that loads every executable at the same place but is fatal to any kind of address-space randomization or shared-library sharing.

Position-independent code (PIC) is code written so that it works correctly regardless of the address at which it is loaded. Every reference within PIC is either PC-relative (so the absolute address need not be known) or routed through a runtime-resolved indirection (the GOT for data, the PLT for functions). x86-64, AArch64, and RISC-V all provide PC-relative addressing modes that make PIC nearly as fast as fixed-position code; on older 32-bit x86 the cost was higher because the ISA had no PC-relative load.

A shared library is always built as PIC, so that the same in-memory copy can be mapped at different addresses in different processes without requiring per-process relocation. The compiler flag is -fPIC; the linker output is the familiar .so, .dylib, or .dll.

Position-independent executables (PIE), an option since the late 2000s, apply the same machinery to the main executable. The output is structurally a shared library that the kernel can load at any address. The motivation is address-space layout randomization (ASLR): the kernel chooses the load addresses of the executable, the libraries, the stack, and the heap from a randomized distribution at every process start. Many memory-corruption exploits rely on knowing absolute addresses; ASLR raises the cost of crafting a working exploit substantially. ASLR alone applies only to whatever pieces of the program are position-independent, so non-PIE executables are not randomized in their main .text, only in their libraries. Modern Linux distributions, all current macOS and Windows versions, and every mobile OS now ship PIE-by-default toolchains.

A related variant is read-only relocations (RELRO): the dynamic linker performs all the relocations the program needs at startup and then marks the GOT and other writable-but-relocated tables read-only, so that subsequent overflow bugs cannot rewrite function pointers in them. The combination full RELRO + PIE + stack protector + non-executable stack is the modern baseline for a hardened binary.

09.Debug Information

Machine code carries no record of the source program that produced it. To make a stripped binary debuggable, profilable, or stack-traceable, the toolchain emits a separate stream of debug information and embeds it in the object file alongside the code.

The dominant format on Unix is DWARF. A DWARF stream describes:

The mapping from instruction addresses back to source-file lines and columns, so that a debugger can show the source line corresponding to the current PC.
The mapping from source-language variables to their storage locations — a register, an offset from the frame pointer, a constant, or a more elaborate expression — separately for every range of instructions over which the storage is valid (compiler optimization moves variables around routinely).
The types of every variable and function, in enough detail that a debugger can print structs, follow pointers, and respect the source language's type system.
A description of every function's stack frame, expressed in a small bytecode language called call-frame information (CFI), enabling stack unwinding even through optimized code without frame pointers.
The lexical scopes, inlined-function call sites, and other source-level structures that make stepping behave intuitively.

The debug data is verbose. A typical optimized binary has DWARF sections several times larger than the code itself. Tools mitigate this in several ways: strip --strip-debug removes the sections after they have been copied into a separate .debug file; the gdb-add-index and dwz tools compress and index the data; the newer split DWARF scheme (-gsplit-dwarf) emits the bulky parts to separate .dwo files that the linker does not consume.

Windows uses a separate format, PDB, with similar capabilities and a different on-disk layout. macOS uses dSYM bundles that wrap DWARF in a Mach-O container.

The .eh_frame section deserves a separate mention because it is the only DWARF-format data that is not discarded from a stripped binary on Linux. It contains the call-frame information needed to unwind the stack during a C++ exception throw or a pthread_cancel, and the runtime cannot do without it. Stripping .eh_frame from a C++ program produces a binary that crashes the moment an exception is thrown.

10.Static Libraries and Archives

Before the modern era of shared libraries, code reuse meant archives: a single file containing many .o files, used as a kind of source pool that the linker drew from on demand.

The canonical Unix archive is the .a file produced by the ar command. Its format is trivial — a magic header followed by a sequence of file headers and the .o payloads concatenated — and the linker treats it as a search pool: when an undefined symbol remains after processing the explicit object files, the linker scans the archive's symbol index for a member that defines the symbol and pulls just that member into the link. Other members of the archive that are not needed never appear in the output.

The consequence of this on-demand behaviour is that link order matters for archives. If liba.a defines foo and references bar, and libb.a defines bar, the link line gcc main.o liba.a libb.a works — the linker pulls foo out of liba.a, sees the new undefined bar, scans libb.a, finds it. The reverse order gcc main.o libb.a liba.a fails: libb.a is scanned before any reference to bar exists, so nothing is pulled out of it; then liba.a introduces a reference that no later archive can satisfy. The fix is either to put libb.a again at the end (libb.a liba.a libb.a) or to use the --start-group/--end-group linker option that re-scans repeatedly.

Thin archives (ar T) hold only references to the original .o files rather than copies, which speeds up large builds at the cost of fragility if the originals move. Whole-archive linking (-Wl,--whole-archive) tells the linker to include every member of an archive whether or not its symbols are referenced; this is sometimes needed when the archive contains code that registers itself through static constructors.

11.Link-Time Optimization

The traditional pipeline compiles each translation unit independently and asks the linker only to glue together pre-generated machine code. This wastes a great deal of optimization opportunity: the compiler cannot inline a function defined in another translation unit, cannot specialize a generic function based on its callers in another file, and cannot eliminate an unused global variable that another file might (but in fact does not) reference.

Link-time optimization (LTO) changes that. Instead of emitting machine code, the compiler with -flto writes a serialized form of its intermediate representation — LLVM bitcode for Clang, GIMPLE for GCC — into the object file. The linker, when invoked with the same flag, reads the IR back from every input, reconstructs a whole-program view, and runs the optimizer once more across the merged module. The resulting code can be substantially smaller and faster than what a per-file build produces; cross-translation-unit inlining is the single most valuable optimization that LTO enables.

The cost is link time, which can grow by an order of magnitude on large projects. ThinLTO, supported by recent Clang and GCC, is a compromise: each object file's IR is summarized into a small index, and the linker consults the indices to identify cross-module inlining opportunities, importing only the IR functions that benefit. ThinLTO scales to very large code bases (Chromium, Firefox, the Linux kernel) at modest cost.

A related feature is profile-guided optimization (PGO): instrument the binary, run a representative workload, feed the resulting profile back into the compiler. The optimizer then has accurate branch frequencies, call counts, and value distributions to guide its decisions. PGO and LTO compose, and together they typically yield the highest-performance builds available from a modern toolchain.

From the machine-code perspective, LTO and PGO are invisible: the output is still a regular ELF or Mach-O file with regular machine instructions. Their existence is part of the toolchain story rather than the runtime story, but they are common enough in performance-sensitive software that any reader of optimized assembly will encounter their fingerprints — unexpectedly inlined functions, missing weak references, hot blocks aligned and laid out adjacent to their callers — sooner rather than later.

12.Disassembly

Disassembly is the inverse of assembly: turn machine-code bytes back into mnemonic instructions. It is a routine activity for any low-level programmer, used to inspect compiler output, debug crashes, reverse-engineer binaries, and understand performance.

The fundamental challenge is that, especially on variable-length ISAs like x86, the boundaries between instructions are not marked in the bytes themselves. The disassembler has to start at a known instruction boundary and walk forward, decoding each instruction in turn. If it starts at the wrong byte (in the middle of an instruction), it produces garbage. ELF and other object formats record the start of code regions and (sometimes) the boundaries of functions, helping the disassembler align correctly.

Two common disassembly modes exist.

Linear disassembly walks the bytes of .text from start to end, decoding each instruction. Simple and reasonably fast. Fails when data is interspersed with code (jump tables, embedded constants), because the disassembler does not know to skip the data.

Recursive disassembly follows control flow. The disassembler decodes the entry point, then follows every branch, call, and jump it sees, queueing new starting points to decode from. Bytes never reached are left undecoded. More accurate but harder to implement, especially with indirect branches whose targets are not statically known.

Tools like objdump -d use linear disassembly with simple heuristics; tools like Ghidra and IDA Pro use sophisticated recursive disassembly with extensive heuristics for handling jump tables, switch statements, and obfuscated code.

A typical disassembly:

Plain Text

$ objdump -d main.o
main.o:     file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <main>:
  0:   55                      push   %rbp
  1:   48 89 e5                mov    %rsp,%rbp
  4:   bf 07 00 00 00          mov    $0x7,%edi
  9:   e8 00 00 00 00          call   e <main+0xe>
                        a: R_X86_64_PLT32       helper-0x4
  e:   5d                      pop    %rbp
  f:   c3                      ret

The leftmost column is the offset from the start of the section. The middle column is the raw bytes. The right column is the decoded mnemonic and operands. The line a: R_X86_64_PLT32 helper-0x4 is the relocation entry: the four bytes starting at offset a will be patched by the linker to point to helper.

Reading disassembly is a skill. With practice, programmers can scan compiler output to verify that an optimization happened, identify where a crash occurred, or hand-tune a hot loop. Everyone working on systems software learns at least enough to read the output for their target architecture.

13.A Quick End-to-End Example

To make the whole pipeline concrete, consider a tiny C file:

// hello.c
int helper(int x) { return x + 1; }
int main(void) { return helper(41); }

The pipeline transforms it through several stages.

Compilation turns the C source into assembly.

Plain Text

$ gcc -S -O0 hello.c

produces hello.s with assembly that resembles:

Assembly

helper:
    pushq   %rbp
    movq    %rsp, %rbp
    movl    %edi, -4(%rbp)
    movl    -4(%rbp), %eax
    addl    $1, %eax
    popq    %rbp
    ret
main:
    pushq   %rbp
    movq    %rsp, %rbp
    movl    $41, %edi
    call    helper
    popq    %rbp
    ret

Assembly turns the assembly into an object file.

Plain Text

$ gcc -c hello.c -o hello.o

produces hello.o, which we can inspect:

Plain Text

$ readelf -s hello.o
  ...
  Symbols:
        Value          Size Type    Bind   Vis      Ndx Name
  ...
        0000000000000000   16 FUNC    GLOBAL DEFAULT    1 helper
        0000000000000010   17 FUNC    GLOBAL DEFAULT    1 main

The call helper in main has an unresolved relocation, because the assembler does not know the eventual address of helper.

Linking turns the object file into an executable.

Plain Text

$ gcc hello.o -o hello

produces hello, an ELF executable. The linker has resolved helper's address (using its position in the merged .text section), patched the call instruction's displacement, and added the C runtime startup code that calls main and handles its return value.

Execution is the kernel loading the executable, the dynamic linker loading any shared libraries (such as libc), and the program running. The CPU fetches the instruction bytes, decodes them, and performs the operations they encode.

Plain Text

$ ./hello
$ echo $?
42

The whole pipeline — source code, assembly, object file, relocation, linking, dynamic loading, execution — has run, with each stage transforming the program's representation while preserving its meaning.

14.Summary

Machine code is bytes; assembly language is a textual representation of those bytes; an assembler turns one into the other. Real programs go through several stages between source and execution: assembly to object files, object files through linking to executables, executables through dynamic loading to running processes. Each stage uses well-defined formats — ELF, PE, Mach-O for object files; ABI-defined calling conventions and section names — and the tools that operate on them (assemblers, linkers, dynamic loaders, disassemblers) share that vocabulary.

The central technical mechanism that connects the stages is relocation: a record of where in a binary an address needs to be patched once it becomes known. Assemblers emit relocations; linkers consume them. Symbols carry visibility, linkage strength, and version attributes that determine how the linker resolves cross-module references and how the dynamic loader binds them at runtime. Position-independent code, used universally for shared libraries and increasingly for executables themselves, defers absolute addressing to the loader and underpins both library sharing and address-space randomization. Debug information — DWARF on Unix, PDB on Windows, dSYM on macOS — maps machine addresses back to source lines, variable locations, and stack-frame layouts; even stripped binaries retain .eh_frame because exception unwinding cannot work without it. Static archives, link-time optimization, and profile-guided optimization extend the toolchain's reach across translation-unit boundaries. Dynamic linking defers some relocations to runtime and uses indirection through PLT and GOT tables to keep the cost manageable. Disassembly, the inverse process of decoding bytes back into mnemonics, is a routine activity in systems work, supported by tools that range from simple linear decoders to elaborate recursive disassemblers.

Chapter 14 turns to the protocols that make all this code interoperate: the calling conventions and ABIs by which functions agree on how to pass arguments, return results, and share registers and stack space.

Book mode

	0: ... # function prologue
	10: mov edi, 7 # load argument
	15: call 0x1a <main+0x1a> # call to helper, but address is unknown!
	1a: ... # function epilogue

	# RISC-V pseudo-instructions and what they expand to
	li a0, 42 # → addi a0, zero, 42 (small)
	li a0, 0x12345678 # → lui a0, 0x12345; addi a0, a0, 0x678 (medium)
	mv a0, a1 # → addi a0, a1, 0
	ret # → jalr zero, ra, 0
	nop # → addi zero, zero, 0
	neg a0, a1 # → sub a0, zero, a1
	not a0, a1 # → xori a0, a1, -1

	.section .rodata
	hello_msg:
	.ascii "Hello, world\n"
	.byte 0

	.section .text
	.globl _start
	_start:
	li a7, 64 # Linux syscall number for write
	li a0, 1 # fd = stdout
	la a1, hello_msg # buffer
	li a2, 13 # length
	ecall # syscall

	li a7, 93 # syscall number for exit
	li a0, 0 # status

	\| sf \| 0 \| 0 \| 0 \| 1 \| 0 \| 1 \| 1 \| shift \| Rm \| imm6 \| Rn \| Rd \|
	1 1 1 1 1 1 1 1 2 5 6 5 5 = 32 bits

	Offset: 0x16
	Type: R_X86_64_PC32 (PC-relative 32-bit)
	Symbol: helper
	Addend: -4

	$ objdump -d main.o

	main.o: file format elf64-x86-64

	Disassembly of section .text:

	0000000000000000 <main>:
	0: 55 push %rbp
	1: 48 89 e5 mov %rsp,%rbp
	4: bf 07 00 00 00 mov $0x7,%edi
	9: e8 00 00 00 00 call e <main+0xe>
	a: R_X86_64_PLT32 helper-0x4
	e: 5d pop %rbp
	f: c3 ret

	// hello.c
	int helper(int x) { return x + 1; }
	int main(void) { return helper(41); }

	$ readelf -s hello.o
	...
	Symbols:
	Value Size Type Bind Vis Ndx Name
	...
	0000000000000000 16 FUNC GLOBAL DEFAULT 1 helper
	0000000000000010 17 FUNC GLOBAL DEFAULT 1 main