Part IFoundations

The Von Neumann Machine

May 16, 2026·35 min read·beginner

Part I built up the raw materials: bits, gates, building blocks, and the synchronous discipline that holds them together. This chapter takes the first step toward assembling those materials into…

Part I built up the raw materials: bits, gates, building blocks, and the synchronous discipline that holds them together. This chapter takes the first step toward assembling those materials into something recognizable as a computer. We are not yet ready to design a working processor — that will occupy the next several chapters — but we can sketch the organization in which a processor sits. The framework is nearly eighty years old, and it has proved astonishingly durable. Almost every general-purpose machine you will ever meet, from a microcontroller in a thermostat to a thousand-core data-center processor, is some variation on the same theme.

The theme has a name: the von Neumann machine. Its central idea — that instructions and data should live in a single memory and that a small loop of fetch and execute drives everything — was sketched in the 1945 EDVAC report and is still the spine of modern computing. We will examine that idea, look at the most important variant of it (the Harvard architecture and its descendants), and then trace how instructions and data flow through the resulting machine over the wires that physically connect its pieces.

01.CPU, Memory, and I/O

At the highest level of organization, a computer divides into three subsystems: the processor, the memory, and the input/output. Every working machine has these three, no matter how small or large, and they always interact in roughly the same way.

The central processing unit, or CPU, is the active part. It is the piece that actually carries out instructions. Inside the CPU live the registers, the arithmetic and logic unit, the control logic, and the various specialized blocks (caches, branch predictors, and so on) that we will meet in later chapters. The defining property of the CPU is that it contains state that changes on every clock cycle and logic that decides what to do next. Everything else in the machine is, from the CPU's perspective, either a place to fetch values from or a place to send values to.

Memory is the passive bulk store. It holds both the program — the sequence of instructions the CPU will execute — and the data those instructions operate on. Memory is organized as a linear array of locations, each identified by a numeric address. Given an address, the memory will return the value stored there (a read) or replace the value with a new one supplied by the CPU (a write). Memory does no thinking of its own; it simply responds to requests. The smallest addressable unit is almost always a byte, but the CPU can typically request a halfword, word, or doubleword at a single aligned address, and the memory subsystem will deliver all the bytes at once.

The input/output subsystem, or I/O, is everything else. It is the bridge between the digital world inside the machine and the physical world outside. Keyboards, displays, network interfaces, disks, sensors, motors, USB devices, audio chips — all of them are I/O. From the CPU's point of view, an I/O device looks much like memory: a set of locations that can be read from or written to. The difference is that reading or writing one of those locations causes something to happen in the outside world, and the values returned by reads may change spontaneously, in response to events the CPU did not initiate.

A useful first-cut block diagram of any computer therefore looks like this:

Figure: High-level computer block diagram: a CPU above, memory and I/O below, all connected by a shared bus that carries instructions and data

LaTeX

\begin{tikzpicture}[font=\small, line cap=round,
  blk/.style={draw, thick, fill=white, minimum width=2cm, minimum height=0.9cm}]
  % Origin (0,0) at top-left.
  \node[blk] (cpu) at (3, -0.5) {CPU};
  \node[blk] (mem) at (1, -3) {Memory};
  \node[blk] (io)  at (5, -3) {I/O};
  \node[font=\footnotesize] at (3, -1.7) {bus / interconnect};
  \draw (cpu.south) -- (3, -2);
  \draw (1, -2) -- (5, -2);
  \draw (mem.north) -- (1, -2);
  \draw (io.north) -- (5, -2);
\end{tikzpicture}

The CPU is the master. The memory and I/O are slaves. The bus or interconnect — which we will examine in detail later in this chapter — is the medium over which they exchange addresses, data, and control signals.

A few remarks before we move on. First, the boundary between these three subsystems is not always crisp. A modern processor chip typically integrates not only the CPU but also memory controllers, large caches, and a growing fraction of the I/O subsystem onto the same piece of silicon. The separation is conceptual rather than physical. Second, the relative speeds of the three are wildly mismatched. A CPU can execute several instructions per nanosecond. Main memory takes tens of nanoseconds to respond. A disk takes microseconds at best, and milliseconds for spinning media. A network round-trip across the planet takes hundreds of milliseconds. Almost every interesting performance technique we will study exists to paper over these speed differences in one way or another.

02.The Von Neumann Model

The von Neumann model is the specific way the three subsystems above are arranged in nearly every modern computer. Its essence can be stated in a few sentences.

There is one memory. It holds both the program and the data, in the same address space, in the same format. The CPU reads instructions from this memory and executes them. The data the instructions operate on lives in the same memory and is reached the same way. The CPU contains a small register, the program counter, that points to the address of the next instruction to fetch. After each instruction is executed, the program counter is updated — usually by simply incrementing it, but sometimes by loading a branch target — and the cycle repeats.

This sounds almost trivial, but in 1945 it was a radical proposal. Earlier machines had separate stores for programs and for data, and they often required physical reconfiguration to load a new program. The von Neumann model unifies the two stores and reduces "load a new program" to "copy bits into memory." We touched on the consequences in Chapter 1: programs become data, compilers and operating systems become possible, the machine becomes truly general-purpose.

A first sketch of a von Neumann machine looks like this:

Figure: Von Neumann machine: PC, IR, ALU, register file, and control unit inside the CPU, sharing a single memory that holds instructions and data together

LaTeX

\begin{tikzpicture}[font=\small, >=Stealth, line cap=round,
  blk/.style={draw, thick, fill=white, minimum width=1.4cm, minimum height=0.7cm}]
  % CPU box outer rectangle
  \draw[thick] (0, 0) rectangle (6.4, -3.4);
  \node[anchor=west] at (0.1, -0.2) {CPU};
  % PC, IR, ALU
  \node[blk] (pc) at (1.2, -1) {PC};
  \node[blk] (ir) at (3.2, -1) {IR};
  \node[blk] (alu) at (5.2, -1) {ALU};
  % register file (wide)
  \node[blk, minimum width=4.6cm] (rf) at (3.2, -2.2) {register file};
  \node at (3.2, -3) {control unit};
  % Memory below
  \node[blk, minimum width=5.6cm, minimum height=1.2cm, align=center] (mem) at (3.2, -5) {Memory\\(instructions and data, mixed)};
  \draw[<->] (3.2, -3.4) -- (mem.north) node[midway, right] {address / data / control};
\end{tikzpicture}

The names will become familiar in the next chapter. For now: PC is the program counter, IR is the instruction register that holds the instruction currently being executed, ALU is the arithmetic and logic unit, and the register file is a small bank of fast storage internal to the CPU. The lines from CPU to memory carry addresses, data, and control signals.

Two quirks of the model deserve attention.

The first is that there is nothing in the bits to mark whether a given location holds an instruction or a piece of data. The CPU treats a location as an instruction when it fetches from there using the program counter, and as data when it reads from there in response to a load instruction. The same memory location can be both, at different moments. This is a feature — it makes self-modifying code, dynamic loaders, and just-in-time compilers possible — but it is also a security headache, since an attacker who can write data into a memory region the CPU will later execute can hijack the program. We will return to this in the chapters on memory protection and security.

The second quirk has come to be known as the von Neumann bottleneck. Because instructions and data share a single memory and a single path between CPU and memory, every cycle the CPU has to choose between fetching the next instruction and reading or writing data. The total bandwidth of the path limits how fast the machine can run, regardless of how clever the CPU itself is. John Backus coined the phrase in a famous 1977 lecture, and the bottleneck has motivated a great deal of architectural innovation since — caches, prefetchers, multiple memory channels, separate instruction and data paths — all aimed at keeping the CPU fed without simply making the single memory faster.

The phrase "stored-program computer" is sometimes used as a synonym for von Neumann machine. It is more precise to say that any stored-program computer descends from von Neumann's idea, but not every stored-program computer puts instructions and data in the same memory in the strictest sense. The most important variant is the subject of the next section.

03.CPU Organization Styles

The von Neumann model fixes the relationship between the CPU, the memory, and the program counter, but it leaves open one important question: where do operands live while a computation is in progress? Different answers to that question produce strikingly different machines, and the history of computer architecture is in large part the history of trying these answers in turn. Four broad styles are worth recognizing.

In an accumulator machine, the CPU has a single special register — the accumulator — that one operand of every arithmetic instruction is implicitly drawn from and the result implicitly written back to. The other operand comes from memory. An instruction like add at address 0x100 means "add the contents of address 0x100 to the accumulator and leave the result in the accumulator." Accumulator machines were the standard form of the earliest computers, including the IBM 701 and the PDP-8, because they minimized the number of bits needed to encode an instruction — the accumulator did not have to be named explicitly. Their drawback is that nontrivial expressions require constant traffic to memory to spill and reload intermediate values.

In a stack machine, operands live on a stack, and arithmetic instructions implicitly pop their operands from the top of the stack and push the result back on. An expression like $A + B \times C$ is encoded as a sequence such as push A; push B; push C; mul; add, with the operations consuming and producing stack entries as they go. Stack machines have an unusually compact instruction encoding and a particularly clean compilation path from expression trees, which is one reason they were favoured by early Pascal and Forth implementations and by the original Burroughs B5000. The Java Virtual Machine and the WebAssembly virtual machine are both stack machines at the bytecode level. As physical hardware, however, stack machines have languished: every instruction depends on the result of the previous one, which is bad for pipelining and parallel execution.

In a memory–memory machine, arithmetic instructions name two or three memory locations directly: add the contents of $A$ and $B$ and store at $C$ might be a single instruction. The VAX-11 of the 1970s and 1980s was the most fully developed example. Memory–memory designs reduce the program size and hide registers from the programmer entirely, but they tie the throughput of every instruction to the speed of memory — catastrophic when memory is much slower than the CPU, which it became almost as soon as caches arrived.

In a load–store, or register–register, machine, arithmetic instructions operate only on values held in a bank of fast general-purpose registers. Memory is reached only by load and store instructions that move values between registers and memory. This style was pioneered by the CDC 6600 and the IBM 801 and codified by the early RISC processors of the 1980s; it is the form taken by RISC-V, ARM AArch64, and the modern simple core of x86-64 (whose CISC veneer is decoded into load–store-style internal operations, as we will see in Chapter 27). Load–store organization is the dominant choice today because it cleanly separates fast on-chip operations from slow memory traffic, makes pipelining straightforward, and gives the compiler explicit control over which values are kept on-chip.

The von Neumann model is silent about the choice among these styles. Any of them can be built on top of a single unified memory and a fetch–execute loop. The choice is about where the working set of a computation lives during the computation, not about how programs are stored. Each style appears again in Part III when we introduce the instruction set architecture proper; the reason to introduce them now is that it would be misleading to present the von Neumann CPU of Chapter 7 — a clear load–store register machine — as if it were the only possible shape.

04.Address Spaces and Memory Maps

The von Neumann picture has so far described memory as a single linear array of bytes indexed by an address. From the perspective of a running program, this is exactly the right abstraction; the address space is the set of all addresses the program can name, and the program reads or writes the value at any address by issuing a load or a store. What sits behind those addresses, however, is rarely a single homogeneous block of RAM. The collection of mappings from addresses to actual storage and devices is called the memory map of the system.

A typical memory map for a small bare-metal computer might look like this:

Plain Text

0x0000_0000 .. 0x0000_FFFF   boot ROM (read-only)
0x0001_0000 .. 0x000F_FFFF   reserved
0x1000_0000 .. 0x1FFF_FFFF   main DRAM
0x4000_0000 .. 0x4000_0FFF   UART registers (memory-mapped I/O)
0x4000_1000 .. 0x4000_1FFF   timer registers
0x4000_2000 .. 0x4FFF_FFFF   other devices
0xFFFF_0000 .. 0xFFFF_FFFF   high vector / debug

A load from an address in the boot-ROM range returns whatever the ROM holds at that offset. A load from the DRAM range goes through the memory controller. A load from the UART range returns the current value of one of the serial port's registers, possibly with a side effect such as clearing a received status bit. The CPU itself sees only addresses; the interconnect and address decoder that we met in the previous section route each transaction to the right destination.

A few features of the memory map deserve naming up front, because they will recur in every later chapter.

The address space has a defined width, set by the ISA. A 32-bit ISA has a $2^{32}$ -byte (4 GiB) address space; a 64-bit ISA in principle has a $2^{64}$ -byte address space, although in practice processors implement only the lower 48 or 57 bits, leaving the rest reserved. Smaller microcontrollers may implement an even narrower range — 24, 20, or even 16 address bits.

Most addresses correspond to byte-addressable memory, but a load or store of a wider value (a halfword, word, or doubleword) is expected to use an aligned address — a multiple of the access size, as we discussed in Chapter 2. Accesses that violate alignment either fault or run more slowly, depending on the architecture.

Different regions of the map have different attributes: cacheable or non-cacheable, executable or non-executable, readable or writable, ordered strictly or weakly. The CPU does not assume these attributes; the operating system, through page tables we will meet in Chapter 19, attaches them to ranges of the map. A CPU executing code in a region that has been marked non-executable will trap rather than run the bytes there.

When the operating system supports multiple processes, each process sees its own address space, called a virtual address space, that is separately mapped onto the underlying physical map. The same virtual address in two different processes points at two different physical locations, except where the OS has explicitly arranged sharing. The architectural support for this — the memory management unit, the page tables, the translation lookaside buffer — is also a Chapter 19 topic. For now, the relevant fact is that the simple, flat "single memory" of the von Neumann picture is, in any real system, a structured map with at least the partitions sketched above.

The layout within a single program's address space is also conventional. A typical user-mode process on a Unix-like operating system has roughly the following arrangement:

Plain Text

low addresses
  .text       program code (instructions, read-only, executable)
  .rodata     read-only data (string literals, constants)
  .data       initialized writable data
  .bss        zero-initialized writable data
  heap        grows upward, allocated via malloc / mmap
  ...         (large gap)
  shared libs dynamically loaded code and data
  ...         (large gap)
  stack       grows downward, holds activation records
high addresses
  kernel      mapped in, but only accessible in privileged mode

We will return to this layout when we discuss calling conventions in Chapter 14, virtual memory in Chapter 19, and the boundary between user and kernel in Chapter 46. The point of mentioning it now is simply that even the smallest program lives in a structured address space, and the von Neumann assumption that "there is a memory" hides several layers of agreement between the hardware, the operating system, the compiler, and the program loader.

05.The Memory Wall

The von Neumann bottleneck, as Backus described it, was about the bandwidth of the path between CPU and memory: only one transaction at a time can squeeze through. A related but more recent concern goes by the name memory wall, and it is about the latency of that path rather than its bandwidth.

The story is one of two technologies improving at very different rates. From the mid-1980s through the early 2000s, processor clock speeds and instruction-issue rates roughly doubled every 18 months, in line with Moore's law and the architectural ideas (pipelining, superscalar execution, out-of-order issue) that we will meet in Part V. Main-memory latency, measured in nanoseconds, improved by less than a factor of two over the same period. The result is that a single main-memory access, which once took only a handful of CPU cycles, today takes several hundred. A modern processor running at 4 GHz with a memory latency of 80 nanoseconds is waiting 320 cycles per cache miss — long enough to have completed hundreds of arithmetic operations had they been available.

A simple piece of arithmetic illustrates the consequence. Suppose a program has an instruction stream in which 5% of instructions miss the last-level cache and stall for the full main-memory latency. Even with the rest of the program running at one instruction per cycle, the stall contribution alone is $0.05 \times 320 = 16$ cycles per instruction on average. The cache misses, not the instructions, dominate the runtime.

This growing gap is the fundamental reason for the elaborate memory hierarchy that Part IV is devoted to: caches at multiple levels, prefetchers, write buffers, non-blocking caches, speculation, and out-of-order execution. Almost all of these techniques can be read as attempts to hide main memory's latency from the CPU, by issuing requests early or by finding other useful work to do during the wait. The von Neumann model was not designed with this gap in mind; the model still works, but every modern implementation of it has had to develop a sophisticated answer to the question "what does the CPU do while it waits?"

It is worth flagging here that the memory wall is not a fixed feature of physics. Specific technologies push back against it: high-bandwidth memory (HBM) stacks reduce latency per byte by integrating DRAM dies vertically next to the processor; non-volatile memories with new electrical properties promise lower latency than DRAM at the cost of write endurance; near-memory and in-memory computing relocate operations into the memory itself to avoid the round trip. None of them eliminate the gap, but each one shifts the balance, and a working architect needs to understand the gap as the headline constraint that every memory technique either accepts or attacks.

06.Non-Von Neumann Architectures

For completeness, it is worth knowing that the von Neumann model is not the only possible organizing idea, even if it is by far the most important one in practice. Several rivals have been seriously developed, and although none has displaced the dominant model, each remains influential in particular niches.

Dataflow architectures abandon the program counter altogether. Instead of a sequence of instructions executed in a defined order, a program is a graph in which nodes represent operations and edges represent data dependencies. An operation fires whenever its inputs are available, regardless of any global control flow. The MIT Tagged-Token Dataflow Machine, the Manchester Dataflow Computer, and the Monsoon machine of the 1980s explored this idea seriously. Pure dataflow has remained outside the mainstream, but its influence is large: the out-of-order execution engines of every modern high-performance CPU (Chapter 25) use dataflow scheduling internally, even though their architecture remains von Neumann.

Systolic arrays, proposed by H. T. Kung in the late 1970s, organize computation as a regular grid of small processing elements through which data flows in synchronized waves. Each element does a small piece of work and passes results to its neighbours. Systolic structures are extremely efficient for regular numerical kernels such as matrix multiplication and convolution and are the architectural ancestor of modern neural-network accelerators including Google's Tensor Processing Unit.

Cellular automata and other massively parallel grid models were studied in the 1980s and 1990s as candidates for general computation. They have proven to be theoretical curiosities for the most part, although reservoirs of related ideas appear in modern reconfigurable hardware (Part XII).

Quantum computers depart from the model in a much deeper way: the machine's state is no longer a string of classical bits but a vector of quantum amplitudes, and the basic operations are unitary transformations rather than logic gates. Quantum machines remain a research and early-commercial frontier, and they will not make further appearances in this book.

The persistence of the von Neumann model in the face of these alternatives is not an accident. The model's combination of flexibility (any program can be loaded as data), simplicity (one fetch–execute loop drives everything), and compatibility (an enormous existing software ecosystem written for it) has turned out to be very hard to beat. The non–von Neumann ideas that survive in mainstream hardware survive as accelerators attached to a von Neumann host — GPUs, tensor units, network processors — not as replacements for the host itself.

07.Harvard and Modified Harvard Architectures

The Harvard architecture, named after the Harvard Mark I relay computer of the 1940s, takes a different approach. It uses two physically separate memories: one for instructions and one for data. Each has its own address space, its own data path, and often its own width and timing. The CPU can fetch an instruction and read or write data in the same cycle, because there are two independent memory ports.

Figure: Harvard architecture: the CPU has physically separate instruction and data memories, each with its own port, so a fetch and a data access can happen in the same cycle

LaTeX

\begin{tikzpicture}[font=\small, line cap=round,
  blk/.style={draw, thick, fill=white, minimum width=2cm, minimum height=0.8cm}]
  \node[blk, minimum width=6cm] (cpu) at (3, -0.5) {CPU};
  \node[blk] (im) at (1, -2.5) {I-mem};
  \node[blk] (dm) at (5, -2.5) {D-mem};
  \draw (1, -1) -- (im.north);
  \draw (5, -1) -- (dm.north);
\end{tikzpicture}

The advantage is bandwidth. With separate paths, the von Neumann bottleneck disappears: instruction fetch never competes with data access. The drawbacks are equally clear. There is now twice as much memory hardware to provide. A program cannot be written into the instruction memory by an ordinary store instruction, because stores go to the data memory; some explicit mechanism is required to load programs in the first place. And the address space the programmer sees is split, with programs and data referring to numerically overlapping but physically distinct locations.

For these reasons, pure Harvard architectures are rare in general-purpose computing. They are common, however, in embedded microcontrollers, where the program is fixed in flash memory and the data is in RAM, and where the bandwidth advantage is decisive at low clock rates. The PIC and AVR microcontroller families, ubiquitous in cheap embedded designs, are pure Harvard machines. So are many digital signal processors, where the predictable bandwidth of separate paths matches the regular access patterns of signal-processing algorithms.

The compromise that almost every modern general-purpose CPU uses is called the modified Harvard architecture. From the programmer's point of view, the machine is von Neumann: there is a single, unified address space, and a store instruction can write to any address, including ones from which the CPU will later fetch instructions. From the hardware's point of view, however, the CPU has separate first-level instruction and data caches, each with its own port. As long as the working set fits in the caches, the CPU enjoys the bandwidth of a Harvard machine, fetching an instruction and accessing data simultaneously without contention. When the caches miss, both go to the same unified main memory below them.

Figure: Modified Harvard architecture: split L1 I-cache and D-cache give the CPU two ports, while a unified main memory below the caches serves both on a miss

LaTeX

\begin{tikzpicture}[font=\small, line cap=round,
  blk/.style={draw, thick, fill=white, minimum width=2cm, minimum height=0.8cm}]
  \node[blk, minimum width=6cm] (cpu) at (3, -0.5) {CPU};
  \node[blk] (ic) at (1, -2.2) {I-cache};
  \node[blk] (dc) at (5, -2.2) {D-cache};
  \node[blk, minimum width=4cm] (mem) at (3, -4) {unified memory};
  \draw (1, -1) -- (ic.north);
  \draw (5, -1) -- (dc.north);
  \draw (ic.south) -- (1, -3.2) -- (3, -3.2) -- (mem.north);
  \draw (dc.south) -- (5, -3.2) -- (3, -3.2);
\end{tikzpicture}

This arrangement is the best of both worlds. The programmer's mental model stays simple — instructions and data live together — while the implementation gets the parallelism of separate paths where it matters most. The only place where the split shows through is in the rules for self-modifying code: writing new instructions into memory requires an explicit flush of the instruction cache, because the data store goes to the D-cache and the I-cache does not see it. Most ISAs provide an instruction such as IFENCE or ISYNC for exactly this purpose.

Some modern designs go further and introduce a third path for atomics or for non-cacheable I/O accesses, or split the L1 cache into multiple banks for higher bandwidth. The principle is unchanged. The modified Harvard architecture is the practical compromise that almost every CPU on your desk, in your pocket, or in a server farm follows today.

08.Instruction and Data Flow

To understand why these arrangements matter, it helps to walk through the flow of information during a single instruction's execution. Consider the execution of an instruction that adds two values from memory and stores the result back. In a register-based architecture this typically takes three instructions:

Plain Text

load   r1, [addr_a]    ; bring value at addr_a into register r1
load   r2, [addr_b]    ; bring value at addr_b into register r2
add    r3, r1, r2      ; r3 = r1 + r2
store  r3, [addr_c]    ; write r3 to memory at addr_c

The CPU executes this short sequence by performing several distinct kinds of memory traffic, each touching a different combination of subsystems.

The instruction fetch path is the first kind of flow. Each of the four instructions above has to be read from memory before it can be executed. The address used for the fetch comes from the program counter, and the value returned goes into the instruction register. In a von Neumann machine these fetches share the bus with data accesses; in a modified Harvard machine they go through the instruction cache.

The data fetch and data write paths handle the loads and stores. The two load instructions issue read transactions to the data memory at addresses addr_a and addr_b. The CPU waits for the values to arrive, places them in registers r1 and r2, and proceeds. The store instruction issues a write transaction to addr_c, sending the value of r3 along with the address. From the memory's point of view, an instruction fetch and a data fetch look identical; only the use the CPU makes of the returned bits is different.

A complete trace of memory traffic for the four-instruction sequence would look approximately like this:

Plain Text

cycle  bus operation               participant
-----  --------------------------  ------------------
  fetch instr at PC=0x100     I-side / unified
  read data at addr_a         D-side
  fetch instr at PC=0x104     I-side / unified
  read data at addr_b         D-side
  fetch instr at PC=0x108     I-side / unified
  (no memory access; ALU)     internal
  fetch instr at PC=0x10C     I-side / unified
  write data at addr_c        D-side

In a strict von Neumann machine, all eight transactions compete for a single shared path. Even with no other traffic, the machine spends most of its cycles waiting for memory rather than doing arithmetic. In a modified Harvard machine, the I-side and D-side transactions can overlap, so cycles 1 and 2 can happen simultaneously, as can cycles 3 and 4. The net effect, multiplied across millions of instructions per second, is enormous.

Several other refinements come into the picture in real machines, and we will meet them later, but it is worth previewing them here so that the rest of Part II makes sense.

The register file is the smallest and fastest tier of storage. Loads and stores move data between memory and registers; arithmetic operations like add operate on registers directly, with no memory traffic at all. Cycle 6 above is internal precisely because the addition reads r1 and r2 from the register file and writes r3 back to it without going to the bus.

The memory hierarchy — caches at one or more levels, then main memory, then storage — exists to make the average data access far faster than a trip to main memory. Locality of reference, which we will discuss in Part IV, ensures that most loads and stores are answered out of fast caches and only occasionally reach DRAM.

Pipelining, the subject of Chapter 22, allows the CPU to overlap the fetch of a later instruction with the execution of an earlier one, so that the cycles in the trace above run concurrently rather than sequentially. Out-of-order execution, in Chapter 25, goes further and allows independent instructions to complete in any order that respects their data dependencies.

For all of these refinements, however, the basic skeleton remains the von Neumann (or modified Harvard) one. There is a CPU; there is memory holding both instructions and data; there is a path between them; and there is a fetch–execute loop that drives everything. Every later chapter is, in some sense, a description of how to make this loop run faster without breaking its semantics.

09.Buses and Interconnects

The path between CPU and memory, and between CPU and I/O, is the bus or interconnect. The two terms are not strictly synonymous in modern usage, but the older term will help us start.

What a bus is

A bus is a shared set of wires that carries information between multiple devices. A typical bus has three groups of wires, called the address bus, the data bus, and the control bus.

The address bus carries the address of the location being accessed. Its width determines the size of the address space the CPU can reach. A 32-wire address bus addresses $2^{32}$ bytes, that is, 4 gibibytes.
The data bus carries the value being read or written. Its width is one of the defining features of the machine; an 8-wire data bus moves one byte per cycle, a 64-wire data bus moves eight.
The control bus carries the assorted signals that say what kind of transaction is occurring: read or write, ready, valid, byte enables, interrupt requests, and so on.

A simple read transaction proceeds something like this:

Plain Text

CPU drives the target address onto the address bus.
CPU asserts the read control line.
Memory decodes the address, retrieves the value, and drives it onto the data bus.
Memory asserts a "ready" or "ack" signal.
CPU latches the value off the data bus.
Both sides release the bus for the next transaction.

Writes are the same but with the CPU driving the data bus and the memory consuming it.

The shared nature of a bus is its great strength and its great weakness. The strength is that any number of devices can attach to it without changing the basic protocol. The weakness is that only one device can drive each set of wires at a time. If two devices both want to put values on the data bus simultaneously, the bits collide; the result is electrical garbage, and possibly damaged drivers. Bus designs therefore include rules — arbitration protocols, tristate drivers that can be turned off when not active, and clear transaction phases — to ensure that exactly one device speaks at a time.

Synchronous and asynchronous buses

A synchronous bus has a clock signal among its control lines, and all transactions occur on defined cycles of that clock. The advantage is simplicity: every device knows exactly when to sample and when to drive. The disadvantage is that the clock must run at the speed of the slowest device, or the slowest device must use a "wait state" mechanism to stretch the bus cycle.

An asynchronous bus has no clock; instead, each transaction proceeds by an explicit handshake. The initiator asserts a "request" signal; the responder, when ready, asserts an "acknowledge" signal; the initiator drops "request"; the responder drops "acknowledge"; and the cycle is complete. Asynchronous buses adapt naturally to devices of different speeds but are slower per transaction because of the back-and-forth signaling.

Most contemporary on-chip and chip-to-chip interfaces are synchronous, often with parameters that allow individual transactions to take different numbers of cycles. The handshakes have not gone away; they have moved into the synchronous protocol, with explicit valid and ready signals on every cycle.

From buses to interconnects

The classical bus picture — one set of shared wires, every device attached to it, one transaction at a time — does not scale. As CPUs got faster and as the number of devices grew, the bus became the bottleneck. A modern processor cannot afford to wait for a single shared path when it has multiple cores, multiple memory channels, multiple I/O controllers, and a graphics processor all wanting attention at once.

The replacement is the interconnect: a network of point-to-point links and switches that allows multiple transactions to be in flight simultaneously. From the software's point of view it still looks like a memory-mapped bus — addresses go in, data comes out — but underneath, a fabric of routers steers transactions to their destinations and lets unrelated traffic pass without interfering.

There are several important species of interconnect that any practitioner should recognize.

System buses connect the major components of a system. Historically this role was filled by buses such as ISA, PCI, and the front-side bus on early Intel processors. Today the role is filled by PCI Express (PCIe), a serial point-to-point fabric that despite its name behaves nothing like a classical bus. PCIe arranges devices into a tree of switches, with each device having a private link to the switch above it, and supports many concurrent transactions.

Memory interconnects connect the CPU to DRAM and other large storage. They have their own protocols (DDR4, DDR5, HBM3, CXL.mem) tuned for the very high bandwidth and tight latency requirements of memory traffic. We will examine them in detail in Chapter 18.

On-chip interconnects are the fabric inside a single die that ties together cores, caches, memory controllers, and accelerators. Standards such as AMBA AXI from Arm, TileLink from the RISC-V community, and various proprietary ring and mesh fabrics from Intel and AMD play this role. AXI is particularly common: it defines five separate channels (read address, read data, write address, write data, write response) so that multiple kinds of transaction can flow concurrently.

Coherent interconnects are a special class that not only carry transactions but also keep multiple caches consistent with one another. Examples include the cache-coherent flavor of AXI (CHI) and Compute Express Link (CXL.cache). These will become important in Chapter 31, when we discuss multicore coherence.

A worked example: AXI

To make the level of detail concrete, here is a simplified view of an AXI read transaction. The protocol has two relevant channels:

The AR (read address) channel carries the address and burst parameters.
The R (read data) channel carries the returned data.

Each channel has valid and ready handshake signals. A transfer happens on a cycle where both valid (driven by the source) and ready (driven by the destination) are high.

Figure: AXI-style read burst waveform: ARVALID and ARREADY handshake the address, then RDATA streams four data beats D0 through D3 while RVALID and RREADY stay high

LaTeX

\begin{tikzpicture}[font=\footnotesize, line cap=round]
  % Origin (0,0) at top-left of waveform area.
  % Signal labels at left (x=0), waveforms span x=2..8
  % y axis: each signal occupies 0.8 vertical units, top down
  % Define helper: hi=top, lo=top-0.6
  % Signals top y: clock=0, ARADDR=-1.2, ARVALID=-2.4, ARREADY=-3.6, RDATA=-4.8, RVALID=-6.0, RREADY=-7.2
  \node[anchor=east] at (1.8, -0.3) {clock};
  \node[anchor=east] at (1.8, -1.5) {ARADDR};
  \node[anchor=east] at (1.8, -2.7) {ARVALID};
  \node[anchor=east] at (1.8, -3.9) {ARREADY};
  \node[anchor=east] at (1.8, -5.1) {RDATA};
  \node[anchor=east] at (1.8, -6.3) {RVALID};
  \node[anchor=east] at (1.8, -7.5) {RREADY};
  % Clock: 8 half-cycles, period 0.7. From x=2 to x=7.6
  \draw[thick] (2,-0.6) -- (2,0) -- (2.35,0) -- (2.35,-0.6) -- (2.7,-0.6) -- (2.7,0) -- (3.05,0) -- (3.05,-0.6) -- (3.4,-0.6) -- (3.4,0) -- (3.75,0) -- (3.75,-0.6) -- (4.1,-0.6) -- (4.1,0) -- (4.45,0) -- (4.45,-0.6) -- (4.8,-0.6) -- (4.8,0) -- (5.15,0) -- (5.15,-0.6) -- (5.5,-0.6) -- (5.5,0) -- (5.85,0) -- (5.85,-0.6) -- (6.2,-0.6) -- (6.2,0) -- (6.55,0) -- (6.55,-0.6) -- (6.9,-0.6) -- (6.9,0) -- (7.25,0) -- (7.25,-0.6) -- (7.6,-0.6);
  % ARADDR: A on cycle 1 (x=2..2.7), then idle
  \draw[thick] (2,-1.8) -- (2,-1.2) -- (2.7,-1.2) -- (2.7,-1.8) -- (7.6,-1.8);
  \node at (2.35,-1.5) {A};
  % ARVALID: high during cycle 1
  \draw[thick] (2,-3.0) -- (2,-2.4) -- (2.7,-2.4) -- (2.7,-3.0) -- (7.6,-3.0);
  % ARREADY: high during cycle 1
  \draw[thick] (2,-4.2) -- (2,-3.6) -- (2.7,-3.6) -- (2.7,-4.2) -- (7.6,-4.2);
  % RDATA: D0..D3 in cycles 2..5
  \draw[thick] (2,-5.4) -- (2.7,-5.4);
  \draw[thick] (2.7,-5.4) -- (2.7,-4.8) -- (5.5,-4.8) -- (5.5,-5.4) -- (7.6,-5.4);
  \node at (3.05,-5.1) {$D_0$};
  \node at (3.75,-5.1) {$D_1$};
  \node at (4.45,-5.1) {$D_2$};
  \node at (5.15,-5.1) {$D_3$};
  % RVALID: high cycles 2-5
  \draw[thick] (2,-6.6) -- (2.7,-6.6) -- (2.7,-6.0) -- (5.5,-6.0) -- (5.5,-6.6) -- (7.6,-6.6);
  % RREADY: high cycles 2-5
  \draw[thick] (2,-7.8) -- (2.7,-7.8) -- (2.7,-7.2) -- (5.5,-7.2) -- (5.5,-7.8) -- (7.6,-7.8);
\end{tikzpicture}

The point is not to memorize this; the point is to notice how thoroughly the modern interconnect has internalized the discipline of synchronous design. There are no shared bus lines, no tristate drivers, no implicit "wait until ready" rules. Every transfer is an explicit handshake on a clock edge, and many transfers can be in flight on different channels simultaneously.

10.Summary

A computer organizes itself into three subsystems: a CPU that does the work, a memory that holds programs and data, and an I/O subsystem that connects the digital world to the physical one. The von Neumann model, articulated in 1945, places programs and data in a single memory and drives the CPU through a continuous fetch–execute loop. Within this model, different CPU organization styles — accumulator, stack, memory–memory, and load–store — take very different views of where operands live during computation, with load–store organization having won the modern argument. The flat memory of the model is realized in practice as a structured memory map, in which different address ranges correspond to ROM, DRAM, and memory-mapped I/O, with attributes attached by the operating system. The model's great strengths — universality, simplicity, programmability — are paid for by the von Neumann bottleneck, the single path that everything must share, and by the modern memory wall, the growing latency gap between fast processors and slow main memory that motivates almost the entire memory hierarchy. The Harvard architecture splits the path, gaining bandwidth at the cost of complexity, and the modified Harvard architecture used in nearly every modern processor takes the best parts of both: a unified memory that the programmer sees, with separate caches that the hardware uses to feed itself in parallel. Several non–von Neumann architectures — dataflow, systolic, quantum — have been seriously explored, but they survive mostly as accelerators attached to a conventional host. The flow of information through such a machine consists of instruction fetches, data loads, and data stores, all carried by a bus or, in modern systems, by a richer interconnect that allows many transactions to overlap.

In the next chapter we move inside the CPU itself and examine how the registers, the ALU, and the control unit fit together to actually carry out the instructions that flow through this organization.

Book mode

	cycle bus operation participant
	----- -------------------------- ------------------
	1 fetch instr at PC=0x100 I-side / unified
	2 read data at addr_a D-side
	3 fetch instr at PC=0x104 I-side / unified
	4 read data at addr_b D-side
	5 fetch instr at PC=0x108 I-side / unified
	6 (no memory access; ALU) internal
	7 fetch instr at PC=0x10C I-side / unified
	8 write data at addr_c D-side

	1. CPU drives the target address onto the address bus.
	2. CPU asserts the read control line.
	3. Memory decodes the address, retrieves the value, and drives it onto the data bus.
	4. Memory asserts a "ready" or "ack" signal.
	5. CPU latches the value off the data bus.
	6. Both sides release the bus for the next transaction.

	0x0000_0000 .. 0x0000_FFFF boot ROM (read-only)
	0x0001_0000 .. 0x000F_FFFF reserved
	0x1000_0000 .. 0x1FFF_FFFF main DRAM
	0x4000_0000 .. 0x4000_0FFF UART registers (memory-mapped I/O)
	0x4000_1000 .. 0x4000_1FFF timer registers
	0x4000_2000 .. 0x4FFF_FFFF other devices
	0xFFFF_0000 .. 0xFFFF_FFFF high vector / debug

	low addresses
	.text program code (instructions, read-only, executable)
	.rodata read-only data (string literals, constants)
	.data initialized writable data
	.bss zero-initialized writable data
	heap grows upward, allocated via malloc / mmap
	... (large gap)
	shared libs dynamically loaded code and data
	... (large gap)
	stack grows downward, holds activation records
	high addresses
	kernel mapped in, but only accessible in privileged mode

	\begin{tikzpicture}[font=\small, line cap=round,
	blk/.style={draw, thick, fill=white, minimum width=2cm, minimum height=0.8cm}]
	\node[blk, minimum width=6cm] (cpu) at (3, -0.5) {CPU};
	\node[blk] (im) at (1, -2.5) {I-mem};
	\node[blk] (dm) at (5, -2.5) {D-mem};
	\draw (1, -1) -- (im.north);
	\draw (5, -1) -- (dm.north);
	\end{tikzpicture}

	load r1, [addr_a] ; bring value at addr_a into register r1
	load r2, [addr_b] ; bring value at addr_b into register r2
	add r3, r1, r2 ; r3 = r1 + r2
	store r3, [addr_c] ; write r3 to memory at addr_c