Storage and I/O Subsystems
May 16, 2026·32 min read·intermediate
The bottom of the memory hierarchy is the layer the program never sees directly: persistent storage and the I/O subsystems that connect it to the CPU. Files, programs, databases, operating-system…
The bottom of the memory hierarchy is the layer the program never sees directly: persistent storage and the I/O subsystems that connect it to the CPU. Files, programs, databases, operating-system images, and everything else that survives a power cycle live in this layer. The technologies are different from DRAM (and from each other), the access patterns are different, the latencies are vastly larger, and the software stack between the program and the device is taller. But the principles that organize the rest of the hierarchy — locality, parallelism, hierarchy, cost-versus-speed tradeoffs — apply here too, in their own forms.
This chapter covers the major storage technologies (HDDs and SSDs), the way they are accessed (block devices, memory-mapped devices, NVMe), the buses that connect them (PCI and PCIe), the transfer mechanisms (DMA and interrupts, drawing on Chapter 9), and the latency picture that ties it all together. The story will close Part IV by completing the picture from registers all the way down to magnetic platters.
01. Block Devices
To the operating system, persistent storage almost always looks like a block device: a large array of fixed-size blocks, addressed by block number, that can be read and written individually. The block size is typically 512 bytes (the historical default for hard disks) or 4096 bytes (the modern advanced format default for both SSDs and recent HDDs). A device with a 4 KB block size and 2 TB of capacity has blocks.
The block-level interface is intentionally narrow. A driver issues commands like:
- Read N blocks starting at block address X into memory at address Y.
- Write N blocks starting at block address X from memory at address Y.
- Flush any cached writes to permanent media.
The device handles the rest: which physical sectors of the platter are involved, which Flash cells, which internal buffers. This abstraction has been remarkably stable across decades of changing technology — the same block-device interface works for tape drives, hard disks, SSDs, optical drives, and network-attached storage — and it is what file systems are built on top of.
A file system, in turn, organizes these blocks into the named directories and files that programs see. A file is, structurally, a sequence of blocks plus some metadata; a directory is a special file listing the names and starting blocks of its contents. The file system maintains data structures (inodes, B-trees, journals) on the block device that describe this organization. When a program opens a file and reads its bytes, the file system translates each byte access into one or more block reads, which the storage stack ultimately converts into device commands.
The block-device interface mismatches application access patterns at two granularities. A program that reads four bytes triggers a 4 KB block read; the rest of the block is cached for possible future use. A program that writes four bytes triggers a read-modify-write cycle: the block is read, the four bytes are updated, the block is written back. The operating system's page cache (also called buffer cache in some systems) holds recently accessed blocks in DRAM, so that small accesses to the same blocks are answered without going to the device. This page cache is, conceptually, a software cache layered on top of the hardware storage hierarchy — analogous to the CPU's data cache, but at the block level.
02. HDDs and SSDs
Two technologies dominate modern persistent storage.
Hard Disk Drives
A hard disk drive (HDD) is a stack of rotating magnetic platters with read/write heads that float just above their surfaces. Data is encoded as magnetization patterns on tracks of the platter; the head positions over the right track and the platter's rotation brings the right sector under the head, at which point the head reads or writes the magnetization.
The mechanical nature of the HDD is the source of its characteristic performance behavior. To read a random sector, the head must:
- Seek to the right track. Seek time is typically 5–10 milliseconds for a 7200-RPM consumer drive.
- Wait for the platter to rotate the sector under the head. Rotational latency averages half a revolution, or about 4 ms at 7200 RPM.
- Read the data as the platter passes the head. At 200 MB/s sustained transfer, a 4 KB sector takes 20 µs.
The total for a random access is dominated by the mechanical components: roughly 10 ms. Sequential accesses, in contrast, hit the same track in succession and skip the seek entirely; modern HDDs deliver well over 200 MB/s of sequential bandwidth, but only at sustained sequential workloads.
This pattern — fast sequential, very slow random — has shaped operating systems for decades. File systems try hard to allocate file blocks contiguously to reduce seek costs. Database engines structure their indices and storage layouts to favor sequential access. Defragmentation, once a regular maintenance task, was an attempt to repair the damage of fragmented allocation.
HDDs are no longer the dominant storage medium for active workloads. They survive in two niches: bulk archival storage where capacity dominates cost (the 20+ TB drives in cloud storage farms) and very large databases where the cost per byte is more important than the latency. For everyday computing, SSDs have replaced them entirely.
Solid-State Drives
A solid-state drive (SSD) is a Flash memory device packaged with a controller and a block-device interface. Flash, like DRAM, stores bits as charge on isolated transistor gates; unlike DRAM, the charge is retained without power. An SSD has no moving parts, no platters, no heads, and consequently no seek or rotational latency.
Random reads on an SSD complete in tens of microseconds — three orders of magnitude faster than an HDD. Sequential reads can reach gigabytes per second on a single device. From a workload-performance perspective, SSDs have eliminated the need to optimize for sequential access patterns; random and sequential reads have similar performance.
But Flash has its own quirks, and an SSD's controller is non-trivial.
Erase before write. A Flash cell can be programmed (set to 0) individually, but it can only be erased (set to 1) in large groups — erase blocks, typically 256 KB to several megabytes. Writing a 4 KB chunk requires either finding an already-erased page within an erase block (the common case) or copying the block elsewhere, erasing it, and writing the modified version back.
Wear. Each Flash cell can be erased a limited number of times — typically 1,000 to 100,000 cycles, depending on the cell type. Once a cell is worn out, it can no longer reliably store data. The SSD controller spreads writes across the device (wear leveling) so that no cell wears out far ahead of the others, extending the device's life.
Garbage collection. As the device ages, blocks become a mixture of valid pages (containing live data) and invalidated pages (whose data has been overwritten elsewhere). The controller periodically copies the valid pages from a partially-used block to fresh blocks, then erases the original. This garbage collection runs in the background and consumes some bandwidth.
Write amplification. The combination of erase-before-write and garbage collection means that the actual write traffic to Flash is often larger than the logical writes the host issues. A modern SSD's write amplification factor under typical workloads is 1.5 to 3, which has implications for both performance and lifespan.
The controller hides all of this from the host. The host sees a simple block device: blocks numbered from 0 to N, readable and writable in any order, with consistent low latency. The host does not (and cannot) know which physical Flash pages hold which logical block numbers; the mapping is the controller's business and changes over time.
A few details of Flash technology worth knowing.
SLC, MLC, TLC, QLC. A Flash cell can store one bit (Single-Level Cell), two bits (Multi-Level Cell), three (Triple-Level Cell), or four (Quad-Level Cell). Each step up doubles capacity per cell at the cost of speed and endurance. SLC is fast and durable but expensive per bit; QLC is cheap but slow and wears out quickly. Consumer SSDs are typically TLC; high-end enterprise drives use MLC or even SLC; bulk-archival drives use QLC.
3D NAND. Modern Flash chips stack many layers of cells vertically, with hundreds of layers in current devices. This extends the density gains that planar shrinking can no longer provide.
TRIM / discard. The operating system can tell the SSD that certain blocks are no longer in use (e.g., after a file is deleted). This information helps the SSD's garbage collector avoid copying data that is logically dead, reducing write amplification and extending life. Modern file systems issue TRIM commands automatically; older ones did not, and users sometimes ran periodic fstrim jobs to recover SSD performance.
03. Memory-Mapped Devices and the Storage Path
The controller of a modern SSD or HDD is itself a small computer: a CPU running firmware, with its own RAM, its own flash for state, and a high-speed interface to the host. The host communicates with the controller through the I/O mechanisms we covered in Chapter 9: memory-mapped registers, DMA, and interrupts.
A typical I/O command flows like this:
- The driver, running on the host CPU, prepares a command in host RAM. The command describes the desired operation (read or write), the device-internal address (block number), the host RAM address that is the source or destination of the data, and the length.
- The driver writes the command's address to a memory-mapped register on the device. This is the "doorbell": it tells the device that a new command is ready.
- The device's controller reads the command from host RAM via DMA.
- For a read, the device fetches the data from its storage media into its internal buffers, then DMAs it to the host RAM address. For a write, the reverse.
- When the operation completes, the device updates an in-memory completion queue and raises an interrupt to the host.
- The driver's interrupt handler examines the completion, notifies the file system, and returns.
This flow exemplifies the I/O techniques from Chapter 9. The device registers are memory-mapped. The bulk transfer is by DMA. The completion is signaled by an interrupt. The driver and the device cooperate through queues in shared memory.
Modern SSDs go further with submission and completion queues held in host RAM, on which the host and device cooperate using lock-free protocols. The host enqueues new commands; the device dequeues them, processes them, and enqueues completions; the host dequeues completions in turn. The "doorbell" registers are written only to wake a sleeping device, not on every command. This style of interface — pioneered by NVMe, discussed shortly — scales to millions of operations per second per device, far more than the simple per-command-doorbell model would support.
04. PCI and PCIe
Storage devices, network cards, GPUs, and most other high-performance peripherals attach to the system through PCI Express (PCIe), a high-speed serial interconnect that has dominated peripheral connectivity since the mid-2000s. PCIe descended from older parallel PCI (Peripheral Component Interconnect) buses, which themselves replaced the original IBM PC's ISA bus. The genealogy matters mostly because the software-visible model — the configuration space, the BAR-based memory mapping, the device-discovery protocol — has been largely preserved across the transition from PCI to PCIe.
A PCIe device is a self-describing piece of hardware. At system boot, the firmware (or the operating system) walks the PCIe topology and queries each device's configuration space, a structured set of registers that includes:
- A vendor ID and device ID, identifying the maker and model.
- A class code, identifying the kind of device (storage, network, display, audio, etc.).
- Base Address Registers (BARs), which the OS programs to assign each device's memory-mapped registers a range in the system's physical address space.
- Capabilities, describing optional features (MSI-X interrupts, power management, error reporting).
After enumeration, the operating system loads the appropriate driver based on the vendor and device IDs (or the class code, for devices that follow a standard interface), maps the device's MMIO regions, configures interrupts, and begins issuing commands.
PCIe's physical layer is a serial point-to-point link with multiple lanes. Each lane is one differential pair in each direction, transmitting data at a rate that has roughly doubled with each generation:
| Generation | Per-lane data rate | x16 link bandwidth |
|---|---|---|
| PCIe 1.0 | 250 MB/s | 4 GB/s |
| PCIe 2.0 | 500 MB/s | 8 GB/s |
| PCIe 3.0 | 985 MB/s | 15.75 GB/s |
| PCIe 4.0 | 1.97 GB/s | 31.5 GB/s |
| PCIe 5.0 | 3.94 GB/s | 63 GB/s |
| PCIe 6.0 | 7.88 GB/s | 126 GB/s |
A device negotiates a link width (x1, x2, x4, x8, x16) and a generation with the host. A high-end SSD typically uses an x4 PCIe 4.0 or 5.0 link, giving 8 or 16 GB/s of bandwidth. A high-end GPU uses an x16 link of the same generation, getting 32 or 64 GB/s.
PCIe is not just point-to-point; PCIe switches allow tree topologies, and modern processors include integrated root complexes that connect to several lanes' worth of devices either directly or through switches. Server platforms with dozens of devices use external switching extensively; consumer platforms typically have a few direct slots and use the chipset to fan out the rest.
05. NVMe
For decades, the standard interface to storage devices was AHCI (Advanced Host Controller Interface), designed in the era of HDDs. AHCI assumed a single command queue with 32 slots, enough for any HDD's tiny request rate. SSDs overwhelmed it: a fast SSD can complete hundreds of thousands of operations per second, and AHCI's serialization quickly became the bottleneck.
NVMe (Non-Volatile Memory Express) is the modern replacement. Designed from scratch for fast Flash-based storage attached over PCIe, NVMe gives the host:
- Up to 64,000 queues, each with up to 64,000 entries.
- Per-CPU-core queues, so that each core can submit and complete commands without locking against other cores.
- MSI-X interrupts, with one vector per queue, so completions can be steered to the same core that issued the command.
- A streamlined command set, with shorter command descriptors than AHCI's.
- Lock-free queue protocols using submission and completion rings in shared memory.
The combination is dramatic. A high-end NVMe SSD can sustain millions of I/O operations per second with single-microsecond latencies — orders of magnitude better than what AHCI permitted.
A typical NVMe transaction:
- The driver places a command in the submission queue. The submission queue is a ring buffer in host RAM.
- The driver writes the new tail pointer to the submission-queue doorbell register on the device.
- The device fetches the command from host RAM, processes it (reading or writing Flash), and DMAs data as appropriate.
- The device places a completion entry in the completion queue (also a ring buffer in host RAM).
- The device raises an MSI-X interrupt routed to the originating core.
- The driver's handler processes completions, updates the head pointer of the completion queue, and writes it to the doorbell.
Because each core has its own queue pair, there is no contention between cores. Because doorbells are written only at the start of an idle period, the per-command overhead is small. NVMe is, in a sense, a re-implementation of the I/O patterns from Chapter 9 with attention to scaling to high request rates.
NVMe also runs over fabrics other than PCIe. NVMe over Fabrics (NVMe-oF) lets a host access remote NVMe devices over RDMA, TCP, or Fibre Channel, with the same software stack. Cloud and enterprise storage systems use this extensively; an NVMe-oF target can sit on a separate machine and serve thousands of clients with low latency.
06. DMA and Interrupt-Driven I/O
The mechanisms we discussed in Chapter 9 — DMA, interrupts, memory-mapped registers — are exactly what storage I/O uses. A few observations specific to the storage path are worth making.
DMA dominates. Every meaningful storage transaction is a DMA. The host CPU does not move data byte by byte to or from the device; it sets up the transfer and steps aside.
Scatter-gather DMA is universal. A buffer that the file system wants to read into may be scattered across many physical pages (because the OS allocates memory in 4 KB chunks and the buffer was constructed dynamically). The device's DMA engine accepts a list of physical-address-and-length pairs and walks the list as it transfers, eliminating the need for the host to copy data between contiguous and scattered representations.
IOMMU translation sits between DMA-capable devices and DRAM. The IOMMU translates the addresses the device sees (the I/O virtual addresses the driver gave it) into the physical addresses where the buffer pages actually live. This serves two purposes: it lets drivers operate on virtual addresses (so that buffers do not need to be physically contiguous), and it prevents the device from accessing memory it has no business touching (a security and correctness boundary).
Interrupt coalescing. A device that completes thousands of operations per second cannot afford to interrupt the host on every single completion; the interrupt overhead would dominate. NVMe and other modern interfaces support coalescing: the device waits a small amount of time or accumulates a number of completions before raising the interrupt. The driver then processes all of them in one handler invocation. The cost is a small increase in per-operation latency; the benefit is a large reduction in CPU overhead.
Polling for low latency. For the very lowest-latency workloads, drivers may bypass interrupts entirely and poll the completion queue. The CPU spins, checking for new completions, paying its full CPU cost but eliminating interrupt latency. Linux's io_uring interface and DPDK-style network drivers use polling routinely. This is the same polling-versus-interrupts tradeoff we discussed in Chapter 9, applied at the level of high-rate I/O.
07. Storage and Device Latency
The latency picture for the bottom of the hierarchy, completing the table from Chapter 16:
| Layer | Typical access latency | Sustained bandwidth |
|---|---|---|
| L1 cache | ~1 ns | hundreds of GB/s |
| L2 cache | ~3 ns | hundreds of GB/s |
| L3 cache | ~15 ns | tens of GB/s |
| DRAM (local socket) | ~80 ns | tens of GB/s per channel |
| DRAM (remote NUMA socket) | ~120–150 ns | a bit less |
| NVMe SSD (PCIe Gen 4) | ~10 µs | several GB/s |
| SATA SSD | ~80 µs | ~500 MB/s |
| HDD (random) | ~10 ms | ~150 MB/s sequential |
| Network (LAN, same datacenter) | ~50 µs RTT | ~10–100 Gb/s |
| Network (cross-region) | tens of ms RTT | bandwidth varies |
| Tape | seconds (mount) + sequential | hundreds of MB/s once spinning |
A few things are striking.
The gap between DRAM and SSD is roughly 100×. Even a fast NVMe drive is much, much slower than DRAM. A program that fits its working set in DRAM runs orders of magnitude faster than one that does not.
The gap between SSD and HDD is similar — SSDs are roughly 100× faster than HDDs for random access, though only a few times faster for sequential. This is why SSDs displaced HDDs from active workloads almost entirely.
Network latency to a same-datacenter machine is faster than HDD random access. This has profound architectural implications: reaching a remote machine's DRAM (perhaps over RDMA) is faster than reaching a local HDD. Modern distributed systems exploit this: a "remote DRAM" tier, accessible over fast networks, sits naturally between local DRAM and local SSD in the hierarchy. Technologies like CXL.mem aim to make this even cleaner, exposing remote memory as cache-coherent across machines.
The full memory hierarchy, then, is much wider than the single-machine picture we have been drawing. From registers to tape, from the CPU die to the planet-spanning network, the hierarchy of speeds and capacities continues. Understanding which level a piece of data lives in, and what it costs to move it between levels, is one of the most important skills in modern systems programming.
08. The Page Cache and Buffered I/O
We have described the storage stack as if every read and write actually traveled to the device. In practice, the operating system maintains a software cache of recently-accessed disk blocks in DRAM, and most reads are served from that cache without ever issuing a device command. Linux calls it the page cache; older systems and other contexts use the term buffer cache. The two have merged on modern Linux into a single structure indexed by file and offset, sized dynamically to consume free DRAM.
A read() system call first checks the page cache. If the requested page is present, the kernel copies it into the user buffer and returns; the device is not touched. If not, the kernel allocates a page, issues a read to the device, waits for completion, copies into the user buffer, and also keeps the page in the cache for next time. The cache is the reason a second cat of a large file is dramatically faster than the first.
Writes are more complex. By default, a write() system call updates the page cache and returns immediately, leaving the data dirty — modified in memory but not yet on disk. The kernel writes dirty pages back to the device asynchronously, in batches, when convenient (a few seconds later, when memory pressure rises, or when the application asks). This write-back caching is enormously beneficial for performance, but it has a critical implication: data that has been written from the application's perspective may not yet be persistent. A power failure in the wrong moment loses the last few seconds of writes.
The fsync(fd) system call (and its relatives fdatasync, sync_file_range) forces dirty pages of a file to be flushed to the device, blocking until the device acknowledges. Databases and other systems with durability requirements call fsync at carefully chosen points, ensuring that a transaction is on stable storage before declaring it committed. Underneath, fsync triggers the device's own caches to flush as well, sending an NVMe Flush or SATA FLUSH CACHE command and waiting for completion. Recent NVMe and SATA SSDs support a per-write FUA (Force Unit Access) flag, which marks an individual write as bypassing the device's volatile cache and avoiding a separate flush.
Direct I/O (O_DIRECT on Linux, FILE_FLAG_NO_BUFFERING on Windows) bypasses the page cache entirely: reads and writes go directly to the device, with the user buffer playing the role of the cache. Databases sometimes use this to avoid double-caching (the database has its own cache; the OS's adds no value and consumes memory). The cost is that the application now must align buffers, manage its own caching, and forfeit the OS's read-ahead and write-back optimizations.
Read-ahead is the page cache's prefetching counterpart. The kernel detects sequential access patterns and proactively reads pages ahead of the application's requests, hiding device latency under useful work. Linux's posix_fadvise(POSIX_FADV_SEQUENTIAL) lets applications hint at access patterns to control the read-ahead aggressiveness.
The relationship between the page cache and the device's own caches is layered. The application sees a single virtual stack — "the file system" — but data may sit in any combination of page cache, device firmware buffer, NAND program-buffer, and finally NAND cells. Understanding which layer holds the data, and what guarantees each layer makes, is essential for any system that cares about durability.
09. Asynchronous I/O and io_uring
The traditional read/write system calls are synchronous: the calling thread blocks until the I/O completes. For applications that overlap many I/Os — web servers, databases, video processors — the natural pattern is to issue many requests and process completions as they arrive. Several APIs exist for this.
Non-blocking I/O with select/poll/epoll lets a thread monitor many file descriptors and respond as they become readable or writable. This works well for sockets and pipes, where data arrives in chunks; it works poorly for disk files, which are always "readable" in the epoll sense, because the OS will block-read on demand. epoll is the dominant Linux event-loop interface for network code.
POSIX AIO (aio_read, aio_write) allows true asynchronous file I/O: the call returns immediately, and the application polls or receives a signal on completion. It exists on most Unix-like systems but is widely considered unsatisfactory; the Linux implementation, in particular, has historically been a thread-pool emulation rather than a true kernel-async interface.
Linux AIO (io_submit) offered a real kernel-async interface for direct-I/O reads but had significant limitations: it required O_DIRECT, behaved synchronously on metadata operations, and could not handle buffered I/O. Few applications used it directly.
io_uring, introduced in Linux 5.1 (2019), is the modern answer. The application creates a pair of ring buffers in shared memory — a submission queue and a completion queue — and the kernel processes entries from one and posts results to the other. A submission costs a single memory write (no system call); a completion is read from the queue with no kernel involvement either. Optionally, a kernel thread polls the submission queue, allowing the application to issue I/O without ever entering the kernel. The interface supports almost every kind of I/O the kernel can do: reads, writes, opens, accepts, sends, even file metadata operations.
The architectural pattern echoes NVMe directly: ring buffers, lock-free communication between two parties, optional polling. It is no coincidence that the highest-performance kernel interface ends up looking like the highest-performance device interface; both are responses to the same scaling pressure, and both demonstrate that the cost of context switches and interrupts becomes the dominant factor once individual operations are fast enough.
Windows has equivalents — I/O completion ports and the more recent IoRing API — with similar designs.
10. RAID and Redundant Storage
A single disk fails. Servers, even of the most expensive kind, treat disk failure as routine. RAID (Redundant Array of Independent Disks) combines multiple disks into a logical volume that survives some number of individual failures.
The canonical RAID levels are:
- RAID 0 (striping): data is split across disks; capacity is times one disk; bandwidth scales with ; no redundancy. A single disk failure loses all data. Used purely for performance.
- RAID 1 (mirroring): data is duplicated across two disks; capacity is one disk; survives single-disk failure; reads can be parallelized across the mirrors. The simplest form of redundancy.
- RAID 5 (striping with single parity): data and one parity block per stripe are spread across disks; capacity is disks; survives single-disk failure. Parity must be updated on every write, costing read-modify-write cycles.
- RAID 6 (striping with double parity): two parity blocks per stripe; survives any two simultaneous failures. The standard for large storage arrays where rebuild times are long enough that a second failure during rebuild is plausible.
- RAID 10 (mirrored stripes): RAID 0 over RAID 1 pairs. Combines the performance of striping with the redundancy of mirroring at the cost of half the raw capacity.
RAID can be implemented in hardware (a dedicated controller card with battery-backed DRAM cache), in software (Linux's md driver, ZFS, Btrfs), or in the file system itself.
For very large storage systems, parity-based RAID has been replaced by erasure coding, which generalizes the parity idea. An erasure code represents data blocks as blocks (with ); any of the blocks suffice to recover the data. Reed–Solomon codes are the most common variant. Erasure coding is the standard for distributed object stores (S3, Azure Blob, HDFS, Ceph): a single object is split into many fragments, scattered across many machines and racks, with enough redundancy that several simultaneous failures can be tolerated. The storage overhead is much lower than triple-replication for the same durability guarantee.
The architectural lesson is that durability at scale is built from many disks, not from one reliable disk. The file system or storage layer abstracts the redundancy and presents a single logical store; underneath, a great deal of bookkeeping ensures that no individual failure compromises the data.
11. Persistent Memory Programming
If storage-class memory (Chapter 18) places persistent media on the memory bus, the file-system abstractions of this chapter are a bad fit for it. Going through read/write to access something that is already in the CPU's address space is wasteful. A different programming model emerged.
The Linux DAX (Direct Access) mode of file systems like ext4 and xfs lets a file backed by persistent memory be mmap'd in such a way that the application's loads and stores go directly to the persistent media — no page cache, no copy. The programming model is then the same as ordinary memory: pointer dereferences, structure assignments. The new requirement is persistence: the application must ensure that modifications reach the persistence domain (the device) before declaring a transaction durable.
This turns out to be subtle. A store from a CPU does not immediately go to memory; it sits in the cache hierarchy until eviction. To force a cache line to memory (or, more precisely, to the platform's persistence domain, typically a buffer in the memory controller covered by ADR), the application must execute a cache line flush instruction — clwb (cache line write-back, preferred), clflushopt (older), or clflush (oldest, slowest). The flush is followed by a store fence (sfence) to ensure ordering. The combination is wrapped in libraries (Intel's PMDK, Microsoft's PMSDK) that present higher-level transactional and atomic primitives.
The failure model is also different. A power loss in the middle of a store-fence sequence may leave some lines flushed and others not, producing inconsistent state on disk. Persistent-memory libraries provide redo logging, undo logging, or copy-on-write to make this safe. The application interacts with the libraries' atomicity guarantees rather than reasoning about cache flushes directly.
With Optane discontinued, persistent-memory programming has lost its most prominent platform. The model survives, however, in academic systems, in research with battery-backed DRAM, and in the developing CXL ecosystem, where persistent CXL.mem devices have been announced. The programming model is also of independent interest as a model of crash-consistent in-memory data structures, which is relevant whenever any system needs to survive an abrupt failure with consistent on-disk state.
12. Zoned Storage and Modern SSD Variations
The NAND-flash translation layer we described earlier is convenient but expensive: it consumes DRAM in the device, garbage-collects in the background, and can produce unpredictable latency spikes. Two technologies push the abstraction back toward the host.
Zoned Namespaces (ZNS), standardized in NVMe 2.0, divides the SSD into large zones (typically hundreds of MB to a few GB) that must be written sequentially. Within a zone, the host writes from beginning to end; to overwrite, the host explicitly resets the zone, after which it is empty and ready for sequential writing again. The host is responsible for placing data in zones and for managing zone lifecycle; the SSD does not need to garbage-collect because every zone is either empty or sequentially full.
The wins are substantial: less DRAM in the SSD (no large mapping table), more predictable latency (no background GC at unpredictable times), better write amplification (the host can group data with similar lifetimes into the same zone). The cost is a more complex host stack: file systems have to be zone-aware, and applications written for the traditional flat-LBA model do not work without translation.
SMR (Shingled Magnetic Recording) is the HDD analog. SMR HDDs achieve higher density by overlapping write tracks like roof shingles; this makes random writes destructive (writing one track damages adjacent tracks), so SMR drives are organized in zones and presented as zoned block devices. SMR is widely used in cloud object storage, where the workload is primarily large sequential writes and the host stack can adapt.
A related variation is multi-namespace SSDs, which divide a single device into multiple independent namespaces with separate logical address spaces. Different namespaces can have different protection profiles (some replicated, some not), different overprovisioning levels, or different access controls; in cloud and multi-tenant settings, a single physical SSD can serve several customers in isolation.
The broader architectural pattern is exposing more of the device's structure to the host. As workloads become more demanding and the host's coordination capacity grows, the trend is for the device to be less of a black box and for the host's storage stack to participate more directly in placement and lifecycle decisions.
13. Network and Disaggregated Storage
The latency table earlier in this chapter reveals one of the more dramatic facts of modern computing: a same-datacenter network round-trip (50 µs) is faster than an HDD random access (10 ms) and only a few times slower than a local NVMe access (10 µs). The implication is that storage need not be physically attached to the machine that uses it; a remote machine's storage can be as fast as a local one's, given fast enough networking.
iSCSI and Fibre Channel are the older protocols for block-over-network access; they predate fast Ethernet and remain in use in enterprise storage networks. NFS and SMB are the file-level protocols, exposing remote file systems as if they were local. These work but rarely deliver the performance of local storage, because they were designed in an era of slower networks.
NVMe over Fabrics (NVMe-oF) carries the NVMe protocol over a fast network (RDMA over Ethernet, RDMA over InfiniBand, or TCP). It exposes a remote SSD with latency only a few microseconds higher than local. This makes disaggregated storage practical: a fleet of compute servers and a separate fleet of storage servers, connected by a fast network, with each compute server able to use any storage server's drives. Cloud providers run their storage layers this way, and on-premises deployments are increasingly doing the same. The benefit is independent scaling: compute and storage capacity can be added separately, and storage hardware can be upgraded without disturbing compute nodes.
RDMA (Remote Direct Memory Access) is the networking primitive that makes this fast. An RDMA NIC can read or write a remote machine's memory without involving its CPU, using one-sided operations that go through dedicated NIC hardware on each side. Latencies are sub-microsecond on InfiniBand, a few microseconds on RoCE (RDMA over Converged Ethernet). RDMA is used not only for storage but also for distributed databases, machine-learning training, and any other workload where moving data between machines must be cheap.
The architectural shift, then, is that the boundary of the "machine" is becoming fuzzy. Remote DRAM, remote SSD, remote PMEM, even remote computation are all accessible at latencies that, a decade ago, were the exclusive province of local hardware. Software architectures — microservices, distributed databases, disaggregated cloud storage — have organized themselves around this fact. The hierarchy that began in the CPU's register file extends outward through caches, memory, local storage, the local network, and finally the wide-area network, with each level slower and larger than the last and the same locality principles applying at every step.
14. Summary
Persistent storage and I/O subsystems sit at the bottom of the memory hierarchy, holding everything that survives a power cycle. To the operating system, storage devices look like block devices: arrays of fixed-size blocks accessed by number. The file system layers files and directories on top of this abstraction, and on top of that the operating system maintains the page cache — a large software cache of recently accessed blocks in DRAM — so that most reads never reach the device and writes are batched into asynchronous write-back. fsync and FUA bridge the gap between in-memory and durable, and O_DIRECT lets applications opt out when they want to manage caching themselves.
Hard disk drives, with their mechanical seek and rotation, gave decades of operating systems their characteristic optimization patterns — favoring sequential access, fearing random access. Solid-state drives, with their no-moving-parts architecture but their own quirks of erase-before-write and wear, have replaced HDDs for active workloads while introducing new layers of controller-firmware complexity. Modern variations — ZNS, SMR, multi-namespace SSDs — push more device structure back into the host's view in exchange for predictable latency and lower device-side complexity.
The path between CPU and storage device runs through the I/O mechanisms of Chapter 9: memory-mapped registers, DMA, and interrupts, organized via PCIe in modern systems. NVMe, the dominant modern storage interface, replaces older designs with per-core submission and completion queues that scale to millions of operations per second. Parallel software primitives — io_uring, polling drivers, kernel-bypass stacks — extract every last microsecond by avoiding interrupts entirely. RAID and erasure coding give durability against individual disk failures; persistent memory introduces a programming model where loads and stores reach durable media directly, with explicit cache flushes for ordering. NVMe-oF and RDMA carry storage and memory access over the network at latencies competitive with local devices, making disaggregated compute and storage practical at cloud scale.
Storage latency dwarfs DRAM latency, and DRAM latency dwarfs cache latency; the hierarchy spans nine orders of magnitude in time and ten in capacity, and now extends across machines through the local network. The architectural patterns we have built across Part IV — registers, caches, DRAM, virtual memory, storage — together give the program the illusion of a flat, fast, unbounded memory while behind the scenes a layered system juggles bytes among many technologies of different speeds and costs.
This concludes Part IV. Part V turns from the data side of the system to the execution side, examining how a CPU implements an ISA at high performance: pipelining, branch prediction, superscalar execution, out-of-order execution, and the rest of the micro-architectural toolkit.