Part III·Memory and Storage·Chapter 18 of 62

Part IIIMemory and Storage

Main Memory and DRAM

May 16, 2026·30 min read·intermediate

When a load misses every cache, it goes to **main memory**, which on essentially every general-purpose computer means **DRAM** — Dynamic Random-Access Memory. DRAM is the workhorse memory technology…

When a load misses every cache, it goes to main memory, which on essentially every general-purpose computer means DRAM — Dynamic Random-Access Memory. DRAM is the workhorse memory technology of the last fifty years, and almost every gigabyte of memory you have ever used was a gigabyte of DRAM. It is the memory the operating system and the hardware are talking about when they refer to "RAM": the place where programs and their data live while they run.

DRAM is also a strange and unintuitive piece of hardware. It is built from minimum-area cells that lose their charge if not refreshed, organized into arrays whose access patterns favor long, contiguous bursts, and connected to the processor through controllers that schedule traffic in surprising ways. To a software programmer, DRAM looks like a flat byte-addressable store. To a hardware engineer, it is a layered system whose performance depends critically on access pattern, refresh timing, and channel utilization. This chapter unfolds DRAM from one view to the other.

01.SRAM versus DRAM

We have already met SRAM (Static RAM): the memory technology used for registers, caches, and TLBs. An SRAM cell is built from six transistors arranged as two cross-coupled inverters with two access transistors. The cell is static: as long as power is applied, it holds its state indefinitely. It is fast (sub-nanosecond access in modern processes) and easy to interface to.

Figure: 6T SRAM cell: two cross-coupled inverters store the bit while two access transistors connect the cell to the bit lines BL and BL-prime under control of the word line

LaTeX

\begin{tikzpicture}[font=\footnotesize, line cap=round]
  \node[font=\small] at (3, 0) {SRAM cell (6T)};
  \node at (3, -0.7) {WL (word line)};
  \draw[thick, fill=white] (1.5, -2) rectangle (2.5, -1); \node at (2, -1.5) {inv};
  \draw[thick, fill=white] (3.5, -2) rectangle (4.5, -1); \node at (4, -1.5) {inv};
  \draw[<->] (2.5, -1.5) -- (3.5, -1.5);
  \draw[thick] (0.8, -1.3) -- (1.5, -1.3);
  \node[anchor=east] at (0.8, -1.3) {access};
  \draw[thick] (4.5, -1.3) -- (5.2, -1.3);
  \node[anchor=west] at (5.2, -1.3) {access};
  \node at (0.8, -2.4) {BL};
  \node at (5.2, -2.4) {BL'};
\end{tikzpicture}

SRAM's drawback is area. Six transistors per bit means low density. A 1 Mb SRAM array takes substantial silicon; a 1 GB SRAM would be impractically large.

A DRAM cell, by contrast, is a single transistor and a single capacitor — one transistor and one capacitor per bit. The capacitor stores the bit as a charge (full = 1, empty = 0); the transistor connects the capacitor to the bit line for reading and writing. The cell is roughly one-tenth the area of an SRAM cell, which is why DRAM gives us gigabytes where SRAM gives us megabytes.

Figure: 1T1C DRAM cell: a single capacitor stores the bit as charge while a single transistor, gated by the word line, connects it to the bit line for reads and writes

LaTeX

\begin{tikzpicture}[font=\footnotesize, line cap=round]
  \node[font=\small] at (2, 0) {DRAM cell (1T1C)};
  \node at (2, -0.6) {WL (word line)};
  \draw[thick, fill=white] (1.5, -1.7) rectangle (2.5, -1);
  \node at (2, -1.35) {transistor};
  \draw[thick, fill=white] (1.5, -2.7) rectangle (2.5, -2);
  \node at (2, -2.35) {capacitor};
  \draw[thick] (2, -1) -- (2, -0.8);
  \draw[thick] (2, -1.7) -- (2, -2);
  \draw[thick] (2, -2.7) -- (2, -3.1);
  \node at (2, -3.4) {BL (bit line)};
\end{tikzpicture}

The single-transistor cell is wonderfully compact, but it has problems. The capacitor is tiny — femtofarads — and leaks charge over time. Even at room temperature, a DRAM cell loses its state in milliseconds if left alone. The cell must therefore be refreshed: read and rewritten periodically to restore its charge. DRAM is dynamic in the sense that, unlike SRAM, it cannot retain its state passively; it must be actively maintained.

Reading a DRAM cell is also destructive. Connecting the cell's tiny capacitor to the much larger capacitance of the bit line drains its charge by sharing. The sense amplifier resolves the tiny voltage shift into a logic level, but the cell's stored value is lost in the process; the sense amplifier writes it back as part of the read operation. Every DRAM read is therefore a read-and-rewrite, which is part of why DRAM is slower than SRAM.

These properties — destructive reads, periodic refresh, very small per-cell area — give DRAM its characteristic shape. The technology rewards big bulk transfers and large arrays; it punishes small random accesses; and it requires a controller that manages the refresh and timing constraints.

A summary comparison:

Property	SRAM	DRAM
Cell size	6 transistors	1 transistor + 1 capacitor
Density	Low	~10× higher
Cost per bit	High	Low
Access latency	< 1 ns	~50 ns (cell), ~80 ns (system)
Volatile?	Yes	Yes
Refresh required?	No	Yes (every ~64 ms)
Used for	Registers, caches	Main memory

02.DRAM Organization

A DRAM chip is built as a hierarchy of structures, from cells up to chips and modules. The hierarchy is worth following carefully, because the organization shapes how DRAM behaves.

Cells, Rows, and Columns

The smallest unit is the cell, holding one bit. Cells are arranged in a two-dimensional grid: many rows and many columns. A row in a modern DRAM might be a few thousand cells wide, called a row buffer or page (not to be confused with virtual-memory pages).

Reading from DRAM is a two-step process.

The first step is row activation. The row's word line is asserted, connecting all the cells in the row to their respective bit lines. The sense amplifiers detect the small voltage shifts and latch the row into the row buffer. This step takes some time — measured in tens of nanoseconds — and is one of the dominant components of DRAM latency.

The second step is column access. Once the row is in the row buffer, individual columns can be read out by selecting them with the column address. Column accesses are much faster than row activations, because they are essentially reads from a wide latched register.

Figure: DRAM array activation: a selected row is driven onto the bit lines into the row buffer and sense amplifiers, after which columns are read out individually

LaTeX

\begin{tikzpicture}[font=\footnotesize, >=Stealth, line cap=round]
  % Rows: horizontal lines from x=2 to x=8
  \node[anchor=east] at (1.9, -0.5) {row 0};      \draw[thick] (2, -0.5) -- (8, -0.5);
  \node[anchor=east] at (1.9, -1.0) {row 1};      \draw[thick] (2, -1.0) -- (8, -1.0);
  \node[anchor=east] at (1.9, -1.5) {row 2};      \draw[thick] (2, -1.5) -- (8, -1.5);
  \node[anchor=east] at (1.9, -2.0) {$\vdots$};
  \node[anchor=east] at (1.9, -2.5) {row 8191};   \draw[thick] (2, -2.5) -- (8, -2.5);
  % Column lines down to row buffer
  \draw[thick] (2, -2.5) -- (2, -3.2);
  \draw[thick] (3, -2.5) -- (3, -3.2);
  \draw[thick] (4, -2.5) -- (4, -3.2);
  \draw[thick] (5, -2.5) -- (5, -3.2);
  \draw[thick] (6, -2.5) -- (6, -3.2);
  \draw[thick] (7, -2.5) -- (7, -3.2);
  \draw[thick] (8, -2.5) -- (8, -3.2);
  % Row buffer
  \draw[thick, fill=white] (1.5, -3.9) rectangle (8.5, -3.2);
  \node at (5, -3.55) {row buffer / sense amps};
  \node at (2, -2.8) {col 0};
  \node at (8, -2.8) {col 8191};
\end{tikzpicture}

This two-step structure has a major performance consequence. Successive accesses to the same row are fast, because the row is already in the row buffer; only the column address changes. Successive accesses to different rows are slow, because each new row requires re-activation. DRAM benefits enormously from spatial locality, just as caches do, but at a different granularity.

The vocabulary that captures this is row buffer hits and row buffer conflicts. A row buffer hit is an access to a row already activated; it returns in just the column-access time (perhaps 10 ns). A row buffer conflict is an access to a different row in the same bank; it requires precharging the current row and activating the new one, doubling or tripling the latency.

Banks

A DRAM chip contains multiple banks, each an independent array of rows and columns with its own row buffer. Banks can operate in parallel: bank 0 can be servicing one access while bank 1 is precharging and bank 2 is activating. Modern DDR4 chips have 16 banks; DDR5 has 32.

Banking is critical for bandwidth. A single bank can deliver only one transaction at a time, and each transaction has substantial latency. By spreading consecutive memory accesses across many banks, the controller can keep all of them busy simultaneously and approach the device's full bandwidth. Modern memory controllers go to considerable lengths to schedule requests so that bank conflicts are minimized.

Ranks

Multiple DRAM chips are mounted on a circuit board called a DIMM (Dual Inline Memory Module). The chips on a DIMM are organized into ranks — groups of chips that operate together in lockstep. A 64-bit-wide rank, for example, might consist of eight chips that each contribute 8 bits to every transaction. A DIMM can have one or two ranks (single-rank or dual-rank DIMMs).

From the controller's perspective, a rank is a unit that can service one transaction at a time. Multiple ranks on the same DIMM can operate independently, similarly to multiple banks but at a coarser granularity. Ranks live on the same channel and share the same data lines, so they cannot drive data simultaneously, but they can perform background operations (precharge, activation) in parallel.

Channels

A memory channel is a complete path from the memory controller to a set of DIMMs: an address bus, a data bus, command lines, the works. A modern desktop processor has 2 channels; a server chip might have 6, 8, or 12. Each channel operates independently, so adding channels straightforwardly multiplies bandwidth.

The aggregate memory bandwidth of a system is roughly

$\text{BW} = N_{\text{channels}} \times \text{channel width} \times \text{transfer rate}.$

A typical 2-channel desktop with DDR5-5600 has

$2 \times 64 \text{ bits} \times 5600 \text{ MT/s} = 716 \text{ Gb/s} = 89.5 \text{ GB/s}.$

A 12-channel server with DDR5-4800 reaches around 460 GB/s on paper.

Bandwidth scales nicely with channel count, but only if the workload spreads evenly across channels. A program that hits a single bank on a single channel is limited to that channel's bandwidth, no matter how many other channels exist.

Putting It Together

A simplified view of the layered structure:

Figure: DRAM hierarchy: the memory controller dispatches to channels, each channel reaches DIMMs that decompose into ranks, chips, banks, rows, and finally cells

LaTeX

\begin{tikzpicture}[font=\footnotesize, >=Stealth, line cap=round,
  blk/.style={draw, thick, fill=white, minimum height=0.7cm}]
  \node[blk, minimum width=3cm] (mc) at (1.5, -0.5) {memory controller};
  \node[blk, minimum width=2cm] (c0) at (4.5, -1.7) {channel 0};
  \node[blk, minimum width=2cm] (c1) at (4.5, -2.6) {channel 1};
  \node at (4.5, -3.4) {$\vdots$};
  \draw[->] (mc.east) -- (3, -0.5) -- (3, -1.7) -- (c0.west);
  \draw[->] (mc.east) -- (3, -0.5) -- (3, -2.6) -- (c1.west);
  \node[blk, minimum width=8cm] (d0) at (10, -1.7) {DIMM $\to$ ranks $\to$ chips $\to$ banks $\to$ rows $\to$ cells};
  \node[blk, minimum width=8cm] (d1) at (10, -2.6) {DIMM $\to$ ranks $\to$ chips $\to$ banks $\to$ rows $\to$ cells};
  \draw[->] (c0) -- (d0);
  \draw[->] (c1) -- (d1);
\end{tikzpicture}

Every DRAM access has to be routed through this structure: the controller decides which channel, the channel selects a rank, the rank addresses a chip, the chip selects a bank, the bank activates a row, and finally the column read or write happens. Each layer adds opportunities for parallelism (across channels, ranks, banks) and constraints (timing parameters that must be honored).

03.Memory Channels, Ranks, and Banks: Why Parallelism Matters

A simple way to see why all of this matters is to look at the latency-bandwidth product.

A single bank in a single rank in a single channel can deliver one transaction. A modern transaction is 64 bytes (one cache line); the cycle time of a single bank — the minimum time between independent transactions — is roughly $t_{RC}$ , perhaps 50 ns. So a single bank can deliver about $64 \text{ B} / 50 \text{ ns} = 1.28$ GB/s.

A real channel carrying DDR5-5600 delivers nearly 50 GB/s. To use that bandwidth, the system must keep dozens of transactions in flight simultaneously, distributed across many banks. A single thread issuing one load at a time, with each load dependent on the previous, will see only 1 to 2 GB/s of effective bandwidth, even on a system with 100 GB/s on paper. This is one of the most important practical lessons about modern memory systems: bandwidth is not delivered through latency; it is delivered through parallelism.

A modern out-of-order processor can have a hundred or more memory operations in flight at once, distributed across the cache hierarchy and the memory controller's request queue. Hardware prefetchers add even more, often pulling in lines that the program will need soon. The effect is to keep many banks busy in parallel, achieving close to the channel's full bandwidth on streaming workloads.

The same parallelism is hard to achieve on workloads with long dependency chains. A pointer-chasing benchmark — load A, follow the pointer in A to load B, follow the pointer in B to load C — has at most one outstanding load at a time, so it sees the full DRAM latency on every access. Bandwidth is mostly irrelevant; latency dominates. This is why pointer-heavy data structures (linked lists, trees) often run much slower than array-based equivalents that the same algorithm could be expressed with.

04.Refresh

Every DRAM cell must be refreshed periodically. The standard interval, set by the JEDEC specifications, is 64 milliseconds at room temperature; the cell must be refreshed at least once in that window or it loses its data.

Refresh is performed by the memory controller in the background. The controller issues refresh commands at appropriate intervals; each command refreshes a small group of rows in each bank. The refresh budget is divided across the 64 ms window so that, on average, the controller issues one refresh command every $64 \text{ ms} / N$ , where $N$ is the number of refresh commands required to cover all rows.

For a typical DDR4 device, refresh consumes a few percent of the available bandwidth — small but not zero. During a refresh, the affected bank cannot service ordinary requests, so the memory controller has to schedule refreshes around the workload's traffic pattern. Modern devices support fine-grained refresh and partial-array self-refresh modes that let the controller make smaller refresh commands more frequently, reducing the worst-case stall.

In high-temperature operation (above 85 °C, or under specific thermal conditions), the refresh interval shortens to 32 ms because cells leak faster. Server-class systems often run hot enough to fall into this mode, costing additional bandwidth.

A useful corollary: if power is removed for any significant fraction of the refresh interval, DRAM contents are lost. Persistent memory is not a property of DRAM; it requires SSDs, battery-backed DRAM, or explicit non-volatile memory technologies.

05.Bandwidth and Latency

We have been using the words bandwidth and latency as if they were obvious. In the context of DRAM, both have several distinct meanings worth distinguishing.

Cell latency — the time from when a column read is issued to when the bit appears at the chip's output — is about 10 to 15 ns on modern devices. This is the part of the access that comes from the chip's internal physics.

Row activation latency ( $t_{RCD}$ ) — the time from when a row is activated to when its cells are ready to be read — is also around 10 to 15 ns. If a row needs to be activated before the column read, both delays apply.

Precharge latency ( $t_{RP}$ ) — the time to close an open row before opening a new one — is similarly around 10 to 15 ns.

Random access latency — the time from a request for a closed row in a closed bank to the data being delivered — is the sum of precharge, activation, and column access: typically 40 to 50 ns on the chip, plus another 20 to 30 ns for path delay through the DDR interface, controller, and memory bus. The total is what users see as DRAM latency, around 60 to 100 ns.

Burst transfer time — the time to transfer a 64-byte burst over the data bus — depends on the data rate. At DDR5-4800, 64 bytes take $64 / (8 \cdot 4800/8) = 13.3$ ns to transfer. (In practice, a burst length of 8 transfers per data line takes 8 transfer cycles, around 1.7 ns at DDR5 speeds.)

Channel bandwidth — the maximum sustained throughput of a memory channel — is the data rate times the channel width. At DDR5-4800 and 64-bit channels, that is 38.4 GB/s; at DDR5-6400, 51.2 GB/s. Real workloads achieve some fraction of this, depending on access pattern and parallelism.

Sustained bandwidth — what a real workload achieves — is workload-dependent. Streaming reads with good prefetching can hit 80 % to 90 % of the channel maximum. Mixed read/write workloads see less, due to bus turnaround time. Random workloads see much less, because of bank conflicts and unhidden activate-precharge cycles.

A useful piece of mental arithmetic: a modern desktop CPU has roughly 50 GB/s of DRAM bandwidth and roughly 80 ns of DRAM latency. The latency-bandwidth product is

$50 \text{ GB/s} \times 80 \text{ ns} = 4000 \text{ bytes} = 62.5 \text{ cache lines}.$

To use the full bandwidth, the system must keep about 60 cache lines worth of memory operations in flight at once. This is why modern processors have load buffers of dozens of entries, hardware prefetchers running aggressively, and out-of-order execution that can dispatch many memory operations in parallel.

06.Memory Controllers

The memory controller is the piece of hardware that orchestrates all of this. On modern processors it is integrated onto the CPU die — the integrated memory controller — rather than living on a separate northbridge chip as it did until the mid-2000s. Integration cuts latency: requests no longer cross a chip boundary on their way to the memory channel.

The controller's job is to take requests from the cache hierarchy, queue them, schedule them across channels and banks, issue the necessary commands (precharge, activate, read, write, refresh) while honoring DRAM timing parameters, and route the responses back. It is, in effect, a small specialized processor whose program is the DRAM protocol.

Several scheduling policies appear in practice.

First-Come First-Served (FCFS) is the simplest: requests are serviced in arrival order. This is fair and predictable but performs poorly because it ignores the DRAM's row-buffer-locality preference.

First-Ready First-Come-First-Served (FR-FCFS) prioritizes requests that hit in the row buffer over requests that would require a row activation. This dramatically improves throughput by exploiting row-buffer locality, at some cost to fairness.

Open-row versus closed-row policies determine what to do with a row buffer between requests. Open-row policies leave the row open in the hope that another access to it is coming; closed-row policies precharge after each access to be ready for arbitrary new requests. Real controllers use predictive hybrids.

Bank parallelism scheduling distributes incoming requests across banks to maximize concurrency.

Read/write reordering groups reads together and writes together to minimize the bus turnaround time between them, which is a real cost on bidirectional data buses.

Quality-of-service mechanisms in some controllers prioritize traffic from particular requestors (the GPU, real-time-critical I/O, latency-sensitive cores) over others.

Modern memory controllers run dozens of pages of internal state machines and scheduling logic. The work they do is surprisingly substantial; a poorly designed controller can lose tens of percent of the channel's nominal bandwidth.

07.DRAM Timing Parameters

The schedule a memory controller has to honor is encoded in a long list of timing parameters, every one of which is a constraint between two events on the DRAM bus. The parameters are usually written in clock cycles of the memory clock and quoted alongside the data rate as, for example, "DDR4-3200 CL16-18-18-38." The numbers each name a parameter; reading a memory specification is a matter of understanding what the names mean.

The most important parameters, with typical DDR4-3200 values, are:

Parameter	Meaning	Typical
CL ( $t_{CL}$ )	CAS latency: column address to data	16
$t_{RCD}$	Row to column delay (after activate)	18
$t_{RP}$	Row precharge time (after closing a row)	18
$t_{RAS}$	Row active time (minimum time a row is open)	38
$t_{RC}$	Row cycle: $t_{RAS} + t_{RP}$	56
$t_{RFC}$	Refresh cycle time	350 ns
$t_{FAW}$	Four-activate window: at most 4 activates in this span	30
$t_{WTR}$	Write-to-read delay within a bank	~10
$t_{CWL}$	CAS write latency	14

The physical meaning is simple. To read data from a closed row, the controller issues an activate (ACT) to bring the row into the row buffer, which takes $t_{RCD}$ cycles. It then issues a read (RD) targeting a column, and data emerges $t_{CL}$ cycles later. Closing the row takes $t_{RP}$ cycles of precharge before another row can be activated. Between activate and precharge to the same row, $t_{RAS}$ must elapse so that the sense amplifiers have settled.

The latency for a row miss (closed-row, must activate) is therefore approximately $t_{RP} + t_{RCD} + t_{CL}$ , of order 50 cycles or 30 ns at DDR4-3200. A row hit (target row already open in the buffer) is just $t_{CL}$ , of order 16 cycles. A row conflict (different row open in the same bank) is $t_{RP} + t_{RCD} + t_{CL}$ , the worst case.

The four-activate window ( $t_{FAW}$ ) and refresh interval ( $t_{RFC}$ ) are limits imposed by power: the chip cannot sustain more than four activates in a 30-cycle window without exceeding its current budget, and refresh stalls all banks in a rank for 350 ns or so. These constraints become significant for memory-intensive workloads where the controller would otherwise issue activates faster than the chip can absorb them.

Reading a memory product specification is now possible. "DDR5-6400 CL32-39-39-77" means CAS latency 32 cycles, $t_{RCD}$ and $t_{RP}$ each 39 cycles, $t_{RAS}$ 77 cycles, all at the 3.2 GHz memory clock that drives DDR5-6400. Lower numbers (at the same data rate) mean lower latency for the workloads that hit them.

08.Address Mapping and Bank Interleaving

The memory controller takes a physical address and decomposes it into channel, rank, bank group, bank, row, and column fields. The choice of which address bits drive which field is the address mapping, and it has a substantial effect on performance.

A naïve mapping puts the column bits at the bottom (so consecutive bytes lie in the same row), then the row bits, then the bank, rank, and channel bits at the top. This favors row-buffer locality for sequential access but performs poorly for strided access: a stride of exactly the row size hits the same bank and the same row repeatedly, while a stride of half the row size hits the same bank with a different row every time, producing constant row conflicts.

A better mapping interleaves the bank, rank, and channel bits at lower-order positions, so that consecutive cache lines spread across banks and channels naturally. With cache-line-grained channel interleaving, a sequential access stream automatically uses every channel; with bank interleaving above that, it spreads across banks within each channel.

Most modern controllers add an XOR hash on top of this. The bank and channel select bits are computed by XORing several address bits together rather than taking them directly. The hash makes it almost impossible for a program to construct a stride that systematically hits the same bank: any pattern of address bits is scrambled into a roughly uniform spread across banks. The cost of the hash is a few logic gates in the controller's fast path; the benefit is a significant reduction in pathological cases.

The practical reading is that the bandwidth a workload sees from main memory depends not only on the channel count and clock rate but also on how the controller maps the program's access pattern onto the DRAM hierarchy. Workloads with high spatial locality fit any sensible mapping; workloads with awkward strides depend on the hash.

09.NUMA in Multi-Socket and Chiplet Systems

We noted in Chapter 16 that multi-socket servers and chiplet desktops have non-uniform memory access. The DRAM details fill in the picture: each socket (or chiplet) has its own integrated memory controller and its own DIMMs, and accessing another socket's memory requires going over an inter-socket interconnect.

Intel calls this interconnect UPI (Ultra Path Interconnect, the successor to QuickPath); AMD calls it Infinity Fabric; ARM server platforms use CMN mesh interconnects. The protocols are coherent (cache lines fetched from a remote socket participate in the coherence protocol of Chapter 31), and they are layered over physical links of high bandwidth and low latency — but lower than local DRAM. A typical two-socket server might see local DRAM latency of 80 ns and remote DRAM latency of 130 ns, with bandwidth across the interconnect comparable to one or two channels rather than the eight-or-more channels available locally.

AMD's chiplet desktop CPUs add a wrinkle: the memory controller lives on a separate I/O die from the cores, and even "local" memory access traverses the Infinity Fabric. Latency on these systems is consequently a bit higher than monolithic CPUs of the same generation, while gaining the advantages of chiplet manufacturing.

The operating system's role, as discussed in Chapter 16, is to allocate memory near the threads that use it and to migrate threads or pages when needed. Software that ignores NUMA can lose 10–30% on memory-intensive workloads on a two-socket box.

10.HBM, GDDR, and LPDDR Variants

DDR is the memory of laptops, desktops, and servers, but it is not the only DRAM variant in use.

HBM (High Bandwidth Memory) is a stacked DRAM technology: several DRAM dies are stacked vertically, connected by through-silicon vias (TSVs), and packaged on the same silicon interposer as a CPU or GPU. The interface is extremely wide — 1024 bits per stack — and runs at lower clock rates than DDR, achieving very high bandwidth at moderate power. HBM is used in high-end GPUs (Nvidia H100, AMD MI300), accelerators, and a small number of CPUs (Intel's Xeon Max). A single HBM3 stack delivers around 800 GB/s; an H100 with five stacks reaches 3 TB/s.

GDDR (Graphics DDR) is a DRAM variant optimized for the bandwidth-hungry, latency-tolerant access patterns of graphics. GDDR6 runs at much higher clock rates than DDR (16–24 Gbps per pin), uses wider buses (32-bit per chip), and trades latency for bandwidth. Mainstream GPUs use GDDR; the high end has moved to HBM.

LPDDR (Low-Power DDR) targets mobile and embedded systems. It runs at lower voltage, has more aggressive power-management states (deep power-down, partial-array self-refresh), and is often soldered directly onto a system-on-chip rather than installed as a removable module. Apple's M-series chips, most modern smartphones, and many laptops use LPDDR5 or LPDDR5X. The bandwidth per watt is excellent; the cost per bit is higher than DDR.

The takeaway is that "DRAM" is a family of technologies, each tuned for a different point in the bandwidth/power/latency/cost design space. Server and desktop CPUs use DDR; mobile uses LPDDR; GPUs and accelerators use GDDR or HBM.

11.Persistent Memory and Optane

A decade ago, the gap between DRAM and Flash looked like a permanent feature of the hierarchy. A memory technology that was byte-addressable, persistent, and faster than Flash would fit into the gap, and several were developed: phase-change memory (PCM), magnetoresistive RAM (MRAM), resistive RAM (ReRAM), ferroelectric RAM (FeRAM).

Intel's Optane, based on 3D XPoint, was the most prominent product to reach the market. Optane DIMMs (Intel called them DCPMM, Data Center Persistent Memory Modules) plugged into DDR4 sockets and offered hundreds of gigabytes per DIMM, latencies of a few hundred nanoseconds (compared to DRAM's 80 ns), and persistence across power loss. The programming model was unusual: applications could map persistent memory directly into their address space and access it byte-by-byte, with explicit cache flushes (the clwb and clflushopt instructions) ensuring data reached the persistence domain before declaring a transaction committed. Hardware support called ADR (Asynchronous DRAM Refresh) guaranteed that data in the memory controller's write queues would be flushed to persistence on power failure.

Intel discontinued Optane in 2022, and no comparable product has filled the niche. The architectural ideas have not gone away, however. CXL.mem (next section) provides a different way to attach lower-tier memory, persistent or not, to a CPU. The persistent-memory programming model is preserved by libraries (Intel's PMDK) and continues to influence the design of databases and storage systems.

12.Rowhammer and DRAM Security

DRAM cells are dense and increasingly so, which means cells are physically close to each other. Repeated activation of one row can disturb the charge in the adjacent rows enough to flip their bits, even though those rows are not being accessed directly. This phenomenon is called Rowhammer, and it was first publicly demonstrated in 2014.

Rowhammer is not a hardware bug; it is a side effect of pushing cell area down in successive DRAM generations. Its security implications, however, are serious. A program with no privileges and no special hardware access can, by hammering a particular memory pattern fast enough, cause bit flips in physical memory it does not own. With careful exploitation, those bit flips can be aimed at sensitive data structures — page table entries, kernel pointers — to gain elevated privileges. A long line of follow-up research has shown that browsers, JavaScript engines, and even network packets can be vehicles for Rowhammer attacks.

Mitigations have been added at multiple levels. DRAM chips themselves implement Target Row Refresh (TRR), which detects suspicious activation patterns and proactively refreshes likely-victim rows. DDR5 added Refresh Management (RFM), a dedicated mechanism to refresh rows under controller control. ECC partially helps: a single bit flip in a SECDED-protected word is corrected silently, raising the bar for an attacker. Operating systems can refuse to allocate pages adjacent to attacker-controlled pages.

The arms race continues. Each new mitigation has been followed by a new attack variant (Half-Double, RowPress, and so on), and DDR5's TRR has been shown to be defeatable. The architectural lesson is that DRAM is not a passive store; its physics are accessible to software in ways that have security consequences. We will revisit some of the broader micro-architectural side channels in Chapter 51.

13.Error Correction

DRAM is, like all very dense technologies, vulnerable to errors. Cosmic rays and other ionizing radiation flip bits at random, at a rate that depends on density, altitude, and process. A modern DRAM chip might experience one bit error per gigabyte per few months under typical conditions — rare, but not negligible at server scale, where a single machine has hundreds of gigabytes and runs for years.

The standard defense is error-correcting code (ECC) memory. ECC adds redundant bits to each data word, computed so that single-bit errors can be corrected and double-bit errors can be detected. The most common code in use is SECDED — Single-Error Correcting, Double-Error Detecting — typically a Hamming code variant that adds 8 bits of redundancy to each 64-bit word. The DIMM is wider as a result: a non-ECC DIMM is 64 bits wide; an ECC DIMM is 72 bits.

The memory controller computes the ECC syndrome on every read; if the syndrome is non-zero, the controller corrects single-bit errors silently and records the event, and reports double-bit errors as uncorrectable. Operating systems log uncorrectable errors and may refuse to use the affected memory.

Server systems use ECC universally. Desktop and consumer systems often do not, on cost grounds — though modern DDR5 includes "on-die ECC" inside each chip that handles bit errors at the cell level, separately from the system-level ECC that detects errors on the bus.

More elaborate codes are also used. Chipkill ECC can survive the failure of an entire DRAM chip on a DIMM, by spreading each word across many chips and using a code strong enough to recover when one chip's contribution is lost. Advanced server platforms run chipkill or stronger codes by default.

14.DDR Generations and What Changes

The interface to DRAM has gone through several generations: DDR (released around 2000), DDR2, DDR3, DDR4, and DDR5 (current as of this writing, with DDR6 in early development). Each generation roughly doubles the data rate and adds new features. From a system perspective, the interesting things to know are:

Data rate increases with each generation. DDR5 transfers data at up to 6400 MT/s on standard DIMMs, with overclocked parts going much higher. DDR4 topped out around 3200 MT/s.

Voltage decreases. DDR3 was 1.5 V, DDR4 1.2 V, DDR5 1.1 V. Lower voltage means lower power per bit, an important consideration as memory subsystems consume substantial energy.

Channel structure changes. DDR5 splits each 64-bit channel into two independent 32-bit subchannels, doubling the number of independent banks the controller can operate. This improves parallelism on workloads with many small accesses.

On-die ECC was added in DDR5 to handle the increasing soft-error rate of high-density chips. This is separate from system-level ECC and operates inside each DRAM device.

Bank groups and bank counts have grown, providing more parallelism. DDR4 has 4 bank groups of 4 banks each (16 total); DDR5 has 8 bank groups of 4 banks each (32 total).

The architectural effect of these changes is mostly to make the memory subsystem faster and more parallel; the basic structure (cells, rows, banks, ranks, channels) has remained stable for decades. A program written against the abstraction "main memory" sees, generation over generation, the same byte-addressable store, only faster.

15.Summary

Main memory is built from DRAM, a technology of one-transistor-one-capacitor cells whose density is unmatched but whose physics impose substantial constraints: destructive reads, periodic refresh, and access latencies in the tens of nanoseconds. DRAM is organized hierarchically into cells, rows, banks, ranks, channels, and modules; understanding the levels is essential because performance depends on access pattern. Row-buffer locality rewards consecutive accesses to the same row; bank-level parallelism rewards spreading accesses across many banks; channel-level parallelism rewards spreading them across channels. Bandwidth scales with parallelism, not with serial latency.

Memory controllers, integrated onto modern CPU dies, schedule requests across channels and banks, honor the DRAM timing parameters ( $t_{RCD}$ , $t_{RP}$ , $t_{RAS}$ , $t_{RFC}$ , $t_{FAW}$ , and the rest), manage refresh, and try to extract every bit of bandwidth the technology allows. Their address-mapping function, often XOR-hashed, determines how a program's access pattern spreads across the DRAM hierarchy. On multi-socket and chiplet systems, the picture becomes non-uniform: each CPU has its own controllers and channels, and remote-memory access goes over an inter-socket interconnect at higher latency.

DRAM is a family of related technologies. DDR drives desktops and servers; LPDDR drives mobile devices; GDDR and HBM drive GPUs and accelerators. Persistent-memory variants, of which Optane was the most prominent example, briefly filled the gap between DRAM and Flash and are being succeeded by CXL-attached memory tiers. Error-correcting codes protect against the inevitable bit errors at scale; Rowhammer reminds us that DRAM physics has security implications, not just performance ones. DDR generations have steadily increased rates and parallelism while keeping the basic structure stable.

A clear understanding of DRAM is prerequisite to almost everything that follows. Cache misses, memory-bound performance, NUMA effects, the cost of pointer chasing — all are at heart questions about how the program's access pattern interacts with this layered, parallelism-hungry technology. Chapter 19 turns to the layer of abstraction that hides the specifics of physical memory from the programmer: virtual memory.

Book mode