Part III·Memory and Storage·Chapter 17 of 62

Part IIIMemory and Storage

Cache Design

May 16, 2026·30 min read·intermediate

A cache is a piece of fast memory that sits between the CPU and a slower memory below it, holding copies of recently used data so that future accesses are fast. The previous chapter argued that…

A cache is a piece of fast memory that sits between the CPU and a slower memory below it, holding copies of recently used data so that future accesses are fast. The previous chapter argued that caches are the linchpin of the memory hierarchy: without them, every load and store would pay the full DRAM latency, and modern processor speeds would be wasted on waiting. This chapter takes the cache apart and looks at how it is built.

There is a surprising amount of structure in a piece of hardware whose job is, on the face of it, "remember what was just used." A cache has to decide where to put incoming data, how to find it again, what to evict when full, what to do on writes, and how to anticipate future accesses. Each of these decisions has been studied for decades and has become a subfield of its own. We will go through them in order.

01.Cache Lines and Blocks

The first design choice is the unit of caching. A cache does not store individual bytes or individual words; it stores cache lines (also called blocks), each typically 64 bytes long. Every transaction between the cache and the level below it moves a whole line.

Why store more than the program asked for? Two reasons.

Spatial locality says that accesses to nearby addresses tend to follow each other. If a program reads byte 100, it is very likely to also read bytes 101, 102, 103, and possibly 104 through 163 in the near future. By bringing in the entire line on the first miss, we save time on the subsequent accesses. The first access pays the latency to fetch from DRAM; the rest are cache hits.

Per-access overhead. Each access to DRAM costs not just the latency to read the bits but the overhead of opening a row, sending an address, and so on. Fetching 64 bytes in a single transaction is much cheaper, per byte, than fetching them one byte at a time. The DRAM itself is designed around this; we will see in Chapter 18 that DRAM bursts are 64 bytes for exactly this reason.

The chosen line size — 64 bytes on essentially every general-purpose CPU since the early 2000s — is a compromise. Larger lines exploit spatial locality more aggressively and amortize per-access overhead better, but they waste cache capacity if the program accesses only a small part of each line, and they make false sharing worse in multi-threaded code (Chapter 31). Smaller lines have the opposite tradeoff: less wasted capacity, more per-access overhead. Sixty-four bytes is the empirical sweet spot for general-purpose code on current technologies.

Within a line, addresses are decomposed into a line address and a byte offset. With 64-byte lines, the bottom 6 bits of any byte address identify the offset within the line, and the remaining bits identify the line:

Figure: Address split into a line address and a 6-bit byte offset, since 64-byte lines use the bottom 6 bits to name the byte within a line

LaTeX

An access to byte address 0x1A37 lives in line 0x1A00 (the upper bits with the bottom 6 zeroed) at offset 0x37. The cache stores the line; the offset is used to extract the requested word from it.

02.Cache Mapping: The Central Question

When a line is brought into the cache, which slot in the cache does it go into? The answer to this question defines the cache's organization, and it is the most important single design decision.

Three policies exist: direct-mapped, set-associative, and fully associative. They differ in how much freedom the cache has when placing a line.

Direct-Mapped Caches

In a direct-mapped cache, each memory line maps to exactly one cache slot. The slot is determined by some bits of the line's address — typically a middle range of bits called the index.

Suppose we have a 32 KB cache with 64-byte lines. That is $32768 / 64 = 512$ slots. We need 9 bits to identify a slot, so the index is 9 bits wide. The address layout becomes:

Figure: Direct-mapped cache address split into tag, 9-bit index, and 6-bit offset for a 32 KB cache with 64-byte lines and 512 slots

LaTeX

\begin{tikzpicture}[font=\small, line cap=round]
  \draw[thick] (0, -0.8) rectangle (4, 0); \node at (2, -0.4) {tag};
  \draw[thick] (4, -0.8) rectangle (6, 0); \node at (5, -0.4) {index (9)};
  \draw[thick] (6, -0.8) rectangle (8, 0); \node at (7, -0.4) {offset (6)};
\end{tikzpicture}

The bottom 6 bits are the offset within the line; the next 9 are the index, which selects the slot; the remaining bits are the tag, which uniquely identifies which line is currently in that slot.

When the CPU issues a load, the cache hardware:

Splits the address into tag, index, and offset.
Looks up the slot named by the index.
Compares the slot's tag with the request's tag.
If they match, it is a hit; the cache returns the bytes at the offset.
If they do not match, it is a miss; the cache evicts whatever is there and fetches the new line.

Figure: Direct-mapped cache lookup: tag and index drawn from the address, the index selects a slot, and a tag compare against the stored tag yields hit or miss

LaTeX

\begin{tikzpicture}[font=\footnotesize, >=Stealth, line cap=round]
  % Address fields at top
  \draw[thick] (0, -0.7) rectangle (3, 0); \node at (1.5, -0.35) {tag 0xABC};
  \draw[thick] (3, -0.7) rectangle (5, 0); \node at (4, -0.35) {index 0x1A3};
  \draw[thick] (5, -0.7) rectangle (7, 0); \node at (6, -0.35) {offset 0x37};
  % Slot
  \draw[thick, fill=white] (1, -2.3) rectangle (7, -1.4);
  \node[align=center] at (4, -1.85) {slot 0x1A3\\valid $|$ tag stored $|$ data};
  \draw[->] (4, -0.7) -- (4, -1.4);
  % Compare
  \draw[thick, fill=white] (2, -3.4) rectangle (6, -2.7);
  \node at (4, -3.05) {compare with request tag};
  \draw[->] (4, -2.3) -- (4, -2.7);
  % Hit/miss
  \draw[thick, fill=white] (2.8, -4.4) rectangle (5.2, -3.8);
  \node at (4, -4.1) {hit / miss};
  \draw[->] (4, -3.4) -- (4, -3.8);
\end{tikzpicture}

Direct-mapped caches are simple and fast. The lookup is one comparison, one address decode, one read; the hardware is straightforward. The downside is conflict misses. Two different memory addresses with the same index bits map to the same slot. If the program alternates between them, every access misses, even though the cache as a whole is mostly empty. A loop that strides through memory at exactly the cache size, for example, can cause every access to evict the previous one.

Set-Associative Caches

A set-associative cache is a compromise. The cache is divided into sets, each containing a small number of ways (slots). A line still maps to one specific set (chosen by the index), but within the set it can occupy any way. The number of ways is the associativity of the cache.

A 4-way set-associative cache with 32 KB total and 64-byte lines has $32768 / (64 \times 4) = 128$ sets, so the index is 7 bits wide:

Figure: 4-way set-associative cache address split into tag, 7-bit index, and 6-bit offset, giving 128 sets of 4 ways each at 64-byte lines

LaTeX

\begin{tikzpicture}[font=\small, line cap=round]
  \draw[thick] (0, -0.8) rectangle (4, 0); \node at (2, -0.4) {tag};
  \draw[thick] (4, -0.8) rectangle (6, 0); \node at (5, -0.4) {index (7)};
  \draw[thick] (6, -0.8) rectangle (8, 0); \node at (7, -0.4) {offset (6)};
\end{tikzpicture}

On lookup, the cache:

Splits the address into tag, index, and offset.
Looks up all 4 ways of the set named by the index.
Compares each way's tag with the request's tag.
If any way matches, that way's data is returned (a hit).
If none match, it is a miss; the cache picks one way to evict (according to its replacement policy) and fills it.

Figure: 4-way set-associative lookup: all four ways of the indexed set are compared with the request tag in parallel, selecting the matching way on a hit

LaTeX

\begin{tikzpicture}[font=\footnotesize, >=Stealth, line cap=round]
  \node[font=\small, anchor=east] at (0, -0.5) {set 0x42:};
  % Four ways laid out at fixed coordinates
  \draw[thick, fill=white] (0.2, -1) rectangle (2.9, 0);
  \node at (1.55, -0.3) {valid $|$ tag $|$ data};
  \node at (1.55, -0.7) {way 0};
  \draw[thick, fill=white] (3.2, -1) rectangle (5.9, 0);
  \node at (4.55, -0.3) {valid $|$ tag $|$ data};
  \node at (4.55, -0.7) {way 1};
  \draw[thick, fill=white] (6.2, -1) rectangle (8.9, 0);
  \node at (7.55, -0.3) {valid $|$ tag $|$ data};
  \node at (7.55, -0.7) {way 2};
  \draw[thick, fill=white] (9.2, -1) rectangle (11.9, 0);
  \node at (10.55, -0.3) {valid $|$ tag $|$ data};
  \node at (10.55, -0.7) {way 3};
  % Compare bar
  \draw[thick, fill=white] (1, -2.5) rectangle (11, -1.8);
  \node at (6, -2.15) {compared with request tag in parallel};
  \draw[->] (1.55, -1)  -- (1.55, -1.8);
  \draw[->] (4.55, -1)  -- (4.55, -1.8);
  \draw[->] (7.55, -1)  -- (7.55, -1.8);
  \draw[->] (10.55, -1) -- (10.55, -1.8);
  % Hit/miss
  \draw[thick, fill=white] (3.5, -3.5) rectangle (8.5, -2.9);
  \node at (6, -3.2) {hit / miss + selected way};
  \draw[->] (6, -2.5) -- (6, -2.9);
\end{tikzpicture}

Set-associative caches dramatically reduce conflict misses. Two addresses that map to the same set can both be cached, as long as the set has spare ways. Common associativities are 4, 8, 16. The L1 caches of modern processors are typically 8-way set-associative; L2 and L3 caches are often 16-way or higher.

The cost of associativity is parallelism: each lookup must compare the tag against several ways simultaneously, which means more comparators and more energy per access. Associative tag-checking is one of the main reasons that L1 access latency is several cycles rather than one.

Fully Associative Caches

In a fully associative cache, a line can go into any slot. There is no index; the lookup must compare the tag against every slot.

Fully associative caches give the best hit rate for a given size — they have no conflict misses at all — but the comparison cost grows with size. A fully associative 1024-entry cache requires 1024 simultaneous tag comparisons, which is feasible only for small structures. Fully associative caches are common in TLBs (which we will see in Chapter 19), where the structure has only 32 to 128 entries and hit rate is critical, but they are not used for general data caches except at the very smallest sizes.

A Comparison

Property	Direct-mapped	Set-associative	Fully associative
Lookup speed	Fastest	Moderate	Slowest
Tag comparators	1	$n$	all entries
Conflict misses	Most	Fewer	None
Hit rate (same size)	Lowest	Higher	Highest
Energy per access	Lowest	Moderate	Highest
Used for	Small fast structures	L1, L2, L3	TLBs, victim caches

The right answer for any given cache depends on its size, latency target, and workload. Larger caches, where conflict-miss avoidance pays off more, tend to use higher associativity. Smaller caches, where speed matters most, use less. The L3 of a server-class chip might be 16-way set-associative; the µop cache of a desktop chip might be 8-way; some embedded L1s are direct-mapped.

03.Tags, Indices, and Offsets

The decomposition of an address into tag, index, and offset is worth dwelling on, because it determines the cache's behavior. Two arithmetic facts are crucial.

The offset bits are determined by the line size. A line size of $L$ bytes uses $\log_2 L$ offset bits. For 64-byte lines, that is 6 bits.

The index bits are determined by the number of sets. A cache with $S$ sets uses $\log_2 S$ index bits. The number of sets is $S = C / (L \cdot W)$ , where $C$ is the total cache capacity and $W$ is the associativity. For a 32 KB, 4-way, 64-byte-line cache: $S = 32768 / (64 \times 4) = 128$ , so 7 index bits.

The tag bits are everything left over. For a 48-bit virtual or 40-bit physical address, the tag is $40 - 7 - 6 = 27$ bits in the example above.

The total storage in the cache is data plus tags plus a few status bits per line:

$\text{storage} = S \cdot W \cdot (L \cdot 8 + \text{tag bits} + \text{status bits}).$

For the 32 KB example: $128 \times 4 \times (64 \times 8 + 27 + \text{a few}) \approx 32 \text{ KB} + 1.7 \text{ KB}$ of overhead. A few percent of the total, paid for the bookkeeping that makes the cache work.

04.A Worked Example

Suppose the CPU issues a load from address 0x0000_1A38_4256_F123 on a system with a 32 KB, 4-way set-associative L1 with 64-byte lines.

Plain Text

binary tail: ...0001 0010 0011
offset = bits [5:0]   = 100011  = 0x23  → byte 35 of the line
index  = bits [12:6]  = 0000000 (after right-shift by 6)
                      Let's compute from the actual address: 
                      address >> 6 = 0x68913E6B824
                      index = (address >> 6) & 0x7F  = 0x4 (just one possible value)
tag    = address >> 13

(The exact bit splits depend on the cache layout; the point is the procedure, not these particular numbers.)

The cache hardware:

Computes the index from bits 6–12.
Reads the four ways of set index simultaneously. Each way returns its tag, valid bit, and 64 bytes of data.
Compares the request's tag to each of the four tags.
If exactly one matches and is valid, selects that way's data. The byte at offset 35 within the 64-byte line is forwarded to the load result. This is a hit, completing in a few cycles.
If none matches, a miss occurs. The cache picks a victim way (next section), issues a fetch for the line to the next level of memory, and stalls the load (and any dependent instruction) until the line returns. When it does, the cache fills the slot, marks it valid with the new tag, and forwards the requested byte.

The tag comparison and data fetch can happen in parallel — modern caches read all four ways' data on the chance that one will hit, then select among them with a multiplexer based on the comparison results. This pipelining is one of several tricks that keep hit latency low.

05.Replacement Policies

When a set is full and a new line needs to be inserted, the cache must choose which existing line to evict. This is the replacement policy.

The optimal policy, which would replace the line that will not be needed for the longest time, is impossible: it requires knowing the future. Real policies are heuristics that try to approximate it.

Least Recently Used (LRU) evicts the line that has been unused for the longest. It is a good approximation of optimal in many workloads, because temporal locality says that recently used lines are more likely to be used again. Implementing exact LRU on a 4-way set requires 6 bits of order information (the relative recency of each pair); on a 16-way set the cost balloons, and most "LRU" caches use various approximations called pseudo-LRU.

First In, First Out (FIFO) evicts the line that was inserted longest ago, regardless of how recently used. Simpler than LRU but typically performs worse, because it ignores access patterns.

Random picks a line at random. Performs slightly worse than LRU on average but is trivially cheap and avoids pathological patterns where another policy would consistently make the wrong choice.

Not Recently Used (NRU) is a coarser version of LRU: each line has a "used recently" bit, set on access and periodically cleared. On replacement, the cache picks an unused line. Simple and reasonably effective.

Re-Reference Interval Prediction (RRIP) is a more sophisticated modern policy. Each line carries a small (2- or 3-bit) counter representing how soon it is expected to be used again; on access the counter is set to a low value; on no access it ages upward; on replacement, the highest-counter line is evicted. RRIP variants outperform LRU on workloads with poor temporal locality (such as streaming through large arrays).

The practical effect of these policies is usually small — a few percent of hit rate either way — but at L3 sizes a few percent matters. Most modern processors use some pseudo-LRU or RRIP variant.

06.Virtual and Physical Addressing of Caches

A subtlety we have papered over so far is which kind of address the cache uses for indexing and tag-matching. The CPU issues virtual addresses (Chapter 19); DRAM is reached by physical addresses; the cache sits in between, and its designers must choose at every level whether to look up by virtual address, by physical address, or by some combination of the two. The choice affects the cache's latency, its capacity, and its handling of address-space changes.

Four combinations are possible.

Physically Indexed, Physically Tagged (PIPT) caches do everything in physical address space. The TLB must translate the virtual address before the cache can be looked up, so the TLB sits in series with the cache and contributes to the hit latency. The advantage is correctness without surprises: a physical address has a single, unambiguous mapping to a cache line, and address-space changes (context switches) do not invalidate any cache contents. PIPT is the universal choice for L2 and L3 caches, where the TLB lookup is amortized over a slower cache access anyway.

Virtually Indexed, Virtually Tagged (VIVT) caches do everything in virtual space. No TLB lookup is needed for a hit, so they are fast. The disadvantages are severe: two virtual addresses that map to the same physical address (synonyms) can have separate cache entries that drift out of sync, and a context switch invalidates every entry. VIVT is rarely used in modern designs because the synonym and consistency problems outweigh the speed advantage.

Virtually Indexed, Physically Tagged (VIPT) caches use the virtual address to pick the set in parallel with the TLB lookup, then compare the resulting physical tag against the set's entries once the TLB has produced the physical address. The set lookup and the TLB lookup happen at the same time, hiding the TLB's latency under the cache's. This is the standard arrangement for L1 caches in modern processors.

VIPT requires a careful trick to avoid synonyms. The set-selecting bits of the virtual address must come from below the page boundary, so that they are identical in the virtual and physical addresses (the offset bits, which translation does not change). This constrains the cache geometry: the number of bytes per way (line size times number of sets per way — in fact, this is the cache size divided by the associativity) must be at most the page size. With 4 KB pages, a way can be at most 4 KB, so a 32 KB L1 must be at least 8-way associative. This is one reason every modern L1 is 8-way or higher.

Physically Indexed, Virtually Tagged (PIVT) caches are theoretically possible but rarely used; they combine the disadvantages of both schemes.

The practical reading of all this is that L1 caches are VIPT with a geometry that just fits inside the page size, while L2 and L3 caches are PIPT and pay the (smaller, relatively) cost of an upstream TLB lookup. The VIPT trick lets the L1 hit in 4 cycles even on a system whose TLB takes 1 cycle by itself; without it, the L1 would be that much slower.

07.Non-Blocking Caches and MSHRs

A naïve cache stalls completely when it misses: the access in flight cannot complete until the line returns from below, and any subsequent access has to wait. On a modern CPU, where DRAM latency is around 200 cycles and the program issues a memory operation every cycle or two, a blocking cache would force the pipeline to sit idle most of the time. Real caches do not work this way.

A non-blocking cache continues to service further accesses while a miss is outstanding. A subsequent hit can complete normally, returning data to the CPU; a subsequent miss can be issued to the next level alongside the first. The cache effectively has multiple miss requests in flight simultaneously.

The data structure that tracks outstanding misses is the miss-status holding register (MSHR). Each MSHR entry records one outstanding line fetch: which line is being fetched, which level it was sent to, and which CPU operations are waiting for it. When the line eventually returns, the MSHR's record is consulted to forward the data to every waiting operation; multiple loads to the same line that miss within a short window are served from a single fetch (a property called miss merging or MSHR coalescing).

A modern L1 has perhaps 8 to 16 MSHRs; an L2, several dozen; an L3, hundreds. When all the MSHRs are full, the cache cannot accept any more misses and the pipeline does eventually stall — a condition called MSHR exhaustion. Workloads with very high MLP can exhaust the MSHRs and become bandwidth-bound at a level below the channel's nominal capacity, simply because the structures that track in-flight requests are full.

A closely related feature is hit-under-miss: an access that hits in the cache can complete even while one or more misses are outstanding. Without it, the cache would block all subsequent accesses on a miss; with it, the hit returns its data in the normal cycle count, and only the misses queue up. Miss-under-miss (sometimes counted as a separate level of non-blocking) lets multiple misses be in flight as well; it is the standard configuration for every modern data cache.

Non-blocking behaviour is also why a single load instruction whose latency seems hopelessly long can in practice still execute at a high rate: as long as later instructions are independent, the pipeline keeps moving and the long latency is hidden behind useful work. The cost is the MSHRs themselves and the more elaborate cache controller that drives them, but the win in MLP is decisive.

08.Critical-Word First and Early Restart

A cache miss brings in 64 bytes, but the program usually wants only 8 of them right now — the critical word the load actually requested. Returning the entire line before the load can complete would waste cycles. Two related optimizations attack this.

Critical-word first tells the next-level memory system to deliver the requested word first, ahead of the rest of the line. The line is still filled in entirety, but the requested bytes arrive before the others.

Early restart lets the load complete as soon as the critical word arrives, rather than waiting for the full line fill. The remaining bytes continue arriving in the background and are written to the cache, but the dependent instructions can begin executing immediately.

Together, the two cut a typical fill latency from "line fetch + 8 cycles for the burst" down to "line fetch + the position of the critical word in the burst." Modern memory systems implement the same idea at the DRAM level, where the burst order can be programmed to start at any 8-byte offset within a 64-byte line.

09.Banked and Multi-Ported Caches

A cache that can service one access per cycle is fast enough for an in-order, single-issue CPU. A four-wide superscalar CPU may issue two loads and one store in the same cycle, and a single-ported cache cannot keep up.

The cleanest answer is to give the cache multiple read and write ports. A truly multi-ported SRAM has separate word lines and bit lines for every port, allowing genuinely simultaneous accesses. The cost is large, both in area and in power: each port multiplies the number of bit-line transistors per cell. A two-ported cache is roughly twice the area of a single-ported one of the same capacity, and a four-ported cache is much worse.

The more common compromise is to bank the cache into several independently-accessible sub-arrays, each single-ported. Two accesses in the same cycle that fall into different banks can proceed in parallel; two that fall into the same bank conflict and must be serialized. Modern L1 data caches typically have 8 or more banks, and the front-end logic detects bank conflicts and stalls the offending operation by a cycle. Bank conflicts are uncommon in normal code (because the addressing patterns spread across banks naturally) but can be pathological for code that, for example, strides through memory at exactly the bank stride.

A related arrangement is interleaving at the line level: consecutive cache lines are placed in different banks, so that a sequential access pattern hits all banks in turn. This is the same trick that DRAM uses to spread bandwidth across banks (Chapter 18), applied at the much smaller cache scale.

The net effect is a cache that appears to service multiple accesses per cycle most of the time, with the small probability of a bank conflict adding a small probability of stall. This appearance is what lets a wide superscalar feed itself.

10.Way Prediction and Other Latency Reductions

The set-associative tag check requires reading and comparing all the ways of a set in parallel, which costs energy and can stretch the access time. A few designs use way prediction to avoid this when possible: a small predictor guesses which way will hit on the basis of recent history, only that way's data array is read, and the tag check happens in parallel. On a correct prediction, the access completes with the energy cost of a direct-mapped cache; on a misprediction, the cache replays the access reading all ways.

A related technique is the micro-tag or way-predicting L1: a smaller tag array, indexed by an easy-to-compute function of the address, that returns a single way directly. Apple's M-series cores famously use a variant of this in their large L1 caches.

Victim caches, introduced by Norm Jouppi in the early 1990s, are small fully-associative caches that hold lines recently evicted from the main cache. A miss in the main cache that hits in the victim cache restores the line at little cost, recovering most of the hit rate that a more associative main cache would have provided. Victim caches were popular when associativity was expensive; they are less common now that 8- or 16-way main caches are routine.

These tricks, together with non-blocking behaviour, banking, and VIPT, account for the fact that real L1 caches deliver something close to one or two accesses per cycle at hit latencies of 4 or 5 cycles — numbers that would have been impossible with a naïve implementation.

11.Write Policies

When the CPU writes to memory, the cache has two questions: should the write be reflected to the next level immediately, and what should happen if the address is not currently in the cache?

The first question is write-through versus write-back.

In a write-through cache, every store is immediately written both to the cache and to the next level of memory. The cache and the next level always agree. This is conceptually simple and makes coherence easier in multi-core systems, but it generates a huge amount of traffic to the lower levels — every store, no matter how often the same line is updated, becomes a separate write to (say) L2 or DRAM.

In a write-back cache, the store updates only the cache. The line is marked dirty, indicating that its contents differ from the next level. The dirty line is written back later, only when it is evicted. This is much more efficient — many writes to the same line collapse into a single eventual write-back — at the cost of more bookkeeping (the dirty bit) and more complexity in coherence (cores have to ask each other whether they have a dirty copy).

Modern CPUs use write-back almost universally for L1 and L2 data caches. The traffic savings are substantial, and the complexity is manageable. Some L3s use a slight variant called write-back-with-allocate-on-fill.

The second question is write-allocate versus no-write-allocate, which applies on a write miss.

A write-allocate policy fetches the missing line into the cache before performing the write. The line is then resident, and any subsequent reads or writes hit. The cost is the fetch overhead on the first write.

A no-write-allocate policy writes directly to the next level without bringing the line into the cache. Subsequent reads to the line will miss (until something else brings the line in). This is appropriate for streaming writes, such as filling a large buffer that the program will not read again.

The two choices combine. The most common combination is write-back, write-allocate: dirty lines stay in the cache until evicted, and writes that miss bring the line in first. Streaming workloads can be helped by special "non-temporal" store instructions (MOVNTDQ on x86, STR with non-temporal hints on AArch64) that bypass the cache on writes, behaving as no-write-allocate, write-through.

12.Cache Prefetching

A miss costs many cycles. If the cache could predict that a miss is about to happen and start fetching the line before the program asks for it, the cycles spent waiting could be hidden. This is the idea of prefetching.

There are two kinds.

Software prefetching uses explicit instructions inserted by the compiler or the programmer.

for (int i = 0; i < N; i++) {
    __builtin_prefetch(&a[i + 16]);   // hint: bring a[i+16] into cache
    sum += a[i];
}

The hint tells the cache to start fetching a line that the program will need soon. The compiler picks the prefetch distance — far enough ahead to hide DRAM latency, not so far that the prefetched line is evicted before use. Software prefetching is precise but requires the programmer or compiler to know the access pattern.

Hardware prefetching is the more interesting case. The cache hardware watches the stream of cache misses, looks for patterns, and generates prefetch requests on its own. Several mechanisms are in common use.

A stride prefetcher notices that consecutive misses are at addresses with a constant stride (e.g., 64 bytes apart, exactly one line per access) and generates prefetches for the next several lines in the same direction. This catches the simple case of array iteration.

A stream prefetcher maintains a small table of recent miss addresses and watches for sequential streams (lines fetched in order). When a stream is detected, the prefetcher pulls in lines several ahead of the current position.

A next-line prefetcher is the simplest of all: on a miss to line $L$ , also fetch line $L + 1$ . Cheap and effective for spatial locality, useless for irregular access patterns.

A spatial prefetcher observes patterns at a coarser granularity, such as which lines within a 4 KB region are touched together, and learns to fetch the rest when one is touched.

Modern processors typically have several prefetchers running in parallel, each handling a different access pattern. They interact in complex ways: an aggressive prefetcher can pollute the cache (filling it with data the program never needed), while a conservative one leaves performance on the table. Tuning is a fine art, and CPU vendors invest heavily in it.

13.Hits, Misses, and the Three Cs

Cache misses can be classified into three categories, traditionally called the three Cs:

Compulsory misses (also called cold misses) happen the first time a line is referenced. They cannot be avoided by any cache organization, only reduced by prefetching or by larger lines.

Capacity misses happen because the cache is too small to hold the working set. Even a fully associative cache with optimal replacement would miss; the data simply does not fit. Reducing capacity misses requires a larger cache or a smaller working set.

Conflict misses happen in direct-mapped and set-associative caches because the working set fits but two pieces of it map to the same set, evicting each other. A fully associative cache of the same size would not have these misses. Reducing conflict misses requires more associativity or a different mapping function.

A fourth category sometimes added is coherence misses, which arise in multi-core systems when one core invalidates a line that another core had cached. We will return to coherence in Chapter 31.

The three Cs are useful for diagnosing performance. A program with many compulsory misses is bound by initial fetch bandwidth (and benefits from prefetching). One with many capacity misses needs a smaller working set or more cache. One with many conflict misses suffers from unfortunate addressing patterns and may benefit from data-layout changes (e.g., padding arrays to break up problematic stride relationships).

14.A Quick Tour of Real Caches

A modern processor cache hierarchy might look like this (numbers approximate and changing yearly):

Level	Size	Lines	Associativity	Latency	Notes
L1 instruction	32 KB	64 B	8-way	4 cycles	per-core, read-mostly
L1 data	32–48 KB	64 B	8- or 12-way	4–5 cycles	per-core, read/write
L2 unified	1–2 MB	64 B	8- or 16-way	12–15 cycles	per-core
L3 unified	16–96 MB	64 B	16-way	30–60 cycles	shared across cores

A few details that matter in practice.

The L1 instruction and data caches are split: instructions go to one, data to the other. This is a Modified Harvard arrangement that doubles the cache bandwidth for the common case of an instruction fetch and a data access in the same cycle. Lower levels are unified (instructions and data share), because instruction-cache pressure is generally lower at deeper levels and the extra capacity is more useful as one pool.

The L1 is small because it is latency-critical. Making it larger would increase access latency, which would directly increase the latency of every load. The L2 is larger and slower because it backs up the L1 and is expected to take a few extra cycles when the L1 misses. The L3 is much larger and much slower because it backs up the L2 and serves all cores.

Cache lines are 64 bytes everywhere. This is not arbitrary — it is the result of decades of co-design between CPUs, memory controllers, and DRAM. Changing it would ripple through the entire system.

15.Summary

A cache is a piece of fast memory between the CPU and main memory that exploits locality of reference to make most accesses fast. Caches are built around a few central design choices: the cache line size (typically 64 bytes), the mapping policy (direct-mapped, set-associative, or fully associative), the replacement policy (LRU and its approximations, RRIP), the write policy (write-back with allocate, almost universally), and the prefetching strategy (a mix of hardware and software techniques). The address splits into tag, index, and offset, and the cache uses these to find the right line in the right set. The choice between virtual and physical addressing of indices and tags shapes the L1 in particular: VIPT geometry with a way-size at most one page is the standard arrangement that hides the TLB lookup under the cache lookup. Non-blocking behaviour, MSHRs, hit-under-miss, critical-word-first delivery, banking, and way prediction are the implementation tricks that turn a logically simple structure into one that can sustain high memory-level parallelism at low hit latency.

Performance is measured by hit rate, which the equation $\text{AMAT} = h \cdot t_{\text{cache}} + (1 - h) \cdot t_{\text{mem}}$ converts into average access time. Misses fall into compulsory, capacity, and conflict categories — a useful taxonomy when diagnosing performance problems. Real processors stack three or four cache levels of increasing size and decreasing speed, balancing latency at the top of the hierarchy with capacity at the bottom.

Chapter 18 turns to what happens when the cache misses all the way through and the request reaches DRAM.

Book mode

	\begin{tikzpicture}[font=\small, line cap=round]
	\draw[thick] (0, -0.8) rectangle (4, 0); \node at (2, -0.4) {tag};
	\draw[thick] (4, -0.8) rectangle (6, 0); \node at (5, -0.4) {index (9)};
	\draw[thick] (6, -0.8) rectangle (8, 0); \node at (7, -0.4) {offset (6)};
	\end{tikzpicture}

	binary tail: ...0001 0010 0011

	offset = bits [5:0] = 100011 = 0x23 → byte 35 of the line
	index = bits [12:6] = 0000000 (after right-shift by 6)
	Let's compute from the actual address:
	address >> 6 = 0x68913E6B824
	index = (address >> 6) & 0x7F = 0x4 (just one possible value)
	tag = address >> 13

	for (int i = 0; i < N; i++) {
	__builtin_prefetch(&a[i + 16]); // hint: bring a[i+16] into cache
	sum += a[i];
	}