Part VISA Case Studies

RISC-V Extensions

May 16, 2026·23 min read·advanced

This chapter covers the RISC-V unprivileged instruction set and its major extensions. Where Chapter 42 surveyed RISC-V at a strategic level, this chapter is the programmer's-eye view: the base…

This chapter covers the RISC-V unprivileged instruction set and its major extensions. Where Chapter 42 surveyed RISC-V at a strategic level, this chapter is the programmer's-eye view: the base integer ISA in detail, the standard extensions (M, A, F/D, C, B, V), calling conventions, common idioms, and a comparison with the equivalent facilities in x86-64 and AArch64.

The treatment parallels Chapters 33 (x86-64 programming model) and 38 (AArch64 programming model). RISC-V's minimalism makes the chapter shorter on the base ISA but the modular extensions still demand significant detail, particularly the V (vector) extension which is one of the most distinctive parts of modern RISC-V.

01. Base Integer ISA Recap

The 47 instructions of RV32I (Chapter 42) cluster into computational, load/store, control, system, and memory-ordering categories. RV64I adds about a dozen more for 64-bit and 32-bit-on-64-bit operations. Let's walk through the base in concrete examples.

Computational Instructions

Register-register:

Assembly
add a0, a1, a2 # a0 = a1 + a2
sub a0, a1, a2 # a0 = a1 - a2
and a0, a1, a2 # a0 = a1 & a2
or a0, a1, a2 # a0 = a1 | a2
xor a0, a1, a2 # a0 = a1 ^ a2
sll a0, a1, a2 # a0 = a1 << (a2 & 0x3f) [shift left logical]
srl a0, a1, a2 # a0 = a1 >> (a2 & 0x3f) [shift right logical]
sra a0, a1, a2 # a0 = a1 >> (a2 & 0x3f) [shift right arithmetic, sign-extend]
slt a0, a1, a2 # a0 = (a1 < a2) ? 1 : 0 [signed]

Register-immediate:

Assembly
addi a0, a1, 100 # a0 = a1 + 100
andi a0, a1, 0xff # a0 = a1 & 0xff
ori a0, a1, 0x10
xori a0, a1, -1 # a0 = ~a1 (xor with -1 flips all bits)
slli a0, a1, 4 # a0 = a1 << 4
srli a0, a1, 4 # logical shift right
srai a0, a1, 4 # arithmetic shift right
slti a0, a1, 100
sltiu a0, a1, 100

The immediate field is 12 bits, sign-extended. So immediate range is -2048 to +2047. For larger constants, use LUI followed by ADDI:

Assembly
lui a0, 0x12345 # a0 = 0x12345000 (upper 20 bits)
addi a0, a0, 0x678 # a0 = 0x12345678

For loading any 32-bit constant, lui + addi works. For arbitrary 64-bit constants, the assembler emits a multi-instruction sequence (or a literal-pool load). The pseudo-instruction li a0, 0x123456789abcdef0 gets expanded by the assembler.

RV64-Specific Instructions

RV64 adds 32-bit (word-sized) variants of certain operations, all with a W suffix:

Assembly
addw a0, a1, a2 # 32-bit add, sign-extend result to 64 bits
subw a0, a1, a2
addiw a0, a1, 100
sllw a0, a1, a2 # 32-bit shift; only low 5 bits of shift count used
srlw, sraw
slliw, srliw, sraiw

These are necessary because RV64's regular ADD is a 64-bit add; the 32-bit-result variants are distinct instructions. The W suffix is the convention.

To use a 32-bit value as the low half of a 64-bit register, the high bits must be either zero or sign-extended. The W instructions sign-extend; if you want zero extension, use a shift-mask:

Assembly
slli a0, a0, 32 # shift left by 32
srli a0, a0, 32 # logical shift right by 32 — clears upper bits

Or combine with SLLI + SRLI in one operation (some assemblers provide a zext.w pseudo-instruction).

Branches

RISC-V has no flag register; conditional branches compare two registers directly:

Assembly
beq a0, a1, label # branch if a0 == a1
bne a0, a1, label # branch if a0 != a1
blt a0, a1, label # branch if a0 < a1 (signed)
bge a0, a1, label # branch if a0 >= a1 (signed)
bltu a0, a1, label # branch if a0 < a1 (unsigned)
bgeu a0, a1, label # branch if a0 >= a1 (unsigned)

The branch range is ±4 KiB (12-bit signed offset, scaled by 2). For longer ranges, a pseudo-instruction expands to a conditional branch followed by an unconditional jump:

Assembly
beq a0, a1, far_label # pseudo, may expand to:
# bne a0, a1, .L_skip
# j far_label
# .L_skip:

Comparisons with zero are common, so there are pseudo-instructions:

Assembly
beqz a0, label # branch if a0 == 0 (assembles to beq a0, x0, label)
bnez a0, label # branch if a0 != 0
bltz a0, label # branch if a0 < 0
bgez a0, label # branch if a0 >= 0

These exploit x0 (the zero register) cleverly: comparing against x0 is just comparing against 0.

Jumps

Assembly
jal ra, label # jump and link: ra = pc+4; jump to label
# (also jal x0, label = unconditional jump, no link)
jalr ra, a0, 0 # jump and link register: ra = pc+4; jump to a0+0
ret # alias for jalr x0, ra, 0 — return
j label # alias for jal x0, label — unconditional jump
call label # pseudo for jal ra, label

The function call convention: jal ra, callee saves the return address in ra and jumps. The callee returns with ret (which is jalr x0, ra, 0).

Loads and Stores

Assembly
lb a0, 0(a1) # load byte signed
lbu a0, 0(a1) # load byte unsigned
lh a0, 0(a1) # load halfword signed (16-bit)
lhu a0, 0(a1) # load halfword unsigned
lw a0, 0(a1) # load word (32-bit), sign-extended in RV64
lwu a0, 0(a1) # load word unsigned (RV64 only)
ld a0, 0(a1) # load doubleword (RV64 only)
sb a0, 0(a1) # store byte
sh a0, 0(a1) # store halfword
sw a0, 0(a1) # store word

The addressing mode is base register + 12-bit signed immediate. There is no indexed (base+register) addressing — that has to be done with an explicit ADD first:

Assembly
# Loading array[i] where a0=array, a1=i, scale=4:
slli t0, a1, 2 # t0 = i*4
add t0, a0, t0 # t0 = &array[i]
lw a2, 0(t0) # a2 = array[i]

This is more verbose than ARM's ldr w2, [x0, x1, lsl #2] (which fits the same operation in one instruction). RISC-V's choice trades encoding flexibility for simplicity.

The bit-manipulation extension Zba reintroduces some scaled-add operations (SH1ADD, SH2ADD, SH3ADD) that compress this pattern; we'll see them shortly.

PC-Relative Addressing

To form a PC-relative address, RISC-V uses AUIPC (Add Upper Immediate to PC):

Assembly
auipc t0, 0x12345 # t0 = pc + (0x12345 << 12)
addi t0, t0, 0x678 # t0 = pc + 0x12345678

Or, to load a value PC-relative:

Assembly
auipc t0, %pcrel_hi(global_var) # high 20 bits of (global_var - pc)
ld a0, %pcrel_lo(label)(t0) # load from t0 + low 12 bits
label: # ... where we use the same auipc

The %pcrel_hi and %pcrel_lo markers are linker relocations. The pattern is verbose but mechanical: AUIPC + offset to get the address (or the value via a load).

For compiler-generated PIC, this is the standard pattern. The compiler emits AUIPC + ADDI for taking the address of a global, AUIPC + LD for reading a global, etc.

Memory Ordering

RISC-V uses a weak memory model, RVWMO. The default is that loads and stores can be reordered freely (subject to data dependencies). Synchronization is via the FENCE instruction:

Assembly
fence rw, rw # full memory barrier: orders all rw before all rw
fence r, rw # earlier reads before later reads/writes
fence w, w # store-store fence
fence iorw, iorw # I/O fence (orders memory and I/O)

The operands name which classes (r/w/i/o) come before and after. fence rw, rw is the most common — equivalent to ARM's DMB ISH.

For acquire/release semantics, the A extension's atomics have built-in aq and rl annotations. We discuss those next.

02. The M Extension: Multiply and Divide

The M extension adds 8 instructions for integer multiply and divide:

Assembly
mul a0, a1, a2 # a0 = (a1 * a2) low 64 bits (or 32 in RV32)
mulh a0, a1, a2 # a0 = high 64 bits of signed*signed
mulhsu a0, a1, a2 # a0 = high 64 bits of signed*unsigned
mulhu a0, a1, a2 # a0 = high 64 bits of unsigned*unsigned
div a0, a1, a2 # a0 = a1 / a2 (signed)
divu a0, a1, a2 # a0 = a1 / a2 (unsigned)
rem a0, a1, a2 # a0 = a1 % a2 (signed)
remu a0, a1, a2 # a0 = a1 % a2 (unsigned)

In RV64, there are also W-variants:

Assembly
mulw, divw, divuw, remw, remuw # 32-bit operations

Division by zero in RISC-V does not trap. Instead, the result is defined: x/0 = -1 (all bits set), x%0 = x. This is unusual — most architectures either trap or are unspecified. RISC-V's choice avoids the need for traps, which complicates microcontrollers. Software that wants to trap on division by zero must check explicitly.

03. The A Extension: Atomics

The A extension provides atomic memory operations. Two styles:

Load-Reserved / Store-Conditional (LR/SC).

Assembly
loop:
lr.w t0, (a0) # load-reserved from [a0]
addi t0, t0, 1 # increment
sc.w t1, t0, (a0) # store-conditional: t1 = 0 on success, 1 on failure
bnez t1, loop # retry if failed

This is the LR/SC pattern (Chapters 30, 31): LR marks the line for monitoring; SC succeeds only if the line has not been written by another agent since the LR.

Atomic Memory Operations (AMOs).

Assembly
amoadd.w t0, t1, (a0) # atomically: t0 = [a0]; [a0] += t1
amoand.w t0, t1, (a0)
amoor.w t0, t1, (a0)
amoxor.w t0, t1, (a0)
amomax.w t0, t1, (a0)
amomin.w t0, t1, (a0)
amomaxu.w, amominu.w
amoswap.w t0, t1, (a0) # atomic exchange

Each AMO is a single instruction performing a read-modify-write. .w is 32-bit; .d is 64-bit (RV64).

Both LR/SC and AMO instructions support acquire and release annotations via the aq and rl bits in the encoding:

Assembly
amoadd.w.aq # AMO with acquire semantics
amoadd.w.rl # AMO with release semantics
amoadd.w.aqrl # AMO with both (sequentially consistent)
lr.w.aq, sc.w.rl # typical mutex pattern

Acquire means: subsequent operations don't move before this. Release means: preceding operations don't move after this. The combined aqrl form is sequentially consistent.

This is more flexible than ARM's choice (LDAR/STLR are full acquire/release; weaker forms aren't directly available). RISC-V lets the programmer (or compiler) specify exactly the ordering needed.

For compare-and-swap (CAS), there is no single CAS instruction in the base A extension. CAS is built from LR/SC:

Assembly
# CAS: if [a0] == t0, replace with t1 and return 0; else return [a0] in t0
cas:
lr.w t2, (a0)
bne t2, t0, fail
sc.w t3, t1, (a0)
bnez t3, cas # retry if SC failed
li t0, 0 # success
ret
fail:
mv t0, t2 # return current value
ret

The newer Zacas extension adds explicit AMOCAS instructions, mirroring AArch64's CAS. Adoption of Zacas is just emerging.

04. The F and D Extensions: Floating-Point

The F extension adds single-precision FP; the D extension adds double-precision (and implies F).

Registers

32 floating-point registers, f0-f31, with ABI names ft0-ft11 (temporaries), fs0-fs11 (saved), fa0-fa7 (arguments). Each register is 32 bits wide if only F is implemented, 64 bits wide with D.

A control and status register, fcsr, holds rounding mode (3 bits) and exception flags (5 bits, IEEE 754-style: NX, UF, OF, DZ, NV).

Instructions

Most arithmetic instructions exist for both F and D, distinguished by .s (single) and .d (double) suffixes:

Assembly
fadd.s fa0, fa1, fa2 # single-precision add fadd.d fa0, fa1, fa2 # double-precision add fsub.s, fmul.s, fdiv.s, fsqrt.s fmin.s, fmax.s # IEEE 754-style min/max with NaN handling fmadd.s fa0, fa1, fa2, fa3 # FMA: fa0 = fa1*fa2 + fa3 fnmadd.s, fmsub.s, fnmsub.s fcvt.s.w fa0, a0 # convert int32 → float fcvt.w.s a0, fa0 # convert float → int32 fcvt.s.d, fcvt.d.s # between single and double fcvt.l.s, fcvt.s.l # int64 conversions (RV64) flt.s a0, fa1, fa2 # FP less than: a0 = (fa1 < fa2) ? 1 : 0 feq.s a0, fa1, fa2 # FP equal fle.s a0, fa1, fa2 # FP less or equal fclass.s a0, fa0 # classify FP value (returns bit-mask of classifications) fmv.x.w a0, fa0 # move FP register's bits to integer register fmv.w.x fa0, a0 # move integer register's bits to FP register flw fa0, 0(a0) # load single fld fa0, 0(a0) # load double fsw fa0, 0(a0) # store single fsd fa0, 0(a0) # store double

Each arithmetic instruction takes a 3-bit rounding-mode field, encoded in the instruction. Common values: RNE (round to nearest, ties to even), RTZ (toward zero), RDN (down, toward -inf), RUP (up, toward +inf), RMM (nearest, ties to max magnitude). The default is "use the rounding mode in fcsr".

Comparisons and Branches

There is no dedicated FP branch in the base F/D extensions. Instead, one of the FP compare instructions sets an integer register, and a regular integer branch is used:

Assembly
flt.s t0, fa1, fa2 # t0 = 1 if fa1 < fa2 else 0
bnez t0, fa1_less_label

Two instructions for what x86 does in comiss + jb. The cost is small in OoO cores (the integer branch is regular).

05. The C Extension: Compressed Instructions

The C extension adds 16-bit encodings for common 32-bit instructions. Examples:

Assembly
c.add a0, a1 # 16-bit form of: add a0, a0, a1
c.li a0, 5 # 16-bit form of: addi a0, x0, 5
c.lw a0, 8(sp) # 16-bit form of: lw a0, 8(sp)
c.j label # 16-bit jump
c.beqz a0, label # 16-bit branch-if-zero
c.mv a0, a1 # 16-bit move
c.nop # 16-bit nop

The compressed encoding has constraints: only 8 of the 32 registers (x8-x15, the "compressed register set") are easily addressed; immediates are smaller; not all combinations are encodable.

Mixed code (32-bit and 16-bit instructions interleaved) is the norm. The decoder identifies the size from bits 0-1: 11 means 32-bit, anything else means 16-bit.

Code-density gain: typical embedded code drops to ~60-70% of its uncompressed size. Important for microcontrollers with small flash.

06. The B Extension: Bit Manipulation

The B extension is a collection of sub-extensions:

  • Zba (Address generation): SH1ADD, SH2ADD, SH3ADD — shifted-add. SH1ADD computes rd = rs1 + (rs2 << 1). Useful for array indexing.
  • Zbb (Basic bit manipulation): ANDN, ORN, XNOR, CLZ (count leading zeros), CTZ (count trailing zeros), CPOP (population count, popcount), MIN, MAX (integer min/max), SEXT.B/SEXT.H (sign-extend), ZEXT.H, ROR/ROL (rotate), ORC.B, REV8 (byte reverse).
  • Zbs (Single-bit operations): BSET, BCLR, BINV, BEXT — set, clear, invert, extract a single bit by position.
  • Zbc (Carry-less multiply): CLMUL, CLMULH, CLMULR. For CRC and GCM.

Examples:

Assembly
# Compute array[i] address with scaling:
sh3add t0, a1, a0 # t0 = a0 + (a1 << 3) — a[i] for 8-byte elements
ld a2, 0(t0)
# Compute leading zero count:
clz a0, a1 # a0 = count of leading zeros in a1
# Population count:
cpop a0, a1 # a0 = popcount(a1)
# Min/max:
min a0, a1, a2 # signed min
maxu a0, a1, a2 # unsigned max

Adoption: ratified in 2021. Modern application processors (RVA22+ profile) require it. Compilers emit B instructions when targeting RVA22 or later.

07. The V Extension: Vectors

The V extension is RISC-V's vector ISA, ratified in 2021 (V 1.0). Like ARM SVE, it is a vector-length-agnostic design: code compiled for V works on implementations with various vector lengths (VLEN), automatically benefiting from wider hardware.

Registers

The V extension adds 32 vector registers, v0-v31. Each register has VLEN bits, where VLEN is implementation-defined (typically 128, 256, 512, or higher). Optionally, registers can be grouped via the LMUL parameter (1, 2, 4, or 8 registers grouped) to provide longer effective vectors.

Several control registers:

  • vstart: first element to process (for resumption after fault).
  • vxsat, vxrm: fixed-point saturation, rounding.
  • vcsr: control and status.
  • vl: current vector length (number of active elements).
  • vtype: vector element width (SEW), grouping (LMUL), tail/mask policy.
  • vlenb: VLEN in bytes (read-only, queryable).

Vector Configuration

Before performing vector operations, you set up the vector type and length:

Assembly
li t0, 8 # element count to process
vsetvli t1, t0, e32, m1, ta, ma # configure: 32-bit elements, LMUL=1, ...

vsetvli does several things at once:

  • Asks for a desired element count (t0).
  • Selects element width (e32 = 32-bit elements).
  • Selects grouping (m1 = LMUL of 1, single register per vector).
  • Sets tail policy (ta = tail-agnostic, undefined values OK in tail).
  • Sets mask policy (ma = mask-agnostic).
  • Returns the actual VL granted (which may be less than requested if VLEN doesn't allow more).

vsetivli is a variant with the count as an immediate.

Vector Arithmetic

Assembly
vadd.vv v0, v1, v2 # vector + vector: v0 = v1 + v2
vadd.vx v0, v1, a0 # vector + scalar: v0 = v1 + a0 (broadcast)
vadd.vi v0, v1, 5 # vector + immediate
vsub.vv, vmul.vv, vdiv.vv
vand.vv, vor.vv, vxor.vv
vsll.vv, vsrl.vv, vsra.vv
vmin.vv, vmax.vv, vminu.vv, vmaxu.vv
vfadd.vv v0, v1, v2 # FP add
vfmul.vv v0, v1, v2
vfmacc.vv v0, v1, v2 # FMA: v0 += v1 * v2
vmacc.vv v0, v1, v2 # integer multiply-accumulate: v0 += v1 * v2

Each instruction acts on vl elements. Inactive elements (above vl) are handled per the tail policy.

Vector Memory Operations

Assembly
vle32.v v0, (a0) # load vl elements, 32-bit each, contiguous
vse32.v v0, (a0) # store vl elements, 32-bit each
vle8.v v0, (a0) # load bytes
vlse32.v v0, (a0), a1 # strided load (stride = a1)
vluxei32.v v0, (a0), v8 # gather (indices in v8)
vsuxei32.v v0, (a0), v8 # scatter

Strided and indexed loads/stores cover most access patterns. The element size (e8, e32, etc.) is chosen by vsetvli, but the load instruction's mnemonic (vle32.v) overrides for that operation.

Masked Operations

The mask is in v0 (special-cased). Operations can be conditional on v0:

Assembly
vadd.vv v1, v2, v3, v0.t # v1 = v2 + v3 where v0 is true; tail policy elsewhere

The v0.t suffix means "use v0 as a mask, only update where v0 is true". This is the predicated-execution pattern, essential for vectorizing conditional code.

Reductions

Assembly
vredsum.vs v0, v1, v2 # sum reduction: v0[0] = v2[0] + sum(v1[0..vl])
vredmax.vs, vredmin.vs
vfredsum.vs, vfredmax.vs # FP versions

A reduction takes a vector source and a scalar source, and writes a scalar result (in v0[0]).

Vector-Length-Agnostic Loop

The canonical V vector loop:

Assembly
.loop:
vsetvli t0, a2, e32, m1, ta, ma # t0 = min(a2, max VL)
vle32.v v0, (a0) # load t0 elements
vle32.v v1, (a1) # load t0 elements
vadd.vv v2, v0, v1 # add
vse32.v v2, (a3) # store
sub a2, a2, t0 # remaining count
slli t0, t0, 2 # bytes processed = elements * 4
add a0, a0, t0
add a1, a1, t0
add a3, a3, t0
bnez a2, .loop

The loop is length-agnostic: vsetvli grants whatever vector length the implementation supports, and the loop iterates until done. No fixup loop for the tail; the last iteration just gets a smaller VL.

This is structurally similar to ARM SVE, but the configuration is more explicit (vsetvli per loop) and the predication mechanism uses v0 specifically rather than dedicated predicate registers.

Element Width Mixing

A key V capability: in one loop, you can mix element widths. For example, expand 16-bit data into 32-bit accumulators:

Assembly
vsetvli t0, a2, e16, m1
vle16.v v0, (a0) # load 16-bit elements
vsetvli t0, a2, e32, m2 # promote to 32-bit, LMUL=2 (two regs)
vwmulu.vv v2, v0, v0 # widen-multiply (16x16 -> 32, into v2:v3)

vwmulu.vv is a widening multiply: takes 16-bit inputs, produces 32-bit outputs. The destination is two registers (LMUL=2). This is heavily used in DSP and fixed-point algorithms.

Comparison with SVE and AVX-512

FeatureRISC-V VARM SVE/SVE2AVX-512
Vector widthVariableVariableFixed 512
Mask registerv0 (no separate file)P0-P15 (16)k0-k7 (8)
Length configvsetvli per loopWHILELT predicateFixed
Element widths8/16/32/64, mixable8/16/32/648/16/32/64
Strided accessYesYesNo (gather only)
Gather/scatterYesYesYes
First-fault loadsYesYesNo
LMUL groupingYes (unique)No equivalentNo

LMUL is unique to RISC-V V: it lets you "group" registers to get longer effective vectors at the cost of register count. For tight loops with few live vectors, LMUL=4 or LMUL=8 effectively gives 4× or 8× the vector length. Compilers and hand-written kernels use LMUL aggressively.

Adoption

V adoption is gradual. RVA23 profile (2024) requires V 1.0 and several Zv* sub-extensions. Implementations:

  • T-Head C910: optional V (some chips have it, some don't); often early non-ratified V 0.7.1.
  • SpacemiT K1 (Banana Pi BPI-F3): V 1.0 + Zvfh.
  • SiFive P870: V 1.0.
  • Tenstorrent's RISC-V cores: V 1.0 (used internally for AI work).
  • Ventana V2: V 1.0.

Most current RISC-V SoCs do not yet have V; for them, scalar code is the only option for portable software. SIMD-style extensions (like P, the packed-SIMD extension that was being discussed) are not part of any deployed RISC-V profile.

08. Calling Convention

The standard RISC-V calling convention (RV64, "lp64d" ABI for FP-equipped systems):

Argument passing.

  • Integer / pointer args: a0-a7 (8 registers).
  • FP args: fa0-fa7 (8 registers).
  • Additional args on stack.
  • Return value in a0 (and a1 for 128-bit). FP return in fa0.

Caller-saved (volatile, "temporary"). ra, t0-t6, a0-a7, ft0-ft11, fa0-fa7.

Callee-saved. sp, s0-s11, fs0-fs11.

Frame setup. sp must be 16-byte aligned at any call. s0/fp is the conventional frame pointer.

A typical prologue:

Assembly
function:
addi sp, sp, -32 # allocate frame
sd ra, 24(sp) # save return address
sd s0, 16(sp) # save fp
addi s0, sp, 32 # set new fp
# ... body ...
ld s0, 16(sp)
ld ra, 24(sp)
addi sp, sp, 32

For leaf functions (no calls), the ra save can be skipped, and the prologue may be minimal.

The convention is straightforward — comparable to AArch64's AAPCS64 in spirit, with similar argument register count and similar callee/caller saved partitioning.

09. Common Idioms

Zeroing a register.

Assembly
mv a0, x0 # a0 = 0 (alias for addi a0, x0, 0)
li a0, 0 # same

Negation.

Assembly
neg a0, a1 # a0 = -a1 (alias for sub a0, x0, a1)

Boolean materialization.

Assembly
slt a0, a1, a2 # a0 = (a1 < a2) ? 1 : 0
sltu a0, a1, a2 # unsigned version

Branchless absolute value (signed):

Assembly
sra t0, a0, 63 # t0 = sign-extended (-1 if negative, 0 otherwise)
xor a0, a0, t0 # flip bits if negative
sub a0, a0, t0 # add 1 if negative

(Same as the AArch64 / x86-64 trick.)

Branchless min/max with B extension:

Assembly
min a0, a1, a2 # signed min (with Zbb)
max a0, a1, a2 # signed max

Without Zbb, this is a slt + branch + select sequence.

Loop counter.

Assembly
.loop:
lw a2, 0(a0)
add a3, a3, a2
addi a0, a0, 4
addi a1, a1, -1
bnez a1, .loop

The standard decrement-counter form. Can also be expressed as compare-against-bound:

Assembly
.loop2:
lw a2, 0(a0)
add a3, a3, a2
addi a0, a0, 4
blt a0, a1, .loop2

10. Compiler Output Walk-Through

Same example as Chapters 33 and 38: array sum in C.

C
int sum_array(const int* a, size_t n) {
int s = 0;
for (size_t i = 0; i < n; i++)
s += a[i];
return s;
}

With clang -O2 --target=riscv64-linux-gnu -march=rv64gc:

Assembly
sum_array: beqz a1, .Lzero # n == 0? li a2, 0 # i = 0 li a0, 0 # s = 0 .Lloop: slli a3, a2, 2 # i*4 add a4, ?, a3 # &a[i] — actually compiler keeps a in a0... # (Realistically, the loop is more streamlined.) # More likely actual output: sum_array: beqz a1, .Lzero li a2, 0 mv a3, a0 # save a base li a0, 0 # s slli a4, a1, 2 # n*4 = end offset add a4, a3, a4 # end pointer .Lloop: lw a5, 0(a3) add a0, a0, a5 addi a3, a3, 4 bne a3, a4, .Lloop ret .Lzero: li a0, 0 ret

Density: 4 instructions in the inner loop (load, add, increment, branch), comparable to AArch64. With C extension, several of these compress to 16-bit instructions, so total bytes is similar to or less than AArch64.

11. Privileged vs. Unprivileged

Application programs run in U-mode (user mode). The base ISA is fully available. Privileged instructions are:

  • CSR (Control and Status Register) access: not available in U-mode unless the specific CSR is marked unprivileged. Some unprivileged CSRs include cycle, time, instret (counter access), fcsr (FP control).
  • WFI (Wait For Interrupt): privileged.
  • MRET, SRET, URET: privileged returns from traps.
  • SFENCE.VMA: TLB flush; privileged.

Privileged operations are covered in Chapter 44.

12. Practical Tools

  • riscv64-linux-gnu-objdump -d binary — disassemble.
  • gcc -S -O2 --target=riscv64-linux-gnu — compile to assembly.
  • Compiler Explorer (godbolt.org) — supports RISC-V across compilers.
  • qemu-riscv64 — emulation; useful for testing RISC-V code on x86 hosts.
  • RISC-V ISA Manual (Volumes 1 and 2) — the canonical reference, freely available.
  • rvv-intrinsics-doc — the V extension's intrinsics reference.

13. Summary

RISC-V's unprivileged ISA is built around a 47-instruction integer base plus modular standard extensions: M (multiply/divide), A (atomics with explicit acquire/release), F/D (single/double FP), C (compressed instructions for code density), B (bit manipulation), V (vector). The base is genuinely minimal — no flag register, no compare-and-branch fusion in hardware, no flexible addressing modes — pushing complexity into compiler-generated sequences. The compressed extension recovers density; B fills the bit-manipulation gap; V provides modern vector capabilities competitive with SVE.

Compared with x86-64 and AArch64, RISC-V is the cleanest ISA at the architectural level. Whether the cleanliness translates to the best implementations is up to the silicon designers. In day-to-day use, RISC-V code looks like a more spartan version of AArch64 — same RISC structure, fewer baked-in conveniences, more explicit address arithmetic.

The next chapter steps up to the system level: machine mode, supervisor mode, virtual memory (Sv39, Sv48, Sv57), the SBI interface, hypervisor extension, and the RISC-V boot process.

Book mode
computer-architecturerisc-visa-case-study
Was this helpful?