Part V·ISA Case Studies·Chapter 43 of 62

Part VISA Case Studies

RISC-V Extensions

May 16, 2026·23 min read·advanced

This chapter covers the RISC-V unprivileged instruction set and its major extensions. Where Chapter 42 surveyed RISC-V at a strategic level, this chapter is the programmer's-eye view: the base…

This chapter covers the RISC-V unprivileged instruction set and its major extensions. Where Chapter 42 surveyed RISC-V at a strategic level, this chapter is the programmer's-eye view: the base integer ISA in detail, the standard extensions (M, A, F/D, C, B, V), calling conventions, common idioms, and a comparison with the equivalent facilities in x86-64 and AArch64.

The treatment parallels Chapters 33 (x86-64 programming model) and 38 (AArch64 programming model). RISC-V's minimalism makes the chapter shorter on the base ISA but the modular extensions still demand significant detail, particularly the V (vector) extension which is one of the most distinctive parts of modern RISC-V.

01.Base Integer ISA Recap

The 47 instructions of RV32I (Chapter 42) cluster into computational, load/store, control, system, and memory-ordering categories. RV64I adds about a dozen more for 64-bit and 32-bit-on-64-bit operations. Let's walk through the base in concrete examples.

Computational Instructions

Register-register:

Assembly

add  a0, a1, a2     # a0 = a1 + a2
sub  a0, a1, a2     # a0 = a1 - a2
and  a0, a1, a2     # a0 = a1 & a2
or   a0, a1, a2     # a0 = a1 | a2
xor  a0, a1, a2     # a0 = a1 ^ a2
sll  a0, a1, a2     # a0 = a1 << (a2 & 0x3f)  [shift left logical]
srl  a0, a1, a2     # a0 = a1 >> (a2 & 0x3f)  [shift right logical]
sra  a0, a1, a2     # a0 = a1 >> (a2 & 0x3f)  [shift right arithmetic, sign-extend]
slt  a0, a1, a2     # a0 = (a1 < a2) ? 1 : 0   [signed]

Register-immediate:

Assembly

addi  a0, a1, 100    # a0 = a1 + 100
andi  a0, a1, 0xff   # a0 = a1 & 0xff
ori   a0, a1, 0x10
xori  a0, a1, -1     # a0 = ~a1 (xor with -1 flips all bits)
slli  a0, a1, 4      # a0 = a1 << 4
srli  a0, a1, 4      # logical shift right
srai  a0, a1, 4      # arithmetic shift right
slti  a0, a1, 100
sltiu a0, a1, 100

The immediate field is 12 bits, sign-extended. So immediate range is -2048 to +2047. For larger constants, use LUI followed by ADDI:

Assembly

lui   a0, 0x12345     # a0 = 0x12345000 (upper 20 bits)
addi  a0, a0, 0x678   # a0 = 0x12345678

For loading any 32-bit constant, lui + addi works. For arbitrary 64-bit constants, the assembler emits a multi-instruction sequence (or a literal-pool load). The pseudo-instruction li a0, 0x123456789abcdef0 gets expanded by the assembler.

RV64-Specific Instructions

RV64 adds 32-bit (word-sized) variants of certain operations, all with a W suffix:

Assembly

addw   a0, a1, a2     # 32-bit add, sign-extend result to 64 bits
subw   a0, a1, a2
addiw  a0, a1, 100
sllw   a0, a1, a2     # 32-bit shift; only low 5 bits of shift count used
srlw, sraw
slliw, srliw, sraiw

These are necessary because RV64's regular ADD is a 64-bit add; the 32-bit-result variants are distinct instructions. The W suffix is the convention.

To use a 32-bit value as the low half of a 64-bit register, the high bits must be either zero or sign-extended. The W instructions sign-extend; if you want zero extension, use a shift-mask:

Assembly

slli  a0, a0, 32     # shift left by 32
srli  a0, a0, 32     # logical shift right by 32 — clears upper bits

Or combine with SLLI + SRLI in one operation (some assemblers provide a zext.w pseudo-instruction).

Branches

RISC-V has no flag register; conditional branches compare two registers directly:

Assembly

beq  a0, a1, label    # branch if a0 == a1
bne  a0, a1, label    # branch if a0 != a1
blt  a0, a1, label    # branch if a0 < a1 (signed)
bge  a0, a1, label    # branch if a0 >= a1 (signed)
bltu a0, a1, label    # branch if a0 < a1 (unsigned)
bgeu a0, a1, label    # branch if a0 >= a1 (unsigned)

The branch range is ±4 KiB (12-bit signed offset, scaled by 2). For longer ranges, a pseudo-instruction expands to a conditional branch followed by an unconditional jump:

Assembly

beq a0, a1, far_label    # pseudo, may expand to:
                          # bne a0, a1, .L_skip
                          # j   far_label
                          # .L_skip:

Comparisons with zero are common, so there are pseudo-instructions:

Assembly

beqz a0, label       # branch if a0 == 0 (assembles to beq a0, x0, label)
bnez a0, label       # branch if a0 != 0
bltz a0, label       # branch if a0 < 0
bgez a0, label       # branch if a0 >= 0

These exploit x0 (the zero register) cleverly: comparing against x0 is just comparing against 0.

Jumps

Assembly

jal  ra, label       # jump and link: ra = pc+4; jump to label
                    # (also jal x0, label = unconditional jump, no link)
jalr ra, a0, 0       # jump and link register: ra = pc+4; jump to a0+0
ret                  # alias for jalr x0, ra, 0 — return
j label              # alias for jal x0, label — unconditional jump
call label           # pseudo for jal ra, label

The function call convention: jal ra, callee saves the return address in ra and jumps. The callee returns with ret (which is jalr x0, ra, 0).

Loads and Stores

Assembly

lb   a0, 0(a1)        # load byte signed
lbu  a0, 0(a1)        # load byte unsigned
lh   a0, 0(a1)        # load halfword signed (16-bit)
lhu  a0, 0(a1)        # load halfword unsigned
lw   a0, 0(a1)        # load word (32-bit), sign-extended in RV64
lwu  a0, 0(a1)        # load word unsigned (RV64 only)
ld   a0, 0(a1)        # load doubleword (RV64 only)
sb   a0, 0(a1)        # store byte
sh   a0, 0(a1)        # store halfword
sw   a0, 0(a1)        # store word

The addressing mode is base register + 12-bit signed immediate. There is no indexed (base+register) addressing — that has to be done with an explicit ADD first:

Assembly

# Loading array[i] where a0=array, a1=i, scale=4:
slli  t0, a1, 2       # t0 = i*4
add   t0, a0, t0      # t0 = &array[i]
lw    a2, 0(t0)       # a2 = array[i]

This is more verbose than ARM's ldr w2, [x0, x1, lsl #2] (which fits the same operation in one instruction). RISC-V's choice trades encoding flexibility for simplicity.

The bit-manipulation extension Zba reintroduces some scaled-add operations (SH1ADD, SH2ADD, SH3ADD) that compress this pattern; we'll see them shortly.

PC-Relative Addressing

To form a PC-relative address, RISC-V uses AUIPC (Add Upper Immediate to PC):

Assembly

auipc t0, 0x12345    # t0 = pc + (0x12345 << 12)
addi  t0, t0, 0x678  # t0 = pc + 0x12345678

Or, to load a value PC-relative:

Assembly

auipc t0, %pcrel_hi(global_var)    # high 20 bits of (global_var - pc)
ld    a0, %pcrel_lo(label)(t0)     # load from t0 + low 12 bits
label: # ... where we use the same auipc

The %pcrel_hi and %pcrel_lo markers are linker relocations. The pattern is verbose but mechanical: AUIPC + offset to get the address (or the value via a load).

For compiler-generated PIC, this is the standard pattern. The compiler emits AUIPC + ADDI for taking the address of a global, AUIPC + LD for reading a global, etc.

Memory Ordering

RISC-V uses a weak memory model, RVWMO. The default is that loads and stores can be reordered freely (subject to data dependencies). Synchronization is via the FENCE instruction:

Assembly

fence rw, rw      # full memory barrier: orders all rw before all rw
fence r, rw       # earlier reads before later reads/writes
fence w, w        # store-store fence
fence iorw, iorw  # I/O fence (orders memory and I/O)

The operands name which classes (r/w/i/o) come before and after. fence rw, rw is the most common — equivalent to ARM's DMB ISH.

For acquire/release semantics, the A extension's atomics have built-in aq and rl annotations. We discuss those next.

02.The M Extension: Multiply and Divide

The M extension adds 8 instructions for integer multiply and divide:

Assembly

mul    a0, a1, a2      # a0 = (a1 * a2) low 64 bits  (or 32 in RV32)
mulh   a0, a1, a2      # a0 = high 64 bits of signed*signed
mulhsu a0, a1, a2      # a0 = high 64 bits of signed*unsigned
mulhu  a0, a1, a2      # a0 = high 64 bits of unsigned*unsigned
div    a0, a1, a2      # a0 = a1 / a2  (signed)
divu   a0, a1, a2      # a0 = a1 / a2  (unsigned)
rem    a0, a1, a2      # a0 = a1 % a2  (signed)
remu   a0, a1, a2      # a0 = a1 % a2  (unsigned)

In RV64, there are also W-variants:

Assembly

mulw, divw, divuw, remw, remuw # 32-bit operations

Division by zero in RISC-V does not trap. Instead, the result is defined: x/0 = -1 (all bits set), x%0 = x. This is unusual — most architectures either trap or are unspecified. RISC-V's choice avoids the need for traps, which complicates microcontrollers. Software that wants to trap on division by zero must check explicitly.

03.The A Extension: Atomics

The A extension provides atomic memory operations. Two styles:

Load-Reserved / Store-Conditional (LR/SC).

Assembly

loop:
    lr.w  t0, (a0)         # load-reserved from [a0]
    addi  t0, t0, 1        # increment
    sc.w  t1, t0, (a0)     # store-conditional: t1 = 0 on success, 1 on failure
    bnez  t1, loop         # retry if failed

This is the LR/SC pattern (Chapters 30, 31): LR marks the line for monitoring; SC succeeds only if the line has not been written by another agent since the LR.

Atomic Memory Operations (AMOs).

Assembly

amoadd.w   t0, t1, (a0)     # atomically: t0 = [a0]; [a0] += t1
amoand.w   t0, t1, (a0)
amoor.w    t0, t1, (a0)
amoxor.w   t0, t1, (a0)
amomax.w   t0, t1, (a0)
amomin.w   t0, t1, (a0)
amomaxu.w, amominu.w
amoswap.w  t0, t1, (a0)     # atomic exchange

Each AMO is a single instruction performing a read-modify-write. .w is 32-bit; .d is 64-bit (RV64).

Both LR/SC and AMO instructions support acquire and release annotations via the aq and rl bits in the encoding:

Assembly

amoadd.w.aq    # AMO with acquire semantics
amoadd.w.rl    # AMO with release semantics
amoadd.w.aqrl  # AMO with both (sequentially consistent)
lr.w.aq, sc.w.rl   # typical mutex pattern

Acquire means: subsequent operations don't move before this. Release means: preceding operations don't move after this. The combined aqrl form is sequentially consistent.

This is more flexible than ARM's choice (LDAR/STLR are full acquire/release; weaker forms aren't directly available). RISC-V lets the programmer (or compiler) specify exactly the ordering needed.

For compare-and-swap (CAS), there is no single CAS instruction in the base A extension. CAS is built from LR/SC:

Assembly

# CAS: if [a0] == t0, replace with t1 and return 0; else return [a0] in t0
cas:
    lr.w   t2, (a0)
    bne    t2, t0, fail
    sc.w   t3, t1, (a0)
    bnez   t3, cas        # retry if SC failed
    li     t0, 0           # success
    ret
fail:
    mv     t0, t2          # return current value
    ret

The newer Zacas extension adds explicit AMOCAS instructions, mirroring AArch64's CAS. Adoption of Zacas is just emerging.

04.The F and D Extensions: Floating-Point

The F extension adds single-precision FP; the D extension adds double-precision (and implies F).

Registers

32 floating-point registers, f0-f31, with ABI names ft0-ft11 (temporaries), fs0-fs11 (saved), fa0-fa7 (arguments). Each register is 32 bits wide if only F is implemented, 64 bits wide with D.

A control and status register, fcsr, holds rounding mode (3 bits) and exception flags (5 bits, IEEE 754-style: NX, UF, OF, DZ, NV).

Instructions

Most arithmetic instructions exist for both F and D, distinguished by .s (single) and .d (double) suffixes:

Assembly

fadd.s   fa0, fa1, fa2      # single-precision add
fadd.d   fa0, fa1, fa2      # double-precision add
fsub.s, fmul.s, fdiv.s, fsqrt.s
fmin.s, fmax.s              # IEEE 754-style min/max with NaN handling
fmadd.s  fa0, fa1, fa2, fa3 # FMA: fa0 = fa1*fa2 + fa3
fnmadd.s, fmsub.s, fnmsub.s

fcvt.s.w  fa0, a0           # convert int32 → float
fcvt.w.s  a0, fa0           # convert float → int32
fcvt.s.d, fcvt.d.s          # between single and double
fcvt.l.s, fcvt.s.l          # int64 conversions (RV64)

flt.s    a0, fa1, fa2       # FP less than: a0 = (fa1 < fa2) ? 1 : 0
feq.s    a0, fa1, fa2       # FP equal
fle.s    a0, fa1, fa2       # FP less or equal

fclass.s a0, fa0            # classify FP value (returns bit-mask of classifications)
fmv.x.w  a0, fa0            # move FP register's bits to integer register
fmv.w.x  fa0, a0            # move integer register's bits to FP register

flw  fa0, 0(a0)             # load single
fld  fa0, 0(a0)             # load double
fsw  fa0, 0(a0)             # store single
fsd  fa0, 0(a0)             # store double

Each arithmetic instruction takes a 3-bit rounding-mode field, encoded in the instruction. Common values: RNE (round to nearest, ties to even), RTZ (toward zero), RDN (down, toward -inf), RUP (up, toward +inf), RMM (nearest, ties to max magnitude). The default is "use the rounding mode in fcsr".

Comparisons and Branches

There is no dedicated FP branch in the base F/D extensions. Instead, one of the FP compare instructions sets an integer register, and a regular integer branch is used:

Assembly

flt.s   t0, fa1, fa2        # t0 = 1 if fa1 < fa2 else 0
bnez    t0, fa1_less_label

Two instructions for what x86 does in comiss + jb. The cost is small in OoO cores (the integer branch is regular).

05.The C Extension: Compressed Instructions

The C extension adds 16-bit encodings for common 32-bit instructions. Examples:

Assembly

c.add   a0, a1          # 16-bit form of: add a0, a0, a1
c.li    a0, 5           # 16-bit form of: addi a0, x0, 5
c.lw    a0, 8(sp)       # 16-bit form of: lw a0, 8(sp)
c.j     label           # 16-bit jump
c.beqz  a0, label       # 16-bit branch-if-zero
c.mv    a0, a1          # 16-bit move
c.nop                   # 16-bit nop

The compressed encoding has constraints: only 8 of the 32 registers (x8-x15, the "compressed register set") are easily addressed; immediates are smaller; not all combinations are encodable.

Mixed code (32-bit and 16-bit instructions interleaved) is the norm. The decoder identifies the size from bits 0-1: 11 means 32-bit, anything else means 16-bit.

Code-density gain: typical embedded code drops to ~60-70% of its uncompressed size. Important for microcontrollers with small flash.

06.The B Extension: Bit Manipulation

The B extension is a collection of sub-extensions:

Zba (Address generation): SH1ADD, SH2ADD, SH3ADD — shifted-add. SH1ADD computes rd = rs1 + (rs2 << 1). Useful for array indexing.
Zbb (Basic bit manipulation): ANDN, ORN, XNOR, CLZ (count leading zeros), CTZ (count trailing zeros), CPOP (population count, popcount), MIN, MAX (integer min/max), SEXT.B/SEXT.H (sign-extend), ZEXT.H, ROR/ROL (rotate), ORC.B, REV8 (byte reverse).
Zbs (Single-bit operations): BSET, BCLR, BINV, BEXT — set, clear, invert, extract a single bit by position.
Zbc (Carry-less multiply): CLMUL, CLMULH, CLMULR. For CRC and GCM.

Examples:

Assembly

# Compute array[i] address with scaling:
sh3add  t0, a1, a0     # t0 = a0 + (a1 << 3)  — a[i] for 8-byte elements
ld      a2, 0(t0)
# Compute leading zero count:
clz     a0, a1         # a0 = count of leading zeros in a1
# Population count:
cpop    a0, a1         # a0 = popcount(a1)
# Min/max:
min     a0, a1, a2     # signed min
maxu    a0, a1, a2     # unsigned max

Adoption: ratified in 2021. Modern application processors (RVA22+ profile) require it. Compilers emit B instructions when targeting RVA22 or later.

07.The V Extension: Vectors

The V extension is RISC-V's vector ISA, ratified in 2021 (V 1.0). Like ARM SVE, it is a vector-length-agnostic design: code compiled for V works on implementations with various vector lengths (VLEN), automatically benefiting from wider hardware.

Registers

The V extension adds 32 vector registers, v0-v31. Each register has VLEN bits, where VLEN is implementation-defined (typically 128, 256, 512, or higher). Optionally, registers can be grouped via the LMUL parameter (1, 2, 4, or 8 registers grouped) to provide longer effective vectors.

Several control registers:

vstart: first element to process (for resumption after fault).
vxsat, vxrm: fixed-point saturation, rounding.
vcsr: control and status.
vl: current vector length (number of active elements).
vtype: vector element width (SEW), grouping (LMUL), tail/mask policy.
vlenb: VLEN in bytes (read-only, queryable).

Vector Configuration

Before performing vector operations, you set up the vector type and length:

Assembly

li     t0, 8                  # element count to process
vsetvli t1, t0, e32, m1, ta, ma  # configure: 32-bit elements, LMUL=1, ...

vsetvli does several things at once:

Asks for a desired element count (t0).
Selects element width (e32 = 32-bit elements).
Selects grouping (m1 = LMUL of 1, single register per vector).
Sets tail policy (ta = tail-agnostic, undefined values OK in tail).
Sets mask policy (ma = mask-agnostic).
Returns the actual VL granted (which may be less than requested if VLEN doesn't allow more).

vsetivli is a variant with the count as an immediate.

Vector Arithmetic

Assembly

vadd.vv   v0, v1, v2        # vector + vector: v0 = v1 + v2
vadd.vx   v0, v1, a0        # vector + scalar: v0 = v1 + a0 (broadcast)
vadd.vi   v0, v1, 5         # vector + immediate
vsub.vv, vmul.vv, vdiv.vv
vand.vv, vor.vv, vxor.vv
vsll.vv, vsrl.vv, vsra.vv
vmin.vv, vmax.vv, vminu.vv, vmaxu.vv
vfadd.vv  v0, v1, v2        # FP add
vfmul.vv  v0, v1, v2
vfmacc.vv v0, v1, v2        # FMA: v0 += v1 * v2
vmacc.vv  v0, v1, v2        # integer multiply-accumulate: v0 += v1 * v2

Each instruction acts on vl elements. Inactive elements (above vl) are handled per the tail policy.

Vector Memory Operations

Assembly

vle32.v   v0, (a0)             # load vl elements, 32-bit each, contiguous
vse32.v   v0, (a0)             # store vl elements, 32-bit each
vle8.v    v0, (a0)             # load bytes
vlse32.v  v0, (a0), a1          # strided load (stride = a1)
vluxei32.v v0, (a0), v8         # gather (indices in v8)
vsuxei32.v v0, (a0), v8         # scatter

Strided and indexed loads/stores cover most access patterns. The element size (e8, e32, etc.) is chosen by vsetvli, but the load instruction's mnemonic (vle32.v) overrides for that operation.

Masked Operations

The mask is in v0 (special-cased). Operations can be conditional on v0:

Assembly

vadd.vv   v1, v2, v3, v0.t   # v1 = v2 + v3 where v0 is true; tail policy elsewhere

The v0.t suffix means "use v0 as a mask, only update where v0 is true". This is the predicated-execution pattern, essential for vectorizing conditional code.

Reductions

Assembly

vredsum.vs   v0, v1, v2     # sum reduction: v0[0] = v2[0] + sum(v1[0..vl])
vredmax.vs, vredmin.vs
vfredsum.vs, vfredmax.vs    # FP versions

A reduction takes a vector source and a scalar source, and writes a scalar result (in v0[0]).

Vector-Length-Agnostic Loop

The canonical V vector loop:

Assembly

.loop:
    vsetvli t0, a2, e32, m1, ta, ma   # t0 = min(a2, max VL)
    vle32.v v0, (a0)                  # load t0 elements
    vle32.v v1, (a1)                  # load t0 elements
    vadd.vv v2, v0, v1                # add
    vse32.v v2, (a3)                  # store
    sub     a2, a2, t0                # remaining count
    slli    t0, t0, 2                 # bytes processed = elements * 4
    add     a0, a0, t0
    add     a1, a1, t0
    add     a3, a3, t0
    bnez    a2, .loop

The loop is length-agnostic: vsetvli grants whatever vector length the implementation supports, and the loop iterates until done. No fixup loop for the tail; the last iteration just gets a smaller VL.

This is structurally similar to ARM SVE, but the configuration is more explicit (vsetvli per loop) and the predication mechanism uses v0 specifically rather than dedicated predicate registers.

Element Width Mixing

A key V capability: in one loop, you can mix element widths. For example, expand 16-bit data into 32-bit accumulators:

Assembly

vsetvli t0, a2, e16, m1
vle16.v v0, (a0)           # load 16-bit elements
vsetvli t0, a2, e32, m2    # promote to 32-bit, LMUL=2 (two regs)
vwmulu.vv v2, v0, v0       # widen-multiply (16x16 -> 32, into v2:v3)

vwmulu.vv is a widening multiply: takes 16-bit inputs, produces 32-bit outputs. The destination is two registers (LMUL=2). This is heavily used in DSP and fixed-point algorithms.

Comparison with SVE and AVX-512

Feature	RISC-V V	ARM SVE/SVE2	AVX-512
Vector width	Variable	Variable	Fixed 512
Mask register	v0 (no separate file)	P0-P15 (16)	k0-k7 (8)
Length config	vsetvli per loop	WHILELT predicate	Fixed
Element widths	8/16/32/64, mixable	8/16/32/64	8/16/32/64
Strided access	Yes	Yes	No (gather only)
Gather/scatter	Yes	Yes	Yes
First-fault loads	Yes	Yes	No
LMUL grouping	Yes (unique)	No equivalent	No

LMUL is unique to RISC-V V: it lets you "group" registers to get longer effective vectors at the cost of register count. For tight loops with few live vectors, LMUL=4 or LMUL=8 effectively gives 4× or 8× the vector length. Compilers and hand-written kernels use LMUL aggressively.

Adoption

V adoption is gradual. RVA23 profile (2024) requires V 1.0 and several Zv* sub-extensions. Implementations:

T-Head C910: optional V (some chips have it, some don't); often early non-ratified V 0.7.1.
SpacemiT K1 (Banana Pi BPI-F3): V 1.0 + Zvfh.
SiFive P870: V 1.0.
Tenstorrent's RISC-V cores: V 1.0 (used internally for AI work).
Ventana V2: V 1.0.

Most current RISC-V SoCs do not yet have V; for them, scalar code is the only option for portable software. SIMD-style extensions (like P, the packed-SIMD extension that was being discussed) are not part of any deployed RISC-V profile.

08.Calling Convention

The standard RISC-V calling convention (RV64, "lp64d" ABI for FP-equipped systems):

Argument passing.

Integer / pointer args: a0-a7 (8 registers).
FP args: fa0-fa7 (8 registers).
Additional args on stack.
Return value in a0 (and a1 for 128-bit). FP return in fa0.

Caller-saved (volatile, "temporary"). ra, t0-t6, a0-a7, ft0-ft11, fa0-fa7.

Callee-saved. sp, s0-s11, fs0-fs11.

Frame setup. sp must be 16-byte aligned at any call. s0/fp is the conventional frame pointer.

A typical prologue:

Assembly

function:
    addi  sp, sp, -32            # allocate frame
    sd    ra, 24(sp)             # save return address
    sd    s0, 16(sp)             # save fp
    addi  s0, sp, 32             # set new fp
    # ... body ...
    ld    s0, 16(sp)
    ld    ra, 24(sp)
    addi  sp, sp, 32

For leaf functions (no calls), the ra save can be skipped, and the prologue may be minimal.

The convention is straightforward — comparable to AArch64's AAPCS64 in spirit, with similar argument register count and similar callee/caller saved partitioning.

09.Common Idioms

Zeroing a register.

Assembly

mv   a0, x0       # a0 = 0 (alias for addi a0, x0, 0)
li   a0, 0        # same

Negation.

Assembly

neg a0, a1 # a0 = -a1 (alias for sub a0, x0, a1)

Boolean materialization.

Assembly

slt  a0, a1, a2   # a0 = (a1 < a2) ? 1 : 0
sltu a0, a1, a2   # unsigned version

Branchless absolute value (signed):

Assembly

sra  t0, a0, 63       # t0 = sign-extended (-1 if negative, 0 otherwise)
xor  a0, a0, t0       # flip bits if negative
sub  a0, a0, t0       # add 1 if negative

(Same as the AArch64 / x86-64 trick.)

Branchless min/max with B extension:

Assembly

min  a0, a1, a2     # signed min (with Zbb)
max  a0, a1, a2     # signed max

Without Zbb, this is a slt + branch + select sequence.

Loop counter.

Assembly

.loop:
    lw    a2, 0(a0)
    add   a3, a3, a2
    addi  a0, a0, 4
    addi  a1, a1, -1
    bnez  a1, .loop

The standard decrement-counter form. Can also be expressed as compare-against-bound:

Assembly

.loop2:
    lw    a2, 0(a0)
    add   a3, a3, a2
    addi  a0, a0, 4
    blt   a0, a1, .loop2

10.Compiler Output Walk-Through

Same example as Chapters 33 and 38: array sum in C.

int sum_array(const int* a, size_t n) {
    int s = 0;
    for (size_t i = 0; i < n; i++)
        s += a[i];
    return s;
}

With clang -O2 --target=riscv64-linux-gnu -march=rv64gc:

Assembly

sum_array:
    beqz   a1, .Lzero        # n == 0?
    li     a2, 0              # i = 0
    li     a0, 0              # s = 0
.Lloop:
    slli   a3, a2, 2          # i*4
    add    a4, ?, a3          # &a[i]    — actually compiler keeps a in a0...
    # (Realistically, the loop is more streamlined.)

# More likely actual output:
sum_array:
    beqz   a1, .Lzero
    li     a2, 0
    mv     a3, a0             # save a base
    li     a0, 0              # s
    slli   a4, a1, 2          # n*4 = end offset
    add    a4, a3, a4         # end pointer
.Lloop:
    lw     a5, 0(a3)
    add    a0, a0, a5
    addi   a3, a3, 4
    bne    a3, a4, .Lloop
    ret
.Lzero:
    li     a0, 0
    ret

Density: 4 instructions in the inner loop (load, add, increment, branch), comparable to AArch64. With C extension, several of these compress to 16-bit instructions, so total bytes is similar to or less than AArch64.

11.Privileged vs. Unprivileged

Application programs run in U-mode (user mode). The base ISA is fully available. Privileged instructions are:

CSR (Control and Status Register) access: not available in U-mode unless the specific CSR is marked unprivileged. Some unprivileged CSRs include cycle, time, instret (counter access), fcsr (FP control).
WFI (Wait For Interrupt): privileged.
MRET, SRET, URET: privileged returns from traps.
SFENCE.VMA: TLB flush; privileged.

Privileged operations are covered in Chapter 44.

12.Practical Tools

riscv64-linux-gnu-objdump -d binary — disassemble.
gcc -S -O2 --target=riscv64-linux-gnu — compile to assembly.
Compiler Explorer (godbolt.org) — supports RISC-V across compilers.
qemu-riscv64 — emulation; useful for testing RISC-V code on x86 hosts.
RISC-V ISA Manual (Volumes 1 and 2) — the canonical reference, freely available.
rvv-intrinsics-doc — the V extension's intrinsics reference.

13.Summary

RISC-V's unprivileged ISA is built around a 47-instruction integer base plus modular standard extensions: M (multiply/divide), A (atomics with explicit acquire/release), F/D (single/double FP), C (compressed instructions for code density), B (bit manipulation), V (vector). The base is genuinely minimal — no flag register, no compare-and-branch fusion in hardware, no flexible addressing modes — pushing complexity into compiler-generated sequences. The compressed extension recovers density; B fills the bit-manipulation gap; V provides modern vector capabilities competitive with SVE.

Compared with x86-64 and AArch64, RISC-V is the cleanest ISA at the architectural level. Whether the cleanliness translates to the best implementations is up to the silicon designers. In day-to-day use, RISC-V code looks like a more spartan version of AArch64 — same RISC structure, fewer baked-in conveniences, more explicit address arithmetic.

The next chapter steps up to the system level: machine mode, supervisor mode, virtual memory (Sv39, Sv48, Sv57), the SBI interface, hypervisor extension, and the RISC-V boot process.

Book mode

	add a0, a1, a2 # a0 = a1 + a2
	sub a0, a1, a2 # a0 = a1 - a2
	and a0, a1, a2 # a0 = a1 & a2
	or a0, a1, a2 # a0 = a1 \| a2
	xor a0, a1, a2 # a0 = a1 ^ a2
	sll a0, a1, a2 # a0 = a1 << (a2 & 0x3f) [shift left logical]
	srl a0, a1, a2 # a0 = a1 >> (a2 & 0x3f) [shift right logical]
	sra a0, a1, a2 # a0 = a1 >> (a2 & 0x3f) [shift right arithmetic, sign-extend]
	slt a0, a1, a2 # a0 = (a1 < a2) ? 1 : 0 [signed]

	addi a0, a1, 100 # a0 = a1 + 100
	andi a0, a1, 0xff # a0 = a1 & 0xff
	ori a0, a1, 0x10
	xori a0, a1, -1 # a0 = ~a1 (xor with -1 flips all bits)
	slli a0, a1, 4 # a0 = a1 << 4
	srli a0, a1, 4 # logical shift right
	srai a0, a1, 4 # arithmetic shift right
	slti a0, a1, 100
	sltiu a0, a1, 100

	lui a0, 0x12345 # a0 = 0x12345000 (upper 20 bits)
	addi a0, a0, 0x678 # a0 = 0x12345678

	addw a0, a1, a2 # 32-bit add, sign-extend result to 64 bits
	subw a0, a1, a2
	addiw a0, a1, 100
	sllw a0, a1, a2 # 32-bit shift; only low 5 bits of shift count used
	srlw, sraw
	slliw, srliw, sraiw

	slli a0, a0, 32 # shift left by 32
	srli a0, a0, 32 # logical shift right by 32 — clears upper bits

	beq a0, a1, label # branch if a0 == a1
	bne a0, a1, label # branch if a0 != a1
	blt a0, a1, label # branch if a0 < a1 (signed)
	bge a0, a1, label # branch if a0 >= a1 (signed)
	bltu a0, a1, label # branch if a0 < a1 (unsigned)
	bgeu a0, a1, label # branch if a0 >= a1 (unsigned)

	beq a0, a1, far_label # pseudo, may expand to:
	# bne a0, a1, .L_skip
	# j far_label
	# .L_skip:

	beqz a0, label # branch if a0 == 0 (assembles to beq a0, x0, label)
	bnez a0, label # branch if a0 != 0
	bltz a0, label # branch if a0 < 0
	bgez a0, label # branch if a0 >= 0

	jal ra, label # jump and link: ra = pc+4; jump to label
	# (also jal x0, label = unconditional jump, no link)
	jalr ra, a0, 0 # jump and link register: ra = pc+4; jump to a0+0
	ret # alias for jalr x0, ra, 0 — return
	j label # alias for jal x0, label — unconditional jump
	call label # pseudo for jal ra, label

	lb a0, 0(a1) # load byte signed
	lbu a0, 0(a1) # load byte unsigned
	lh a0, 0(a1) # load halfword signed (16-bit)
	lhu a0, 0(a1) # load halfword unsigned
	lw a0, 0(a1) # load word (32-bit), sign-extended in RV64
	lwu a0, 0(a1) # load word unsigned (RV64 only)
	ld a0, 0(a1) # load doubleword (RV64 only)

	sb a0, 0(a1) # store byte
	sh a0, 0(a1) # store halfword
	sw a0, 0(a1) # store word

	# Loading array[i] where a0=array, a1=i, scale=4:
	slli t0, a1, 2 # t0 = i*4
	add t0, a0, t0 # t0 = &array[i]
	lw a2, 0(t0) # a2 = array[i]

	auipc t0, 0x12345 # t0 = pc + (0x12345 << 12)
	addi t0, t0, 0x678 # t0 = pc + 0x12345678

	auipc t0, %pcrel_hi(global_var) # high 20 bits of (global_var - pc)
	ld a0, %pcrel_lo(label)(t0) # load from t0 + low 12 bits
	label: # ... where we use the same auipc

	fence rw, rw # full memory barrier: orders all rw before all rw
	fence r, rw # earlier reads before later reads/writes
	fence w, w # store-store fence
	fence iorw, iorw # I/O fence (orders memory and I/O)

	mul a0, a1, a2 # a0 = (a1 * a2) low 64 bits (or 32 in RV32)
	mulh a0, a1, a2 # a0 = high 64 bits of signed*signed
	mulhsu a0, a1, a2 # a0 = high 64 bits of signed*unsigned
	mulhu a0, a1, a2 # a0 = high 64 bits of unsigned*unsigned

	div a0, a1, a2 # a0 = a1 / a2 (signed)
	divu a0, a1, a2 # a0 = a1 / a2 (unsigned)
	rem a0, a1, a2 # a0 = a1 % a2 (signed)
	remu a0, a1, a2 # a0 = a1 % a2 (unsigned)

	loop:
	lr.w t0, (a0) # load-reserved from [a0]
	addi t0, t0, 1 # increment
	sc.w t1, t0, (a0) # store-conditional: t1 = 0 on success, 1 on failure
	bnez t1, loop # retry if failed

	amoadd.w t0, t1, (a0) # atomically: t0 = [a0]; [a0] += t1
	amoand.w t0, t1, (a0)
	amoor.w t0, t1, (a0)
	amoxor.w t0, t1, (a0)
	amomax.w t0, t1, (a0)
	amomin.w t0, t1, (a0)
	amomaxu.w, amominu.w
	amoswap.w t0, t1, (a0) # atomic exchange

	amoadd.w.aq # AMO with acquire semantics
	amoadd.w.rl # AMO with release semantics
	amoadd.w.aqrl # AMO with both (sequentially consistent)
	lr.w.aq, sc.w.rl # typical mutex pattern

	# CAS: if [a0] == t0, replace with t1 and return 0; else return [a0] in t0
	cas:
	lr.w t2, (a0)
	bne t2, t0, fail
	sc.w t3, t1, (a0)
	bnez t3, cas # retry if SC failed
	li t0, 0 # success
	ret
	fail:
	mv t0, t2 # return current value
	ret

	flt.s t0, fa1, fa2 # t0 = 1 if fa1 < fa2 else 0
	bnez t0, fa1_less_label

	c.add a0, a1 # 16-bit form of: add a0, a0, a1
	c.li a0, 5 # 16-bit form of: addi a0, x0, 5
	c.lw a0, 8(sp) # 16-bit form of: lw a0, 8(sp)
	c.j label # 16-bit jump
	c.beqz a0, label # 16-bit branch-if-zero
	c.mv a0, a1 # 16-bit move
	c.nop # 16-bit nop

	# Compute array[i] address with scaling:
	sh3add t0, a1, a0 # t0 = a0 + (a1 << 3) — a[i] for 8-byte elements
	ld a2, 0(t0)

	# Compute leading zero count:
	clz a0, a1 # a0 = count of leading zeros in a1

	# Population count:
	cpop a0, a1 # a0 = popcount(a1)

	# Min/max:
	min a0, a1, a2 # signed min
	maxu a0, a1, a2 # unsigned max

	li t0, 8 # element count to process
	vsetvli t1, t0, e32, m1, ta, ma # configure: 32-bit elements, LMUL=1, ...

	vadd.vv v0, v1, v2 # vector + vector: v0 = v1 + v2
	vadd.vx v0, v1, a0 # vector + scalar: v0 = v1 + a0 (broadcast)
	vadd.vi v0, v1, 5 # vector + immediate
	vsub.vv, vmul.vv, vdiv.vv
	vand.vv, vor.vv, vxor.vv
	vsll.vv, vsrl.vv, vsra.vv
	vmin.vv, vmax.vv, vminu.vv, vmaxu.vv

	vfadd.vv v0, v1, v2 # FP add
	vfmul.vv v0, v1, v2
	vfmacc.vv v0, v1, v2 # FMA: v0 += v1 * v2

	vmacc.vv v0, v1, v2 # integer multiply-accumulate: v0 += v1 * v2

	vle32.v v0, (a0) # load vl elements, 32-bit each, contiguous
	vse32.v v0, (a0) # store vl elements, 32-bit each
	vle8.v v0, (a0) # load bytes
	vlse32.v v0, (a0), a1 # strided load (stride = a1)
	vluxei32.v v0, (a0), v8 # gather (indices in v8)
	vsuxei32.v v0, (a0), v8 # scatter

	vredsum.vs v0, v1, v2 # sum reduction: v0[0] = v2[0] + sum(v1[0..vl])
	vredmax.vs, vredmin.vs
	vfredsum.vs, vfredmax.vs # FP versions

	.loop:
	vsetvli t0, a2, e32, m1, ta, ma # t0 = min(a2, max VL)
	vle32.v v0, (a0) # load t0 elements
	vle32.v v1, (a1) # load t0 elements
	vadd.vv v2, v0, v1 # add
	vse32.v v2, (a3) # store
	sub a2, a2, t0 # remaining count
	slli t0, t0, 2 # bytes processed = elements * 4
	add a0, a0, t0
	add a1, a1, t0
	add a3, a3, t0
	bnez a2, .loop

	vsetvli t0, a2, e16, m1
	vle16.v v0, (a0) # load 16-bit elements

	vsetvli t0, a2, e32, m2 # promote to 32-bit, LMUL=2 (two regs)
	vwmulu.vv v2, v0, v0 # widen-multiply (16x16 -> 32, into v2:v3)

	function:
	addi sp, sp, -32 # allocate frame
	sd ra, 24(sp) # save return address
	sd s0, 16(sp) # save fp
	addi s0, sp, 32 # set new fp
	# ... body ...
	ld s0, 16(sp)
	ld ra, 24(sp)
	addi sp, sp, 32

	mv a0, x0 # a0 = 0 (alias for addi a0, x0, 0)
	li a0, 0 # same