== Instruction Cycle Counts All timings are given assuming perfect bus behaviour (no downstream bus stalls). === RV32I [%autowidth.stretch, options="header"] |=== | Instruction | Cycles | Note 3+| Integer Register-register | `add rd, rs1, rs2` | 1 | | `sub rd, rs1, rs2` | 1 | | `slt rd, rs1, rs2` | 1 | | `sltu rd, rs1, rs2` | 1 | | `and rd, rs1, rs2` | 1 | | `or rd, rs1, rs2` | 1 | | `xor rd, rs1, rs2` | 1 | | `sll rd, rs1, rs2` | 1 | | `srl rd, rs1, rs2` | 1 | | `sra rd, rs1, rs2` | 1 | 3+| Integer Register-immediate | `addi rd, rs1, imm` | 1 | `nop` is a pseudo-op for `addi x0, x0, 0` | `slti rd, rs1, imm` | 1 | | `sltiu rd, rs1, imm` | 1 | | `andi rd, rs1, imm` | 1 | | `ori rd, rs1, imm` | 1 | | `xori rd, rs1, imm` | 1 | | `slli rd, rs1, imm` | 1 | | `srli rd, rs1, imm` | 1 | | `srai rd, rs1, imm` | 1 | 3+| Large Immediate | `lui rd, imm` | 1 | | `auipc rd, imm` | 1 | 3+| Control Transfer | `jal rd, label` | 2footnote:unaligned_branch[A jump or branch to a 32-bit instruction which is not 32-bit-aligned requires one additional cycle, because two naturally aligned bus cycles are required to fetch the target instruction.]| | `jalr rd, rs1, imm` | 2footnote:unaligned_branch[] | | `beq rs1, rs2, label`| 1 or 2footnote:unaligned_branch[] | 1 if nontaken, 2 if taken. | `bne rs1, rs2, label`| 1 or 2footnote:unaligned_branch[] | 1 if nontaken, 2 if taken. | `blt rs1, rs2, label`| 1 or 2footnote:unaligned_branch[] | 1 if nontaken, 2 if taken. | `bge rs1, rs2, label`| 1 or 2footnote:unaligned_branch[] | 1 if nontaken, 2 if taken. | `bltu rs1, rs2, label`| 1 or 2footnote:unaligned_branch[] | 1 if nontaken, 2 if taken. | `bgeu rs1, rs2, label`| 1 or 2footnote:unaligned_branch[] | 1 if nontaken, 2 if taken. 3+| Load and Store | `lw rd, imm(rs1)` | 1 or 2 | 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[If an instruction uses load data (from stage 3) in stage 2, a 1-cycle bubble is inserted after the load. Load-data to store-data dependency does not experience this, because the store data is used in stage 3. However, load-data to store-address (or e.g. load-to-add) does qualify.] | `lh rd, imm(rs1)` | 1 or 2 | 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[] | `lhu rd, imm(rs1)` | 1 or 2 | 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[] | `lb rd, imm(rs1)` | 1 or 2 | 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[] | `lbu rd, imm(rs1)` | 1 or 2 | 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[] | `sw rs2, imm(rs1)` | 1 | | `sh rs2, imm(rs1)` | 1 | | `sb rs2, imm(rs1)` | 1 | |=== === M Extension Timings assume the core is configured with `MULDIV_UNROLL = 2` and `MUL_FAST = 1`. I.e. the sequential multiply/divide circuit processes two bits per cycle, and a separate dedicated multiplier is present for the `mul` instruction. [%autowidth.stretch, options="header"] |=== | Instruction | Cycles | Note 3+| 32 {times} 32 -> 32 Multiply | `mul rd, rs1, rs2` | 1 or 2 | 1 if next instruction is independent, 2 if dependent. 3+| 32 {times} 32 -> 64 Multiply, Upper Half | `mulh rd, rs1, rs2` | 18 to 20 | Depending on sign correction | `mulhsu rd, rs1, rs2` | 18 to 20 | Depending on sign correction | `mulhu rd, rs1, rs2` | 18 | 3+| Divide and Remainder | `div rd, rs1, rs2` | 18 or 19 | Depending on sign correction | `divu rd, rs1, rs2` | 18 | | `rem rd, rs1, rs2` | 18 or 19 | Depending on sign correction | `remu rd, rs1, rs2` | 18 | |=== === A Extension [%autowidth.stretch, options="header"] |=== | Instruction | Cycles | Note 3+| Load-Reserved/Store-Conditional | `lr.w rd, (rs1)` | 1 or 2 | 2 if next instruction is dependentfootnote:data_dependency[], or an `lr.w`, `sc.w` or `amo*.w`.footnote:exclusive_pipelining[A pipeline bubble is inserted between `lr.w`/`sc.w` and an immediately-following `lr.w`/`sc.w`/`amo*`, because the AHB5 bus standard does not permit pipelined exclusive accesses. A stall would be inserted between `lr.w` and `sc.w` anyhow, so the local monitor can be updated based on the `lr.w` data phase in time to suppress the `sc.w` address phase.] | `sc.w rd, rs2, (rs1)` | 1 or 2 | 2 if next instruction is an `lr.w`, `sc.w` or `amo*.w`.footnote:exclusive_pipelining[] 3+| Atomic Memory Operations |`amoswap.w rd, rs2, (rs1)` | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[AMOs are issued as a paired exclusive read and exclusive write on the bus, at the maximum speed of 2 cycles per access, since the bus does not permit pipelining of exclusive reads/writes. If the write phase fails due to the global monitor reporting a lost reservation, the instruction loops at a rate of 4 cycles per loop, until success. If the read reservation is refused by the global monitor, the instruction generates a Store/AMO Fault exception, to avoid an infinite loop.] |`amoadd.w rd, rs2, (rs1)` | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[] |`amoxor.w rd, rs2, (rs1)` | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[] |`amoand.w rd, rs2, (rs1)` | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[] |`amoor.w rd, rs2, (rs1)` | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[] |`amomin.w rd, rs2, (rs1)` | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[] |`amomax.w rd, rs2, (rs1)` | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[] |`amominu.w rd, rs2, (rs1)` | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[] |`amomaxu.w rd, rs2, (rs1)` | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[] |=== === C Extension All C extension 16-bit instructions are aliases of base RV32I instructions. On Hazard3, they perform identically to their 32-bit counterparts. A consequence of the C extension is that 32-bit instructions can be non-naturally-aligned. This has no penalty during sequential execution, but branching to a 32-bit instruction that is not 32-bit-aligned carries a 1 cycle penalty, because the instruction fetch is cracked into two naturally-aligned bus accesses. === Privileged Instructions (including Zicsr) [%autowidth.stretch, options="header"] |=== | Instruction | Cycles | Note 3+| CSR Access | `csrrw rd, csr, rs1` | 1 | | `csrrc rd, csr, rs1` | 1 | | `csrrs rd, csr, rs1` | 1 | | `csrrwi rd, csr, imm` | 1 | | `csrrci rd, csr, imm` | 1 | | `csrrsi rd, csr, imm` | 1 | 3+| Trap Request | `ecall` | 3 | Time given is for jumping to `mtvec` | `ebreak` | 3 | Time given is for jumping to `mtvec` |=== === Bit Manipulation [%autowidth.stretch, options="header"] |=== | Instruction | Cycles | Note 3+| Zba (address generation) |`sh1add rd, rs1, rs2` | 1 | |`sh2add rd, rs1, rs2` | 1 | |`sh3add rd, rs1, rs2` | 1 | 3+| Zbb (basic bit manipulation) |`andn rd, rs1, rs2` | 1 | |`clz rd, rs1` | 1 | |`cpop rd, rs1` | 1 | |`ctz rd, rs1` | 1 | |`max rd, rs1, rs2` | 1 | |`maxu rd, rs1, rs2` | 1 | |`min rd, rs1, rs2` | 1 | |`minu rd, rs1, rs2` | 1 | |`orc.b rd, rs1` | 1 | |`orn rd, rs1, rs2` | 1 | |`rev8 rd, rs1` | 1 | |`rol rd, rs1, rs2` | 1 | |`ror rd, rs1, rs2` | 1 | |`rori rd, rs1, imm` | 1 | |`sext.b rd, rs1` | 1 | |`sext.h rd, rs1` | 1 | |`xnor rd, rs1, rs2` | 1 | |`zext.h rd, rs1` | 1 | |`zext.b rd, rs1` | 1 | `zext.b` is a pseudo-op for `andi rd, rs1, 0xff` 3+| Zbc (carry-less multiply) |`clmul rd, rs1, rs2` | 1 | |`clmulh rd, rs1, rs2` | 1 | |`clmulr rd, rs1, rs2` | 1 | 3+| Zbs (single-bit manipulation) |`bclr rd, rs1, rs2` | 1 | |`bclri rd, rs1, imm` | 1 | |`bext rd, rs1, rs2` | 1 | |`bexti rd, rs1, imm` | 1 | |`binv rd, rs1, rs2` | 1 | |`binvi rd, rs1, imm` | 1 | |`bset rd, rs1, rs2` | 1 | |`bseti rd, rs1, imm` | 1 | |===