Hazard3/doc/sections/instruction_timings.adoc

== Instruction Cycle Counts

All timings are given assuming perfect bus behaviour (no downstream bus stalls), and that the core is configured with `MULDIV_UNROLL = 2` and all other configuration options set for maximum performance.

=== RV32I

[%autowidth.stretch, options="header"]
|===
| Instruction | Cycles | Note
3+| Integer Register-register
| `add rd, rs1, rs2` | 1 |
| `sub rd, rs1, rs2` | 1 |
| `slt rd, rs1, rs2` | 1 |
| `sltu rd, rs1, rs2` | 1 |
| `and rd, rs1, rs2` | 1 |
| `or rd, rs1, rs2` | 1 |
| `xor rd, rs1, rs2` | 1 |
| `sll rd, rs1, rs2` | 1 |
| `srl rd, rs1, rs2` | 1 |
| `sra rd, rs1, rs2` | 1 |
3+| Integer Register-immediate
| `addi rd, rs1, imm` | 1 | `nop` is a pseudo-op for `addi x0, x0, 0`
| `slti rd, rs1, imm` | 1 |
| `sltiu rd, rs1, imm` | 1 |
| `andi rd, rs1, imm` | 1 |
| `ori rd, rs1, imm` | 1 |
| `xori rd, rs1, imm` | 1 |
| `slli rd, rs1, imm` | 1 |
| `srli rd, rs1, imm` | 1 |
| `srai rd, rs1, imm` | 1 |
3+| Large Immediate
| `lui rd, imm` | 1 |
| `auipc rd, imm` | 1 |
3+| Control Transfer
| `jal rd, label` | 2footnote:unaligned_branch[A jump or branch to a 32-bit instruction which is not 32-bit-aligned requires one additional cycle, because two naturally aligned bus cycles are required to fetch the target instruction.]|
| `jalr rd, rs1, imm` | 2footnote:unaligned_branch[] |
| `beq rs1, rs2, label`| 1 or 2footnote:unaligned_branch[] | 1 if correctly predicted, 2 if mispredicted.
| `bne rs1, rs2, label`| 1 or 2footnote:unaligned_branch[] | 1 if correctly predicted, 2 if mispredicted.
| `blt rs1, rs2, label`| 1 or 2footnote:unaligned_branch[] | 1 if correctly predicted, 2 if mispredicted.
| `bge rs1, rs2, label`| 1 or 2footnote:unaligned_branch[] | 1 if correctly predicted, 2 if mispredicted.
| `bltu rs1, rs2, label`| 1 or 2footnote:unaligned_branch[] | 1 if correctly predicted, 2 if mispredicted.
| `bgeu rs1, rs2, label`| 1 or 2footnote:unaligned_branch[] | 1 if correctly predicted, 2 if mispredicted.
3+| Load and Store
| `lw rd, imm(rs1)` | 1 or 2 | 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[If an instruction in stage 2 (e.g. an `add`) uses data from stage 3 (e.g. a `lw` result), a 1-cycle bubble is inserted between the pair. A load data -> store data dependency is _not_ an example of this, because data is produced and consumed in stage 3. However, load data -> load address _would_ qualify, as would e.g. `sc.w` -> `beqz`.]
| `lh rd, imm(rs1)` | 1 or 2 | 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[]
| `lhu rd, imm(rs1)` | 1 or 2 | 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[]
| `lb rd, imm(rs1)` | 1 or 2 | 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[]
| `lbu rd, imm(rs1)` | 1 or 2 | 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[]
| `sw rs2, imm(rs1)` | 1 |
| `sh rs2, imm(rs1)` | 1 |
| `sb rs2, imm(rs1)` | 1 |
|===

=== M Extension

Timings assume the core is configured with `MULDIV_UNROLL = 2` and `MUL_FAST = 1`. I.e. the sequential multiply/divide circuit processes two bits per cycle, and a separate dedicated multiplier is present for the `mul` instruction.


[%autowidth.stretch, options="header"]
|===
| Instruction | Cycles | Note
3+| 32 {times} 32 -> 32 Multiply
| `mul rd, rs1, rs2` | 1  |
3+| 32 {times} 32 -> 64 Multiply, Upper Half
| `mulh rd, rs1, rs2` | 1 |
| `mulhsu rd, rs1, rs2` | 1 |
| `mulhu rd, rs1, rs2` | 1 |
3+| Divide and Remainder
| `div rd, rs1, rs2` | 18 or 19 | Depending on sign correction
| `divu rd, rs1, rs2` | 18 |
| `rem rd, rs1, rs2` | 18 or 19 | Depending on sign correction
| `remu rd, rs1, rs2` | 18 |
|===

=== A Extension

[%autowidth.stretch, options="header"]
|===
| Instruction | Cycles | Note
3+| Load-Reserved/Store-Conditional
| `lr.w rd, (rs1)` | 1 or 2 | 2 if next instruction is dependentfootnote:data_dependency[], an `lr.w`, `sc.w` or `amo*.w`.footnote:exclusive_pipelining[A pipeline bubble is inserted between `lr.w`/`sc.w` and an immediately-following `lr.w`/`sc.w`/`amo*`, because the AHB5 bus standard does not permit pipelined exclusive accesses. A stall would be inserted between `lr.w` and `sc.w` anyhow, so the local monitor can be updated based on the `lr.w` data phase in time to suppress the `sc.w` address phase.]
| `sc.w rd, rs2, (rs1)` | 1 or 2 | 2 if next instruction is dependentfootnote:data_dependency[], an `lr.w`, `sc.w` or `amo*.w`.footnote:exclusive_pipelining[]
3+| Atomic Memory Operations
|`amoswap.w rd, rs2, (rs1)` | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[AMOs are issued as a paired exclusive read and exclusive write on the bus, at the maximum speed of 2 cycles per access, since the bus does not permit pipelining of exclusive reads/writes. If the write phase fails due to the global monitor reporting a lost reservation, the instruction loops at a rate of 4 cycles per loop, until success. If the read reservation is refused by the global monitor, the instruction generates a Store/AMO Fault exception, to avoid an infinite loop.]
|`amoadd.w rd, rs2, (rs1)`  | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[]
|`amoxor.w rd, rs2, (rs1)`  | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[]
|`amoand.w rd, rs2, (rs1)`  | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[]
|`amoor.w rd, rs2, (rs1)`   | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[]
|`amomin.w rd, rs2, (rs1)`  | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[]
|`amomax.w rd, rs2, (rs1)`  | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[]
|`amominu.w rd, rs2, (rs1)` | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[]
|`amomaxu.w rd, rs2, (rs1)` | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[]
|===

=== C Extension

All C extension 16-bit instructions are aliases of base RV32I instructions. On Hazard3, they perform identically to their 32-bit counterparts.

A consequence of the C extension is that 32-bit instructions can be non-naturally-aligned. This has no penalty during sequential execution, but branching to a 32-bit instruction that is not 32-bit-aligned carries a 1 cycle penalty, because the instruction fetch is cracked into two naturally-aligned bus accesses.

=== Privileged Instructions (including Zicsr)

[%autowidth.stretch, options="header"]
|===
| Instruction | Cycles | Note
3+| CSR Access
| `csrrw rd, csr, rs1` | 1 |
| `csrrc rd, csr, rs1` | 1 |
| `csrrs rd, csr, rs1` | 1 |
| `csrrwi rd, csr, imm` | 1 |
| `csrrci rd, csr, imm` | 1 |
| `csrrsi rd, csr, imm` | 1 |
3+| Trap Request
| `ecall` | 3 | Time given is for jumping to `mtvec`
| `ebreak` | 3 | Time given is for jumping to `mtvec`
|===

=== Bit Manipulation

[%autowidth.stretch, options="header"]
|===
| Instruction | Cycles | Note
3+| Zba (address generation)
|`sh1add rd, rs1, rs2` | 1 |
|`sh2add rd, rs1, rs2` | 1 |
|`sh3add rd, rs1, rs2` | 1 |
3+| Zbb (basic bit manipulation)
|`andn rd, rs1, rs2`   | 1 |
|`clz rd, rs1`         | 1 |
|`cpop rd, rs1`        | 1 |
|`ctz rd, rs1`         | 1 |
|`max rd, rs1, rs2`    | 1 |
|`maxu rd, rs1, rs2`   | 1 |
|`min rd, rs1, rs2`    | 1 |
|`minu rd, rs1, rs2`   | 1 |
|`orc.b rd, rs1`       | 1 |
|`orn rd, rs1, rs2`    | 1 |
|`rev8 rd, rs1`        | 1 |
|`rol rd, rs1, rs2`    | 1 |
|`ror rd, rs1, rs2`    | 1 |
|`rori rd, rs1, imm`   | 1 |
|`sext.b rd, rs1`      | 1 |
|`sext.h rd, rs1`      | 1 |
|`xnor rd, rs1, rs2`   | 1 |
|`zext.h rd, rs1`      | 1 |
|`zext.b rd, rs1`      | 1 | `zext.b` is a pseudo-op for `andi rd, rs1, 0xff`
3+| Zbc (carry-less multiply)
|`clmul rd, rs1, rs2`  | 1 |
|`clmulh rd, rs1, rs2` | 1 |
|`clmulr rd, rs1, rs2` | 1 |
3+| Zbs (single-bit manipulation)
|`bclr rd, rs1, rs2`   | 1 |
|`bclri rd, rs1, imm`  | 1 |
|`bext rd, rs1, rs2`   | 1 |
|`bexti rd, rs1, imm`  | 1 |
|`binv rd, rs1, rs2`   | 1 |
|`binvi rd, rs1, imm`  | 1 |
|`bset rd, rs1, rs2`   | 1 |
|`bseti rd, rs1, imm`  | 1 |
3+| Zbkb (basic bit manipulation for cryptography)
|`pack rd, rs1, rs2`   | 1 |
|`packh rd, rs1, rs2`  | 1 |
|`brev8 rd, rs1`       | 1 |
|`zip   rd, rs1`       | 1 |
|`unzip rd, rs1`       | 1 |
|===

=== Zcb Extension

Similarly to the C extension, this extension contains 16-bit variants of common 32-bit instructions:

* RV32I base ISA: `lbu`, `lh`, `lhu`, `sb`, `sh`, `zext.b` (alias of `andi`), `not` (alias of `xori`)
* Zbb extension: `sext.b`, `zext.h`, `sext.h`
* M extension: `mul`

They perform identically to their 32-bit counterparts.

=== Zcmp Extension

[%autowidth.stretch, options="header"]
|===
| Instruction | Cycles | Note
|`cm.push {rlist}, -imm` | 1 + _n_ | _n_ is number of registers in rlist
|`cm.pop {rlist}, imm` | 1 + _n_ | _n_ is number of registers in rlist
|`cm.popret {rlist}, imm` | 3 + _n_ footnote:unaligned_branch[] | _n_ is number of registers in rlist
|`cm.popretz {rlist}, imm` | 4 + _n_ footnote:unaligned_branch[] | _n_ is number of registers in rlist
|`cm.mva01s r1s', r2s'` | 2 |
|`cm.mvsa01 r1s', r2s'` | 2 |
|===


=== Branch Predictor

Hazard3 includes a minimal branch predictor, to accelerate tight loops:

* The instruction frontend remembers the last taken, backward branch
* If the same branch is seen again, it is predicted taken
* All other branches are predicted nontaken
* If a predicted-taken branch is not taken, the predictor state is cleared, and it will be predicted nontaken on its next execution.

Correctly predicted branches execute in one cycle: the frontend is able to stitch together the two nonsequential fetch paths so that they appear sequential. Mispredicted branches incur a penalty cycle, since a nonsequential fetch address must be issued when the branch is executed.