Hazard3/doc/sections/instruction_timings.adoc

== Instruction Cycle Counts

All timings are given assuming perfect bus behaviour (no downstream bus stalls).

=== RV32I

[%autowidth.stretch, options="header"]
|===
| Instruction | Cycles | Note
3+| Integer Register-register
| `add rd, rs1, rs2` | 1 |
| `sub rd, rs1, rs2` | 1 |
| `slt rd, rs1, rs2` | 1 |
| `sltu rd, rs1, rs2` | 1 |
| `and rd, rs1, rs2` | 1 |
| `or rd, rs1, rs2` | 1 |
| `xor rd, rs1, rs2` | 1 |
| `sll rd, rs1, rs2` | 1 |
| `srl rd, rs1, rs2` | 1 |
| `sra rd, rs1, rs2` | 1 |
3+| Integer Register-immediate
| `addi rd, rs1, imm` | 1 | `nop` is a pseudo-op for `addi x0, x0, 0`
| `slti rd, rs1, imm` | 1 |
| `sltiu rd, rs1, imm` | 1 |
| `andi rd, rs1, imm` | 1 |
| `ori rd, rs1, imm` | 1 |
| `xori rd, rs1, imm` | 1 |
| `slli rd, rs1, imm` | 1 |
| `srli rd, rs1, imm` | 1 |
| `srai rd, rs1, imm` | 1 |
3+| Large Immediate
| `lui rd, imm` | 1 |
| `auipc rd, imm` | 1 |
3+| Control Transfer
| `jal rd, label` | 2footnote:unaligned_branch[A jump or branch to a 32-bit instruction which is not 32-bit-aligned requires one additional cycle, because two naturally aligned bus cycles are required to fetch the target instruction.]|
| `jalr rd, rs1, imm` | 2footnote:unaligned_branch[] |
| `beq rs1, rs2, label`| 1 or 2footnote:unaligned_branch[] | 1 if nontaken, 2 if taken.
| `bne rs1, rs2, label`| 1 or 2footnote:unaligned_branch[] | 1 if nontaken, 2 if taken.
| `blt rs1, rs2, label`| 1 or 2footnote:unaligned_branch[] | 1 if nontaken, 2 if taken.
| `bge rs1, rs2, label`| 1 or 2footnote:unaligned_branch[] | 1 if nontaken, 2 if taken.
| `bltu rs1, rs2, label`| 1 or 2footnote:unaligned_branch[] | 1 if nontaken, 2 if taken.
| `bgeu rs1, rs2, label`| 1 or 2footnote:unaligned_branch[] | 1 if nontaken, 2 if taken.
3+| Load and Store
| `lw rd, imm(rs1)` | 1 or 2 | 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[If an instruction uses load data (from stage 3) in stage 2, a 1-cycle bubble is inserted after the load. Load-data to store-data dependency does not experience this, because the store data is used in stage 3. However, load-data to store-address (or e.g. load-to-add) does qualify.]
| `lh rd, imm(rs1)` | 1 or 2 | 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[]
| `lhu rd, imm(rs1)` | 1 or 2 | 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[]
| `lb rd, imm(rs1)` | 1 or 2 | 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[]
| `lbu rd, imm(rs1)` | 1 or 2 | 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[]
| `sw rs2, imm(rs1)` | 1 |
| `sh rs2, imm(rs1)` | 1 |
| `sb rs2, imm(rs1)` | 1 |
|===

=== M Extension

Timings assume the core is configured with `MULDIV_UNROLL = 2` and `MUL_FAST = 1`. I.e. the sequential multiply/divide circuit processes two bits per cycle, and a separate dedicated multiplier is present for the `mul` instruction.


[%autowidth.stretch, options="header"]
|===
| Instruction | Cycles | Note
3+| 32 {times} 32 -> 32 Multiply
| `mul rd, rs1, rs2` | 1 or 2 | 1 if next instruction is independent, 2 if dependent.
3+| 32 {times} 32 -> 64 Multiply, Upper Half
| `mulh rd, rs1, rs2` | 18 to 20 | Depending on sign correction
| `mulhsu rd, rs1, rs2` | 18 to 20 | Depending on sign correction
| `mulhu rd, rs1, rs2` | 18 |
3+| Divide and Remainder
| `div rd, rs1, rs2` | 18 or 19 | Depending on sign correction
| `divu rd, rs1, rs2` | 18 |
| `rem rd, rs1, rs2` | 18 or 19 | Depending on sign correction
| `remu rd, rs1, rs2` | 18 |
|===

=== A Extension

[%autowidth.stretch, options="header"]
|===
| Instruction | Cycles | Note
3+| Load-Reserved/Store-Conditional
| `lr.w rd, (rs1)` | 1 or 2 | 2 if next instruction is dependentfootnote:data_dependency[], or an `lr.w`, `sc.w` or `amo*.w`.footnote:exclusive_pipelining[A pipeline bubble is inserted between `lr.w`/`sc.w` and an immediately-following `lr.w`/`sc.w`/`amo*`, because the AHB5 bus standard does not permit pipelined exclusive accesses. A stall would be inserted between `lr.w` and `sc.w` anyhow, so the local monitor can be updated based on the `lr.w` data phase in time to suppress the `sc.w` address phase.]
| `sc.w rd, rs2, (rs1)` | 1 or 2 | 2 if next instruction is an `lr.w`, `sc.w` or `amo*.w`.footnote:exclusive_pipelining[]
3+| Atomic Memory Operations
|`amoswap.w rd, rs2, (rs1)` | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[AMOs are issued as a paired exclusive read and exclusive write on the bus, at the maximum speed of 2 cycles per access, since the bus does not permit pipelining of exclusive reads/writes. If the write phase fails due to the global monitor reporting a lost reservation, the instruction loops at a rate of 4 cycles per loop, until success. If the read reservation is refused by the global monitor, the instruction generates a Store/AMO Fault exception, to avoid an infinite loop.]
|`amoadd.w rd, rs2, (rs1)`  | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[]
|`amoxor.w rd, rs2, (rs1)`  | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[]
|`amoand.w rd, rs2, (rs1)`  | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[]
|`amoor.w rd, rs2, (rs1)`   | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[]
|`amomin.w rd, rs2, (rs1)`  | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[]
|`amomax.w rd, rs2, (rs1)`  | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[]
|`amominu.w rd, rs2, (rs1)` | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[]
|`amomaxu.w rd, rs2, (rs1)` | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[]
|===

=== C Extension

All C extension 16-bit instructions are aliases of base RV32I instructions. On Hazard3, they perform identically to their 32-bit counterparts.

A consequence of the C extension is that 32-bit instructions can be non-naturally-aligned. This has no penalty during sequential execution, but branching to a 32-bit instruction that is not 32-bit-aligned carries a 1 cycle penalty, because the instruction fetch is cracked into two naturally-aligned bus accesses.

=== Privileged Instructions (including Zicsr)

[%autowidth.stretch, options="header"]
|===
| Instruction | Cycles | Note
3+| CSR Access
| `csrrw rd, csr, rs1` | 1 |
| `csrrc rd, csr, rs1` | 1 |
| `csrrs rd, csr, rs1` | 1 |
| `csrrwi rd, csr, imm` | 1 |
| `csrrci rd, csr, imm` | 1 |
| `csrrsi rd, csr, imm` | 1 |
3+| Trap Request
| `ecall` | 3 | Time given is for jumping to `mtvec`
| `ebreak` | 3 | Time given is for jumping to `mtvec`
|===

=== Bit Manipulation

[%autowidth.stretch, options="header"]
|===
| Instruction | Cycles | Note
3+| Zba (address generation)
|`sh1add rd, rs1, rs2` | 1 |
|`sh2add rd, rs1, rs2` | 1 |
|`sh3add rd, rs1, rs2` | 1 |
3+| Zbb (basic bit manipulation)
|`andn rd, rs1, rs2`   | 1 |
|`clz rd, rs1`         | 1 |
|`cpop rd, rs1`        | 1 |
|`ctz rd, rs1`         | 1 |
|`max rd, rs1, rs2`    | 1 |
|`maxu rd, rs1, rs2`   | 1 |
|`min rd, rs1, rs2`    | 1 |
|`minu rd, rs1, rs2`   | 1 |
|`orc.b rd, rs1`       | 1 |
|`orn rd, rs1, rs2`    | 1 |
|`rev8 rd, rs1`        | 1 |
|`rol rd, rs1, rs2`    | 1 |
|`ror rd, rs1, rs2`    | 1 |
|`rori rd, rs1, imm`   | 1 |
|`sext.b rd, rs1`      | 1 |
|`sext.h rd, rs1`      | 1 |
|`xnor rd, rs1, rs2`   | 1 |
|`zext.h rd, rs1`      | 1 |
|`zext.b rd, rs1`      | 1 | `zext.b` is a pseudo-op for `andi rd, rs1, 0xff`
3+| Zbc (carry-less multiply)
|`clmul rd, rs1, rs2`  | 1 |
|`clmulh rd, rs1, rs2` | 1 |
|`clmulr rd, rs1, rs2` | 1 |
3+| Zbs (single-bit manipulation)
|`bclr rd, rs1, rs2`   | 1 |
|`bclri rd, rs1, imm`  | 1 |
|`bext rd, rs1, rs2`   | 1 |
|`bexti rd, rs1, imm`  | 1 |
|`binv rd, rs1, rs2`   | 1 |
|`binvi rd, rs1, imm`  | 1 |
|`bset rd, rs1, rs2`   | 1 |
|`bseti rd, rs1, imm`  | 1 |
|===
Document some IRQ CSRs, and instruction timings 2021-05-31 22:57:05 +08:00			`== Instruction Cycle Counts`

Add basic support for lr/sc instructions from the A extension 2021-12-04 23:02:31 +08:00			`All timings are given assuming perfect bus behaviour (no downstream bus stalls).`
Document some IRQ CSRs, and instruction timings 2021-05-31 22:57:05 +08:00
			`=== RV32I`

			`[%autowidth.stretch, options="header"]`
			`\|===`
			`\| Instruction \| Cycles \| Note`
			`3+\| Integer Register-register`
			\| `add rd, rs1, rs2` \| 1 \|
			\| `sub rd, rs1, rs2` \| 1 \|
			\| `slt rd, rs1, rs2` \| 1 \|
			\| `sltu rd, rs1, rs2` \| 1 \|
			\| `and rd, rs1, rs2` \| 1 \|
			\| `or rd, rs1, rs2` \| 1 \|
			\| `xor rd, rs1, rs2` \| 1 \|
			\| `sll rd, rs1, rs2` \| 1 \|
			\| `srl rd, rs1, rs2` \| 1 \|
			\| `sra rd, rs1, rs2` \| 1 \|
			`3+\| Integer Register-immediate`
			\| `addi rd, rs1, imm` \| 1 \| `nop` is a pseudo-op for `addi x0, x0, 0`
			\| `slti rd, rs1, imm` \| 1 \|
			\| `sltiu rd, rs1, imm` \| 1 \|
			\| `andi rd, rs1, imm` \| 1 \|
			\| `ori rd, rs1, imm` \| 1 \|
			\| `xori rd, rs1, imm` \| 1 \|
			\| `slli rd, rs1, imm` \| 1 \|
			\| `srli rd, rs1, imm` \| 1 \|
			\| `srai rd, rs1, imm` \| 1 \|
			`3+\| Large Immediate`
			\| `lui rd, imm` \| 1 \|
			\| `auipc rd, imm` \| 1 \|
			`3+\| Control Transfer`
Fix bug where an IRQ can fire during load/store dphase, followed by dphase bus exception. Result was that the exception would sample the IRQ vector PC rather than the load/store instruction PC. Fix by fencing off on in-flight dphases before asserting the IRQ. This adds a cycle of jitter to IRQs, but is required for correct operation without adding a full exception-gathering pipeline. 2021-12-08 03:24:53 +08:00			\| `jal rd, label` \| 2footnote:unaligned_branch[A jump or branch to a 32-bit instruction which is not 32-bit-aligned requires one additional cycle, because two naturally aligned bus cycles are required to fetch the target instruction.]\|
Document some IRQ CSRs, and instruction timings 2021-05-31 22:57:05 +08:00			\| `jalr rd, rs1, imm` \| 2footnote:unaligned_branch[] \|
			\| `beq rs1, rs2, label`\| 1 or 2footnote:unaligned_branch[] \| 1 if nontaken, 2 if taken.
			\| `bne rs1, rs2, label`\| 1 or 2footnote:unaligned_branch[] \| 1 if nontaken, 2 if taken.
			\| `blt rs1, rs2, label`\| 1 or 2footnote:unaligned_branch[] \| 1 if nontaken, 2 if taken.
			\| `bge rs1, rs2, label`\| 1 or 2footnote:unaligned_branch[] \| 1 if nontaken, 2 if taken.
			\| `bltu rs1, rs2, label`\| 1 or 2footnote:unaligned_branch[] \| 1 if nontaken, 2 if taken.
			\| `bgeu rs1, rs2, label`\| 1 or 2footnote:unaligned_branch[] \| 1 if nontaken, 2 if taken.
			`3+\| Load and Store`
			\| `lw rd, imm(rs1)` \| 1 or 2 \| 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[If an instruction uses load data (from stage 3) in stage 2, a 1-cycle bubble is inserted after the load. Load-data to store-data dependency does not experience this, because the store data is used in stage 3. However, load-data to store-address (or e.g. load-to-add) does qualify.]
			\| `lh rd, imm(rs1)` \| 1 or 2 \| 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[]
			\| `lhu rd, imm(rs1)` \| 1 or 2 \| 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[]
			\| `lb rd, imm(rs1)` \| 1 or 2 \| 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[]
			\| `lbu rd, imm(rs1)` \| 1 or 2 \| 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[]
			\| `sw rs2, imm(rs1)` \| 1 \|
			\| `sh rs2, imm(rs1)` \| 1 \|
			\| `sb rs2, imm(rs1)` \| 1 \|
			`\|===`

			`=== M Extension`

			Timings assume the core is configured with `MULDIV_UNROLL = 2` and `MUL_FAST = 1`. I.e. the sequential multiply/divide circuit processes two bits per cycle, and a separate dedicated multiplier is present for the `mul` instruction.


			`[%autowidth.stretch, options="header"]`
			`\|===`
			`\| Instruction \| Cycles \| Note`
			`3+\| 32 {times} 32 -> 32 Multiply`
			\| `mul rd, rs1, rs2` \| 1 or 2 \| 1 if next instruction is independent, 2 if dependent.
			`3+\| 32 {times} 32 -> 64 Multiply, Upper Half`
			\| `mulh rd, rs1, rs2` \| 18 to 20 \| Depending on sign correction
			\| `mulhsu rd, rs1, rs2` \| 18 to 20 \| Depending on sign correction
			\| `mulhu rd, rs1, rs2` \| 18 \|
			`3+\| Divide and Remainder`
Regenerate PDF 2021-11-29 00:27:54 +08:00			\| `div rd, rs1, rs2` \| 18 or 19 \| Depending on sign correction
			\| `divu rd, rs1, rs2` \| 18 \|
			\| `rem rd, rs1, rs2` \| 18 or 19 \| Depending on sign correction
			\| `remu rd, rs1, rs2` \| 18 \|
Document some IRQ CSRs, and instruction timings 2021-05-31 22:57:05 +08:00			`\|===`

Add basic support for lr/sc instructions from the A extension 2021-12-04 23:02:31 +08:00			`=== A Extension`

			`[%autowidth.stretch, options="header"]`
			`\|===`
			`\| Instruction \| Cycles \| Note`
Fix illegal issue of pipelined exclusives on the bus, and document correct timings 2021-12-05 02:23:01 +08:00			`3+\| Load-Reserved/Store-Conditional`
Fix column width 2021-12-07 01:14:23 +08:00			\| `lr.w rd, (rs1)` \| 1 or 2 \| 2 if next instruction is dependentfootnote:data_dependency[], or an `lr.w`, `sc.w` or `amo.w`.footnote:exclusive_pipelining[A pipeline bubble is inserted between `lr.w`/`sc.w` and an immediately-following `lr.w`/`sc.w`/`amo`, because the AHB5 bus standard does not permit pipelined exclusive accesses. A stall would be inserted between `lr.w` and `sc.w` anyhow, so the local monitor can be updated based on the `lr.w` data phase in time to suppress the `sc.w` address phase.]
First plausibly working AMOs. Add AMOs to instruction timings list 2021-12-05 07:44:22 +08:00			\| `sc.w rd, rs2, (rs1)` \| 1 or 2 \| 2 if next instruction is an `lr.w`, `sc.w` or `amo*.w`.footnote:exclusive_pipelining[]
			`3+\| Atomic Memory Operations`
			\|`amoswap.w rd, rs2, (rs1)` \| 4+ \| 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[AMOs are issued as a paired exclusive read and exclusive write on the bus, at the maximum speed of 2 cycles per access, since the bus does not permit pipelining of exclusive reads/writes. If the write phase fails due to the global monitor reporting a lost reservation, the instruction loops at a rate of 4 cycles per loop, until success. If the read reservation is refused by the global monitor, the instruction generates a Store/AMO Fault exception, to avoid an infinite loop.]
			\|`amoadd.w rd, rs2, (rs1)` \| 4+ \| 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[]
			\|`amoxor.w rd, rs2, (rs1)` \| 4+ \| 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[]
			\|`amoand.w rd, rs2, (rs1)` \| 4+ \| 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[]
			\|`amoor.w rd, rs2, (rs1)` \| 4+ \| 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[]
			\|`amomin.w rd, rs2, (rs1)` \| 4+ \| 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[]
			\|`amomax.w rd, rs2, (rs1)` \| 4+ \| 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[]
			\|`amominu.w rd, rs2, (rs1)` \| 4+ \| 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[]
			\|`amomaxu.w rd, rs2, (rs1)` \| 4+ \| 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[]
Add basic support for lr/sc instructions from the A extension 2021-12-04 23:02:31 +08:00			`\|===`

Document some IRQ CSRs, and instruction timings 2021-05-31 22:57:05 +08:00			`=== C Extension`

Add A bit to MISA, update docs 2021-12-07 13:10:20 +08:00			`All C extension 16-bit instructions are aliases of base RV32I instructions. On Hazard3, they perform identically to their 32-bit counterparts.`
Document some IRQ CSRs, and instruction timings 2021-05-31 22:57:05 +08:00
			`A consequence of the C extension is that 32-bit instructions can be non-naturally-aligned. This has no penalty during sequential execution, but branching to a 32-bit instruction that is not 32-bit-aligned carries a 1 cycle penalty, because the instruction fetch is cracked into two naturally-aligned bus accesses.`

			`=== Privileged Instructions (including Zicsr)`

			`[%autowidth.stretch, options="header"]`
			`\|===`
			`\| Instruction \| Cycles \| Note`
			`3+\| CSR Access`
			\| `csrrw rd, csr, rs1` \| 1 \|
			\| `csrrc rd, csr, rs1` \| 1 \|
			\| `csrrs rd, csr, rs1` \| 1 \|
			\| `csrrwi rd, csr, imm` \| 1 \|
			\| `csrrci rd, csr, imm` \| 1 \|
			\| `csrrsi rd, csr, imm` \| 1 \|
			`3+\| Trap Request`
			\| `ecall` \| 3 \| Time given is for jumping to `mtvec`
			\| `ebreak` \| 3 \| Time given is for jumping to `mtvec`
			`\|===`
Update docs with bitmanip instructions 2021-11-28 11:16:45 +08:00
			`=== Bit Manipulation`

			`[%autowidth.stretch, options="header"]`
			`\|===`
			`\| Instruction \| Cycles \| Note`
			`3+\| Zba (address generation)`
Regenerate PDF 2021-11-29 00:27:54 +08:00			\|`sh1add rd, rs1, rs2` \| 1 \|
			\|`sh2add rd, rs1, rs2` \| 1 \|
			\|`sh3add rd, rs1, rs2` \| 1 \|
Update docs with bitmanip instructions 2021-11-28 11:16:45 +08:00			`3+\| Zbb (basic bit manipulation)`
Regenerate PDF 2021-11-29 00:27:54 +08:00			\|`andn rd, rs1, rs2` \| 1 \|
			\|`clz rd, rs1` \| 1 \|
			\|`cpop rd, rs1` \| 1 \|
			\|`ctz rd, rs1` \| 1 \|
			\|`max rd, rs1, rs2` \| 1 \|
			\|`maxu rd, rs1, rs2` \| 1 \|
			\|`min rd, rs1, rs2` \| 1 \|
			\|`minu rd, rs1, rs2` \| 1 \|
			\|`orc.b rd, rs1` \| 1 \|
			\|`orn rd, rs1, rs2` \| 1 \|
			\|`rev8 rd, rs1` \| 1 \|
			\|`rol rd, rs1, rs2` \| 1 \|
			\|`ror rd, rs1, rs2` \| 1 \|
			\|`rori rd, rs1, imm` \| 1 \|
			\|`sext.b rd, rs1` \| 1 \|
			\|`sext.h rd, rs1` \| 1 \|
			\|`xnor rd, rs1, rs2` \| 1 \|
			\|`zext.h rd, rs1` \| 1 \|
			\|`zext.b rd, rs1` \| 1 \| `zext.b` is a pseudo-op for `andi rd, rs1, 0xff`
Update docs with bitmanip instructions 2021-11-28 11:16:45 +08:00			`3+\| Zbc (carry-less multiply)`
Regenerate PDF 2021-11-29 00:27:54 +08:00			\|`clmul rd, rs1, rs2` \| 1 \|
			\|`clmulh rd, rs1, rs2` \| 1 \|
			\|`clmulr rd, rs1, rs2` \| 1 \|
Update docs with bitmanip instructions 2021-11-28 11:16:45 +08:00			`3+\| Zbs (single-bit manipulation)`
Regenerate PDF 2021-11-29 00:27:54 +08:00			\|`bclr rd, rs1, rs2` \| 1 \|
			\|`bclri rd, rs1, imm` \| 1 \|
			\|`bext rd, rs1, rs2` \| 1 \|
			\|`bexti rd, rs1, imm` \| 1 \|
			\|`binv rd, rs1, rs2` \| 1 \|
			\|`binvi rd, rs1, imm` \| 1 \|
			\|`bset rd, rs1, rs2` \| 1 \|
			\|`bseti rd, rs1, imm` \| 1 \|
Update docs with bitmanip instructions 2021-11-28 11:16:45 +08:00			`\|===`