Hazard3/doc/sections/instruction_timings.adoc

== Instruction Cycle Counts

All timings are given assuming perfect bus behaviour (no stalls). Stalling of the `I` bus can delay execution indefinitely, as can stalling of the `D` bus during a load or store.

=== RV32I

[%autowidth.stretch, options="header"]
|===
| Instruction | Cycles | Note
3+| Integer Register-register
| `add rd, rs1, rs2` | 1 |
| `sub rd, rs1, rs2` | 1 |
| `slt rd, rs1, rs2` | 1 |
| `sltu rd, rs1, rs2` | 1 |
| `and rd, rs1, rs2` | 1 |
| `or rd, rs1, rs2` | 1 |
| `xor rd, rs1, rs2` | 1 |
| `sll rd, rs1, rs2` | 1 |
| `srl rd, rs1, rs2` | 1 |
| `sra rd, rs1, rs2` | 1 |
3+| Integer Register-immediate
| `addi rd, rs1, imm` | 1 | `nop` is a pseudo-op for `addi x0, x0, 0`
| `slti rd, rs1, imm` | 1 |
| `sltiu rd, rs1, imm` | 1 |
| `andi rd, rs1, imm` | 1 |
| `ori rd, rs1, imm` | 1 |
| `xori rd, rs1, imm` | 1 |
| `slli rd, rs1, imm` | 1 |
| `srli rd, rs1, imm` | 1 |
| `srai rd, rs1, imm` | 1 |
3+| Large Immediate
| `lui rd, imm` | 1 |
| `auipc rd, imm` | 1 |
3+| Control Transfer
| `jal rd, label` | 2footnote:unaligned_branch[A branch to a 32-bit instruction which is not 32-bit-aligned requires one additional cycle, because two naturally-aligned bus cycles are required to fetch the target instruction.]|
| `jalr rd, rs1, imm` | 2footnote:unaligned_branch[] |
| `beq rs1, rs2, label`| 1 or 2footnote:unaligned_branch[] | 1 if nontaken, 2 if taken.
| `bne rs1, rs2, label`| 1 or 2footnote:unaligned_branch[] | 1 if nontaken, 2 if taken.
| `blt rs1, rs2, label`| 1 or 2footnote:unaligned_branch[] | 1 if nontaken, 2 if taken.
| `bge rs1, rs2, label`| 1 or 2footnote:unaligned_branch[] | 1 if nontaken, 2 if taken.
| `bltu rs1, rs2, label`| 1 or 2footnote:unaligned_branch[] | 1 if nontaken, 2 if taken.
| `bgeu rs1, rs2, label`| 1 or 2footnote:unaligned_branch[] | 1 if nontaken, 2 if taken.
3+| Load and Store
| `lw rd, imm(rs1)` | 1 or 2 | 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[If an instruction uses load data (from stage 3) in stage 2, a 1-cycle bubble is inserted after the load. Load-data to store-data dependency does not experience this, because the store data is used in stage 3. However, load-data to store-address (or e.g. load-to-add) does qualify.]
| `lh rd, imm(rs1)` | 1 or 2 | 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[]
| `lhu rd, imm(rs1)` | 1 or 2 | 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[]
| `lb rd, imm(rs1)` | 1 or 2 | 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[]
| `lbu rd, imm(rs1)` | 1 or 2 | 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[]
| `sw rs2, imm(rs1)` | 1 |
| `sh rs2, imm(rs1)` | 1 |
| `sb rs2, imm(rs1)` | 1 |
|===

=== M Extension

Timings assume the core is configured with `MULDIV_UNROLL = 2` and `MUL_FAST = 1`. I.e. the sequential multiply/divide circuit processes two bits per cycle, and a separate dedicated multiplier is present for the `mul` instruction.


[%autowidth.stretch, options="header"]
|===
| Instruction | Cycles | Note
3+| 32 {times} 32 -> 32 Multiply
| `mul rd, rs1, rs2` | 1 or 2 | 1 if next instruction is independent, 2 if dependent.
3+| 32 {times} 32 -> 64 Multiply, Upper Half
| `mulh rd, rs1, rs2` | 18 to 20 | Depending on sign correction
| `mulhsu rd, rs1, rs2` | 18 to 20 | Depending on sign correction
| `mulhu rd, rs1, rs2` | 18 |
3+| Divide and Remainder
| `div rd, rs1, rs2` | 18 or 19 | Depending on sign correction
| `divu rd, rs1, rs2` | 18 |
| `rem rd, rs1, rs2` | 18 or 19 | Depending on sign correction
| `remu rd, rs1, rs2` | 18 |
|===

=== C Extension

All C extension 16-bit instructions on Hazard3 are aliases of base RV32I instructions. They perform identically to their 32-bit counterparts.

A consequence of the C extension is that 32-bit instructions can be non-naturally-aligned. This has no penalty during sequential execution, but branching to a 32-bit instruction that is not 32-bit-aligned carries a 1 cycle penalty, because the instruction fetch is cracked into two naturally-aligned bus accesses.

=== Privileged Instructions (including Zicsr)

[%autowidth.stretch, options="header"]
|===
| Instruction | Cycles | Note
3+| CSR Access
| `csrrw rd, csr, rs1` | 1 |
| `csrrc rd, csr, rs1` | 1 |
| `csrrs rd, csr, rs1` | 1 |
| `csrrwi rd, csr, imm` | 1 |
| `csrrci rd, csr, imm` | 1 |
| `csrrsi rd, csr, imm` | 1 |
3+| Trap Request
| `ecall` | 3 | Time given is for jumping to `mtvec`
| `ebreak` | 3 | Time given is for jumping to `mtvec`
|===

=== Bit Manipulation

[%autowidth.stretch, options="header"]
|===
| Instruction | Cycles | Note
3+| Zba (address generation)
|`sh1add rd, rs1, rs2` | 1 |
|`sh2add rd, rs1, rs2` | 1 |
|`sh3add rd, rs1, rs2` | 1 |
3+| Zbb (basic bit manipulation)
|`andn rd, rs1, rs2`   | 1 |
|`clz rd, rs1`         | 1 |
|`cpop rd, rs1`        | 1 |
|`ctz rd, rs1`         | 1 |
|`max rd, rs1, rs2`    | 1 |
|`maxu rd, rs1, rs2`   | 1 |
|`min rd, rs1, rs2`    | 1 |
|`minu rd, rs1, rs2`   | 1 |
|`orc.b rd, rs1`       | 1 |
|`orn rd, rs1, rs2`    | 1 |
|`rev8 rd, rs1`        | 1 |
|`rol rd, rs1, rs2`    | 1 |
|`ror rd, rs1, rs2`    | 1 |
|`rori rd, rs1, imm`   | 1 |
|`sext.b rd, rs1`      | 1 |
|`sext.h rd, rs1`      | 1 |
|`xnor rd, rs1, rs2`   | 1 |
|`zext.h rd, rs1`      | 1 |
|`zext.b rd, rs1`      | 1 | `zext.b` is a pseudo-op for `andi rd, rs1, 0xff`
3+| Zbc (carry-less multiply)
|`clmul rd, rs1, rs2`  | 1 |
|`clmulh rd, rs1, rs2` | 1 |
|`clmulr rd, rs1, rs2` | 1 |
3+| Zbs (single-bit manipulation)
|`bclr rd, rs1, rs2`   | 1 |
|`bclri rd, rs1, imm`  | 1 |
|`bext rd, rs1, rs2`   | 1 |
|`bexti rd, rs1, imm`  | 1 |
|`binv rd, rs1, rs2`   | 1 |
|`binvi rd, rs1, imm`  | 1 |
|`bset rd, rs1, rs2`   | 1 |
|`bseti rd, rs1, imm`  | 1 |
|===
Document some IRQ CSRs, and instruction timings 2021-05-31 22:57:05 +08:00			`== Instruction Cycle Counts`

			All timings are given assuming perfect bus behaviour (no stalls). Stalling of the `I` bus can delay execution indefinitely, as can stalling of the `D` bus during a load or store.

			`=== RV32I`

			`[%autowidth.stretch, options="header"]`
			`\|===`
			`\| Instruction \| Cycles \| Note`
			`3+\| Integer Register-register`
			\| `add rd, rs1, rs2` \| 1 \|
			\| `sub rd, rs1, rs2` \| 1 \|
			\| `slt rd, rs1, rs2` \| 1 \|
			\| `sltu rd, rs1, rs2` \| 1 \|
			\| `and rd, rs1, rs2` \| 1 \|
			\| `or rd, rs1, rs2` \| 1 \|
			\| `xor rd, rs1, rs2` \| 1 \|
			\| `sll rd, rs1, rs2` \| 1 \|
			\| `srl rd, rs1, rs2` \| 1 \|
			\| `sra rd, rs1, rs2` \| 1 \|
			`3+\| Integer Register-immediate`
			\| `addi rd, rs1, imm` \| 1 \| `nop` is a pseudo-op for `addi x0, x0, 0`
			\| `slti rd, rs1, imm` \| 1 \|
			\| `sltiu rd, rs1, imm` \| 1 \|
			\| `andi rd, rs1, imm` \| 1 \|
			\| `ori rd, rs1, imm` \| 1 \|
			\| `xori rd, rs1, imm` \| 1 \|
			\| `slli rd, rs1, imm` \| 1 \|
			\| `srli rd, rs1, imm` \| 1 \|
			\| `srai rd, rs1, imm` \| 1 \|
			`3+\| Large Immediate`
			\| `lui rd, imm` \| 1 \|
			\| `auipc rd, imm` \| 1 \|
			`3+\| Control Transfer`
			\| `jal rd, label` \| 2footnote:unaligned_branch[A branch to a 32-bit instruction which is not 32-bit-aligned requires one additional cycle, because two naturally-aligned bus cycles are required to fetch the target instruction.]\|
			\| `jalr rd, rs1, imm` \| 2footnote:unaligned_branch[] \|
			\| `beq rs1, rs2, label`\| 1 or 2footnote:unaligned_branch[] \| 1 if nontaken, 2 if taken.
			\| `bne rs1, rs2, label`\| 1 or 2footnote:unaligned_branch[] \| 1 if nontaken, 2 if taken.
			\| `blt rs1, rs2, label`\| 1 or 2footnote:unaligned_branch[] \| 1 if nontaken, 2 if taken.
			\| `bge rs1, rs2, label`\| 1 or 2footnote:unaligned_branch[] \| 1 if nontaken, 2 if taken.
			\| `bltu rs1, rs2, label`\| 1 or 2footnote:unaligned_branch[] \| 1 if nontaken, 2 if taken.
			\| `bgeu rs1, rs2, label`\| 1 or 2footnote:unaligned_branch[] \| 1 if nontaken, 2 if taken.
			`3+\| Load and Store`
			\| `lw rd, imm(rs1)` \| 1 or 2 \| 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[If an instruction uses load data (from stage 3) in stage 2, a 1-cycle bubble is inserted after the load. Load-data to store-data dependency does not experience this, because the store data is used in stage 3. However, load-data to store-address (or e.g. load-to-add) does qualify.]
			\| `lh rd, imm(rs1)` \| 1 or 2 \| 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[]
			\| `lhu rd, imm(rs1)` \| 1 or 2 \| 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[]
			\| `lb rd, imm(rs1)` \| 1 or 2 \| 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[]
			\| `lbu rd, imm(rs1)` \| 1 or 2 \| 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[]
			\| `sw rs2, imm(rs1)` \| 1 \|
			\| `sh rs2, imm(rs1)` \| 1 \|
			\| `sb rs2, imm(rs1)` \| 1 \|
			`\|===`

			`=== M Extension`

			Timings assume the core is configured with `MULDIV_UNROLL = 2` and `MUL_FAST = 1`. I.e. the sequential multiply/divide circuit processes two bits per cycle, and a separate dedicated multiplier is present for the `mul` instruction.


			`[%autowidth.stretch, options="header"]`
			`\|===`
			`\| Instruction \| Cycles \| Note`
			`3+\| 32 {times} 32 -> 32 Multiply`
			\| `mul rd, rs1, rs2` \| 1 or 2 \| 1 if next instruction is independent, 2 if dependent.
			`3+\| 32 {times} 32 -> 64 Multiply, Upper Half`
			\| `mulh rd, rs1, rs2` \| 18 to 20 \| Depending on sign correction
			\| `mulhsu rd, rs1, rs2` \| 18 to 20 \| Depending on sign correction
			\| `mulhu rd, rs1, rs2` \| 18 \|
			`3+\| Divide and Remainder`
Regenerate PDF 2021-11-29 00:27:54 +08:00			\| `div rd, rs1, rs2` \| 18 or 19 \| Depending on sign correction
			\| `divu rd, rs1, rs2` \| 18 \|
			\| `rem rd, rs1, rs2` \| 18 or 19 \| Depending on sign correction
			\| `remu rd, rs1, rs2` \| 18 \|
Document some IRQ CSRs, and instruction timings 2021-05-31 22:57:05 +08:00			`\|===`

			`=== C Extension`

			`All C extension 16-bit instructions on Hazard3 are aliases of base RV32I instructions. They perform identically to their 32-bit counterparts.`

			`A consequence of the C extension is that 32-bit instructions can be non-naturally-aligned. This has no penalty during sequential execution, but branching to a 32-bit instruction that is not 32-bit-aligned carries a 1 cycle penalty, because the instruction fetch is cracked into two naturally-aligned bus accesses.`

			`=== Privileged Instructions (including Zicsr)`

			`[%autowidth.stretch, options="header"]`
			`\|===`
			`\| Instruction \| Cycles \| Note`
			`3+\| CSR Access`
			\| `csrrw rd, csr, rs1` \| 1 \|
			\| `csrrc rd, csr, rs1` \| 1 \|
			\| `csrrs rd, csr, rs1` \| 1 \|
			\| `csrrwi rd, csr, imm` \| 1 \|
			\| `csrrci rd, csr, imm` \| 1 \|
			\| `csrrsi rd, csr, imm` \| 1 \|
			`3+\| Trap Request`
			\| `ecall` \| 3 \| Time given is for jumping to `mtvec`
			\| `ebreak` \| 3 \| Time given is for jumping to `mtvec`
			`\|===`
Update docs with bitmanip instructions 2021-11-28 11:16:45 +08:00
			`=== Bit Manipulation`

			`[%autowidth.stretch, options="header"]`
			`\|===`
			`\| Instruction \| Cycles \| Note`
			`3+\| Zba (address generation)`
Regenerate PDF 2021-11-29 00:27:54 +08:00			\|`sh1add rd, rs1, rs2` \| 1 \|
			\|`sh2add rd, rs1, rs2` \| 1 \|
			\|`sh3add rd, rs1, rs2` \| 1 \|
Update docs with bitmanip instructions 2021-11-28 11:16:45 +08:00			`3+\| Zbb (basic bit manipulation)`
Regenerate PDF 2021-11-29 00:27:54 +08:00			\|`andn rd, rs1, rs2` \| 1 \|
			\|`clz rd, rs1` \| 1 \|
			\|`cpop rd, rs1` \| 1 \|
			\|`ctz rd, rs1` \| 1 \|
			\|`max rd, rs1, rs2` \| 1 \|
			\|`maxu rd, rs1, rs2` \| 1 \|
			\|`min rd, rs1, rs2` \| 1 \|
			\|`minu rd, rs1, rs2` \| 1 \|
			\|`orc.b rd, rs1` \| 1 \|
			\|`orn rd, rs1, rs2` \| 1 \|
			\|`rev8 rd, rs1` \| 1 \|
			\|`rol rd, rs1, rs2` \| 1 \|
			\|`ror rd, rs1, rs2` \| 1 \|
			\|`rori rd, rs1, imm` \| 1 \|
			\|`sext.b rd, rs1` \| 1 \|
			\|`sext.h rd, rs1` \| 1 \|
			\|`xnor rd, rs1, rs2` \| 1 \|
			\|`zext.h rd, rs1` \| 1 \|
			\|`zext.b rd, rs1` \| 1 \| `zext.b` is a pseudo-op for `andi rd, rs1, 0xff`
Update docs with bitmanip instructions 2021-11-28 11:16:45 +08:00			`3+\| Zbc (carry-less multiply)`
Regenerate PDF 2021-11-29 00:27:54 +08:00			\|`clmul rd, rs1, rs2` \| 1 \|
			\|`clmulh rd, rs1, rs2` \| 1 \|
			\|`clmulr rd, rs1, rs2` \| 1 \|
Update docs with bitmanip instructions 2021-11-28 11:16:45 +08:00			`3+\| Zbs (single-bit manipulation)`
Regenerate PDF 2021-11-29 00:27:54 +08:00			\|`bclr rd, rs1, rs2` \| 1 \|
			\|`bclri rd, rs1, imm` \| 1 \|
			\|`bext rd, rs1, rs2` \| 1 \|
			\|`bexti rd, rs1, imm` \| 1 \|
			\|`binv rd, rs1, rs2` \| 1 \|
			\|`binvi rd, rs1, imm` \| 1 \|
			\|`bset rd, rs1, rs2` \| 1 \|
			\|`bseti rd, rs1, imm` \| 1 \|
Update docs with bitmanip instructions 2021-11-28 11:16:45 +08:00			`\|===`