202 lines
9.7 KiB
Plaintext
202 lines
9.7 KiB
Plaintext
== Instruction Cycle Counts
|
|
|
|
All timings are given assuming perfect bus behaviour (no downstream bus stalls), and that the core is configured with `MULDIV_UNROLL = 2` and all other configuration options set for maximum performance.
|
|
|
|
=== RV32I
|
|
|
|
[%autowidth.stretch, options="header"]
|
|
|===
|
|
| Instruction | Cycles | Note
|
|
3+| Integer Register-register
|
|
| `add rd, rs1, rs2` | 1 |
|
|
| `sub rd, rs1, rs2` | 1 |
|
|
| `slt rd, rs1, rs2` | 1 |
|
|
| `sltu rd, rs1, rs2` | 1 |
|
|
| `and rd, rs1, rs2` | 1 |
|
|
| `or rd, rs1, rs2` | 1 |
|
|
| `xor rd, rs1, rs2` | 1 |
|
|
| `sll rd, rs1, rs2` | 1 |
|
|
| `srl rd, rs1, rs2` | 1 |
|
|
| `sra rd, rs1, rs2` | 1 |
|
|
3+| Integer Register-immediate
|
|
| `addi rd, rs1, imm` | 1 | `nop` is a pseudo-op for `addi x0, x0, 0`
|
|
| `slti rd, rs1, imm` | 1 |
|
|
| `sltiu rd, rs1, imm` | 1 |
|
|
| `andi rd, rs1, imm` | 1 |
|
|
| `ori rd, rs1, imm` | 1 |
|
|
| `xori rd, rs1, imm` | 1 |
|
|
| `slli rd, rs1, imm` | 1 |
|
|
| `srli rd, rs1, imm` | 1 |
|
|
| `srai rd, rs1, imm` | 1 |
|
|
3+| Large Immediate
|
|
| `lui rd, imm` | 1 |
|
|
| `auipc rd, imm` | 1 |
|
|
3+| Control Transfer
|
|
| `jal rd, label` | 2footnote:unaligned_branch[A jump or branch to a 32-bit instruction which is not 32-bit-aligned requires one additional cycle, because two naturally aligned bus cycles are required to fetch the target instruction.]|
|
|
| `jalr rd, rs1, imm` | 2footnote:unaligned_branch[] |
|
|
| `beq rs1, rs2, label`| 1 or 2footnote:unaligned_branch[] | 1 if correctly predicted, 2 if mispredicted.
|
|
| `bne rs1, rs2, label`| 1 or 2footnote:unaligned_branch[] | 1 if correctly predicted, 2 if mispredicted.
|
|
| `blt rs1, rs2, label`| 1 or 2footnote:unaligned_branch[] | 1 if correctly predicted, 2 if mispredicted.
|
|
| `bge rs1, rs2, label`| 1 or 2footnote:unaligned_branch[] | 1 if correctly predicted, 2 if mispredicted.
|
|
| `bltu rs1, rs2, label`| 1 or 2footnote:unaligned_branch[] | 1 if correctly predicted, 2 if mispredicted.
|
|
| `bgeu rs1, rs2, label`| 1 or 2footnote:unaligned_branch[] | 1 if correctly predicted, 2 if mispredicted.
|
|
3+| Load and Store
|
|
| `lw rd, imm(rs1)` | 1 or 2 | 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[If an instruction in stage 2 (e.g. an `add`) uses data from stage 3 (e.g. a `lw` result), a 1-cycle bubble is inserted between the pair. A load data -> store data dependency is _not_ an example of this, because data is produced and consumed in stage 3. However, load data -> load address _would_ qualify, as would e.g. `sc.w` -> `beqz`.]
|
|
| `lh rd, imm(rs1)` | 1 or 2 | 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[]
|
|
| `lhu rd, imm(rs1)` | 1 or 2 | 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[]
|
|
| `lb rd, imm(rs1)` | 1 or 2 | 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[]
|
|
| `lbu rd, imm(rs1)` | 1 or 2 | 1 if next instruction is independent, 2 if dependent.footnote:data_dependency[]
|
|
| `sw rs2, imm(rs1)` | 1 |
|
|
| `sh rs2, imm(rs1)` | 1 |
|
|
| `sb rs2, imm(rs1)` | 1 |
|
|
|===
|
|
|
|
=== M Extension
|
|
|
|
Timings assume the core is configured with `MULDIV_UNROLL = 2` and `MUL_FAST = 1`. I.e. the sequential multiply/divide circuit processes two bits per cycle, and a separate dedicated multiplier is present for the `mul` instruction.
|
|
|
|
|
|
[%autowidth.stretch, options="header"]
|
|
|===
|
|
| Instruction | Cycles | Note
|
|
3+| 32 {times} 32 -> 32 Multiply
|
|
| `mul rd, rs1, rs2` | 1 |
|
|
3+| 32 {times} 32 -> 64 Multiply, Upper Half
|
|
| `mulh rd, rs1, rs2` | 1 |
|
|
| `mulhsu rd, rs1, rs2` | 1 |
|
|
| `mulhu rd, rs1, rs2` | 1 |
|
|
3+| Divide and Remainder
|
|
| `div rd, rs1, rs2` | 18 or 19 | Depending on sign correction
|
|
| `divu rd, rs1, rs2` | 18 |
|
|
| `rem rd, rs1, rs2` | 18 or 19 | Depending on sign correction
|
|
| `remu rd, rs1, rs2` | 18 |
|
|
|===
|
|
|
|
=== A Extension
|
|
|
|
[%autowidth.stretch, options="header"]
|
|
|===
|
|
| Instruction | Cycles | Note
|
|
3+| Load-Reserved/Store-Conditional
|
|
| `lr.w rd, (rs1)` | 1 or 2 | 2 if next instruction is dependentfootnote:data_dependency[], an `lr.w`, `sc.w` or `amo*.w`.footnote:exclusive_pipelining[A pipeline bubble is inserted between `lr.w`/`sc.w` and an immediately-following `lr.w`/`sc.w`/`amo*`, because the AHB5 bus standard does not permit pipelined exclusive accesses. A stall would be inserted between `lr.w` and `sc.w` anyhow, so the local monitor can be updated based on the `lr.w` data phase in time to suppress the `sc.w` address phase.]
|
|
| `sc.w rd, rs2, (rs1)` | 1 or 2 | 2 if next instruction is dependentfootnote:data_dependency[], an `lr.w`, `sc.w` or `amo*.w`.footnote:exclusive_pipelining[]
|
|
3+| Atomic Memory Operations
|
|
|`amoswap.w rd, rs2, (rs1)` | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[AMOs are issued as a paired exclusive read and exclusive write on the bus, at the maximum speed of 2 cycles per access, since the bus does not permit pipelining of exclusive reads/writes. If the write phase fails due to the global monitor reporting a lost reservation, the instruction loops at a rate of 4 cycles per loop, until success. If the read reservation is refused by the global monitor, the instruction generates a Store/AMO Fault exception, to avoid an infinite loop.]
|
|
|`amoadd.w rd, rs2, (rs1)` | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[]
|
|
|`amoxor.w rd, rs2, (rs1)` | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[]
|
|
|`amoand.w rd, rs2, (rs1)` | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[]
|
|
|`amoor.w rd, rs2, (rs1)` | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[]
|
|
|`amomin.w rd, rs2, (rs1)` | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[]
|
|
|`amomax.w rd, rs2, (rs1)` | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[]
|
|
|`amominu.w rd, rs2, (rs1)` | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[]
|
|
|`amomaxu.w rd, rs2, (rs1)` | 4+ | 4 per attempt. Multiple attempts if reservation is lost.footnote:amo_timing[]
|
|
|===
|
|
|
|
=== C Extension
|
|
|
|
All C extension 16-bit instructions are aliases of base RV32I instructions. On Hazard3, they perform identically to their 32-bit counterparts.
|
|
|
|
A consequence of the C extension is that 32-bit instructions can be non-naturally-aligned. This has no penalty during sequential execution, but branching to a 32-bit instruction that is not 32-bit-aligned carries a 1 cycle penalty, because the instruction fetch is cracked into two naturally-aligned bus accesses.
|
|
|
|
=== Privileged Instructions (including Zicsr)
|
|
|
|
[%autowidth.stretch, options="header"]
|
|
|===
|
|
| Instruction | Cycles | Note
|
|
3+| CSR Access
|
|
| `csrrw rd, csr, rs1` | 1 |
|
|
| `csrrc rd, csr, rs1` | 1 |
|
|
| `csrrs rd, csr, rs1` | 1 |
|
|
| `csrrwi rd, csr, imm` | 1 |
|
|
| `csrrci rd, csr, imm` | 1 |
|
|
| `csrrsi rd, csr, imm` | 1 |
|
|
3+| Trap Request
|
|
| `ecall` | 3 | Time given is for jumping to `mtvec`
|
|
| `ebreak` | 3 | Time given is for jumping to `mtvec`
|
|
|===
|
|
|
|
=== Bit Manipulation
|
|
|
|
[%autowidth.stretch, options="header"]
|
|
|===
|
|
| Instruction | Cycles | Note
|
|
3+| Zba (address generation)
|
|
|`sh1add rd, rs1, rs2` | 1 |
|
|
|`sh2add rd, rs1, rs2` | 1 |
|
|
|`sh3add rd, rs1, rs2` | 1 |
|
|
3+| Zbb (basic bit manipulation)
|
|
|`andn rd, rs1, rs2` | 1 |
|
|
|`clz rd, rs1` | 1 |
|
|
|`cpop rd, rs1` | 1 |
|
|
|`ctz rd, rs1` | 1 |
|
|
|`max rd, rs1, rs2` | 1 |
|
|
|`maxu rd, rs1, rs2` | 1 |
|
|
|`min rd, rs1, rs2` | 1 |
|
|
|`minu rd, rs1, rs2` | 1 |
|
|
|`orc.b rd, rs1` | 1 |
|
|
|`orn rd, rs1, rs2` | 1 |
|
|
|`rev8 rd, rs1` | 1 |
|
|
|`rol rd, rs1, rs2` | 1 |
|
|
|`ror rd, rs1, rs2` | 1 |
|
|
|`rori rd, rs1, imm` | 1 |
|
|
|`sext.b rd, rs1` | 1 |
|
|
|`sext.h rd, rs1` | 1 |
|
|
|`xnor rd, rs1, rs2` | 1 |
|
|
|`zext.h rd, rs1` | 1 |
|
|
|`zext.b rd, rs1` | 1 | `zext.b` is a pseudo-op for `andi rd, rs1, 0xff`
|
|
3+| Zbc (carry-less multiply)
|
|
|`clmul rd, rs1, rs2` | 1 |
|
|
|`clmulh rd, rs1, rs2` | 1 |
|
|
|`clmulr rd, rs1, rs2` | 1 |
|
|
3+| Zbs (single-bit manipulation)
|
|
|`bclr rd, rs1, rs2` | 1 |
|
|
|`bclri rd, rs1, imm` | 1 |
|
|
|`bext rd, rs1, rs2` | 1 |
|
|
|`bexti rd, rs1, imm` | 1 |
|
|
|`binv rd, rs1, rs2` | 1 |
|
|
|`binvi rd, rs1, imm` | 1 |
|
|
|`bset rd, rs1, rs2` | 1 |
|
|
|`bseti rd, rs1, imm` | 1 |
|
|
3+| Zbkb (basic bit manipulation for cryptography)
|
|
|`pack rd, rs1, rs2` | 1 |
|
|
|`packh rd, rs1, rs2` | 1 |
|
|
|`brev8 rd, rs1` | 1 |
|
|
|`zip rd, rs1` | 1 |
|
|
|`unzip rd, rs1` | 1 |
|
|
|===
|
|
|
|
=== Zcb Extension
|
|
|
|
Similarly to the C extension, this extension contains 16-bit variants of common 32-bit instructions:
|
|
|
|
* RV32I base ISA: `lbu`, `lh`, `lhu`, `sb`, `sh`, `zext.b` (alias of `andi`), `not` (alias of `xori`)
|
|
* Zbb extension: `sext.b`, `zext.h`, `sext.h`
|
|
* M extension: `mul`
|
|
|
|
They perform identically to their 32-bit counterparts.
|
|
|
|
=== Zcmp Extension
|
|
|
|
[%autowidth.stretch, options="header"]
|
|
|===
|
|
| Instruction | Cycles | Note
|
|
|`cm.push {rlist}, -imm` | 1 + _n_ | _n_ is number of registers in rlist
|
|
|`cm.pop {rlist}, imm` | 1 + _n_ | _n_ is number of registers in rlist
|
|
|`cm.popret {rlist}, imm` | 3 + _n_ footnote:unaligned_branch[] | _n_ is number of registers in rlist
|
|
|`cm.popretz {rlist}, imm` | 4 + _n_ footnote:unaligned_branch[] | _n_ is number of registers in rlist
|
|
|`cm.mva01s r1s', r2s'` | 2 |
|
|
|`cm.mvsa01 r1s', r2s'` | 2 |
|
|
|===
|
|
|
|
|
|
=== Branch Predictor
|
|
|
|
Hazard3 includes a minimal branch predictor, to accelerate tight loops:
|
|
|
|
* The instruction frontend remembers the last taken, backward branch
|
|
* If the same branch is seen again, it is predicted taken
|
|
* All other branches are predicted nontaken
|
|
* If a predicted-taken branch is not taken, the predictor state is cleared, and it will be predicted nontaken on its next execution.
|
|
|
|
Correctly predicted branches execute in one cycle: the frontend is able to stitch together the two nonsequential fetch paths so that they appear sequential. Mispredicted branches incur a penalty cycle, since a nonsequential fetch address must be issued when the branch is executed.
|