Decode is now split into a block which depends only on the instruction
bits, and a block which gates critical decode signals based on fetch
faults, invalidity etc.
Apply a similar transform to the gating of the uop counter update.
cxxrtl performance seems unchanged after removing the event loops, but
verilator and live-scheduled simulators should improve.
Interrupting the PC-setting step of a cm.popret (only) can sample the return target
as the exception return PC, which will cause the stack pointer adjust to be skipped
when returning from the IRQ. Fix this by making the PC-setting step uninterruptible
(note the PC-setting step is the instruction we execute first out of the group
of instructions specified in the Zc spec as being atomic wrt interrupts. This
does not itself imply that the PC-setting step is uninterruptible, it just
requires that when the PC-setting step retires, all following steps also retire.
However this is not sufficient given the special case logic that allows the jr
ra PC-setting step to execute before the final stack adjust as an optimisation.)