The performance advantage of out-of-order processors stems from their ability to extract more instruction-level parallelism (ILP) and memory-level parallelism (MLP) than in-order cores. This is largely the benefit of the dynamic out-of-order schedules they create. The downside of out-of-order scheduling is that it comes at high energy and die-area cost.
We evaluate three recently proposed FIFO-based scheduling techniques found in Load Slice Core (LSC), Delay and Bypass (DnB), and CASINO. They all promise a large part of the performance gain of out-of-order scheduling, but at a much lower cost.
DnB and LSC focus on extracting MLP by iteratively building load slices and giving them prece- dence in the execution order. The dependency analysis technique they employ is called Iterative Backward Dependency Analysis (IBDA). We evaluate implementability, performance, and area of the proposed IBDA as well as proposing three improved implementations of IBDA that require less area and power while providing essentially the same performance. DnB, LSC, and CASINO, the third technique, are all based around the idea of replacing the expensive, content addressable issue queue with cheaper FIFOs.
We implement all these techniques based on BOOM, an open-source, RTL implementation of an out-of-order RISC-V core. We synthesize our designs and evaluate them on a Xilinx ZC707 FPGA. By instantiating our cores as part of a system on chip, we are also able to boot Linux on them.
Our evaluation, using parts of the SPEC CPU2006 benchmark suite, confirms the claims that these techniques come close to the performance of a fully-fledged out-of-order core. LSC and CASINO do this while consuming noticeably fewer resources. DnB comes closest to the perfor- mance of out-of-order cores, but it fails to show area-benefits in our implementation. Additionally, we provide insights into the overheads that the BOOM core has over its smaller sibling, the in-order processor Rocket.
As this form of implementation is much closer to real silicon tapeouts than simulators, it forces us to consider and analyze implementation specifics that can be ignored in high-level simulation. This provides insights into the implementability of the different techniques.
Our work provides a big step towards providing accurate measurements, instead of estimates, of performance, power, and area usage for LSC, DnB, and CASINO.