Design of a Snoop Filter for Snoop Based Cache Coherency Protocols

Multi core architectures has become common in mobile SoCs; not only for CPUs, but also for mobile GPUs. With the introduction of OpenCl for mobile GPU architecture, the SoCs are able to become more powerful than before. Because programs that were executed on the CPU before, can now be executed faster on the GPU. Along with this the need for cache coherence protocols has also been introduced. Snoop based cache coherence protocols inherently leads to extensive coherence traffic on the bus in a multi core system. All this traffic leads to tag lookups in remote data caches. However, recent research shows that these lookups and coherency traffic, are by a large extent unnecessary. In other words a lot of power is wasted by transmitting unnecessary snoop requests over a interconnect. This project has explored one possible solution to reducing these requests: Snoop Filters.Previous research has been done for CPUs with SPLASH and other benchmark suits. This thesis however, will look at coherence transactions and lookups from a GPU perspective. To be able to thoroughly analyze coherence transactions from OpenCl benchmarks, a parameterizable multi core model has been constructed. The model is capable of replaying OpenCl benchmarks after executing on a ARM-MALI T6xx GPU. The results show that similarly to CPU benchmarks, the coherency traffic induced by OpenCl benchmark also end up in cache misses. Recent research also shows that CPU coherency protocols using the MESI states, instead of just MSI, reduces the unwanted coherency traffic. The reduction is so big that much that snoop filters and other coherence limiting approaches were unnecessary. The research done for this thesis has shown that this is not the case for GPUs, as the MESI protocol does not reduce power consumption in a multi core GPU. Because of this, snoop filters based on the CSR(Ranganathan 2012) filter were explored. The analysis in this thesis of the original destination based CSR filter, showed that the filter reduced the unnecessary tag lookups to around 53% for the OpenCl benchmarks. This means a great underlying potential in how the resources are selected according to the address stream. The analysis also showed that a fair deal of the snoop induced transactions also were unnecessary.Based on the filter analysis two new filters were designed: Source-CSRHashed-index CSRAlthough source CSR represents more hardware overhead compared to the destination filter, it is capable of reducing 30% of the snoop transactions. The source filter is also capable of a 53% reduction of the tag lookups. The hashed index filter was inspired by the potential in reducing the tag lookups further than 53\%. The filter was capable of a 56% reduction. Although this is only 3% improvement over the normal filter, the filter performed remarkably for a number of benchmarks. Unfortunately this was not the case for all benchmarks. It shows that dynamic allocation of filter resources is capable of further reduction. The best case scenario would have been to use the original resource selection on some of the benchmarks, and the hashed index system on the others.The source filter was also implemented in Verilog HDL, and formally verified in JasperGold using SystemVerilog. The filter was supposed to be power simulated, but some unknown error in the switching activity conversion halted any further power estimation. A proper conclusion about the power saving potential for the source filter can therefore not be made. This thesis does however include a power estimation methodology in order for the power estimation to be completed in the future.

Utgiver

Institutt for elektronikk og telekommunikasjon