Evaluating Shared Last Level Cache Partitioning Algorithms

Over the past few decades, the development of Dynamic Random-Access Memory (DRAM) has mainly focused on increasing capacity and lowering costs. However, microprocessor development has experienced enormous improvements in latency. This has led to an increasing memory latency-gap, that unaddressed can lead to significant underutilization of available microprocessor resources. To bridge this gap, memory hierarchies including several levels of cache memories have been introduced. Chip Multiprocessors (CMPs) or multi-core architectures commonly share the Last Level Cache (LLC). Sharing allows for destructive interference, as several cores can start to compete for cache space. With CMPs becoming commonplace and as their core count increases, scalable algorithms that partition the LLC among the cores of a CMP are becoming increasingly important.

This thesis describes the implementation of the zcache and the cache partitioning

algorithm Vantage in a simulation framework based on Sniper, a parallel multi-core

simulator. We utilize this simulation framework to establish the performance improvements of Vantage and the cache partitioning algorithms Thread-Aware Dynamic Insertion Policy (TADIP), Dynamic Re-Reference Interval Prediction (DRRIP), Promotion/Insertion Pseudo Partitioning (PIPP), Utility-Based Cache Partitioning (UCP) and Probabilistic Shared Cache Management (PriSM) over a range of architectural configurations, with the conventional Least Recently Used (LRU) algorithm as baseline. Moreover, we identify several root causes that lead to the observed performance differences. We find that Vantage, by using the highly associative zcache, attains the highest performance improvements and is the most scalable cache partitioning algorithm in our evaluation. The scalability and performance of cache partitioning algorithms utilizing the conventional set-associative cache are mainly limited by the restricted associativity that the set-associative cache provides. We further find that although individual improvements in the System Throughput (STP) can reach up to approximately 20% in our evaluation, the overall impact of cache partitioning is minor, improving the STP and the Harmonic Mean of Speedups (HMS) by a maximum of 3% with respect to LRU.

Utgiver

NTNU