Delegated Replies: Alleviating Network Clogging in Heterogeneous Architectures
Peer reviewed, Journal article
MetadataShow full item record
Original versionIEEE Symposium on High-Performance Computer Architecture (HPCA). 2022, 1014-1028. 10.1109/HPCA53966.2022.00078
Heterogeneous architectures with latency-sensitive CPU cores and bandwidth-intensive accelerators are attractive as they deliver high performance at favorable cost. These architectures typically have significantly more compute cores than memory nodes. The many bandwidth-intensive accelerators hence overwhelm the few memory nodes, resulting in suboptimal accelerator performance — as their bandwidth needs are not met — and poor CPU performance — because memory node blocking creates high latencies. We call this phenomenon network clogging. Since network clogging is a widespread issue in heterogeneous architectures, we first investigate if existing state-of-the-art approaches can address it. We find that the most effective prior approach, called Realistic Probing (RP), is suboptimal because it searches the local caches of other cores for missing data.We propose Delegated Replies which lets memory nodes speculatively delegate the responsibility of replying to last-level cache hits to the private cache that last accessed the requested cache block, hence avoiding the search that fundamentally limits RP. Moreover, Delegated Replies uses the (typically) under-utilized request network for delegation; it is the reply network links of the memory nodes that commonly clog because replies include complete cache blocks in addition to metadata. We evaluate Delegated Replies in the context of heterogeneous architectures with latency-sensitive CPU cores and bandwidth-intensive GPU cores and find that it improves GPU (CPU) performance by 14.2% (5.2%) and 25.7% (8.8%) on average compared to RP and our baseline, respectively.