Twig: Multi-Agent Task Management for Colocated Latency-Critical Cloud Services
Peer reviewed, Journal article
Accepted version

View/ Open
Date
2020Metadata
Show full item recordCollections
Original version
IEEE Symposium on High-Performance Computer Architecture (HPCA). 2020, 167-179. 10.1109/HPCA47549.2020.00023Abstract
Many of the important services running on data centres are latency-critical, time-varying, and demand strict user satisfaction. Stringent tail-latency targets for colocated services and increasing system complexity make it challenging to reduce the power consumption of data centres. Data centres typically sacrifice server efficiency to maintain tail-latency targets resulting in an increased total cost of ownership. This paper introduces Twig, a scalable quality-of-service (QoS) aware task manager for latency-critical services co-located on a server system. Twig successfully leverages deep reinforcement learning to characterise tail latency using hardware performance counters and to drive energy-efficient task management decisions in data centres. We evaluate Twig on a typical data centre server managing four widely used latency-critical services. Our results show that Twig outperforms prior works in reducing energy usage by up to 38% while achieving up to 99% QoS guarantee for latency-critical services.