Happy to share a brief overview of our work on a learning based platform for resource scheduling in datacenters, which just appeared at ACM CoNEXT 2018.

In the last five years or so, several exciting proposals by the networking research community have advocated a scheduling approach for minimizing user response times (e.g., Shortest Job First) in large-scale cloud services such as search (think time to get your Google search result or to load your Facebook newsfeed).

A key challenge in minimizing response times (e.g., at the tail such as the 99th percentile of response times) is that there is no single non-learning scheduling policy that can optimize performance across workloads (i.e., distribution of job sizes). This can be a significant concern for datacenter operators as workloads can change or exhibit significant variations over time. As a step towards addressing this challenge, we have designed a learning based scheduling framework that is robust to changes in workload. Our hope with this work is that it can help incentivize greater adoption of scheduling-based approaches in cloud applications.

This work was carried in collaboration with a remarkable group of students (Abdullah Bin Faisal and Hafiz Mohsin) at Tufts and my colleagues (Fahad R. Dogar and Zartash Afzal Uzmi)

December 8, 2018 (Heraklion, Greece)