Zeus is an efficient, highly scalable, and distributed shuffle as a service that is powering all Data processing (Spark and Hive) at Uber. Uber runs one of the largest Spark and Hive clusters on top of YARN in the industry which leads to many issues such as hardware failures (Burn out Disks), reliability, and scalability challenges. Last year, we discussed with this forum about Zeus service architecture traits and early results. Since then we made great progress, we open-sourced Zeus last year and deployed it to our all analytics clusters.
In this talk, we want to talk about how we scaled the Zeus service to all the spark workloads, scaled to billions of shuffle messages and petabytes of shuffle data at uber. We will also talk about the strategies which we took to roll out Zeus to this massive scale without users noticing any difference or any service disruption. We also want to talk about further improvements which are on the horizon for Zeus as well as the performance and reliability improvements that have been done in future releases.