Skip to content

A Performance Prediction Model for Spark Applications

Apache Spark is a popular open-source distributed-processing framework that enables efficient processing of massive amounts of data. It has a large number of parameters that need to be tuned to get the best performance. However, tuning these parameters manually is a complex and time-consuming task. Therefore, a robust performance model to predict applications execution time could greatly help in accelerating the deployment and optimization of big data applications relying on Spark. In this paper, we ran extensive experiments on a selected set of Spark applications that cover the most common workloads to generate a representative dataset of execution time. In addition, we extracted application and data features to build a machine learning-based performance model to predict Spark applications execution time. The experiments show that boosting algorithms achieved better results compared to other algorithms.

Florian Demesmaeker, Amine Ghrab, Usama Javaid, Ahmed Amir Kanoun, A Performance Prediction Model for Spark Applications, in the proceedings of Big Data congress 2020.

Click here to access the paper in its preprint form.

Share on linkedin
Share on twitter
Share on email

Releated Posts

15 Papers in 2021: the outputs

The only way to master knowledge is to explore and enrich it. As we look back on the year 2021, we are proud to say that our R&D department has published 15 peer-reviewed scientific papers this year. Find out the impacts of the published papers in our new article.
Read More

2021 Wrap Up

We got a deep dive into some of the most memorable moments of 2021.
Read More