Skip to content

A Performance Prediction Model for Spark Applications

Apache Spark is a popular open-source distributed-processing framework that enables efficient processing of massive amounts of data. It has a large number of parameters that need to be tuned to get the best performance. However, tuning these parameters manually is a complex and time-consuming task. Therefore, a robust performance model to predict applications execution time could greatly help in accelerating the deployment and optimization of big data applications relying on Spark. In this paper, we ran extensive experiments on a selected set of Spark applications that cover the most common workloads to generate a representative dataset of execution time. In addition, we extracted application and data features to build a machine learning-based performance model to predict Spark applications execution time. The experiments show that boosting algorithms achieved better results compared to other algorithms.

Florian Demesmaeker, Amine Ghrab, Usama Javaid, Ahmed Amir Kanoun, A Performance Prediction Model for Spark Applications, in the proceedings of Big Data congress 2020.

Click here to access the paper in its preprint form.

Releated Posts

Investigating a Feature Unlearning Bias Mitigation Technique for Cancer-type Bias in AutoPet Dataset

We proposed a feature unlearning technique to reduce cancer-type bias, which improved segmentation accuracy while promoting fairness across sub-groups, even with limited data.
Read More

Muppet: A Modular and Constructive Decomposition for Perturbation-based Explanation Methods

The topic of explainable AI has recently received attention driven by a growing awareness of the need for transparent and accountable AI. In this paper, we propose a novel methodology to decompose any state-of-the-art perturbation-based explainability approach into four blocks. In addition, we provide Muppet: an open-source Python library for explainable AI.
Read More