Towards Privacy Policy Conceptual Modeling

After GDPR enforcement in May 2018, the problem of implementing privacy by design and staying compliant with regulations has been more prominent than ever for businesses of all sizes, which is evident from frequent cases against companies and significant fines paid due to non-compliance. Consequently, numerous research works have been emerging in this area. Yet, to this moment, no publicly available model can offer a comprehensive representation of privacy policies written in natural language, that is machine-readable, interoperable and suitable for automatic compliance checking. Meanwhile, privacy policies stay one of the main means of communication between a business (Data Controller) and a Data Subject, when it comes to the use of personal data. In this paper, we propose a conceptual model for fine-grained representation of privacy policies. We reuse and adapt existing Semantic Web resources in the spirit of interoperability. We represent our model as an ODRL profile and demonstrate how existing privacy policies can be translated into ODRL-like policies, consisting of deontic rules. We enrich our model with vocabularies for describing personal data processing in great detail, making it suitable for further usage in downstream applications, such as access control tools, to support adoption and implementation of privacy by design. We also demonstrate our model’s capability of handling personal data processing rules in other types of documents, namely data processing agreements, essential for controlling data privacy in a relationship between a Controller and a Processor.

The paper is available online on Springer. Currently, it is unfortunately freely available only to subscribers, but do not hesitate to reach out to us for more information!

Krasnashchok K., Mustapha M., Al Bassit A., Skhiri S. Towards Privacy Policy Conceptual Modeling. In Dobbie G., Frank U., Kappel G., Liddle S.W., Mayr H.C. (eds), Proc. of the 39th International Conference on Conceptual Modeling, LNCS 12400, 2020. Springer, Cham.


Applying Machine Learning Modeling to Enhance Runway Throughput at A Big European Airport

One of the factors limiting busiest airport’s runway throughput capacity is the spacing to be applied between landing aircraft in order to ensure that the runway is vacated when the follower aircraft reaches the runway threshold. Today, because the Controller is not able to always anticipate the runway occupancy time (ROT) of the leader aircraft, significant spacing buffers are added to the minimum required spacing in order to cover all possible cases, which negatively affects the resulting arrival throughput. The present paper shows how a Machine Learning (ML) analysis can support the development of accurate, yet operational, models for ROT prediction depending on all impact parameters. Based on Gradient Boosting Regressors, those ML models make use of flight plan information (such as aircraft type, airline, flight data) and weather information to model the ROT. This paper shows how it can be used operationally to increase runway capacity while maintaining or reducing the risk of delivery of separations below runway occupancy time. The methodology and related benefits are assessed using three years of field measurements gathered at Zurich airport.

You can find the slide here and the paper here.

Guillaume Stempfel, Victor Brossard, Ivan De Visscher, Antoine Bonnefoy, Mohamed Ellejmi,  Vincent Treve ̧ Applying Machine Learning Modeling to Enhance Runway Throughput at A Big European Airport, Proc. of the 10th EASN International Conference on “Innovation in Aviation & Space to the Satisfaction of the European Citizens, Naples, Italy, 2020.

Pruning Random Forest with Orthogonal Matching Trees

In this paper we propose a new method to reduce the size of Breiman’s Random Forests. Given a RandomForest and a target size, our algorithm builds a linear combination of trees which minimizes the training error. Selected trees, as well as weights of the linear combination are obtained by means of the Orthogonal Matching Pursuit algorithm. We test our method on many public benchmark datasets both on regression and binary classification, and we compare it to other pruning techniques. Experiments show that our technique performs significantly better or equally good on many datasets1. We also discuss the benefit and short-coming of learning weights for the pruned forest which lead us to propose to use a non-negative constraint on the OMP weights for better empirical results.

Luc Giffon, Charly Lamothe, Léo Bouscarrat, Paolo Milanesi, Farah Cherfaoui, and Sokol Ko, Pruning Random Forest with Orthogonal Matching Trees, Proc. of CAP 2020.

Click here to access the paper.

Multilingual Enrichment of Disease Biomedical Ontologies

Translating biomedical ontologies is an important challenge, but doing it manually requires much time and money. We study the possibility to use open-source knowledge bases to translate biomedical ontologies. We focus on two aspects: coverage and quality. We look at the coverage of two biomedical ontologies focusing on diseases with respect to Wikidata for 9 European languages (Czech, Dutch, English, French, German, Italian, Polish, Portuguese and Spanish) for both, plus Arabic, Chinese and Russian for the second. We first use direct links between Wikidata and the studied ontologies and then use second-order links by going through other intermediate ontologies. We then compare the quality of the translations obtained thanks to Wikidata with a commercial machine translation tool, here Google Cloud Translation.

Léo Bouscarrat, Antoine Bonnefoy, Cécile Capponi, Carlos Ramisch, Multilingual Enrichment of Disease Biomedical Ontologies, Proc. of MultilingualBIO 2020.

Click here to access the paper.

TopoGraph: an End-To-End Framework to Build and Analyze Graph Cubes

Graphs are a fundamental structure that provides an intuitive abstraction for modelling and analyzing complex and highly interconnected data. Given the potential complexity of such data, some approaches proposed extending decision-support systems with multidimensional analysis capabilities over graphs. In this paper, we introduce TopoGraph, an end-to-end framework for building and analyzing graph cubes. TopoGraph extends the existing graph cube models by defining new types of dimensions and measures and organizing them within a multidimensional space that guarantees multidimensional integrity constraints. This results in defining three new types of graph cubes: property graph cubes, topological graph cubes, and graph-structured cubes. Afterwards, we define the algebraic OLAP operations for such novel cubes. We implement and experimentally validate TopoGraph with different types of real-world datasets.


The paper will be published soon in Information Systems Frontiers, and is already available online on Springer. Currently, it is unfortunately available only to subscribers, but do not hesitate to reach out to us for more information!


Amine Ghrab, Oscar Romero, Sabri Skhiri, Esteban Zimányi, TopoGraph: an End-To-End Framework to Build and Analyze Graph Cubes, published in Information Systems Frontiers (2020).



A Performance Prediction Model for Spark Applications

Apache Spark is a popular open-source distributed-processing framework that enables efficient processing of massive amounts of data. It has a large number of parameters that need to be tuned to get the best performance. However, tuning these parameters manually is a complex and time-consuming task. Therefore, a robust performance model to predict applications execution time could greatly help in accelerating the deployment and optimization of big data applications relying on Spark. In this paper, we ran extensive experiments on a selected set of Spark applications that cover the most common workloads to generate a representative dataset of execution time. In addition, we extracted application and data features to build a machine learning-based performance model to predict Spark applications execution time. The experiments show that boosting algorithms achieved better results compared to other algorithms.

Florian Demesmaeker, Amine Ghrab, Usama Javaid, Ahmed Amir Kanoun, A Performance Prediction Model for Spark Applications, in the proceedings of Big Data congress 2020.

Click here to access the paper in its preprint form.

GraphOpt: Framework for Automatic Parameters Tuning of Graph Processing Frameworks

Finding the optimal configuration of a black-box system is a difficult problem that requires a lot of time and human labor. Big data processing frameworks are among the increasingly popular systems whose tuning is a complex and time consuming. The challenge of automatically finding the optimal parameters of big data frameworks attracted a lot of research in recent years. Some of the studies focused on optimizing specific frameworks such as distributed stream processing, or finding the best cloud configurations, while others proposed general services for optimizing any black-box system. In this paper, we introduce a new use case in the domain of automatic parameter tuning: optimizing the parameters of distributed graph processing frameworks. This task is notably difficult given the particular challenges of distributed graph processing that include the graph partitioning and the iterative nature of graph algorithms.

To address this challenge, we designed and implemented GraphOpt: an efficient and scalable black-box optimization framework that automatically tunes distributed graph processing frameworks. GraphOpt implements state-of-the-art optimization algorithms and introduces a new hill-climbing-based search algorithm. These algorithms are used to optimize the performance of two major graph processing frameworks: Giraph and GraphX. Extensive experiments were run on GraphOpt using multiple graph benchmarks to evaluate its performance and show that it provides up to 47.8% improvement compared to random search and an average improvement of up to 5.7%.

Muaz Twaty, Amine Ghrab, Skhiri Sabri: GraphOpt: a Framework for Automatic Parameters Tuning of Graph Processing Frameworks. 2019 IEEE International Conference on Big Data (Big Data) Workshops, Los Angeles, CA, USA.

The paper was published at the third IEEE International Workshop on Benchmarking, Performance Tuning and Optimization for Big Data Applications (BPOD 2019).

You can access it here in its preprint version.

Do not hesitate to contact our R&D department at to discuss how you can leverage graph processing in your projects.

STRASS: A Light and Effective Method for Extractive Summarization

This paper introduces STRASS: Summarization by TRAnsformation Selection and Scoring. It is an extractive text summarization method which leverages the semantic information in existing sentence embedding spaces. Our method creates an extractive summary by selecting the sentences with the closest embeddings to the document embedding. The model learns a transformation of the document embedding to minimize the similarity between the extractive summary and the ground truth summary. As the transformation is only composed of a dense layer, the training can be done on CPU, therefore, inexpensive. Moreover, inference time is short and linear according to the number of sentences. As a second contribution, we introduce the French CASS dataset, composed of judgments from the French Court of cassation and their corresponding summaries. On this dataset, our results show that our method performs similarly to the state of the art extractive methods with effective training and inferring time.

Léo Bouscarrat, Antoine Bonnefoy, Thomas Peel, Cécile Pereira, STRASS: A Light and Effective Method for Extractive Summarization Based on Sentence Embeddings, in 2019 ACL Student Research Workshop, Florence, Italy.

Click here to access the paper.

Florence, Italy

LEAD: A Formal Specification For Event Processing

Processing event streams is an increasingly important area for modern businesses aiming to detect and efficiently react to critical situations in near real-time. The need to govern the behaviour of systems where such streams exist has led to the development of numerous Complex Event Processing (CEP) engines, capable of detecting patterns and analyzing event streams. Although current CEP systems provide real-time analysis foundations for a variety of applications, several challenges arise due to languages’ limitations and imprecise semantics, as well as the lack of power to handle big data requirements. In this paper, we discuss such systems, analyzing some of the most sensitive issues in this domain. Further, in this context, we present our contributions expressed in LEAD, a formal specification for processing complex events. LEAD provides an algebra that consists of a set of operators for constructing complex events (patterns), temporally restricting the construction process and choosing among several selection and consumption policies. We show how to build LEAD rules to demonstrate the expressive power of our approach. Furthermore, we introduce a novel approach of interpreting these rules into a logical execution plan, built with temporal prioritized coloured petri nets.

Anas Al Bassit, Skhiri Sabri, LEAD: A Formal Specification For Event Processing, in 13Th ACM international Conference on distributed and event-based systems 2019

Click here to access the paper.

Coherence Regularization for Neural Topic Models

Neural topic models aim to predict the words of a document given the document itself. In such models, perplexity is used as a training criterion, whereas the final quality measure is topic coherence. In this work, we introduce a coherence regularization loss that penalizes incoherent topics during the training of the model. We analyze our approach using coherence and an additional metric – exclusivity, responsible for the uniqueness of the terms in topics. We argue that this combination of metrics is an adequate indicator of the model quality. Our results indicate the effectiveness of our loss and the potential to be used in the future neural topic models.

The paper will be published at the 16th International Symposium on Neural Networks taking place in Moscow. In the meantime, do not hesitate to contact our R&D department at to discuss how you can leverage neural topic models in your projects.

Katsiaryna Krasnashchok, Aymen Cherif, Coherence Regularization for Neural Topic Models. in 16th International Symposium on Neural Networks 2019 (ISNN 2019)

Click here to access the paper.