In this section you will find EURA NOVA’s scientific publications and reports.
For safety-critical systems involving AI components (such as in planes, cars, or healthcare), safety and associated certification tasks are one of the main challenges, which can become costly and difficult to address.
One key aspect is to ensure that the decisions a machine-learning classifier makes are properly calibrated. This Thursday, our engineer Nicolas presented at the MLSC workshop part of the research work on classifiers calibration carried out with our senior data scientist Antoine Bonnefoy.
The Machine Learning in Certified Systems workshop brought together machine learning researchers with international authorities and industry experts to present the main open questions and methods for verification and certification of critical software. The objective was also to define the future research agenda towards the medium-term goal of certifying critical systems involving AI components. The workshop included invited talks, a poster session and panel discussions.
Nicolas talked about improving the calibration of classifiers and its evaluation through the introduction of continuous estimators of related errors.
Watch him present his poster presentation on Youtube.
You can find the poster pdf below!
Molecular pathway databases represent cellular processes in a structured and standardized way. These databases support the community-wide utilization of pathway information in biological research and the computational analysis of high-throughput biochemical data. Although pathway databases are critical in genomics research, the fast progress of biomedical sciences prevents databases from staying up-to-date. Moreover, the compartmentalization of cellular reactions into defined pathways reflects arbitrary choices that might not always be aligned with the needs of the researcher. Today, no tool exists that allow the easy creation of user-defined pathway representations.
Here we present Padhoc, a pipeline for pathway ad hoc reconstruction. Based on a set of user-provided keywords, Padhoc combines natural language processing, database knowledge extraction, orthology search and powerful graph algorithms to create navigable pathways tailored to the user’s needs. We validate Padhoc with a set of well-established Escherichia coli pathways and demonstrate usability to create not-yet-available pathways in model (human) and non-model (sweet orange) organisms.
Salvador Casaní-Galdón, Cecile Pereira, Ana Conesa, Padhoc: a computational pipeline for pathway reconstruction on the fly, Bioinformatics, Volume 36 (2):i795–i803, December 2020.
Radiomics – high-dimensional features extracted from clinical images – is the main approach used to develop predictive models based on 3D Positron Emission Tomography (PET) scans of patients suffering from cancer. Radiomics extraction relies on an accurate segmentation of the tumoral region, which is a time-consuming task subject to inter-observer variability. On the other hand, data-driven approaches such as deep convolutional neural networks (CNN) struggle to achieve great performances on PET images due to the absence of available large PET datasets combined to the size of 3D networks. In this paper, we assemble several public datasets to create a PET dataset large of 2800 scans and propose a deep learning architecture named “2Be3-Net” associating a 2D feature extractor to a 3D CNN predictor. First, we take advantage of a 2D pre-trained model to extract feature maps out of 2D PET slices. Then we apply a 3D CNN on top of the concatenation of the previously extracted feature maps to compute patient-wise predictions. Experiments suggest that 2Be3-Net has an improved ability to exploit spatial information compared to 2D or 3D only CNN solutions. We also evaluate our network on the prediction of clinical outcomes of head-and-neck cancer. The proposed pipeline outperforms PET radiomics approaches on the prediction of loco-regional recurrences and overall survival. Innovative deep learning architectures combining a pre-trained network with a 3D CNN could therefore be a great alternative to traditional CNN and radiomics approaches while empowering small and medium-sized datasets.
Ronan Thomas, Elsa Schalck, Damien Fourure, Antoine Bonnefoy and Inaki Cervera-Marzal, 2Be3-Net : Combining 2D and 3D convolutional neural networks for 3D PET scans predictions, Proc. of the 2nd International Conference on Medical Imaging and Computer-Aided Diagnosis, 2021.
After GDPR enforcement in May 2018, the problem of implementing privacy by design and staying compliant with regulations has been more prominent than ever for businesses of all sizes, which is evident from frequent cases against companies and significant fines paid due to non-compliance. Consequently, numerous research works have been emerging in this area. Yet, to this moment, no publicly available model can offer a comprehensive representation of privacy policies written in natural language, that is machine-readable, interoperable and suitable for automatic compliance checking. Meanwhile, privacy policies stay one of the main means of communication between a business (Data Controller) and a Data Subject, when it comes to the use of personal data. In this paper, we propose a conceptual model for fine-grained representation of privacy policies. We reuse and adapt existing Semantic Web resources in the spirit of interoperability. We represent our model as an ODRL profile and demonstrate how existing privacy policies can be translated into ODRL-like policies, consisting of deontic rules. We enrich our model with vocabularies for describing personal data processing in great detail, making it suitable for further usage in downstream applications, such as access control tools, to support adoption and implementation of privacy by design. We also demonstrate our model’s capability of handling personal data processing rules in other types of documents, namely data processing agreements, essential for controlling data privacy in a relationship between a Controller and a Processor.
The paper is available online on Springer. Currently, it is unfortunately freely available only to subscribers, but do not hesitate to reach out to us for more information!
DOI : https://doi.org/10.1007/978-3-030-62522-1_32
One of the factors limiting busiest airport’s runway throughput capacity is the spacing to be applied between landing aircraft in order to ensure that the runway is vacated when the follower aircraft reaches the runway threshold. Today, because the Controller is not able to always anticipate the runway occupancy time (ROT) of the leader aircraft, significant spacing buffers are added to the minimum required spacing in order to cover all possible cases, which negatively affects the resulting arrival throughput. The present paper shows how a Machine Learning (ML) analysis can support the development of accurate, yet operational, models for ROT prediction depending on all impact parameters. Based on Gradient Boosting Regressors, those ML models make use of flight plan information (such as aircraft type, airline, flight data) and weather information to model the ROT. This paper shows how it can be used operationally to increase runway capacity while maintaining or reducing the risk of delivery of separations below runway occupancy time. The methodology and related benefits are assessed using three years of field measurements gathered at Zurich airport.
Guillaume Stempfel, Victor Brossard, Ivan De Visscher, Antoine Bonnefoy, Mohamed Ellejmi, Vincent Treve ̧ Applying Machine Learning Modeling to Enhance Runway Throughput at A Big European Airport, Proc. of the 10th EASN International Conference on “Innovation in Aviation & Space to the Satisfaction of the European Citizens, Naples, Italy, 2020.
In this paper we propose a new method to reduce the size of Breiman’s Random Forests. Given a RandomForest and a target size, our algorithm builds a linear combination of trees which minimizes the training error. Selected trees, as well as weights of the linear combination are obtained by means of the Orthogonal Matching Pursuit algorithm. We test our method on many public benchmark datasets both on regression and binary classification, and we compare it to other pruning techniques. Experiments show that our technique performs significantly better or equally good on many datasets1. We also discuss the benefit and short-coming of learning weights for the pruned forest which lead us to propose to use a non-negative constraint on the OMP weights for better empirical results.
Luc Giffon, Charly Lamothe, Léo Bouscarrat, Paolo Milanesi, Farah Cherfaoui, and Sokol Ko, Pruning Random Forest with Orthogonal Matching Trees, Proc. of CAP 2020.
Translating biomedical ontologies is an important challenge, but doing it manually requires much time and money. We study the possibility to use open-source knowledge bases to translate biomedical ontologies. We focus on two aspects: coverage and quality. We look at the coverage of two biomedical ontologies focusing on diseases with respect to Wikidata for 9 European languages (Czech, Dutch, English, French, German, Italian, Polish, Portuguese and Spanish) for both, plus Arabic, Chinese and Russian for the second. We first use direct links between Wikidata and the studied ontologies and then use second-order links by going through other intermediate ontologies. We then compare the quality of the translations obtained thanks to Wikidata with a commercial machine translation tool, here Google Cloud Translation.
Léo Bouscarrat, Antoine Bonnefoy, Cécile Capponi, Carlos Ramisch, Multilingual Enrichment of Disease Biomedical Ontologies, Proc. of MultilingualBIO 2020.
Graphs are a fundamental structure that provides an intuitive abstraction for modelling and analyzing complex and highly interconnected data. Given the potential complexity of such data, some approaches proposed extending decision-support systems with multidimensional analysis capabilities over graphs. In this paper, we introduce TopoGraph, an end-to-end framework for building and analyzing graph cubes. TopoGraph extends the existing graph cube models by defining new types of dimensions and measures and organizing them within a multidimensional space that guarantees multidimensional integrity constraints. This results in defining three new types of graph cubes: property graph cubes, topological graph cubes, and graph-structured cubes. Afterwards, we define the algebraic OLAP operations for such novel cubes. We implement and experimentally validate TopoGraph with different types of real-world datasets.
The paper will be published soon in Information Systems Frontiers, and is already available online on Springer. Currently, it is unfortunately available only to subscribers, but do not hesitate to reach out to us for more information!
Amine Ghrab, Oscar Romero, Sabri Skhiri, Esteban Zimányi, TopoGraph: an End-To-End Framework to Build and Analyze Graph Cubes, published in Information Systems Frontiers (2020).
Apache Spark is a popular open-source distributed-processing framework that enables efficient processing of massive amounts of data. It has a large number of parameters that need to be tuned to get the best performance. However, tuning these parameters manually is a complex and time-consuming task. Therefore, a robust performance model to predict applications execution time could greatly help in accelerating the deployment and optimization of big data applications relying on Spark. In this paper, we ran extensive experiments on a selected set of Spark applications that cover the most common workloads to generate a representative dataset of execution time. In addition, we extracted application and data features to build a machine learning-based performance model to predict Spark applications execution time. The experiments show that boosting algorithms achieved better results compared to other algorithms.
The paper will be published at the Big Data congress 2020 taking place in Hawaii. In the meantime, do not hesitate to contact our R&D department at email@example.com to discuss how you can optimise distributed processing frameworks in your projects.
Florian Demesmaeker, Amine Ghrab, Usama Javaid, Ahmed Amir Kanoun, A Performance Prediction Model for Spark Applications, in the proceedings of Big Data congress 2020.
Finding the optimal configuration of a black-box system is a difficult problem that requires a lot of time and human labor. Big data processing frameworks are among the increasingly popular systems whose tuning is a complex and time consuming. The challenge of automatically finding the optimal parameters of big data frameworks attracted a lot of research in recent years. Some of the studies focused on optimizing specific frameworks such as distributed stream processing, or finding the best cloud configurations, while others proposed general services for optimizing any black-box system. In this paper, we introduce a new use case in the domain of automatic parameter tuning: optimizing the parameters of distributed graph processing frameworks. This task is notably difficult given the particular challenges of distributed graph processing that include the graph partitioning and the iterative nature of graph algorithms.
To address this challenge, we designed and implemented GraphOpt: an efficient and scalable black-box optimization framework that automatically tunes distributed graph processing frameworks. GraphOpt implements state-of-the-art optimization algorithms and introduces a new hill-climbing-based search algorithm. These algorithms are used to optimize the performance of two major graph processing frameworks: Giraph and GraphX. Extensive experiments were run on GraphOpt using multiple graph benchmarks to evaluate its performance and show that it provides up to 47.8% improvement compared to random search and an average improvement of up to 5.7%.
The paper was published at the third IEEE International Workshop on Benchmarking, Performance Tuning and Optimization for Big Data Applications (BPOD 2019).
Do not hesitate to contact our R&D department at firstname.lastname@example.org to discuss how you can leverage graph processing in your projects.
Muaz Twaty, Amine Ghrab, Skhiri Sabri: GraphOpt: a Framework for Automatic Parameters Tuning of Graph Processing Frameworks. 2019 IEEE International Conference on Big Data (Big Data) Workshops, Los Angeles, CA, USA.
This document introduces you to master thesis supervised by our research & development department. Each project offers you the chance to be actively involved in the development of solutions to address tomorrow’s challenges in ICT and implementing them today!
If you are interested in one of our offers, please send us your application to email@example.com, including your CV and motivation regarding your top three master thesis subject (described in the document).
If you are interested in working on a topic that is not in our range of offers, we would be delighted to hear your proposition and invite you get in touch.
Master thesis subjects and application guidelines are available here: Master Thesis Offers.
This paper introduces STRASS: Summarization by TRAnsformation Selection and Scoring. It is an extractive text summarization method which leverages the semantic information in existing sentence embedding spaces. Our method creates an extractive summary by selecting the sentences with the closest embeddings to the document embedding. The model learns a transformation of the document embedding to minimize the similarity between the extractive summary and the ground truth summary. As the transformation is only composed of a dense layer, the training can be done on CPU, therefore, inexpensive. Moreover, inference time is short and linear according to the number of sentences. As a second contribution, we introduce the French CASS dataset, composed of judgments from the French Court of cassation and their corresponding summaries. On this dataset, our results show that our method performs similarly to the state of the art extractive methods with effective training and inferring time.
Léo Bouscarrat, Antoine Bonnefoy, Thomas Peel, Cécile Pereira, STRASS: A Light and Effective Method for Extractive Summarization Based on Sentence Embeddings, in 2019 ACL Student Research Workshop, Florence, Italy.
Processing event streams is an increasingly important area for modern businesses aiming to detect and efficiently react to critical situations in near real-time. The need to govern the behaviour of systems where such streams exist has led to the development of numerous Complex Event Processing (CEP) engines, capable of detecting patterns and analyzing event streams. Although current CEP systems provide real-time analysis foundations for a variety of applications, several challenges arise due to languages’ limitations and imprecise semantics, as well as the lack of power to handle big data requirements. In this paper, we discuss such systems, analyzing some of the most sensitive issues in this domain. Further, in this context, we present our contributions expressed in LEAD, a formal specification for processing complex events. LEAD provides an algebra that consists of a set of operators for constructing complex events (patterns), temporally restricting the construction process and choosing among several selection and consumption policies. We show how to build LEAD rules to demonstrate the expressive power of our approach. Furthermore, we introduce a novel approach of interpreting these rules into a logical execution plan, built with temporal prioritized coloured petri nets.
The paper will be published at the 13Th ACM international Conference on distributed and event-based systems taking place in Germany. In the meantime, do not hesitate to contact our R&D department at firstname.lastname@example.org to discuss how you can leverage complex event processing in your projects.
Anas Al Bassit, Skhiri Sabri, LEAD: A Formal Specification For Event Processing, in 13Th ACM international Conference on distributed and event-based systems 2019
Neural topic models aim to predict the words of a document given the document itself. In such models, perplexity is used as a training criterion, whereas the final quality measure is topic coherence. In this work, we introduce a coherence regularization loss that penalizes incoherent topics during the training of the model. We analyze our approach using coherence and an additional metric – exclusivity, responsible for the uniqueness of the terms in topics. We argue that this combination of metrics is an adequate indicator of the model quality. Our results indicate the effectiveness of our loss and the potential to be used in the future neural topic models.
The paper will be published at the 16th International Symposium on Neural Networks taking place in Moscow. In the meantime, do not hesitate to contact our R&D department at email@example.com to discuss how you can leverage neural topic models in your projects.
Katsiaryna Krasnashchok, Aymen Cherif, Coherence Regularization for Neural Topic Models. in 16th International Symposium on Neural Networks 2019 (ISNN 2019)
At EURA NOVA, we believe investing in research allows us to continuously become more proficient, to maintain our know-how at the cutting edge of IT, and to share its benefits with our customers. As we look back on the year 2018, we are both proud and happy to announce that our R&D department has published 7 publications this year:
Firstly, our paper “Pairwise Image Ranking with Deep Comparative Network” was published at the 26th European Symposium on Artificial Neural Networks. The paper, written by our Lead R&D engineer Aymen Cherif and Salim Jouili, discuss how using the pair-wise ranking model can provide better results for instance-level image retrieval.
Aymen Cherif, Salim Jouili, Pairwise Image Ranking with Deep Comparative Network. ESANN 2018: ES2018-200
Secondly, our R&D engineer Cécile Pereira participated in the redaction of a paper published in Bioinformatics in May 2018. They propose a novel end-to-end deep learning approach for biomedical NER tasks that leverage the local contexts based on n-gram character and word embeddings via Convolutional Neural Network.
Qile Zhu, Xiaolin Li, Ana Conesa, Cécile Pereira, GRAM-CNN: A deep learning approach with local context for named entity recognition in biomedical text, Bioinformatics – May 2018
In July, our R&D engineer Katherine Krasnoschok was in Melbourne, Australia to attend the ACL conference. She presented her poster on topic modelling. Her paper, co-written with Salim Jouili, indicates that involving more named entities positively influences the overall quality of topics.
Katsiaryna Krasnashchok, Salim Jouili, Improving Topic Quality by Promoting Named Entities in Topic Modeling, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Vol. 2. 2018
Moreover, our paper “Graph BI & Analytics: Current State and Future Challenges” was accepted for publication and presented at the 20th International Conference on Big Data Analytics and Knowledge Discovery, taking place in Germany in September. The paper presents the state of the art of graph BI & analytics, with a focus on graph warehousing.
Amine Ghrab, Oscar Romero, Salim Jouili, Sabri Skhiri, Graph BI & Analytics: Current State and Future Challenges. DaWaK 2018: 3-18
In September as well, our paper Data Mining and Machine Learning Techniques supporting Time-based Separation Concept Deployment, co-written with Eurocontrol and WaPT, was presented at the 37th Digital Avionics Systems Conference (DASC) in London, U.K. The paper presents two methods to allow air traffic controllers to deliver separation minima accurately and safely, on the basis of time intervals instead of distances.
De Visscher, I.; Stempfel, G.; Rooseleer, F. & Treve, V.; Data mining and Machine Learning techniques supporting Time-Based Separation concept deployment, in 37th Digital Avionics Systems Conference (DASC), pp 594-603, London, UK, September 23-27, 2018
Finally, our engineer Katsiaryna Krasnashchok presented in October her poster on Hierarchical Attention-Based Neural Topic Model at the 6th International Conference on Statistical Language and Speech Processing. Furthermore, our Lead R&D engineer Aymen Cherif and our bootcamper Luca De Petris presented as well their poster on LSTM Siamese Network.
Katsiaryna Krasnashchok, Salim Jouili, Hierarchical Attention-Based Neural Topic Model, SLSP 2018
Luca De Petris, Aymen Cherif, LSTM Siamese Network for Question Answering System, SLSP 2018
In July, our R&D engineer Katherine Krasnoschok was in Melbourne, Australia to attend the ACL conference. She presented her poster on topic modelling. Her paper, co-written with Salim Jouili, indicates that involving more named entities positively influences the overall quality of topics.
News-related content has been extensively studied in both topic modeling research and named entity recognition. However, expressive power of named entities and their potential for improving the quality of discovered topics has not received much attention. In this paper, we use named entities as domain-specific terms for news-centric content and present a new weighting model for Latent Dirichlet Allocation. Our experimental results indicate that involving more named entities in topic descriptors positively influences the overall quality of topics, improving their interpretability, specificity and diversity.
Katsiaryna Krasnashchok, Salim Jouili, Improving Topic Quality by Promoting Named Entities in Topic Modeling, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Vol. 2. 2018.
Our paper “Data Mining and Machine Learning Techniques supporting Time-based Separation Concept Deployment”, co-written with Eurocontrol and WaPT, has been accepted by the 37th Digital Avionics Systems Conference (DASC) in London, U.K.
The paper presents two methods to allow air traffic controllers to deliver separation minima accurately and safely, on the basis of time intervals instead of distances.
Importantly, in strong headwind conditions, the aircraft’s groundspeed during approach decreases, meaning that keeping the distance-based separation method results in lower landing rates. At a time of intensified air traffic, this situation leads to considerable delays at airports with significant costs to operators and travellers.
With the new methods presented in the paper, capacity can increase by up to 14% in strong wind conditions, and by up to 8% in moderate wind conditions.
[EDIT] The paper has been presented in September at DASC 2018, you can find the full version below. If you wish to go deeper into the subject, do not hesitate to contact our research department at firstname.lastname@example.org.
The Time-Based Separation (TBS) concept consists in the definition of separation minima for aircraft on the final approach to a runway based on time intervals instead of distances, as applied in Distance-Based Separation (DBS) operations.
TBS allows for dynamic distance separation reductions in strong headwind conditions so as to preserve time spacing across all wind conditions. However, TBS application entails the use of a support tool providing separation distance indicators depending on the applicable time separation minimum, the aircraft speed profile which also depends on the headwind conditions.
This paper details two methodologies allowing a system to compute those TBS indicators so as to allow Air Traffic Controllers to accurately and safely deliver the TBS minima using a separation delivery support tool. The first approach is based on “analytical” data mining and modelling whereas the second one is based on a Machine Learning (M/L) procedure.
In the framework of the deployment of the TBS concept in Vienna airport (LOWW), those approaches are developed and tested using a database covering one year of traffic and corresponding local meteorological data.
The operation of TBS with indicators computed using either approaches leads to substantial diminution of time separations compared to a DBS strategy. However, given the large uncertainties related both to leader and follower aircraft speed profiles, the buffers could be designed only for the most frequent pairs. With the M/L approach (resp. the “analytical” approach), the capacity benefits related to the application of TBS with a separation support tool are of the order of 8% (resp. 2%) in moderate wind conditions, and up to 14% (resp. 10%) in strong wind conditions.
De Visscher, I.; Stempfel, G.; Rooseleer, F. & Treve, V.; Data mining and Machine Learning techniques supporting Time-Based Separation concept deployment, in 37th Digital Avionics Systems Conference (DASC), pp 594-603, London, UK, September 23-27, 2018
Our paper “Graph BI & Analytics: Current State and Future Challenges” has been accepted for publication at the 20th International Conference on Big Data Analytics and Knowledge Discovery, taking place in Regensburg, Germany.
The paper presents the state of the art of graph BI & analytics, with a focus on graph warehousing. We survey the topics of graph modelling, management, querying, and processing in graph warehouses. Then we conclude by discussing future research directions for solving complex graph problems, building native graph components and intelligent techniques to assist end-users in building and analysing the graph.
More importantly, the paper calls for the development of intelligent, efficient and industry-grade graph data warehousing systems to support the structure-driven management and analytics of data efficiently. While adopting a template that is similar to the traditional BI systems, the graph BI that is presented here extends current systems with graph analytics capabilities that deliver graph-derived insights.
[EDIT] The paper has been presented in September at DaWak 2018, you can now find the full version bellow. If you wish to go deeper into the subject, don’t hesitate to contact our research department at email@example.com.
Abstract. In an increasingly competitive market, making well-informed decisions requires the analysis of a wide range of heterogeneous, large and complex data. This paper focuses on the emerging field of graph warehousing. Graphs are widespread structures that yield a great expressive power. They are used for modeling highly complex and interconnected domains, and efficiently solving emerging big data application. This paper presents the current status and open challenges of graph BI and analytics, and motivates the need for new warehousing frameworks aware of the topological nature of graphs. We survey the topics of graph modeling, management, processing and analysis in graph warehouses. Then we conclude by discussing future research directions and positioning them within a unified architecture of a graph BI & analytics framework.
Amine Ghrab, Oscar Romero, Salim Jouili, Sabri Skhiri, Graph BI & Analytics: Current State and Future Challenges. DaWaK 2018, 3-18
Due to the increasing importance and volume of highly interconnected data, such as in social or information networks, a plethora of graph mining techniques have been designed to enable the analysis of such data. In this work, we focus on the mining of associations between entity features in networks. We model each entity feature as a dimension to be analyzed. Consequently we build our approach on top of the existing graph cube framework which is an extension of the concept of the data cube to networks. Our task is particularly challenging because it requires the analysis of both the initial multidimensional network and all its subsequent aggregate forms. As soon as we deal with a big data situation it is impossible for an analyst to consider manually all the possible views of the network data. The aim of this work is to design an algorithm for the discovery of interesting patterns in large graph cubes. Thus, instead of examining all the possible aggregations manually, the proposed technique leads the analyst to the interesting associations or patterns in the multidimensional network. Furthermore, we study the application of existing algorithms from the frequent itemset mining literature on graph data and propose a mapping between the two settings.
Florian Demesmaeker, Amine Ghrab, Siegfried Nijssen, Sabri Skhiri: Discovering interesting patterns in large graph cubes. 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, USA, 2017, pp. 3322-3331.
Iterative-convergent algorithms represent an im-portant family of applications in big data analytics. These aretypically run on distributed processing frameworks deployed on a cluster of machines. On the other hand, we are witnessing the move towards data center operating systems (OS), where resources are unified by a resource manager and processing frameworks coexist with each other. In this context, different processing framework job tasks can be scheduled on the same machine and slow down a worker (straggler problem). Existing work has shown that an iteration model with relaxed consistency such as the Stale Synchronous Parallel (SSP) model, while still guaranteeing convergence, is able to cope with stragglers. In this paper we propose a model for the integration of the SSP model on a pipelined distributed processing framework. We then apply SSP on a distributed version of the Frank-Wolfe algorithm. We theoretically show its sparsity bounds and convergence under SSP. Finally, we experimentally show that the Frank-Wolfe algorithm applied on LASSO regression under SSP is able to converge faster than its BSP counterpart, especially under load conditions similar to those encountered in a data center OS.
Nam-Luc Tran, Thomas Peel, Sabri Skhiri, Distributed Frank-Wolfe under Pipelined Stale Synchronous Parallelism, proceedings of the 2015 IEEE Conference on Big Data, November 2015, Santa Clara, CA, USA.
Graphs are widespread structures providing a powerful abstraction for modeling networked data. Large and complex graphs have emerged in various domains such as social networks, bioinformatics, and chemical data. However, current warehousing frameworks are not equipped to handle efficiently the multidimensional modeling and analysis of complex graph data. In this paper, we propose a novel framework for building OLAP cubes from graph data and analyzing the graph topological properties. The framework supports the extraction and design of the candidate multidimensional spaces in property graphs. Besides property graphs, a new database model tailored for multidimensional modeling and enabling the exploration of additional candidate multidimensional spaces is introduced. We present novel techniques for OLAP aggregation of the graph, and discuss the case of dimension hierarchies in graphs.
Furthermore, the architecture and the implementation of our graph warehousing framework are presented and show the effectiveness of our approach.
Amine Ghrab, Oscar Romero, Sabri Skhiri, Alejandro Vaisman, and Esteban Zimany, A Framework for Builidng OLAP Cubes on Graphs, proceedings of the 19th East-European Conference on Advances in Databases and Information Systems, Poitiers, France, September 2015.
We are witnessing the move towards data center operating systems (OS), where resources are unified and processing frameworks coexist with each other. In this context it has been shown that an iteration model with relaxed consistency such as the Stale Synchronous Parallel (SSP) model, while still guaranteeing convergence, is able to cope with the straggler problem for converging iterative algorithms. In this poster we present a model for the integration of the SSP model on a pipelined processing framework. We then apply the SSP on a distributed version of the Frank-Wolfe algorithm and empirically show its convergence under stress situations similar to those encountered in a data center OS.
Thomas Peel, and Nam-Luc Tran, Distributed Frank-Wolfe under Pipelined Stale Synchronous Parallelism, poster at the Greed is Great ICML’15 Workshop, Lille, France, July 2015
In the context of the recent policies concerning anti-money laundering and counter terrorist financing defined by the Financial Action Task Force Recommendation 16, it is the responsibility of the financial institution to monitor the quality of the information present in wire transfers. To that end we present in this paper an approach to automate the monitoring and the validation of the information contained in interbank transfer messages. The approach is backed by a solution built around an event-driven architecture where the data is processed as a stream and transformed at each stage. This architecture is in line with the latest research in data warehouses with stream data processing. We show that our approach is suitable to the requirements and the standards in the banking industry.
Nam-Luc Tran, Analysis of Interbank Messages for the Enforcement of Financial Regulations, proceedings of Journées francophones sur les Entrepôts de Données et l’Analyse en ligne, Bruxelles, Belgium, April 2015.
Over the past years there has been significant enthusiasm for development of parallel computing on Graphics Processing Units (GPU) which have now become powerful and affordable hardware equipping data centers and research clusters. Our earlier research has explored the ways to exploit the parallel compute performance of the GPU along the CPU in the same cluster. We have proposed a model for processing distributed machine learning tasks leveraging both the CPU and the GPU equipped on the nodes. Still in this direction, we present in this paper our approach for optimizing the performance of the previously proposed framework. We then further present our approach for integrating this processing model into a more general dataflow graph processing framework by extending it with support for GPU tasks and resources. In addition we have developed a k-nearest neighbors implementation demonstrating all the features. We then present our model based on flow networks for the efficient scheduling on this heterogeneous framework.
Nam-Luc Tran, Sabri Skhiri, Arnaud Schils, and Egar Isaac Hiroshi Leon Saiki, An Approach for Maximizing Performance on Heterogeneous Clusters of CPU and GPU. EURA NOVA technical series.
Graphs are a fundamental structure for modeling many real world domains and applications. They have emerged in various fields such as social, informational and transportation networks. The hetero geneity and dynamicity of these networks pose challenges to traditional techniques for data modeling, storage and analysis of data.
Managing graph-structured data using native graph structures and algorithms is the key for its efficient analysis. Therefore, the graph should be modeled using nodes and edges, and explored using graph algorithms, such as pattern matching and k-neighborhood.
In this paper, we introduce a novel model for management of graph data. The aim of our model is to provide analysts with a set of simple, well-defined, and adaptable components to perform complex graph modeling and analysis tasks.
Amine Ghrab, Oscar Romero, Sabri Skhiri, and Esteban Zimanyi, Analytics-Aware Graph Database Modeling, EURA NOVA technical series.
In the context of processing high volumes of data, the recent developments have led to numerous models and frameworks of distributed processing running on clusters of commodity hardware. On the other side, the Graphics Processing Unit (GPU) has seen much enthusiastic development as a device for general-purpose intensive parallel computation. In this paper we propose a framework which combines both approaches and evaluates the relevance of having nodes in a distributed processing cluster that make use of GPU units for further fine-grained parallel processing. We have engineered parallel and distributed versions of two data mining problems, the naive Bayes classifier and the k-means clustering algorithm, to run on the framework and have evaluated the performance gain. Finally, we also discuss the requirements and perspectives of integrating GPUs in a distributed processing cluster, introducing a fully distributed heterogeneous computing cluster.
Nam-Luc Tran, Quentin Dugauthier, and Sabri Skhiri, A Distributed Data Mining Framework Accelerated with Graphics Processing Units, proceedings of the 2013 International Conference on Cloud Computing and Big Data (CloudCom-Asia), FuZhou, China, December 2013.
The importance of graphs as the fundamental structure underpinning many real world applications is no longer to be proved. Large graphs have emerged in various fields such as biological, social and transportation networks. The sheer volume of these networks poses challenges to traditional techniques for storage and analysis of graph data. In particular, OLAP analysis requires access to large portions of data to extract key information and to feed strategic decision making. OLAP provides multilevel, multiperspective views of the data. Most of the current techniques are optimized for centralized graph processing. A distributed approach providing horizontal scalability is required in order to handle the analysis workload.
In this paper, we focus on applying OLAP analysis on large, distributed graph data. We describe Distributed Graph Cube, our distributed framework for graph-based OLAP cubes computation and aggregation. Experimental results on large, real-world datasets demonstrate that our method significantly outperforms its centralized counterparts. We also evaluate the performance of both Hadoop and Spark for distributed cubes computations.
Benoît Denis, Amine Ghrab, and Sabri Skhiri, A Distributed Approach for Graph-Oriented Multidimensional Analysis, proceedings of the 2013 IEEE International Conference on Big Data, Santa Clara, CA, USA, October 2013.
Diverse applications including cyber security, social networks, protein networks, recommendation systems or citation networks work with inherently graph-structured data. The graphs modeling the data of these applications are large by nature so the efficient processing of them becomes challenging.
In this paper we present imGraph, a graph system that addresses the challenge of efficient processing of large graphs by using a distributed in-memory storage. We use this type of storage to obtain fast random data access which is mostly required for graph exploration. imGraph uses a native graph data model to ease the implementation of graph algorithms. On top of it, we design and implement a traversal engine that achieves high performance by efficient memory access, distribution of the workload, and optimizations on network communications. We run a set of experiments on real graph datasets of different sizes to assess the performance of imGraph in relation to other graph systems. The results show that imGraph gets better performance on traversals on large graphs than its counterparts.
Salim Jouili, and Aldemar Reynaga, imGraph: A distributed in-memory graph database, proceedings of the 2013 ASE/IEEE International Conference on Big Data, Washington D.C., USA, September 2013.
In recent years, more and more companies provide services that can not be anymore achieved efficiently using relational databases. As such, these companies are forced to use alternative database models such as XML databases, object-oriented databases, document-oriented databases and, more recently graph databases. Graph databases only exist for a few years. Although there have been some comparison attempts, they are mostly focused on certain aspects only.
In this paper, we present a distributed graph database comparison framework and the results we obtained by comparing four important players in the graph databases market: Neo4j, OrientDB, Titan and DEX.
Salim Jouili, and Valentin Vansteenberghe, An empirical comparison of graph databases, proceedings of the 2013 ASE/IEEE International Conference on Big Data, Washington D.C., USA, September 2013.
Graphs are ubiquitous data structures commonly used to represent highly connected data. Many real-world applications, such as social and biological networks, are modeled as graphs. To answer the surge for graph data management, many graph database solutions were developed. These databases are commonly classified as NoSQL graph databases, and they provide better support for graph data management than their relational counterparts. However, each of these databases implement their own operational graph data model, which differ among the products. Further, there is no commonly agreed conceptual model for graph databases.
In this paper, we introduce a novel conceptual model for graph databases. The aim of our model is to provide analysts with a set of simple, welldefined, and adaptable conceptual components to perform rich analysis tasks. These components take into account the evolving aspect of the graph. Our model is analytics-oriented, flexible and incremental, enabling analysis over evolving graph data. The proposed model provides a typing mechanism for the underlying graph, and formally defines the minimal set of data structures and operators needed to analyze the graph.
Amine Ghrab, Sabri Skhiri, Salim Jouili, and Esteban Zimányi, An Analytics-Aware Conceptual Model For Evolving Graphs, proceedings of the 15th International Conference on Data Warehousing and Knowledge Discovery – DaWak 2013, Prague, Czech Republic, August 2013.
Migrating services to the cloud brings all the benefits of elasticity, scalability and cost-cutting. However, migrating services among different cloud infrastructures or outside of the cloud is not an obvious task. In addition, distributing services among multiple cloud providers, or on a hybrid installation requires a custom implementation effort that must be repeated at each infrastructure change. This situation raises the lock-in problem and discourages cloud adoption. Cloud computing open standards were designed to face this situation and to bring interoperability and portability to cloud environments. However, they target isolated resources, and do not take into account the notion of complete services. In this paper, we introduce an extension to OCCI, a cloud computing open standard, in order to support complete service definition and management automation. We support this proposal with an open-source framework for service management through compliant cloud infrastructures.
Amine Ghrab, Sabri Skhiri, Hervé Kœner, and Guy Ledu, Towards A Standards-Based Cloud Service Manager, proceedings of the 3rd International Conference on Cloud Computing and Services Science, CLOSER 2013, Aachen, Germany, May 2013.
The development in computational processing has driven towards distributed processing frameworks performing tasks in parallel setups. The recent advances in Cloud Computing have widely contributed to this tendency. The MapReduce model proposed by Google is one of the most popular despite the well-known limitations inherent to the model which constrain the types of jobs that can be expressed. On the other hand models based on Data Flow Graphs (DFG) for the processing and the definition of the jobs, while more complex to express, are more general and suitable for a wider range of tasks, including iterative and pipelined tasks. In this paper we present AROM, a framework for large scale distributed processing based on DFG to express the jobs and which uses paradigms from functional programming to define the operators. The former leads to more natural handling of pipelined tasks while the latter enhances genericity and reusability of the operators, as shown by our tests on a parallel and pipelined job performing the calculation of PageRank.
Nam-Luc Tran, Sabri Skhiri, Esteban Zimányi, and Arthur Lesuisse. AROM: Processing Big Data With Data Flow Graphs and Functional Programming, proceedings of the 4th IEEE International Conference on Cloud Computing Technology and Science, IEEE CloudCom 2012. IEEE Computer Society Press, Taipei, Taiwan, December 2012.
With the recent growth of the graph-based data, the large graph processing becomes more and more important. In order to explore and to extract knowledge from such data, graph mining methods, like community detection, is a necessity. The legacy graph processing tools mainly rely on single machine computational capacity, which cannot process large graphs with billions of nodes. Therefore, the main challenge of new tools and frameworks lies on the development of new paradigms that are scalable, efficient and flexible. In this paper, we review the new paradigms of large graph processing and their applications to graph mining domains using the distributed and shared nothing approach used for large data by internet players.
Sabri Skhiri, and Salim Jouili, Large Graph Mining: Recent Developments, Challenges and Potential Solutions, presentation during the European Business Intelligence Summer School (eBISS 2012) organized by the Université Libre de Bruxelles and the Ecole Centrale Paris, Brussels, Belgium, July 2012.
The use of trust in recommender systems has been shown to improve the accuracy of rating predictions, especially in the case where a user’s rating significantly differs from the average. Different techniques have been used to incorporate trust into recommender systems, each showing encouraging results. However, the lack of trust information available in public datasets has limited the empirical analysis of these techniques and trust-based recommendation in general, with most analysis limited a single dataset.
In this paper, we provide a more complete empirical analysis of trust-based recommendation. By making use of a method that infers trust between users in a social graph, we are able to apply trust-based recommendation techniques to three separate datasets. From this, we measure the overall accuracy of each technique in terms of the Mean Absolute Error (MAE), the Root Mean Square Error (RMSE) as well as measuring the prediction coverage of each technique. We thus provide a comparison and analysis of each technique on all three datasets.
Daire O’Doherty, Salim Jouili, and Peter Van Roy, Trust-based recommendation: an empirical analysis, proceedings of the 6th ACM SIGKDD Workshop on Social Network Mining and Analysis SNA-KDD, Beijing, China, ACM, July 2012.
The emergence of trust as a key link between users in social networks has provided an effective means of enhancing the personalization of online user content. However, the availability of such trust information remains a challenge to the algorithms that use it, as the majority of social networks do not provide a means of explicit trust feedback. This paper presents an investigation into the inference of trust relations between actor pairs of a social network, based solely on the structural information of the bipartite graph typical of most on-line social networks. Using intuition inspired from real life observations, we argue that the popularity of an item in a social graph is inversely related to the level of trust between actor pairs who have rated it. From an existing bipartite social graph, this method computes a new social graph, linking actors together by means of symmetric weighted trust relations. Through a set of experiments performed on a real social network dataset, our method produces statistically significant results, showing strong trust prediction accuracy.
Daire O’Doherty, Salim Jouili, and Peter Van Roy, Towards trust inference in bipartite social networks, proceedings of the 2d ACM SIGMOD Workshop on Databases and Social Networks, DBSocial 2012, Scottsdale, USA, ACM, June 2012.
In this paper, we introduce a novel method for graph indexing. We propose a hypergraph-based model for graph data sets by allowing cluster overlapping. More precisely, in this representation one graph can be assigned to more than one cluster. Using the concept of the graph median and a given threshold, the proposed algorithm detects automatically the number of classes in the graph database. We consider clusters as hyperedges in our hypergraph model and we index the graph set by the hyperedge centroids. This model is interesting to traverse the data set and efficient to retrieve graphs.
Salim Jouili, and Salvatore Tabbone, Hypergraph-based image retrieval for graph-based representation. Journal of the Pattern Recognition Society, April 2012. © 2012 Elsevier Ltd.
With the emergence of cloud computing, on-demand resources usage is made possible. This allows applications to elastically scale out according to the load. One design pattern that suits this paradigm is the event-driven architecture (EDA) in which messages are sent asynchronously between distributed application instances using message queues. However, existing message queues are only able to scale for a certain number of clients and are not able to scale out elastically. We present the Elastic Queue Service (EQS), an elastic message queue architecture and a scaling algorithm which can be adapted to any message queue in order to make it scale elastically. EQS architecture is layered onto multiple distributed components and its management components can be integrated with the cloud infrastructure management. We have implemented a prototype of EQS and deployed it on a cloud infrastructure. A series of load testings have validated our elastic scaling algorithm and show that EQS is able to scale out in order to adapt to an applied load. We then discuss about the elastic scaling of the management layers of EQS and their possible integration with the cloud infrastructure management.
Nam-Luc Tran, Sabri Skhiri, and Esteban Zimány, EQS: An Elastic and Scalable Message Queue for the Cloud, proceedings of the 3rd International IEEE conference on Cloud computing technology and science (IEEE CloudCom 2011), Athens, Greece, November 2011.
SWIFT is a member-owned cooperative providing secure messaging capabilities to the financial services industry. One critical mission of SWIFT is the standardization of the message flows between the industry players. The model-driven approach naturally came as a solution to the management of these message definitions. However, one of the most important challenges that SWIFT has been facing is the global governance of the message repository and the management of each element. Nowadays modeling tools exist but none of them enables the management of the complete life-cycle of the message models. In this paper wepresent the challenges that SWIFT had to face in the development of a dedicated platform.
Sabri Skhiri, Marc Delbaere, Yves Bontemps, Grégoire de Hemptinne, and Nam-Luc Tran, Governance issues on heavy models in an industrial context. Advances in Conceptual Modeling. Recent Developments and New Directions ER 2011, Brussels, Belgium, November 2011.
The rise of the Internet and the multiplication of data sources have multiplied the number of “Bigdata” storage problems. These data sets are not only very big but also tend to grow very fast, sometimes in a short period. Distributed databases that work well for such data sets need to be not only scalable but also elastic to ensure a fast response to growth in demand of computing power or storage. The goal of this article is to present measurement results that characterize the elasticity of three databases. We have chosen Cassandra, HBase, and mongoDB as three representative popular horizontally scalable NoSQL databases that are in production use. We have made measurements under realistic loads up to 48 nodes, using the Wikipedia database to create our dataset and using the Rackspace cloud infrastructure. We define precisely our methodology and we introduce a new dimensionless measure for elasticity to allow uniform comparisons of different databases at different scales. Our results show clearly that the technical choices taken by the databases have a strong impact on the way they react when new nodes are added to the clusters.
Thibault Dory, Boris Mejías, Peter Van Roy, and Nam-Luc Tran, Measuring Elasticity for Cloud Databases, proceedings of the Cloud Computing 2011 (Second International Conference on Cloud Computing, GRIDs, and Virtualization), Rome, Italy, September 2011.