Graph BI & Analytics: Current State and Future Challenges

Our paper “Graph BI & Analytics: Current State and Future Challenges” has been accepted for publication at the 20th International Conference on Big Data Analytics and Knowledge Discovery, taking place in Regensburg, Germany.

The paper presents the state of the art of graph BI & analytics, with a focus on graph warehousing. We survey the topics of graph modelling, management, querying, and processing in graph warehouses. Then we conclude by discussing future research directions for solving complex graph problems, building native graph components and intelligent techniques to assist end-users in building and analysing the graph.

More importantly, the paper calls for the development of intelligent, efficient and industry-grade graph data warehousing systems to support the structure-driven management and analytics of data efficiently. While adopting a template that is similar to the traditional BI systems, the graph BI that is presented here extends current systems with graph analytics capabilities that deliver graph-derived insights.

The paper has been presented in September at DaWak 2018, you can now find the full version here. If you wish to go deeper into the subject, don’t hesitate to contact our research department at research@euranova.eu.

Abstract. In an increasingly competitive market, making well-informed decisions requires the analysis of a wide range of heterogeneous, large and complex data. This paper focuses on the emerging field of graph warehousing. Graphs are widespread structures that yield a great expressive power. They are used for modeling highly complex and interconnected domains, and efficiently solving emerging big data application. This paper presents the current status and open challenges of graph BI and analytics, and motivates the need for new warehousing frameworks aware of the topological nature of graphs. We survey the topics of graph modeling, management, processing and analysis in graph warehouses. Then we conclude by discussing future research directions and positioning them within a unified architecture of a graph BI & analytics framework.

Amine Ghrab, Oscar Romero, Salim Jouili, Sabri Skhiri, Graph BI & Analytics: Current State and Future Challenges. DaWaK 2018, 3-18

Discovering Interesting Patterns in Large Graph Cubes

Due to the increasing importance and volume of highly interconnected data, such as in social or information networks, a plethora of graph mining techniques have been designed to enable the analysis of such data. In this work, we focus on the mining of associations between entity features in networks. We model each entity feature as a dimension to be analyzed. Consequently we build our approach on top of the existing graph cube framework which is an extension of the concept of the data cube to networks. Our task is particularly challenging because it requires the analysis of both the initial multidimensional network and all its subsequent aggregate forms. As soon as we deal with a big data situation it is impossible for an analyst to consider manually all the possible views of the network data. The aim of this work is to design an algorithm for the discovery of interesting patterns in large graph cubes. Thus, instead of examining all the possible aggregations manually, the proposed technique leads the analyst to the interesting associations or patterns in the multidimensional network. Furthermore, we study the application of existing algorithms from the frequent itemset mining literature on graph data and propose a mapping between the two settings.

Florian Demesmaeker, Amine Ghrab, Siegfried Nijssen, Sabri Skhiri: Discovering interesting patterns in large graph cubes. 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, USA, 2017, pp. 3322-3331.

Click here to access the paper.

Distributed Frank-Wolfe under pipelined stale synchronous parallelism

Iterative-convergent algorithms represent an im-portant family of applications in big data analytics. These aretypically run on distributed processing frameworks deployed on a cluster of machines. On the other hand, we are witnessing the move towards data center operating systems (OS), where resources are unified by a resource manager and processing frameworks coexist with each other. In this context, different processing framework job tasks can be scheduled on the same machine and slow down a worker (straggler problem). Existing work has shown that an iteration model with relaxed consistency such as the Stale Synchronous Parallel (SSP) model, while still guaranteeing convergence, is able to cope with stragglers. In this paper we propose a model for the integration of the SSP model on a pipelined distributed processing framework. We then apply SSP on a distributed version of the Frank-Wolfe algorithm. We theoretically show its sparsity bounds and convergence under SSP. Finally, we experimentally show that the Frank-Wolfe algorithm applied on LASSO regression under SSP is able to converge faster than its BSP counterpart, especially under load conditions similar to those encountered in a data center OS.

Nam-Luc Tran, Thomas Peel, Sabri Skhiri, Distributed Frank-Wolfe under Pipelined Stale Synchronous Parallelism, proceedings of the 2015 IEEE Conference on Big Data, November 2015, Santa Clara, CA, USA.

Click here to access the paper in its preprint form.

A framework for building OLAP cubes on graphs

Graphs are widespread structures providing a powerful abstraction for modeling networked data. Large and complex graphs have emerged in various domains such as social networks, bioinformatics, and chemical data. However, current warehousing frameworks are not equipped to handle efficiently the multidimensional modeling and analysis of complex graph data. In this paper, we propose a novel framework for building OLAP cubes from graph data and analyzing the graph topological properties. The framework supports the extraction and design of the candidate multidimensional spaces in property graphs. Besides property graphs, a new database model tailored for multidimensional modeling and enabling the exploration of additional candidate multidimensional spaces is introduced. We present novel techniques for OLAP aggregation of the graph, and discuss the case of dimension hierarchies in graphs.

Furthermore, the architecture and the implementation of our graph warehousing framework are presented and show the effectiveness of our approach.

Amine Ghrab, Oscar Romero, Sabri Skhiri, Alejandro Vaisman, and Esteban Zimany, A Framework for Builidng OLAP Cubes on Graphs, proceedings of the 19th East-European Conference on Advances in Databases and Information Systems, Poitiers, France, September 2015.

Click here to access the paper in its preprint form.

Analysis of interbank messages for the enforcement of financial regulations

In the context of the recent policies concerning anti-money laundering and counter terrorist financing defined by the Financial Action Task Force Recommendation 16, it is the responsibility of the financial institution to monitor the quality of the information present in wire transfers. To that end we present in this paper an approach to automate the monitoring and the validation of the information contained in interbank transfer messages. The approach is backed by a solution built around an event-driven architecture where the data is processed as a stream and transformed at each stage. This architecture is in line with the latest research in data warehouses with stream data processing. We show that our approach is suitable to the requirements and the standards in the banking industry.

Nam-Luc Tran, Analysis of Interbank Messages for the Enforcement of Financial Regulations, proceedings of Journées francophones sur les Entrepôts de Données et l’Analyse en ligne, Bruxelles, Belgium, April 2015.

Click here to access the paper.

An approach for maximizing performance on heterogeneous clusters of CPU and GPU

Over the past years there has been significant enthusiasm for development of parallel computing on Graphics Processing Units (GPU) which have now become powerful and affordable hardware equipping data centers and research clusters. Our earlier research has explored the ways to exploit the parallel compute performance of the GPU along the CPU in the same cluster. We have proposed a model for processing distributed machine learning tasks leveraging both the CPU and the GPU equipped on the nodes. Still in this direction, we present in this paper our approach for optimizing the performance of the previously proposed framework. We then further present our approach for integrating this processing model into a more general dataflow graph processing framework by extending it with support for GPU tasks and resources. In addition we have developed a k-nearest neighbors implementation demonstrating all the features. We then present our model based on flow networks for the efficient scheduling on this heterogeneous framework.

Nam-Luc Tran, Sabri Skhiri, Arnaud Schils, and Egar Isaac Hiroshi Leon Saiki, An Approach for Maximizing Performance on Heterogeneous Clusters of CPU and GPU. EURA NOVA technical series.

Click here to access the paper.

Analytics-aware graph database modeling

Graphs are a fundamental structure for modeling many real world domains and applications. They have emerged in various fields such as social, informational and transportation networks. The hetero geneity and dynamicity of these networks pose challenges to traditional techniques for data modeling, storage and analysis of data.

Managing graph-structured data using native graph structures and algorithms is the key for its efficient analysis. Therefore, the graph should be modeled using nodes and edges, and explored using graph algorithms, such as pattern matching and k-neighborhood.

In this paper, we introduce a novel model for management of graph data. The aim of our model is to provide analysts with a set of simple, well-defined, and adaptable components to perform complex graph modeling and analysis tasks.

Amine Ghrab, Oscar Romero, Sabri Skhiri, and Esteban Zimanyi, Analytics-Aware Graph Database Modeling, EURA NOVA technical series.

Click here to access the paper.

A distributed data mining framework accelerated with graphics processing units

In the context of processing high volumes of data, the recent developments have led to numerous models and frameworks of distributed processing running on clusters of commodity hardware. On the other side, the Graphics Processing Unit (GPU) has seen much enthusiastic development as a device for general-purpose intensive parallel computation. In this paper we propose a framework which combines both approaches and evaluates the relevance of having nodes in a distributed processing cluster that make use of GPU units for further fine-grained parallel processing. We have engineered parallel and distributed versions of two data mining problems, the naive Bayes classifier and the k-means clustering algorithm, to run on the framework and have evaluated the performance gain. Finally, we also discuss the requirements and perspectives of integrating GPUs in a distributed processing cluster, introducing a fully distributed heterogeneous computing cluster.

Nam-Luc Tran, Quentin Dugauthier, and Sabri Skhiri, A Distributed Data Mining Framework Accelerated with Graphics Processing Units, proceedings of the 2013 International Conference on Cloud Computing and Big Data (CloudCom-Asia), FuZhou, China, December 2013.

Click here to access the paper in its preprint form.

A distributed approach for graph-oriented multidimensional analysis

The importance of graphs as the fundamental structure underpinning many real world applications is no longer to be proved. Large graphs have emerged in various fields such as biological, social and transportation networks. The sheer volume of these networks poses challenges to traditional techniques for storage and analysis of graph data. In particular, OLAP analysis requires access to large portions of data to extract key information and to feed strategic decision making. OLAP provides multilevel, multiperspective views of the data. Most of the current techniques are optimized for centralized graph processing. A distributed approach providing horizontal scalability is required in order to handle the analysis workload.
In this paper, we focus on applying OLAP analysis on large, distributed graph data. We describe Distributed Graph Cube, our distributed framework for graph-based OLAP cubes computation and aggregation. Experimental results on large, real-world datasets demonstrate that our method significantly outperforms its centralized counterparts. We also evaluate the performance of both Hadoop and Spark for distributed cubes computations.

 

Benoît Denis, Amine Ghrab, and Sabri Skhiri, A Distributed Approach for Graph-Oriented Multidimensional Analysis, proceedings of the 2013 IEEE International Conference on Big Data, Santa Clara, CA, USA, October 2013.

Click here to access the paper.