Large Graph Mining

Data science

August 8, 2012

End of July, I was presenting a lecture to the European Business Intelligence Summer School organized by the Univeristé Libre de Bruxelles and the Ecole Centrale Paris. I presented a lecture on large graph mining. In this post I will quickly introduce this fascinating topic.

Why do we need graphs?

Nowadays new emergent industrial needs lead to deal with structured, linked and heterogeneous data instead of traditional multi-dimensional models. This kind of structured dataset is especially suited to be modelled as a set of objects that can be linked in a numerous ways. The greater expressive power of the graph encourages their use in extremely diverse domains. For example, in biology, the biochemical networks such as the metabolic pathways and the genetic regulation known as the transduction signal networks constitute a significant graph of interactions [1,2]. The graphs are used in chemical data processing in a way that the molecule structure is described as a graph which implies that the molecule catalogs are processed as graph set. In the Internet area, the rise of social networks showed the need to model the social interactions as graphs and how to exploit them, either in the recommendation area [3] or in term of influence management [5] for viral marketing or product adoption. We can find another examples in credit card fraud detection in which transactions are modelled as a bipartite graph of users and vendors. There are many more examples in which the graph structure is becoming a constraint to model objects efficiently.

Therefore, the question to ask is whether we can apply directly legacy data mining algorithms on linked structures. The answer is not directly. The main issues reside in

Defining a “graph-aware” concept of distance & similarity measure
Adapting the mining algorithm to structural nature of the graph
A Scalable processing. Most of the graphs we consider as large are composed by hundred million of Nodes and billion of edges, the processing must be highly scalable

The scalability issue is a real challenge and is much more complicated than just using a graphDB as Neo4J. The storage must be able to store the amount of data and the processing layer must be deeply integrated within the storage for a maximum of optimization. Different approaches have emerged these last years as the high performance computing – In memory approach and the distributed approach based on Bulk Synch. Processing (BSP). I described the last approach.

In this lecture, I first introduce 2 data mining algorithms: (1) page rank a typical graph mining algorithm and (2) a typical clustering algorithm that needs to be re-designed to be graph aware. Afterward I briefly introduce the distributed processing approach and describe Giraph, the open source implementation of Google Pregel. Then, I explain how we can implement the algorithms of the first section using this kind of BSP frameworks.

Graph Data warehouse

The second section of the lecture is dedicated to the Graph Data warehouse challenge. I wanted to spend an entire chapter on this important topic. Indeed, as soon as you can provide efficient graph mining stack, the next request you will receive is “Ok that is great, but can you please store this graph coming from the CDRs, this one coming from our Social Network and the user profile data I got … and please can you merge the same users between those graphs ? and it would be nice if you can mine this aggregated graph for finding the influencers and the churners, then we can retrieve the influencers living near from the potential churners …” As you can see those requests are typical mining and warehouse operations. However, as the graph model does not really fit with relational DB, we have to build a kind of equally functional graph data warehouse. As a result I dedicated one chapter to describe the research challenges in this area.

A special issue containing the revised lecture material of the summer school will be published in the series Lecture Notes in Business Information Processing (LNBIP) from Springer. The presentation deck of the lecture can be downloaded from the EBISS web site.

References

[1] Faust, K., Dupont, P., Callut, J. and van Helden, J. (2010). Pathway discovery in metabolic networks by subgraph extraction. Bioinformatics 26:1211-8

[2] Van Helden, J. Wernisch, L., Gilbert, D., and Wodak, S. (2002). Graph-based analysis of metabolic networks. in Ernst Schering Research Foundation Workshop Volume 38: Bioinformatics and Genome Analysis. Editors: H.-W. Mewes, B. Weiss, H. Seidel Springer-Verlag, Berlin Heidelberg 2002.

[3] Daire O’Doherty, Salim Jouili, Peter Van Roy: Towards trust inference from bipartite social networks. DBSocial 2012: 13-18

[4] The Biocyc Database, http://biocyc.org/

[5] Yaron Singer (2012). How to win friends and influence people, truthfully: Influence maximization mechanisms for Social Networks, in Proceedings of the fifth ACM international conference on Web search and data mining – WSDM ’12 p. 733

Releated Posts

Insights from GTC Paris 2025

25.06.2025 / Engineering / Blog, Event

Among the NVIDIA GTC Paris crowd was our CTO Sabri Skhiri, and from quantum computing breakthroughs to the full-stack AI advancements powering industrial digital twins and robotics, there is a lot to share! Explore with Sabri GTC 2025 trends, keynotes, and what it means for businesses looking to innovate.

Development & Evaluation of Automated Tumour Monitoring by Image Registration Based on 3D (PET/CT) Images

23.05.2025 / Engineering / Academic collaborations, Papers

Tumor tracking in PET/CT is essential for monitoring cancer progression and guiding treatment strategies. Traditionally, nuclear physicians manually track tumors, focusing on the five largest ones (PERCIST criteria), which is both time-consuming and imprecise. Automated tumor tracking can allow matching of the numerous metastatic lesions across scans, enhancing tumor change monitoring.

Large Graph Mining

Why do we need graphs?

Graph Data warehouse

References

Releated Posts

Insights from GTC Paris 2025

Development & Evaluation of Automated Tumour Monitoring by Image Registration Based on 3D (PET/CT) Images

Recent Posts

Insights from GTC Paris 2025

Development & Evaluation of Automated Tumour Monitoring by Image Registration Based on 3D (PET/CT) Images

Insights from Data & AI Tech Summit Warsaw 2025

Insights From Flink Forward 2024

Tracks

Mjolnir

Rune

Vadgelmir

Yggdrasil

Field of expertises

Data architecture

Data governance

Data science

Engineering

Academic collaboration

SERVE

Expertise

CRAFT

digazu

CONTACT

Belgium

France

Tunisia

CAREER

Job Offers

Social media