End of July, I was presenting a lecture to the European Business Intelligence Summer School organized by the Univeristé Libre de Bruxelles and the Ecole Centrale Paris. I presented a lecture on large graph mining. In this post I will quickly introduce this fascinating topic.
Why do we need graphs?
Nowadays new emergent industrial needs lead to deal with structured, linked and heterogeneous data instead of traditional multi-dimensional models. This kind of structured dataset is especially suited to be modelled as a set of objects that can be linked in a numerous ways. The greater expressive power of the graph encourages their use in extremely diverse domains. For example, in biology, the biochemical networks such as the metabolic pathways and the genetic regulation known as the transduction signal networks constitute a significant graph of interactions [1,2]. The graphs are used in chemical data processing in a way that the molecule structure is described as a graph which implies that the molecule catalogs are processed as graph set. In the Internet area, the rise of social networks showed the need to model the social interactions as graphs and how to exploit them, either in the recommendation area [3] or in term of influence management [5] for viral marketing or product adoption. We can find another examples in credit card fraud detection in which transactions are modelled as a bipartite graph of users and vendors. There are many more examples in which the graph structure is becoming a constraint to model objects efficiently.
Therefore, the question to ask is whether we can apply directly legacy data mining algorithms on linked structures. The answer is not directly. The main issues reside in
- Defining a “graph-aware” concept of distance & similarity measure
- Adapting the mining algorithm to structural nature of the graph
- A Scalable processing. Most of the graphs we consider as large are composed by hundred million of Nodes and billion of edges, the processing must be highly scalable
The scalability issue is a real challenge and is much more complicated than just using a graphDB as Neo4J. The storage must be able to store the amount of data and the processing layer must be deeply integrated within the storage for a maximum of optimization. Different approaches have emerged these last years as the high performance computing – In memory approach and the distributed approach based on Bulk Synch. Processing (BSP). I described the last approach.
In this lecture, I first introduce 2 data mining algorithms: (1) page rank a typical graph mining algorithm and (2) a typical clustering algorithm that needs to be re-designed to be graph aware. Afterward I briefly introduce the distributed processing approach and describe Giraph, the open source implementation of Google Pregel. Then, I explain how we can implement the algorithms of the first section using this kind of BSP frameworks.
Graph Data warehouse
The second section of the lecture is dedicated to the Graph Data warehouse challenge. I wanted to spend an entire chapter on this important topic. Indeed, as soon as you can provide efficient graph mining stack, the next request you will receive is “Ok that is great, but can you please store this graph coming from the CDRs, this one coming from our Social Network and the user profile data I got … and please can you merge the same users between those graphs ? and it would be nice if you can mine this aggregated graph for finding the influencers and the churners, then we can retrieve the influencers living near from the potential churners …” As you can see those requests are typical mining and warehouse operations. However, as the graph model does not really fit with relational DB, we have to build a kind of equally functional graph data warehouse. As a result I dedicated one chapter to describe the research challenges in this area.
A special issue containing the revised lecture material of the summer school will be published in the series Lecture Notes in Business Information Processing (LNBIP) from Springer. The presentation deck of the lecture can be downloaded from the EBISS web site.
References
[1] Faust, K., Dupont, P., Callut, J. and van Helden, J. (2010). Pathway discovery in metabolic networks by subgraph extraction. Bioinformatics 26:1211-8
[2] Van Helden, J. Wernisch, L., Gilbert, D., and Wodak, S. (2002). Graph-based analysis of metabolic networks. in Ernst Schering Research Foundation Workshop Volume 38: Bioinformatics and Genome Analysis. Editors: H.-W. Mewes, B. Weiss, H. Seidel Springer-Verlag, Berlin Heidelberg 2002.
[3] Daire O’Doherty, Salim Jouili, Peter Van Roy: Towards trust inference from bipartite social networks. DBSocial 2012: 13-18
[4] The Biocyc Database, http://biocyc.org/
[5] Yaron Singer (2012). How to win friends and influence people, truthfully: Influence maximization mechanisms for Social Networks, in Proceedings of the fifth ACM international conference on Web search and data mining – WSDM ’12 p. 733