In this blog post we briefly describe our new contribution to the big data domain, especially, graph storage and querying. This work was accepted for publication at 2013 ASE/IEEE International Conference on Big Data.
Big Data has to deal with two key issues: the growing size of the datasets and the increase of data complexity. Alternative database models such as graph databases are more and more used to address this second problem. Indeed, graphs can be used to model many interesting problems. In fact, managing graph-structured data has gained significant attention in both industrial and research communities (see, e.g., a survey ). This is due to the emergence of modern applications whose data structure is naturally represented as graphs in areas such as social networks, transportation, biological networks and semantic webs. In fact, the graph structure provides a flexible representation for modeling highly-connected and dynamic data.
The managing and processing of large graphs can be very challenging because the huge amount of data causes a high I/O latency. As data access on graphs generally requires random data access (no locality), there is a high number of I/O operations in traversal processing. Although memory caching can be used to reduce the number of I/O operations, the no locality of data access on graphs reduces the efficiency of those caches on large graphs because the cache contents will be frequently modified so the database engine will have to perform several data reads on disk.
Eura Nova contribution
Having these challenges in mind, we introduce a new graph database system called imGraph. We have considered the random access requirement for large graphs as a key factor on deciding the type of storage. Then, we have designed a graph database where all data is stored in memory so the speed of random access is maximized. However, as large graphs can not be completely loaded in the RAM of a single machine, we designed imGraph as distributed graph database. That is, the vertices and the edges are partitioned into subsets, and each subset is located in the memory of one machine belonging to the involved machines (see the following figure). Furthermore, we implemented on imGraph a graph traversal engine that takes advantage of distributed parallel computing and fast in-memory random access to gain performance.
The next figure shows the flow of messages among the machines of the cluster when a traversal is processed. The flow starts with a traversal request message sent from the client to the traversal manager. Then, the traversal manager sends a search request message to Machine B which in turn sends search request messages to machines A and C. Then, machines A and C process the received request messages, machine A sends an additional search request message to machine C. Later, all the machines send response message to the traversal manager, one response message is sent for each request message processed. Finally, the traversal manager sends a traversal response to the client.
The experiments were performed using four graph datasets: soc-Epinions (75K vertices and 508K edges) – a directed graph representing the who-trust-whom online social network of a general consumer review site Epinions.com; youTube (1.1M vertices and 3M edges) – an undirected graph representing the Youtube social network; ukgraph (4.5M vertices and 66M edges) – a directed graph representing a segment of a snapshot of the web network in UK; and soc-LiveJournal (4.8M vertices and 69M edges) – a directed graph representing LiveJournal (a free online community with almost 10 million members with a significant fraction highly active members).
Next figure shows the results (log-scale) of the traversal tests performed on the four datasets: the average execution time of 200 traversals at maximum 3 hops between two vertices having at least one path. We compare our results achieved in 5 and 10 machines to Neo4j and Titan in 5 and 10 machines.
If we compare the results of imGraph and Neo4J we can observe that Neo4J is faster than our implementation only when the smallest dataset (75K vertices) is tested; when larger datasets are tested, imGraph gets better results than Neo4J (up to x200 faster for ukgraph). We observe that Neo4J’s performance is more negatively affected by the large graphs than imGraph’s performance. When the graph is not too large, Neo4J is able to load a considerable part of it into the cache so the traversal will be done almost in memory. However, if the graph is larger, Neo4J will be able to load in the cache only a small portion of the graph so it will have to read from the disk more frequently.
Regarding Titan and imGraph, both using a distributed storage, we can observe that imGraph performs better (up to x150 faster for soc-LiveJournal) in all the traversal configurations for all the datasets. In this case, imGraph’s distributed parallel computing makes the difference. As Titan traversals run on a single machine there is a considerable amount of network communication to obtain graph data; furthermore, the computation power of Titan is limited to the capacity of a single machine. Another factor that makes imGraph perform better is the faster random access provided by the imGraph’s in-memory storage. Although Cassandra provides a caching functionality, this cache cannot store all data from a large graph so many disk reads will be required.
imGraph is a distributed graph database fully in-memory and optimized for complicated graph operations (e.g. Traversals). It provides very promising results comparing to alternative graphDB solutions, it implements the well-known blueprints-API from tinkerpop.
If you’re interested to know more about imGraph, please read the paper imGraph: A distributed in-memory graph database and/or contact us.
 R. Angles and C. Gutiérrez. Survey of graph database models. ACM Comput. Surv., 40(1), 2008.
 Salim Jouili and Aldemar Reynaga, imGraph: A distributed in-memory graph database. To appear in Proceedings of the 2013 ASE/IEEE International Conference on Big Data, Washington D.C., USA, September 2013.