This year we started a research project, named AROM, in collaboration with the Université Libre de Bruxelles. We wanted to evaluate a Data Flow Graph (DFG) processing framework and compare it to a traditional MapReduce (MR) one. In addition, we analyzed high level data manipulation languages such as PigLatin [1] and investigated whether a MR approach introduces an important overhead in the transformation of data operations to the physical execution plan. Going further, we analyzed a typical class of data analytic, the pipelines jobs and we studied the adequacy of DFG for ease of use and performances. Finally, we used functional programming in order to introduce higher order functions to simplify the expression of jobs and to promote the re-usability of operators.
As a result, we ended-up with a complete implementation of a Data Flow Graph distributed processing framework, written in Scala and with a PigLatin translator. We called this framework AROM. The project is open source and can be freely downloaded from Github[2]. The CloudCom 2012 publications describes the two last analysis.
Then, let me take the opportunity of this publication to come back shortly on the convergence between DFG processing and streaming processing.
Data flow graph Processing & Stream processing
When using this kind of distributed processing frameworks, the developer does not have to re-design his algorithm or his job in a Map and a Reduce phases. Instead, he needs to express the complete job as a graph of operations. Notice that this is a generalization of the Map reduce concept. Indeed, we can express any MR job as a set of wired Map – Shuffle – Sort – Reduce operations. The name Data Flow Graph comes from the fact that data flow operations are defined as a graph.
Stream processing originally comes from the DB world by integrating a high performance processing framework in the DB layer. The objectives were to be able to expose a set of operations paths to apply on incoming data flow streams at the DB-level. An interesting outcome of this initiatives are the Complex Event Processing engines (CEP) that expose a continuous query engine on top of an historical DB. If we look at the research projects which led to the CEP products we can quickly realized that they are designed as specialized Stream DB [7,8]
Then, we saw Stream processing frameworks only exposing a set of operations paths to incoming data streams without necessarily having a DB behind. This is typically the case of the IBM System S research project [4] that led to the IBM infosphere product [5]. However this kind of architecture were typically an Enterprise-style of architecture using a CORBA bus and distributed C++ workers emphasizing an SOA approach.
Afterwards, we saw Internet-style designs of Stream processing (still without storage integration) arriving, such as the Apache S4 coming from Yahoo! [3,6] emphasizing Event-Driven Architecture (EDA) and distributed computing. In this case the processing is executed by a set of Processing Elements (PEs) who subscribe to events coming from other PEs. Interestingly, some stream processing frameworks evolve to something between DFG and Stream processing such as Twitter Storm [7]. Here the idea is to define a topology of Spouts (data stream emitter, typically a Kestrel Stream – the distributed messaging system from Twitter) and Bolts similar to processing elements. Then, a stream of data is sent to a structured topology of bolts. Looks like a DFG doesn’t it ?
Not really actually, and this is where it all becomes really interesting! The DFG defines a graph of operations that will be applied on a distributed file system (such as HDFS in AROM). In practice, we apply a structured flow of operations (defined as operators) on a data flow extracted by the first step of the graph. In addition the DFG scheduler will have to deal with distributed data sharding policy in order to execute most of the operations (operators) where the data is located, thereby avoiding as much I/O overhead as possible. Finally, the edges defines themselves a complete set of data transfer primitives such as split in n chunks, merge, etc. On the other hand, the Stream processing frameworks do not use distributed file system and use streams of data as input. Notice that they can use a DFS internally in order to store intermediate results. In addition they do not necessarily provide primitives at the edge level to define data behaviour between processing elements.
We can see an interesting convergence between technologies emerging from DB, enterprises and Internet, all having different architectures and even different objective but slowly converging toward a common behaviour. This is a significant area and I would need more than a post to investigate the comparison, then I will let you delve into the references and have many great readings for your cool evenings of the coming winter. I will finish by recommending you the excellent overview about scalable CEP using stream processing [9]
References
[1] C. Olston and al., Pig latin: a not-so-foreign language for data processing, in the Proceedings of the 2008 ACM SIGMOD international conference on Management of data Pages 1099-1110. 2008
[2] AROM Github repository, https://github.com/nltran/arom
[3] Apache S4 Project, http://incubator.apache.org/s4/
[4] B. Gedik and al., SPADE: the system s declarative stream processing engine, in the Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 1123–1134, ACM. 2008
[5] IBM infoSphere product page, http://www-01.ibm.com/software/data/infosphere/streams/
[6] S4 Project – Yahoo! Labs, http://labs.yahoo.com/event/99
[7] S. Chandrasekaran and al., TelegraphCQ: continuous dataflow processing, in the Proceedings of the 2003 ACM SIGMOD international conference on Management of data Pages 668 – 668. 2003
[8] R. Motwani, J. Widom, A. Arasu, B. Babcock, S. Babu, M. Datar, G. Manku, C. Olston, J. Rosenstein, and R. Varma. Query Processing, Resource Management, and Approximation in a Data Stream Management System In Proc. of CIDR 2003, Jan. 2003
[9] A. Margara and al., Processing flows of information: from data stream to complex event processing. In Proceedings of the 5th ACM international conference on Distributed event-based system(DEBS ’11). 2011