The first edition of Flink Forward took place past October 12th and 13th in Berlin. Flink Forward is two-day conference exclusively dedicated to Apache Flink, the distributed pipelined batch and streaming processing framework. EURA NOVA was present among the speakers of the event. Here is our field report.
Global overview
The conference was an excellent opportunity to see (1) the evolution of the Flink framework and where they are heading, but also (2) what are the industrial use cases of Flink. Flink currently has a clear advantage compared to Spark in terms of
– Memory management
– Optimized logical & physical execution plan, especially for data processing and data queries
– Real streaming processing API with exactly-once semantics
Use cases from the big players
Interestingly, almost all the presented use cases systematically leverage the stream processing API in real time applications. The majority of the speakers however explained that they had first chosen Spark, but eventually switched to Flink because of memory and scalability issues.
We have seen a bunch of applications especially in Telecom with Bouygues (QoE) and Ericsson (failure detection), retail with Otto, travel (online reservation price variation) with Amadeus and Banking industry with Capital One (anomaly detection, fraud detection and online recommendation).
Tooling Integration
On the tooling side, we had a great demo of Zemplin the notebook for interactive queries and visualization. It can be used for Spark and for Flink. Very cool, if we want to show results and quick explorations.
The Evolution of Stream Processing Technologies
As explained before, we can see a clear evolution of the streaming technologies to take into account the out of order events. Google gave a really interesting key note the second day about Google dataflow and the unification API.
Google Data flow semantics
Google dataflow is the evolution of MillWheel (MillWheel: Fault-Tolerant Stream Processing at Internet Scale, VLDB 2013) and FlumeJava (FlumeJava: Easy, Efficient Data-Parallel Pipelines, ACM SIGPLAN 2010) for Stream Processing. The real technical evolution of Google dataflow lays in the ability to stream data and define windows. The system then automatically defines a “watermark”, a point of time at which most of the events should have been received. This watermark can be below the window size, and then trigger results faster. In addition, this allows to register early triggers for results before the watermark and before the end of the window, and late trigger for late arrivals.
Why do this ? Google uses stream processing for click analytics, gaming, billing, ad exchanges, alarms, fraud detection, abuse detection, acquiring data & data transformation, and many other real time use cases. In each use case, they need to provide (1) completeness, (2) latency and (3) cost effectiveness. But depending of the use case, some will require full completeness and average latency, and are ready to pay the cost for that (such for billing), while abuse detection should have low latency in order to react as soon as possible.
By providing all those rich stream processing semantics but also only one unique DAG API for stream and batch, Google can cover a wide variety of different use cases, and it is up to the developer to define the right level of completeness, latency and costs by choosing between:
- Batch
- Batch with Window
- Stream (latency but not completeness)
- Stream & early/late trigger (Latency efficient and less costs)
- Stream & retention (completeness, less costs and less latency efficient)
The unification API
Google Data flow provides only one API for building a DAG, a pipeline of operation. According to the type of data entering in the Pipe, Google data flow decides to compile and optimize that DAG into a batch processing job or in stream processing job.
Conclusion
Most of the use cases we have seen so far were focused on the Stream processing layer of flink that seems (1) much more flexible than Spark Streaming (that limits the sliding window usage by the RDDs) and (2) much more scalable and efficient than Spark (mainly due to the pipelining engine and their memory model). Finally, the out of order event handling is definitively a killer feature.
Google made the official announcement that they will support Flink as an official runner for running Google dataflow DAGs in a private datacenter rather that in Google Cloud Compute. That means that the same API will target two different frameworks, one in the cloud and one in a private data center. All of this makes Flink Stream a really good choice for a streaming platform.