Skip to content

Flink Forward 2015

 

The first edition of Flink Forward took place past October 12th and 13th in Berlin. Flink Forward is two-day conference exclusively dedicated to Apache Flink, the distributed pipelined batch and streaming processing framework. EURA NOVA was present among the speakers of the event. Here is our field report.

Global overview

The conference was an excellent opportunity to see (1) the evolution of the Flink framework and where they are heading, but also (2) what are the industrial use cases of Flink. Flink currently has a clear advantage compared to Spark in terms of

– Memory management

– Optimized logical & physical execution plan, especially for data processing and data queries

– Real streaming processing API with exactly-once semantics

Use cases from the big players

Interestingly, almost all the presented use cases systematically leverage the stream processing API in real time applications. The majority of the speakers however explained that they had first chosen Spark, but eventually switched to Flink because of memory and scalability issues.

We have seen a bunch of applications especially in Telecom with Bouygues (QoE) and Ericsson (failure detection), retail with Otto, travel (online reservation price variation) with Amadeus and Banking industry with Capital One (anomaly detection, fraud detection and online recommendation).

Tooling Integration

On the tooling side, we had a great demo of Zemplin the notebook for interactive queries and visualization. It can be used for Spark and for Flink. Very cool, if we want to show results and quick explorations.

The Evolution of Stream Processing Technologies

As explained before, we can see a clear evolution of the streaming technologies to take into account the out of order events. Google gave a really interesting key note the second day about Google dataflow and the unification API.

Streaming Patterns: Event Time Based Windows
Streaming Patterns: Event Time Based Windows

Google Data flow semantics

Google dataflow is the evolution of MillWheel (MillWheel: Fault-Tolerant Stream Processing at Internet Scale, VLDB 2013) and FlumeJava (FlumeJava: Easy, Efficient Data-Parallel Pipelines, ACM SIGPLAN 2010) for Stream Processing. The real technical evolution of Google dataflow lays in the ability to stream data and define windows. The system then automatically defines a “watermark”, a point of time at which most of the events should have been received. This watermark can be below the window size, and then trigger results faster. In addition, this allows to register early triggers for results before the watermark and before the end of the window, and late trigger for late arrivals.

Why do this ? Google uses stream processing for click analytics, gaming, billing, ad exchanges, alarms, fraud detection, abuse detection, acquiring data & data transformation, and many other real time use cases. In each use case, they need to provide (1) completeness, (2) latency and (3) cost effectiveness. But depending of the use case, some will require full completeness and average latency, and are ready to pay the cost for that (such for billing), while abuse detection should have low latency in order to react as soon as possible.

By providing all those rich stream processing semantics but also only one unique DAG API for stream and batch, Google can cover a wide variety of different use cases, and it is up to the developer to define the right level of completeness, latency and costs by choosing between:

  1. Batch
  2. Batch with Window
  3. Stream (latency but not completeness)
  4. Stream & early/late trigger (Latency efficient and less costs)
  5. Stream & retention (completeness, less costs and less latency efficient)

 

The unification API

Google Data flow provides only one API for building a DAG, a pipeline of operation. According to the type of data entering in the Pipe, Google data flow decides to compile and optimize that DAG into a batch processing job or in stream processing job.

Conclusion

Most of the use cases we have seen so far were focused on the Stream processing layer of flink that seems (1) much more flexible than Spark Streaming (that limits the sliding window usage by the RDDs) and (2) much more scalable and efficient than Spark (mainly due to the pipelining engine and their memory model). Finally, the out of order event handling is definitively a killer feature.

Google made the official announcement that they will support Flink as an official runner for running Google dataflow DAGs in a private datacenter rather that in Google Cloud Compute. That means that the same API will target two different frameworks, one in the cloud and one in a private data center. All of this makes Flink Stream a really good choice for a streaming platform.

Releated Posts

Development & Evaluation of Automated Tumour Monitoring by Image Registration Based on 3D (PET/CT) Images

Tumor tracking in PET/CT is essential for monitoring cancer progression and guiding treatment strategies. Traditionally, nuclear physicians manually track tumors, focusing on the five largest ones (PERCIST criteria), which is both time-consuming and imprecise. Automated tumor tracking can allow matching of the numerous metastatic lesions across scans, enhancing tumor change monitoring.
Read More

Insights from Data & AI Tech Summit Warsaw 2025

11 editions later, one of the biggest technological conferences in Central Europe changed its name to reflect the latest technological advancements. The BIG DATA TECHNOLOGY WARSAW SUMMIT became the DATA & AI WARSAW TECH SUMMIT, and the conference provided a rich platform for gaining fresh perspectives on data and AI. Our CTO, Sabri Skhiri, was present to gather the insights. Here’s a rundown of the key trends, keynotes and talks that took place.
Read More