Flink Forward 2015

Data architecture, Engineering

October 20, 2015

The first edition of Flink Forward took place past October 12th and 13th in Berlin. Flink Forward is two-day conference exclusively dedicated to Apache Flink, the distributed pipelined batch and streaming processing framework. EURA NOVA was present among the speakers of the event. Here is our field report.

Global overview

The conference was an excellent opportunity to see (1) the evolution of the Flink framework and where they are heading, but also (2) what are the industrial use cases of Flink. Flink currently has a clear advantage compared to Spark in terms of

– Memory management

– Optimized logical & physical execution plan, especially for data processing and data queries

– Real streaming processing API with exactly-once semantics

Use cases from the big players

Interestingly, almost all the presented use cases systematically leverage the stream processing API in real time applications. The majority of the speakers however explained that they had first chosen Spark, but eventually switched to Flink because of memory and scalability issues.

We have seen a bunch of applications especially in Telecom with Bouygues (QoE) and Ericsson (failure detection), retail with Otto, travel (online reservation price variation) with Amadeus and Banking industry with Capital One (anomaly detection, fraud detection and online recommendation).

Tooling Integration

On the tooling side, we had a great demo of Zemplin the notebook for interactive queries and visualization. It can be used for Spark and for Flink. Very cool, if we want to show results and quick explorations.

The Evolution of Stream Processing Technologies

As explained before, we can see a clear evolution of the streaming technologies to take into account the out of order events. Google gave a really interesting key note the second day about Google dataflow and the unification API.

Streaming Patterns: Event Time Based Windows

Google Data flow semantics

Google dataflow is the evolution of MillWheel (MillWheel: Fault-Tolerant Stream Processing at Internet Scale, VLDB 2013) and FlumeJava (FlumeJava: Easy, Efficient Data-Parallel Pipelines, ACM SIGPLAN 2010) for Stream Processing. The real technical evolution of Google dataflow lays in the ability to stream data and define windows. The system then automatically defines a “watermark”, a point of time at which most of the events should have been received. This watermark can be below the window size, and then trigger results faster. In addition, this allows to register early triggers for results before the watermark and before the end of the window, and late trigger for late arrivals.

Why do this ? Google uses stream processing for click analytics, gaming, billing, ad exchanges, alarms, fraud detection, abuse detection, acquiring data & data transformation, and many other real time use cases. In each use case, they need to provide (1) completeness, (2) latency and (3) cost effectiveness. But depending of the use case, some will require full completeness and average latency, and are ready to pay the cost for that (such for billing), while abuse detection should have low latency in order to react as soon as possible.

By providing all those rich stream processing semantics but also only one unique DAG API for stream and batch, Google can cover a wide variety of different use cases, and it is up to the developer to define the right level of completeness, latency and costs by choosing between:

Batch
Batch with Window
Stream (latency but not completeness)
Stream & early/late trigger (Latency efficient and less costs)
Stream & retention (completeness, less costs and less latency efficient)

The unification API

Google Data flow provides only one API for building a DAG, a pipeline of operation. According to the type of data entering in the Pipe, Google data flow decides to compile and optimize that DAG into a batch processing job or in stream processing job.

Conclusion

Most of the use cases we have seen so far were focused on the Stream processing layer of flink that seems (1) much more flexible than Spark Streaming (that limits the sliding window usage by the RDDs) and (2) much more scalable and efficient than Spark (mainly due to the pipelining engine and their memory model). Finally, the out of order event handling is definitively a killer feature.

Google made the official announcement that they will support Flink as an official runner for running Google dataflow DAGs in a private datacenter rather that in Google Cloud Compute. That means that the same API will target two different frameworks, one in the cloud and one in a private data center. All of this makes Flink Stream a really good choice for a streaming platform.

Releated Posts

Investigating a Feature Unlearning Bias Mitigation Technique for Cancer-type Bias in AutoPet Dataset

10.09.2025 / Engineering / Papers

We proposed a feature unlearning technique to reduce cancer-type bias, which improved segmentation accuracy while promoting fairness across sub-groups, even with limited data.

Muppet: A Modular and Constructive Decomposition for Perturbation-based Explanation Methods

4.08.2025 / Data science / Papers

The topic of explainable AI has recently received attention driven by a growing awareness of the need for transparent and accountable AI. In this paper, we propose a novel methodology to decompose any state-of-the-art perturbation-based explainability approach into four blocks. In addition, we provide Muppet: an open-source Python library for explainable AI.

Flink Forward 2015

Global overview

Use cases from the big players

Tooling Integration

The Evolution of Stream Processing Technologies

Google Data flow semantics

The unification API

Conclusion

Releated Posts

Investigating a Feature Unlearning Bias Mitigation Technique for Cancer-type Bias in AutoPet Dataset

Muppet: A Modular and Constructive Decomposition for Perturbation-based Explanation Methods

Recent Posts

Investigating a Feature Unlearning Bias Mitigation Technique for Cancer-type Bias in AutoPet Dataset

Muppet: A Modular and Constructive Decomposition for Perturbation-based Explanation Methods

Insights from GTC Paris 2025

Development & Evaluation of Automated Tumour Monitoring by Image Registration Based on 3D (PET/CT) Images

Tracks

Mjolnir

Rune

Vadgelmir

Yggdrasil

Field of expertises

Data architecture

Data governance

Data science

Engineering

Academic collaboration

SERVE

Expertise

CRAFT

digazu

CONTACT

Belgium

France

Tunisia

CAREER

Job Offers

Social media