Skip to content

Flink Forward 2015

 

The first edition of Flink Forward took place past October 12th and 13th in Berlin. Flink Forward is two-day conference exclusively dedicated to Apache Flink, the distributed pipelined batch and streaming processing framework. EURA NOVA was present among the speakers of the event. Here is our field report.

Global overview

The conference was an excellent opportunity to see (1) the evolution of the Flink framework and where they are heading, but also (2) what are the industrial use cases of Flink. Flink currently has a clear advantage compared to Spark in terms of

– Memory management

– Optimized logical & physical execution plan, especially for data processing and data queries

– Real streaming processing API with exactly-once semantics

Use cases from the big players

Interestingly, almost all the presented use cases systematically leverage the stream processing API in real time applications. The majority of the speakers however explained that they had first chosen Spark, but eventually switched to Flink because of memory and scalability issues.

We have seen a bunch of applications especially in Telecom with Bouygues (QoE) and Ericsson (failure detection), retail with Otto, travel (online reservation price variation) with Amadeus and Banking industry with Capital One (anomaly detection, fraud detection and online recommendation).

Tooling Integration

On the tooling side, we had a great demo of Zemplin the notebook for interactive queries and visualization. It can be used for Spark and for Flink. Very cool, if we want to show results and quick explorations.

The Evolution of Stream Processing Technologies

As explained before, we can see a clear evolution of the streaming technologies to take into account the out of order events. Google gave a really interesting key note the second day about Google dataflow and the unification API.

Streaming Patterns: Event Time Based Windows
Streaming Patterns: Event Time Based Windows

Google Data flow semantics

Google dataflow is the evolution of MillWheel (MillWheel: Fault-Tolerant Stream Processing at Internet Scale, VLDB 2013) and FlumeJava (FlumeJava: Easy, Efficient Data-Parallel Pipelines, ACM SIGPLAN 2010) for Stream Processing. The real technical evolution of Google dataflow lays in the ability to stream data and define windows. The system then automatically defines a “watermark”, a point of time at which most of the events should have been received. This watermark can be below the window size, and then trigger results faster. In addition, this allows to register early triggers for results before the watermark and before the end of the window, and late trigger for late arrivals.

Why do this ? Google uses stream processing for click analytics, gaming, billing, ad exchanges, alarms, fraud detection, abuse detection, acquiring data & data transformation, and many other real time use cases. In each use case, they need to provide (1) completeness, (2) latency and (3) cost effectiveness. But depending of the use case, some will require full completeness and average latency, and are ready to pay the cost for that (such for billing), while abuse detection should have low latency in order to react as soon as possible.

By providing all those rich stream processing semantics but also only one unique DAG API for stream and batch, Google can cover a wide variety of different use cases, and it is up to the developer to define the right level of completeness, latency and costs by choosing between:

  1. Batch
  2. Batch with Window
  3. Stream (latency but not completeness)
  4. Stream & early/late trigger (Latency efficient and less costs)
  5. Stream & retention (completeness, less costs and less latency efficient)

 

The unification API

Google Data flow provides only one API for building a DAG, a pipeline of operation. According to the type of data entering in the Pipe, Google data flow decides to compile and optimize that DAG into a batch processing job or in stream processing job.

Conclusion

Most of the use cases we have seen so far were focused on the Stream processing layer of flink that seems (1) much more flexible than Spark Streaming (that limits the sliding window usage by the RDDs) and (2) much more scalable and efficient than Spark (mainly due to the pipelining engine and their memory model). Finally, the out of order event handling is definitively a killer feature.

Google made the official announcement that they will support Flink as an official runner for running Google dataflow DAGs in a private datacenter rather that in Google Cloud Compute. That means that the same API will target two different frameworks, one in the cloud and one in a private data center. All of this makes Flink Stream a really good choice for a streaming platform.

Releated Posts

Internships 2025

You are looking for an internship in an intellectually-stimulating company? are fond of feedback and continuous personal development? want to participate in the development of solutions to address tomorrow’s challenges?
Read More

Insights from IAPP AI Governance Global 2024

In early June, Euranova's CTO Sabri Skhiri, attended the IAPP AI Governance Global 2024 conference in Brussels. In this article, Sabri will delve into some of the keynotes, panels and
Read More