Flink Forward: The Key Takeaways

Early October 2019, 6 EURA NOVA engineers travelled to Berlin to attend the Flink Forward Conference, dedicated to Apache Flink users and stream processing communities.

In this article, they will give you their opinion about Ververica’s’ main announcement, the impact of Ververica acquisition by Alibaba, the big trends, and a selection of their favourite talks.

 

Alibaba!

This is the first Flink Forward conference since the acquisition of Ververica (formerly known as data Artisans) by Alibaba, which has been one of the largest users of Flink and second-largest contributor for years. Our R&D director Sabri Skhiri says: “The only significant impact of this acquisition on the conference is that the venue is now at the Berlin Business Center instead of the Kulturbrauerei. There, we could see that the Apache Flink user’s community has grown significantly as well as their commits on Flink. This edition was a bit more business and enterprise-oriented than previous ones, although it still had its technical DNA and a lot of technical talks. All in all, this was a very good mix. Alibaba folks are deeply committed to open source and creating technology impact. We saw a lot of activities from them such as the integration of the Blink SQL runner, the hive integration or the new scheduling model. In summary, a great event.”

 

First Keynote Announcement

Keynote: Stream Processing and Applications in the Modern Age (Stephan Ewen)

During the first keynote, Ververica took the opportunity to announce the launch of Stateful Functions (statefun.io), an open-source framework built on top of Flink to run stateful serverless functions. It bridges the gap between Function as a Service and stream processing.

Sabri says: ”Last year, they announced their streaming ledger that brings ACID transactions between states to stream processing applications. This year, they announced the launch of Stateful Functions, a framework that reduces the complexity of building and orchestrating stateful applications at scale. In the streaming world, this announcement does not change a lot of things. However, in the microservice community, this opens new doors in terms of design patterns, especially in the way data feeding and stateful operations can be designed more flexibly.”

You can find the video of the presentation here.

 

The Big Trends

1. Unified batch and streaming

A significant trend of this edition is the “Unified Stream and Batch” moto. Our R&D engineer Syrine Ferjaoui says: “Flink currently features different APIs, the DataSet API for batch processing and the DataStream API for stream processing. In addition, the Table API is already a unified API on top of both (DataSet and DataStream) with declarative-style programming. Now, they are working on a solution to unify truly the batch and streaming APIs.”

Sabri adds: “In Flink 1.9, they released the State API with which a state created in batch can be used in a stream application – interesting for bootstrapping/backfilling states. But the community is going further by proposing in Flink 2.0 a unique Data API that will merge DataSet and DataStream while still taking advantage of the batch properties to optimise the execution.”

Every talk was exploring in a way or another how this unification can be pushed forward. For instance, in the Pulsar talk, they were thinking about using Pulsar as a back end to transparently bootstrap a state and then switch on stream using (1) pulsar capability in terms of segment storage and (2) unified data stream API in Flink.”

 

2.”Enterprise-grade” Flink:

Flink is moving clearly toward an “enterprise-grade” technology. Sabri says: “The first signal is that Cloudera adopted Apache Flink into its Data Platform. Also, AWS Kinetics now integrates Flink as a client. Adoption by such big players goes to show that Flink is well on the way to gain enterprise-grade support. The second signal is the release of the Ververica Platform that highly facilitates enterprise-grade operations. Thirdly, the integration of the Hive Metastore with the pluggable catalogue architecture is a significant step towards better governance and metadata management. Finally, there were many talks about lowering the barrier to deploying Flink in prod. The topics included APIs, configuration, memory management, K8S operators, etc.”

 

3.The ML path

Finally, regarding ML/AI, there is still a lot of work to get over the gap with the Spark ecosystem. However, the Alibaba folks are working hard on this topic and we can already see the first results. Sabri says: “The refactoring of the Flink ML interface to work on Flink Table APIs is excellent. There is an excellent vision of integrating Flink as a data prep engine for ML and serving layer; and the roadmap looks great.”

 

Interesting talks

A personal selection by Charles & Christophe of interesting talks to check out :

For Charles, our data architect:

  • Aljoscha Krettek & Timo Walther, respectively a co-founder at Ververica and a PMC member of Apache Flink work on the Flink APIs. They give a summary of recent contributions to Flink’s Table & SQL APIs. It was a very good overview of what is going on in terms of refactoring and where we are going.
  • Roman Grebennikov is a software developer from Findify AB. His talk focused on Flink serialisation framework and common problems happening around it. He illustrated and explained several ways to optimise Flink jobs by taking care of the serialisation, which in most cases represents about 60% of the processing.

For Christophe, our software engineer:

  • Konstantin Klauf is the head of product for the Ververica Platform based on Apache Flink. He discussed Apache Flink worst practices by sharing anecdotes and hard-learned lessons of adopting distributed stream processing. It was a humorous list of general good practices when working with Flink from planning, requirement, deployment, and maintenance.
  • Aaron Levin and Mike Mintz are software engineers in a Stripe’s streaming team. They talked about the many challenges they encountered writing the specialised dual source. This talk was a very well-told story about a simple use case with a high constraint: all-time deduplication of transactions at Stripe (a payment platform‎). It was funny, insightful, full of lessons learned and echoed some of digazu’s features: the history replayer.

 

Kafka Summit: The Key Takeaways

At the beginning of the month, our software engineer Christophe Philemotte was in San Francisco to make a presentation at the Kafka Summit organised by Confluent. The Kafka Summit is one of the main events for data architects, engineers, DevOps, and developers who want to learn about streaming data. In this article, Christophe shares with you the latest trends from the conference.

 

Main observations

This year, one of the most important takeaways at the conference was that Confluent is working towards building an active database with KSQL.

Christophe details: “KSQL is the streaming SQL engine that enables real-time data processing against Apache Kafka. With KSQL, Confluent is embracing the SQL streaming and the integration of its stack into it. They also aim to have the interactivity we already have with a classic database. In short, they are moving towards this new paradigm of active data and passive query where KSQL would make it easy to read, write, and process streaming data in real-time, at scale, using SQL-like semantics. Still, KSQL shouldn’t be chosen over Flink, for instance, without proper consideration of its limitations. For example, real checkpointing and savepoint are missing, as well as global shuffling. There are still constraints on partitioning in some operators and there is no global windowing.”

While talking about SQL streaming, they also mentioned user-defined function or machine learning integration. Find more information on the summit website.

Another interesting point was the shared approaches and themes that were addressed by different companies. For example, 30% of the talks were about the operations. About 5 talks were dedicated to methods how to deploy on Kubernetes, and several other speakers mentioned that deploying on Kubernetes was their target. Real-time analytics, integrations/ETL/DataOps, and of course data pipelines were also often mentioned.

 

Keynotes talks:

During the first keynote talk, Jun Rao, the co-founder of Confluent, looked back at Apache Kafka’s last years and what brought them to where they are today. Christophe says: “One interesting point was the concept of democratising data. They envision Kafka as a one-stop self-service shop for devs, data scientists, etc. Still, the users have to overcome a lot of challenges such as operations, integrations, security, or cold storage. Challenges that we are solving with digazu.”

You can find the video of the presentation here.

Jay Kreps, the CEO of Confluent as well as one of the co-creators of Apache Kafka started his talk by discussing the sentence “Software is eating the world”, by Marc Andreessen. Christophe adds: “The idea is that software must be integrated into an ecosystem of other software. The users are no longer just humans. In some cases, the software will be used almost exclusively by other software.”

Jay Kreps also talked about the new steps for Apache Kafka. He announced that the next release of Kafka KSQL in November will enable users to directly register inputs and outputs thanks to Kafka Connect source and sink connectors. They are also working on better interactivity that will allow users to see the results more quickly in the KSQL CLI.

You can find the video of the presentation here.

 

Our Favourite Use Cases:

Kafka on Kubernetes: Keeping It Simple (Nikki Thean)

Nikki Thean is a staff engineer at Etsy, where she helps deploying Kafka at Etsy. She talked about Etsy’s Cloud Migration and how running Kafka on Kubernetes was the best option for them and was not half as complicated as they thought it had to be. Christophe explains: “At the DataWorks Summit in Barcelona, the message was that K8S resource management was not yet ready to replace YARN. We now see that K8S is the new YARN for many people who are using it to deploy their cluster. For example, Etsy or Confluent Cloud.”

In her talk, Nikki Thean explained how a Kafka-on-K8S setup works. Christophe explains: “The main lessons from her talk are:

  • We can start simply without an operator.
  • We must pay attention to the Kubernetes liveness and readiness probes. They can be used to make a service more robust and more resilient since K8S can restart them if necessary. However, if these probes are not configured carefully, they will kill the brokers unnecessarily.
  • Considering the price to deploy in multiple zones on Google Cloud Platform, a good solution is to deploy at least Zookeeper (the most critical element of the cluster) on multiple zones. Given the low flow of data, it will not be too expensive and Zookeeper will allow identifying which Kafka node has the data.”

You can find the video and the slides of the presentation here.

 

Mission-Critical, Real-Time Fault-Detection for NASA’s Deep Space Network using Apache Kafka (Rishi Verma)

Rishi Verma is a manager at the NASA Jet Propulsion Laboratory. He talked about the new software system being deployed by NASA to upgrade its Deep Space Network (DSN) that operates spacecraft communication links for NASA deep-space spacecraft missions. Christophe says: “It was a super interesting use case! The DSN Complex Event Processing (DCEP) software assembly is a new software system that brings into the DSN next-generation “Big Data” infrastructural tools to do IoT with their legacy assets. The objective is to correlate real-time network data with other critical data assets (in their example, an old radio antenna). They recover all the data on Kafka, then they process it and then they predict signal loss on the basis of weather conditions.”

You can find the video and the slides of the presentation here.

 

 

Cross the Streams Thanks to Kafka and Flink (Christophe Philemotte)

Christophe is the CTO of digazu, the batch and real-time data sharing platform developed by EURA NOVA. In his talk, he explained how you could build a similar data platform and how you could plug Flink into the Kafka ecosystem, as well as what the common pitfalls are and what Flink requires to be deployed on Kubernetes.

Christophe says: “The feedback was positive and I received a lot of questions during the Q&A session and after the talk, notably about Flink vs KSQL vs Spark. Another question that I received a lot is when to use Table, SQL or DataStream API. My answer was that Table and SQL APIs are two different flavours of the same API. The Table API you have a LINQ experience while with the SQL API you have a SQL experience. They are both perfect for data processing that can be expressed simply in SQL. That means in a lot of cases. The DataStream API is a lower-level API compared with the Table and SQL APIs. It gives more control on what you can do, which means it also requires a thorough understanding of Flink core mechanisms. Going for the DataStream API is usually a good choice either when your stream processing cannot be expressed in SQL and requires specific implementation, or when you need to optimise the processing.

The sandbox provided was also very popular.

You can find the video and the slides of the presentation here.

 

Our Favourite User Practice:

 

Please Upgrade Apache Kafka. Now. (Gwen Shapira)

Gwen Shapira is a software engineer at Confluent working on core Apache Kafka. She reviewed all the recent releases and made suggestions on how to de-risk upgrades.

Christophe says: “Gwen Shapira talked about why it is essential to upgrade even though it is risky and time-consuming. She explained that each new release fixes from 30 to 140 bugs and listed the improvements you will get from upgrading”. Among them:

  • The Apache Kafka team is working on improvement to build a reliable replication. For example, watermarking has been improved greatly.
  • They are working on controller design towards the removal of Zookeeper.
  • Finally, some releases are critical for specific reasons (e.g. proper resolution of IP when you work with K8S, JBOD, or EOS).

 

In the second part of her talk, Gwen Shapira made suggestions to upgrade as safely as possible. Christophe explains: “She recommended to take good care of backup configuration and documentation. Regarding documentation, she recommended to read the list of notable changes, to act upon text in bold font, and once you have finished reading, to go over it all again!”

Christophe’s last word? “Be sure to check out slide 35: it lists the ways how not to upgrade!”

You can find the video and the slides of the presentation here.