Insights From Flink Forward 2024

In October, our CTO Sabri Skhiri attended the Flink Forward conference, held in Berlin, which marked the 10-year anniversary of Apache Flink.  This event brought together experts and enthusiasts in the field of stream processing to discuss the latest advancements, challenges, and future trends. In this article, Sabri will delve into some of the keynotes and talks that took place during the conference, highlighting the noteworthy insights and innovations shared by Ververica and industry leaders.

The Keynotes

The Present of Stream Processing with Apache Flink

The organisers invited different companies to share how they are leveraging Apache Flink to address real-time challenges and push the boundaries of stream processing: 

UNIPER: They presented a use case in energy trading, where Flink’s real-time stream processing enables low-latency data aggregation, crucial for consolidating trading positions across multiple platforms.

Booking.com: They transitioned from legacy systems such as MCharts and shell scripts to Flink for transaction processing and fraud detection, significantly improving real-time detection (99% increase). Challenges included balancing reliability with scaling needs and evolving business demands. They also built an abstraction layer on top of Flink, offering documentation, workshops, hands-on sessions, and troubleshooting to make the technology more accessible and operationally effective.

Evoura: They highlighted Flink’s diverse applications across various industries, such as predictive maintenance in manufacturing, fraud detection in finance, and real-time network surveillance in telecom. They also emphasised the rise of the streaming lakehouse architecture, where real-time data is stored in dynamic tables within lakehouses, integrating both transactional and analytical worlds using technologies like Iceberg. Finally, shift-left architecture enforces real-time data contracts and lineage, streamlining the transition of data products between systems.
 

 Hivemind: They shared strategies for helping clients move from batch to real-time processing, starting with small PoCs and engaging stakeholders early. Flink’s tight integration with Kafka and Debezium makes real-time deployment easier and lowers risks.

Aiven: They identified the complexity of developing and operationalising Flink jobs as a barrier to adoption, and suggested that lowering these barriers, possibly by enabling development in plain English, could democratise Flink’s use.
 
StreamNative: They discussed Apache Paimon, a pre-integrated data lakehouse solution combining Flink with Kafka or Pulsar. This setup mirrors what Confluent offers with its streaming services.

The Future of Flink

Fluss SchemaConvergence of Transactional and Analytical Systems: Flink is moving towards deeper integration between these two worlds, addressing the need for a strong storage backend for streaming analytics.

FLUSS (Flink Unified Streaming Storage): Announced as a new architecture based on Apache Arrow, FLUSS will be an open-source solution aimed at improving streaming lakehouse architectures.

Flink 2.0 Release Today !

At Flink Forward 2024, a major announcement was made about the release of Flink 2.0, marking a significant evolution in Apache Flink’s capabilities. The update introduces a modern, cloud-native state management system, replacing the RocksDB architecture used in previous versions. Flink 2.0 enhances its role in the lakehouse ecosystem by allowing seamless, automatic switching between batch and real-time processing, signalling the maturity of its unified approach. Additionally, Flink now supports AI out-of-the-box, enabling the integration of pre-trained AI models directly into streaming pipelines. This version also simplifies development with the removal of outdated APIs and reinforces its support for OLAP workloads using SQL. The release of Flink 2.0 solidifies its position as a comprehensive solution for real-time and batch processing needs.

Favourite Talks

Spoiler Alert: Revealing the Secrets of Apache Flink 2.0

The talk was delivered by the release manager for Flink 2.0. He is a senior developer at Alibaba Cloud and a Project Management Committee (PMC) member of Apache Flink. The speaker traced Flink’s timeline, starting from its roots in 2014 as a research project (Stratosphere) at TU Berlin, through the 1.0 release in 2016, to the planning of Flink 2.0 in March 2023. A preview release of Flink 2.0 is available, with a full General Availability (GA) release planned for Q1 2025.

Flink 2.0 focuses on three main pillars:

  1. Streaming
  2. Stream-batch unification
  3. Streaming lakehouse
1. Streaming Enhancement

The speaker emphasised that state management, which includes persistence, rescaling, and exactly-once consistency, remains central to Flink’s architecture, especially for stateful stream processing.

To address challenges such as performance bottlenecks, resource elasticity, and limitations in memory and disk space, Flink 2.0 introduces significant updates to state management:

  • Disaggregated State Management: The local storage acts as a cache, with remote storage being the primary storage. This shift reduces memory and disk space usage, provides resource elasticity through lazy loading, and leads to little or no performance loss, with improvements in compaction and clean-up processes.
  • FLUSS DB: In Flink 2.0, RocksDB is replaced by FLUSS DB, a state cluster management system designed for improved scalability and performance in stateful stream processing.
2. Stream-Batch Unification
Flink 2.0 reinforces the concept that batch is simply a bounded form of streaming, eliminating the need for separate pipelines (unlike the Lambda architecture). Flink unifies both stream and batch processing, allowing for automatic switching between the two based on efficiency and data freshness requirements.
Key features include:
  • A new SQL keyword for materialised tables, initially supporting Apache Paimon, with plans to extend support to open table formats. This allows Flink to dynamically adapt the physical plan based on data freshness needs.
  • Adaptive Execution: Enhancements such as parallelism inference, broadcast joins, and skew join optimisation ensure efficient execution of jobs.
  • Apache Celeborn Integration: Supports hybrid shuffle, improving batch execution performance.
Flink 2.0 has benchmarked highly in the TPC tests, showing comparable performance to Apache Spark in batch processing.
 
3. Streaming Lakehouse
Flink 2.0’s advancements in stream-batch unification pave the way for full streaming lakehouse support, particularly through its integration with Apache Paimon. This architecture enables real-time data processing within a lakehouse environment.
Materialised table
Notable improvements include:
  • Lookup Join: Flink uses a hash lookup table to locate data in Paimon tables efficiently.
  • Compaction Management: Enhances table maintenance and supports more read-friendly operations.
Breaking Changes

The talk concluded with a reference to several breaking changes in Flink 2.0, which were shown in a slide during the presentation. These changes will affect existing workflows, and users should prepare for migration accordingly.

Programming API

  • DataSet API, Scala DataStream API, SourceFunction/SinkFunction/SinkV1, TableSource/TableSink are removed.
  • Some deprecated classes and methods in DataStream API & Table API are removed.

Configuration

  • flink-conf.yaml is no longer supported. Use ‘config.yaml instead. Migration tool provided.
  • Some deprecated configuration options are removed.

REST API

  • Some deprecated fields are removed.
  • Flink Client & SQL Client
  • Some deprecated options are removed.

State: Compatibility is not guaranteed between Flink 1.x and 2.0.

Java: Java 8 is no longer supported.

Deployment: Per-job mode will be removed in Flink 2.0.

Conclusion

Flink 2.0 represents a significant milestone for the platform, particularly in how it handles state management, stream-batch unification, and its support for streaming lakehouses. These advancements position Flink as a leading technology for real-time, scalable data processing across various industries.  It also positions an open-source challenger to the Confluent cloud platform. The final release is expected in early 2025.

Automate Apache Flink Tuning for Highly Elastic Scaling

The talk was focused on automating Apache Flink tuning for elastic scaling, particularly in use cases involving AI. The speaker began by outlining the dataflow requirements for both AI model training and inference, emphasizing the unique challenges that arise with scaling in AI pipelines, especially due to the cost-sensitive scaling of AI instances.

Key Requirements for AI Use Cases

AI Model Training:

  1. Fault Tolerance: Ensuring reliable model training even in the presence of failures.
  2. High Throughput: Handling large amounts of data quickly during model (online) training (stream mining).
  3. Horizontal Scaling: Efficient scaling to meet the demands of training large AI models.
  4. Data Source Integration: Feeding data from various sources into the training process.
  5. Data Quality Assurance: Ensuring that only high-quality data is used from training data lakes.

AI Model Inference:

  1. Fault Tolerance: Similar to training, ensuring inference remains resilient to failures.
  2. High Throughput: The ability to process large amounts of inference requests quickly.
  3. Horizontal Scaling: Managing scaling efficiently while keeping costs in check, particularly given the high cost of General-Purpose Compute (GPC) licences. 
Challenges in Manual Scaling & Automated Scaling

Manual resource allocation is difficult to fine-tune in real-time AI use cases. As AI workloads are resource-intensive and require high scalability, manual management can lead to inefficiencies or misallocations of computational resources.

Introduction to Automated Scaling

To address these challenges, the speaker introduced automated scaling mechanisms provided by the Ververica platform, which includes Autopilot Mode and Scheduled Mode.

  • Autopilot Mode: Two options are available:
    • Stable Mode: Focuses on reliability with fewer restarts.
    • Adaptive Mode: Optimizes for latency and resource usage, making real-time adjustments to the computational resources allocated.
  • Hot Update Capability: Allows reconfiguration of jobs without stopping or redeploying them, enhancing flexibility in real-time tuning.
Main Dimensions of Autopilot

The speaker elaborated on the core dimensions of Autopilot that drive Flink’s elastic scaling:

  1. Detectors:
    1. Delay Tolerance: Measures the delay between source data consumption and the rest of the pipeline, triggering scaling based on the acceptable delay.
    2. Slot Usage for Scaling: Increases or decreases parallelism based on the processing time percentage at a vertex node.
    3. Memory Scale-down: Triggers memory adjustments at specified intervals.
  2. Parallelism Scaling: Defines minimum and maximum parallelism limits, allowing either unrestricted scaling or constrained within specified ranges.
  3. Resource Limits: Max CUs (Cores) and Memory: Specifies upper limits for compute units and memory usage.
  4. Cooldown Interval: Sets a time interval between consecutive Autopilot triggers to avoid unnecessary scaling fluctuations.

The speaker acknowledged that traffic spikes and slow traffic changes pose challenges for Autopilot. In cases where traffic patterns are predictable, Scheduled Mode is better suited as it allows for planned scaling based on known business demand patterns. Autopilot Mode is ideal for responding to real-time traffic spikes, while Scheduled Mode works well when traffic patterns are known in advance. 

Conclusion

This talk provided an in-depth look at how automated scaling in Flink can optimise resource allocation in highly elastic environments, particularly for AI use cases that require both scalability and cost-efficiency. There are ongoing challenges, especially in managing high selectivity rate operators, where upstream and downstream operators have vastly different scaling needs. This issue is being actively researched in the GEPICIAD project in collaboration with UCLouvain, supported by Euranova. You can find the evaluation streambed here.

Noteworthy Observations

Flink CDC

Flink CDC (Change Data Capture) enables real-time data stream processing by capturing and monitoring changes—such as inserts, updates, and deletes—within databases. It allows Flink to process these changes as continuous events, making it highly beneficial for use cases like real-time analytics, ETL processes, and data replication across distributed systems. Notably, Flink CDC eliminates the need for Kafka and Kafka Connect for database integration, simplifying the pipeline significantly. This makes it a promising feature for organizations looking to maintain real-time data flows without the complexity of additional middleware. The active community around Flink CDC suggests that more connectors and enhanced features will continue to be released, expanding its utility across various database systems.

Flink CDC

Forrester on Streaming for AI

The speaker from Forrester emphasised the importance of integrating streaming data into AI processes, particularly in the context of three types of AI: predictive AI, generative AI, and agentic AI (which supports entire business processes). The core argument was that agents perform better when they have rich context from real-time data, which often comes from multiple applications. Forrester stressed that real-time event processing and state management are critical to building AI systems that can handle complex, dynamic environments. They also highlighted that the governance of streaming data is increasingly important due to the surge in event volumes, necessitating robust management frameworks. Forrester’s analysis of stream processing frameworks identified key factors such as state management, scalability, failure management, and pattern matching as essential for building effective streaming architectures.

Restate - the ultimate highly scalable microservice framework for transactional world ?

As discussed in my data and AI architecture seminar, stream processing has evolved through three waves: Lambda Architecture, real-time analytics, and event-driven architectures. In the final wave, we witnessed the emergence of StateFun, built on Apache Flink. However, a founder and former CEO of Data Artisans (now Ververica) argued that stream processing primarily focuses on analytics and event aggregation, while StateFun does not provide a scalable, high-throughput solution for durable transactions and event-specific business logic. This need led to the creation of Restate.
Restate is a platform designed for building resilient, event-driven applications, offering features like durable execution, state machines, and asynchronous task handling. It supports multiple programming languages (e.g., JavaScript, Python, Rust) and allows developers to orchestrate workflows as code, handle retries and failures automatically, and manage stateful event processing. Restate can be deployed in various environments, such as Kubernetes, serverless, or containers, and provides a fully managed cloud option as well as self-hosting capabilities.

Copilot LLM as GUI for Data Exploration & Query ?

The concept behind this is that a business user could ask natural language questions, such as “What are our current best-selling products in Belgium?” and the system would automatically translate that into a Flink SQL query, deploy it, and retrieve the results. This is exactly what Airy is proposing. It transforms English into the query language, making data exploration highly accessible for non-technical users. By eliminating the need for SQL expertise, this tool empowers users to interact with data dynamically and efficiently, leveraging the underlying power of Flink for real-time analytics.

You can see below the high level architecture of this very interesting copilot.

The current challenges?

  • Hallucinations: LLMs sometimes generate incorrect or fabricated responses (hallucinations), which can lead to faulty business insights if the model misinterprets a query or produces invalid SQL. Ensuring accuracy through tighter model integration with datasets is mandatory to minimize these errors.
  • Privacy: Handling sensitive business data, such as customer or financial information, raises privacy concerns. Strict data governance, including anonymization and role-based access control, is essential to prevent the exposure of confidential information through LLM-generated queries.
  • Security: Allowing LLMs to autonomously generate queries introduces security risks, such as query injection or unintended system vulnerabilities. Validating and securing queries before execution can mitigate these risks, along with monitoring for any unauthorized or harmful actions.

Personal note: When asked about managing data access policies from the Copilot, the response was along the lines of, “That responsibility should lie with the underlying data platform, not us.” I respectfully disagree with this perspective