The Kafka Summit, an annual event dedicated to Apache Kafka and its ecosystem, brought together industry experts, developers, and enthusiasts to discuss the latest advancements and practical applications of event streaming and microservices. In this article, our CTO Sabri Skhiri will delve into some of the keynotes and talks that took place during the summit, highlighting the noteworthy insights and innovations shared by industry leaders.
<Stream processing, data stream processing, data stream management, Complex event processing, high availability streaming>
The Keynote - The Confluent strategic move continues
The keynote speech presented several pivotal updates and strategic shifts for Confluent and the broader Kafka community. The announcement that the community has submitted over 1000 Kafka Improvement Proposals (KIPs) underscores the ongoing efforts to enhance Kafka’s speed, reliability, and functionality. It’s noteworthy that Kafka’s adoption has surged, with over 150,000 organisations now relying on it in their production environments, highlighting its critical role in modern data infrastructure. But the best announcements were about to come…
The universal data product
The keynote emphasised Confluent’s ambition to extend its influence beyond its traditional realm of real-time and fast data processing. By tackling the persistent issue of data entropy—a concept revisited from Jay Kreps’s 2016 narrative—the focus has shifted towards simplifying the increasingly complex and organically grown data infrastructures. This complexity, described as a “data mess,” has been identified as a significant source of developer pain. The proposed paradigm shift from this pain to “Data Products” signifies a move towards empowering developers, prioritising ease and efficiency over mere cost savings.
Addressing the data mess requires a transformative approach, evolving from point-to-point systems to a model that supports multiple subscribers and shifts the burden of data extraction from consumers to organising producers. However, a notable challenge is the division between the operational and analytical domains within organisations. Currently, data products predominantly reside within the analytical realm, detached from operational processes. The envisioned solution is the development of “universal data products” that amalgamate a Kafka topic, schema, and owner, making these products discoverable, consumable, and governed through a clear contract. This approach aims to integrate these products seamlessly across both operational and analytical estates through a comprehensive data platform.
This data platform is designed to offer a suite of capabilities, including real-time and reusable streams, extensive connectivity options, robust governance for security and data quality, and the facility for contextual data enrichment. By transforming data topics into universal data products, Confluent envisions enabling the creation of versatile data products that cater to both operational and analytical needs.
Reflecting on the strategic direction unveiled at the summit, it’s clear that Confluent is intensifying its market presence. Following last year’s strategic expansions—including Flink integration, the launch of a data governance package, and a cloud management platform—Confluent now positions itself as the quintessential data platform. This platform is portrayed as a holistic solution capable of transforming disorganised data infrastructures into streamlined, product-oriented architectures.
The support of the cloud platform
A highlight of the keynote was the introduction of Kora, Confluent’s scalable cloud management platform, which signifies a leap in Kafka’s global delivery capabilities. Kora’s performance, being 16 times faster than Apache Kafka on open messaging benchmarks and its scalability up to 30Gb/s, positions it as a formidable solution, especially for sectors requiring high reliability, such as financial institutions. Moreover, Kora’s pay-as-you-go pricing model, announced at $0.005 per GB, emphasises its cost-effectiveness, catering to a broad spectrum of organisational needs.
Did you say data lake house?
Kafka has long been established as a cornerstone in the operational estate, proficiently handling real-time data streams and events. However, the analytical realm traditionally revolves around the concept of shared data tables, prompting a need for a more unified approach to data storage and analysis. During the keynote, Jay Kreps highlighted Apache Iceberg as a standard for centralising and managing data within the analytical domain. This proposition paves the way for a more cohesive data environment, where Kafka’s operational strengths could be seamlessly integrated with the analytical capabilities of Apache Iceberg.
The interesting proposition of a true integration between Apache Kafka, which excels in handling streams and operational data, and Apache Iceberg, focused on analytical data tables, was explored. Such an integration would not only bridge the gap between operational and analytical data but also encompass stream and governance aspects. The announcement that, within the Confluent cloud platform, every stream topic can now be represented as a table in Apache Iceberg marks a significant step towards this integration. This feature enables direct mapping of the continuous stream of events managed by Kafka into structured tables within Iceberg, embodying the stream/table duality concept familiar to the streaming community. Consequently, these stream tables can be efficiently recorded in Apache Iceberg, while tables within Iceberg can be correspondingly loaded into Kafka topics. This is the true implementation of batch/real-time uniform processing stack.
This dual capability introduces a groundbreaking approach to data management, effectively creating a “data lakehouse in a box.” By enabling the connection of various data sources, integrating them in both batch and real-time, storing them in Iceberg tables within a data lake, and processing them in real-time to build and distribute data products, this approach offers an all-encompassing solution. It not only streamlines the storage and analysis of data but also enhances the development and dissemination of data products. This integration achieved a truly unified data lakehouse, offering a comprehensive and efficient framework for handling the full spectrum of data operations and analytics. The layer behind this integration is TableFlow. The integration allows us to easily connect a source (say Snowflake) to integrate it with KConnect, to store it in iceberg with a simple click and to create data products with Flink actions or Flink SQL.
With this new announcement, Confluent cloud platform is the only managed data platform / data bakehouse on cloud that is truly batch and real-time. As a result, the most important take away of this keynote is that Confluent cloud platform is now a real-time data lake house powered by KConnect, Kafka, Iceberg and Apache Flink. But, the most essential, is that the fact that it is real-time or batch becomes a secondary element. The first one is the ease of integrating data and properly organising the data architecture in data products and in medallion architecture without coding anything, without deploying anything.
Real-time data product - here is Apache Flink
The emphasis on real-time data processing remains paramount for various reasons, including the need for fresh and online data availability. Apache Flink plays a crucial role in the Kafka ecosystem, especially when it comes to processing and distributing data products at scale. The keynote highlighted Flink’s importance not only for traditional AI applications but also for innovative uses such as Retrieval Augmented Generation (RAG), which can leverage Apache Flink to contextualise and extract data residing at rest. Confluent has fully integrated Flink within its platform, offering features like “one-click Flink actions” and ensuring that it is production-ready with a 99.9% reliability. Confluent’s cloud service for Apache Flink has now reached General Availability (GA) across major cloud providers, including Azure, GCP, and AWS, marking a significant milestone in its deployment capabilities.
In discussing the role of Apache Flink within the concept of universal data products, the keynote speaker employed a compelling analogy comparing the creation of data products to shoe manufacturing. Just as consumers are not expected to construct their shoes from raw materials, in the data realm, a set of experts determines how to craft data products from available data “ingredients.” Apache Flink facilitates the creation of these data products, both in real-time and batch processing, through the use of SQL queries and User-Defined Functions (UDFs). Confluent has announced the early access launch of UDFs on its cloud platform, enhancing the flexibility and power of data processing capabilities.
Flink SQL is presented as a key solution for building data products effortlessly, without the need for extensive coding. The introduction of the “Flink action” feature simplifies the process further by offering a selection of predefined actions from which users can choose. For those needing more customisation, the ability to define and integrate custom UDFs into SQL queries was showcased. The demonstration highlighted how simple it is to transform data by reading from input topics, applying Flink queries and UDFs, and then writing back to an output topic. The simplicity and power of this process were striking. Additionally, the integration of infrastructure as code practices, specifically through the use of Terraform for deploying Apache Flink applications, was shown.
Data governance - the mandatory feature for a data platform
A significant emphasis was placed on the advancements in data governance. This focus aligns with Confluent’s ambition to establish the Kafka ecosystem as a central platform for data management, recognising that robust data governance is a critical component for achieving this goal. The presentation included a detailed demonstration of the newly introduced data governance features, reflecting their importance in the broader strategy.
Key features demonstrated included Stream Quality, which focuses on data validation and the establishment of data quality rules. Currently, this is implemented primarily through regular expressions applied to topic schemas, ensuring that data meets predefined standards of quality before it is processed or analysed. The Stream Catalogue feature was introduced to manage technical and business metadata, including sensitive information classification, thereby enhancing data security and compliance. Another notable feature, Stream Lineage, was showcased through a compelling demonstration. It provides comprehensive visibility into the data flow within the ecosystem, displaying detailed information about each topic, including its metadata and ownership, facilitating traceability and accountability.
Furthermore, the keynote addressed the definition of data ownership, encompassing roles, responsibilities, and associated processes. A key aspect of this is the integration of data access request processes within the governance framework, streamlining how data access is managed and controlled. This feature underscores Confluent’s commitment to not only improving the functionality and efficiency of data management but also ensuring that data governance is seamlessly incorporated into the ecosystem, reinforcing security, compliance, and operational integrity.
Favourite Talks
Restate - The Stephen Ewen’s startup
Restate is designed to simplify the development of distributed applications and microservices by combining the ease of traditional RPC-based applications with the scalability and resilience of event-driven architectures. It introduces a novel approach for creating durable, scalable, and fault-tolerant applications without requiring developers to adopt a new programming paradigm. Restate’s core features include a lightweight event broker written in Rust, the ability to build applications with durable async/await patterns, and a distributed log for event and command management. This system aims to reduce the complexity typically associated with distributed applications, making them more robust and easier to develop. For more details, you can read the full explanation on their blog: “Why We Built Restate” or the website https://restate.dev/
The presentation was delivered by Stephan Ewen, the CEO and co-founder of Data Artisans, the initial company behind Apache Flink before its acquisition by Alibaba. Ewen’s talk offered an in-depth exploration of Restate’s user code and high-level architecture, showcasing its similarities to StateFun but with a greater emphasis on developer ease. The presentation demonstrated how Restate simplifies the creation of RPC-like microservices, enhancing stream processing’s reactivity and scalability, making it notably straightforward to implement.
More information here. At the time of writing this document, the demo was online at https://kafka2024.restate.dev/
The editor’s note: Restate can be seen as a modern implementation of the event-driven SOA as envisioned by David Luckham in the early 2010s, eliminating the need for diverse technology stacks or Complex Event Processing (CEP) for transaction state tracking. This framework revolutionises the approach by distributing, checkpointing, and maintaining state across a stream-based application, not within the event itself. This paradigm shift opens new possibilities for real-time, scalable microservices, aligning with the principles of event-driven SOA without its traditional complexities. Finally, restate can also be used as a real-time service broker between a Kafka and a set of microservices.
OpenLineage
OpenLineage (https://openlineage.io/) is an open platform dedicated to the collection and analysis of data lineage, enabling the consistent collection of lineage metadata for a deeper understanding of how data is produced and used. It offers an open standard for lineage data collection, supports metadata about datasets, jobs, and runs, and integrates with various data pipeline tools. OpenLineage aims to aid in identifying the root causes of complex issues and understanding the impact of changes through a standard API for capturing lineage events. But moreover, to provide a full data governance lineage on the origins of a data flow. For more information, visit OpenLineage’s website(https://openlineage.io/).
The speaker emphasised the importance of data lineage for several reasons consistent with data governance principles, including privacy concerns, understanding the downstream impacts of changes such as renaming columns, tracing data origins for better comprehension, and enhancing cost efficiency. OpenLineage introduces an object model comprising runs, jobs, and datasets, facilitating lineage tracking across various technology stacks like Spark, Flink, DBT, Airflow, Dagster, Egeria, Google Cloud, and Keboola. This model is detailed in the OpenLineage documentation, which outlines how it supports lineage tracing in diverse environments. For more detailed information, visit OpenLineage’s documentation on their (https://openlineage.io/docs/spec/object-model/).
In conclusion, OpenLineage with Marquez integration provides a great cross-platform lineage for streaming and analytics applications. The Github demo looks great.
Asynch API V3
The AsyncAPI specification enables detailed descriptions of asynchronous APIs, including not just topic content but also cluster binding information. From this specification, community-developed tools have emerged. For example, Microcks allows for mocking Kafka instances with corresponding topics easily. Additionally, the AsyncAPI Generator facilitates the creation of boilerplate code, such as generating complete Maven projects for Java, ready for packaging and execution. These tools were discussed in the context of AsyncAPI version 3, highlighting their utility in streamlining the development of asynchronous applications from a topic model.
Notice that Confluent integrates the ASynchAPI from the schema registry. As a result, it is possible to export or import AynchAP yaml file.
The talk was given by IBMers contributing to the ASynchAPI. They are from the IBM Event Endpoint Management. Here’s how they relate to Kafka:
- Event discovery: Imagine a scenario where Andre, a developer, needs to find events relevant to his project. IBM Event Endpoint Management can be used to document and publish these events in a Developer Portal This way, Andre can discover the events offered by Kafka clusters (managed by IBM or otherwise), understand what they represent, and find contact information for the event owner .
- Self-service access: Once Andre discovers the relevant events, IBM Event Endpoint Management allows him to get self-service access to them through the Event Gateway [1]. This streamlines the process and avoids the need for manual intervention.
- Security: Shavon, the owner of the events (which are described using AsyncAPI), can define access controls and protections using IBM Event Endpoint Management. This ensures that only authorised users like Andre can access the events.
This tool is part of the cloud Pak for data. More information can be found here.
Migration to Kfrat- the new Zookeeper
Kafka, in conjunction with ZooKeeper, manages a variety of crucial metadata and operational mechanics to ensure distributed coordination and high availability. ZooKeeper’s role includes but is not limited to:
- Cluster Coordination and Management: ZooKeeper tracks Kafka cluster states, including brokers and consumer groups. It plays a critical role in managing cluster membership, ensuring that all brokers are synchronised and that consumer groups are effectively coordinated.
- Controller Election: ZooKeeper is responsible for electing a controller among the brokers. The controller is a key broker responsible for managing the leader/follower relationships for partitions. In the event of a controller failure, ZooKeeper orchestrates the election of a new controller
- Topic and Configuration Management: ZooKeeper stores metadata about Kafka topics, such as topic names, the number of partitions per topic, and configurations for each topic. This includes settings like replication factors and retention policies.
- Access Control Lists (ACLs): ZooKeeper manages ACLs, determining who can read from or write to topics, thus playing a key role in the security and governance of Kafka data
- Quotas Management: It handles client quotas, regulating how much data a client can produce or consume, thereby preventing any single client from overloading the Kafka cluster
- Consumer Offsets and Metadata: For Kafka versions prior to 0.10, ZooKeeper also stored consumer offsets, which track the last message read by a consumer in each partition. However, newer Kafka versions manage this within Kafka itself, moving away from storing such information in ZooKeeper.
While Kafka’s dependency on ZooKeeper has been a core aspect of its architecture, providing reliable distributed coordination, the community has been moving towards a ZooKeeper-less setup starting with Kafka version 2.8.0. This shift aims to simplify Kafka’s architecture by internalising metadata management and reducing operational complexity
Issues with Zookepeer that required this migration
- Scalability and Performance Limitations: The scalability of Kafka when using ZooKeeper was hindered due to the linear increase in metadata change propagation times and the need to persist updated metadata, both of which scaled with the number of topic partitions involved. This made it challenging to efficiently manage large-scale Kafka deployments. ZK based failover is O(num partiation) and must push the entire state to all broker in very large RPC
- Operational Complexity: Operating Kafka with ZooKeeper introduced additional complexity in securing, upgrading, and debugging the system. Simplifying the architecture by removing ZooKeeper helps improve the system’s longevity and stability.
- Metadata Management Inefficiencies: ZooKeeper’s design led to inefficiencies in metadata management. For example, it imposed size limits on Znodes, had a maximum number of watchers, and required extra validation checks for brokers, potentially leading to divergent metadata views among brokers.
What KRAFT bring ?
- Enhanced Metadata Log Management: By implementing a metadata log directly within Kafka (using the KRaft protocol), metadata change propagation could be improved through replication of the metadata changelog by brokers, eliminating worries about metadata divergence and version inconsistencies. This shift allows for better performance through batched asynchronous log I/Os and ensures that the system’s metadata is consistently ordered and versioned.
- Improved Controller Election and Management: Transitioning to KRaft enables a quorum of controllers instead of a single controller, facilitating near-instantaneous controller failover and reducing the unavailability window during leader elections. This is achieved through leveraging Kafka’s leader epochs to ensure a single leader within an epoch, enhancing the overall reliability and efficiency of leader election and metadata consensus.
- Simplified System Architecture: KRaft consolidates responsibility for metadata management into Kafka itself, simplifying the architecture and enabling a single security and management model for the entire system. This consolidation improves stability, makes the system easier to monitor, administer, and support, and allows for the scaling of clusters to millions of partitions without additional complexity.
Migration to KRaft
The migration to KRaft is designed to maintain all cluster metadata as a centralised source of truth, ensuring zero downtime and facilitating a straightforward cluster upgrade. The process is operator-driven, relying on configuration changes and node restarts for execution. A key requirement of the migration is fault tolerance, particularly for the controller, which must be able to fail at any point during the migration without resulting in metadata loss or inconsistencies.
During the migration, there will be a period where the system continues to write metadata to ZooKeeper asynchronously while beginning to use KRaft. This dual-write approach allows for a fallback to ZooKeeper in case of any issues, enhancing the migration’s reliability and safety.
Regarding fault tolerance, the migration incorporates the concept of metadata transactions. Specifically, KIP-868 plays a crucial role by enabling the Kafka controller to transfer all existing metadata from ZooKeeper into a single atomic transaction within the KRaft metadata log. This approach ensures that metadata integrity is preserved throughout the migration process, even in the event of failures.
These mechanisms combined aim to provide a seamless transition to KRaft, minimising risks and ensuring the stability and integrity of the Kafka cluster throughout the migration process.
Data mesh at NORD bank
In the case study on implementing a data mesh at NORD/LB, the key focus is on establishing strong data governance before enabling data exchange between departments. This approach addresses challenges at the data platform level, particularly the need for a common language to avoid the complexities of point-to-point translations. Kafka is used for real-time data ingestion, emphasising decoupling data transport from the application and integrating data quality measures directly onto the topics, the data warehouse tables are created with KSQL DB. Ownership is maintained by the producing departments, with self-service data access facilitated through a Kafka API catalogue and topic-based data discovery. Additionally, data governance extends to Kafka topics, requiring careful modelling by the business units, including naming conventions, schema enforcement, and cloud event formatting. Metadata management is highlighted as crucial for this architecture. On top of KSQL DB, they added a virtualisation layer as an access layer.
The presentation outlines the cultural shift towards a business-driven data integration approach using Kafka and streaming data, exploring the creation and assembly of data products with input from non-IT colleagues, and discusses the advantages and disadvantages of different methods for building data products within the context of NORD/LB’s data mesh architecture.