Our Research Director Is Co-chair at DEBS 2021 [Call for Paper]

Congratulations to our research director Sabri Skhiri on his appointment as industry co-chair of the international conference on distributed and event-based systems.

Event

MDPI Data Journal, special issue – Paper Submission Opening

After the success of five international workshops co-located at IEEE Big Data, the MDPI Data Journal is dedicating a special issue to real-time stream analytics, stream mining, CER/CEP and stream data management in big data.

Event

Reinforcement Learning Course at ENSI

Reinforcement learning is one of the most active research areas in artificial intelligence and applies to a wide range of use cases in different sectors. To provide students with the skills needed in a transforming AI landscape, the ENSI school invited us to give a course on the subject.

Event

Talking Graph Analytics With Students

Last Saturday, our Tunisian team Safa, Ichraf Hamza and Amine took part in the ENSI (Ecole Nationale des Sciences de l’Informatique) virtual forum to share their experience and meet the students! Our graph specialist Amine Ghrab talked to students about the power of graph analytics.

Academic collaborations Event

Our engineer Amine Ghrab presented his PhD public defense on the BI on Graph Project

Last Thursday, our engineer Amine Ghrab presented the BI on Graph project during his PhD public defense. Amine did an amazing job at the edge between Industry & Academia. Amine’s thesis was done in collaboration with the CODE/WIT Lab of the Université Libre de Bruxelles and the Universitat Politècnica de Catalunya, with the support of Prof. Oscar Romero & Prof. Esteban Zimanyi!

In his PhD thesis, Amine defined how BI environments can be enriched with Graph Data structures. Over the past decade, business and social environments have become increasingly complex and interconnected. As a result, graphs have emerged as a widespread abstraction tool at the core of the information infrastructure that supports these environments. In particular, the integration of graphs into data warehouse systems has appeared as a way to extend current information systems with graphs management and analysis capabilities. Going forward, Amine redefined the concepts of multidimensional cube on graph and showed how it can open new doors for data analysts. Finally, he showed how a graph data warehouse architecture can be defined.

Congratulation for your achievements!

You can find below a list of related publications:

Event

Thirty-Fourth AAAI Conference On Artificial Intelligence: A Summary

Two weeks ago, our young research engineers Hounaida Zemzem and Rania Saidi were in New York for the Thirty-Fourth AAAI Conference On Artificial Intelligence. The conference promotes research in artificial intelligence and fosters scientific exchange between researchers, practitioners, scientists, students, and engineers in AI and its affiliated disciplines. Rania and Hounaida attended dozens of technical paper presentations, workshops, and tutorials on their favourite research areas: reinforcement learning for Hounaida and graph theory for Rania. What were the big trends and their favourite talks? Let’s find out with them!

The Big Trends:

Rania says: “The conference focused mostly on advanced AI topics such as graph theory, NLP, Online Learning, Neural Nets Theory and Knowledge Representation. It also looked into real-world applications such as online advertising, email marketing, health care, recommender systems, etc.”

Hounaida adds: “I thought it was very successful given the large number of attendees as well as the quality of the accepted papers (7737 submissions were reviewed and 1,591 accepted). The talks showed the power of AI to tackle problems or improve situations in various domains.”

Favourite talks and tutorials

Hounaida explains: “Several of the sessions I attended were very insightful. My favourite talk was given by Mohammad Ghavamzadeh, an AI researcher at Facebook. He gave a tutorial on Exploration-Exploitation in Reinforcement Learning. The tutorial by William Yeoh, assistant professor at Washington University in St. Louis, was also amazing. He talked about Multi-Agent Distributed Constrained Optimization. Both their talks were clear and funny.”

Rania’s feedback? “One of my favourite talks was given by Yolanda Gil, the president of the Association for the Advancement of Artificial Intelligence (AAAI). She gave a personal perspective on AI and its watershed moments, demonstrated the utility of AI in addressing future challenges, and insisted on the fact that AI is now necessary to science. I also learned a lot about the state of the art in graph theory. The tutorial given by Yao Ma, Wei jin, Lingfu Wu and Tengfei Ma was really interesting. They explained Graph Neural Networks: Models and Applications. Finally, the tutorial presented by Chengxi Zang and Fei Wang about Differential Deep Learning on Graphs and its Applications was excellent. Both were really inspiring and generated a lot of ideas about how to continue to expand my research in the field! ”

Favourite papers

A personal selection by Rania & Hounaida of interesting papers to check out :

For Hounaida:

Generalizable Resource Allocation in Stream Processing via DRL, by Xiang Ni, Jing Li, Mo Yu, Wang Zhou, and Kun-Lung Wu. This paper considers the problem of resource allocation in stream processing, where continuous data flows must be processed in real-time in a large distributed system.
Scaling All-Goals Updates in Reinforcement Learning Using Convolutional Neural Networks, by Fabio Pardo, Vitaly Levdik, and Petar Kormushev. The authors propose to use convolutional network outputs (Q-values) to generate several sub-goals at once. And this, in order to better guide the agents.
From Skills to Symbols: Learning Symbolic Representations for Abstract High-Level Planning, by George Konidaris, Leslie Pack Kaelbling, and Tomas Lozano-Perez. The paper tackles the problem of constructing abstract representations for planning in high-dimensional, continuous environments.

For Rania:

Optimizing Reachability Sets in Temporal Graphs by Delaying, by Argyrios Deligkas and Igor Potapov.
Learning Hierarchy aware knowledge Graph Embeddings for Link Prediction, by Zhanqiu Zhang, Jianyu Cai, Yongdong Zhang, and Jie Wang. The authors propose a novel knowledge graph embedding model which maps entities into the polar coordinate system reflecting hierarchy.
Multi-View Multiple Clustering using Deep Matrix Factorization, by Shaowei Wei, 1Jun Wang, Guoxian Yu, Carlotta Domeniconi, and Xiangliang Zhang. The paper introduces a solution to discover multiple clusterings. It gradually factorizes multi-view data matrices into representational subspaces layer-by-layer and generates one clustering in each layer.

Final thoughts

After attending their first conference as Euranovians, what will Rania & Hounaida remember? Hounaida concludes: “Going to New York for the AAAI-20 Conference as one of the ENX data scientists was an amazing experience. I met many brilliant and sharp international experts in various fields. I enjoyed the one-week talks with so many special events, offline discussions, and the night strolls!”

Blog Event

Schloss Dagstuhl: Where Computer Science Meets

Which direction stream and complex event processing is going to take? Last week, the world’s best-known international researchers met in Schloss Dagstuhl, Germany, to present and discuss their research. Among the members were present Avigdor Gal, Professor at the Israel Institute of Technology, Alessandro Margara, Assistant Professor at the Polytechnic University of Milan, or Till Rohrmann, engineering lead at Veverica.

Invited to talk about the requirements and needs from the industry, our R&D director Sabri Skhiri explains: “The seminar brought together world-class computer scientists and practitioners working on complex event recognition, distributed systems, databases, stream reasoning and artificial intelligence. Our objective was to disseminate the recent foundational results in each of these isolated fields among all participants, to identify the open problems that need to be resolved, and to establish new research collaborations among these fields”.

What were the big trends and intakes gathered by those brilliant minds? Let’s find out with Sabri!

The Big Trends

This seminar is a bit particular as it does not show any trends but rather gives a picture of all the communities working on CER in a way or another. I was fascinated by the diversity of researchers. I did not expect to see such a rich variety of fields: knowledge representation, spatial reasoning, logic-based reasoning, data management, learning-based approaches, event-driven processing, process mining, database theory, stream mining,… According to me, the composite event recognition models that are the best at recognising complex events would include:

Data flow model
Ontology-based and reasoning model
Symbolic reasoning model
Automata-based model

We also identified common challenges across these models and communities. The three priority topics areas we identified are:

Expressivity: composability & hierarchies
Evaluation strategy, parallelization and distribution
Uncertainty management

Favourite Talk

Kurt Rothermel from TU Stuttgart – Time-sensitive Complex Event Processing

My first reaction to load shedding was: “It is useless since customers do not want to lose any event, that is why so much effort is spent today on exactly once semantics…“. However, there is a trend today in stream processing, which is the trade-off between cost, latency, and correctness. Tyler Akidau described this challenge as a choice between one of three propositions: fast and correct, cheap and correct, or fast and cheap. Tyler was talking about streaming but that rule applies in the same way in a CEP context. The load shedding strategy directly falls in the third proposition. In this perspective, the work of Kurt is highly relevant.

Favourite Tutorial

Jacopo Urbani & Fredrik Heintz – Stream Reasoning

Concretely, stream reasoning is incremental reasoning over rapidly changing information. The tutorial opened new perspectives on stream processing for me. It tried to answer a very interesting question: how can you provide reasoning about context from streams of data? I definitely come from the database and event-based systems communities and I did not know at all that stream reasoning was so mature. This community has been evolving from having a continuous version of SPARKQL to a complete distributed stream reasoning semantics. It is interesting to see that the work we have done in the LEAD algebra and semantics is deeply inspired by this community. However, we have never used any reasoning logic on top of LEAD. But after a few hours of the tutorial, I realise that (1) reasoning can be used for query rewriting and optimisation (2) it is worth evaluating at least BigSR, the LARS implementation on Flink.

Avigdor Gal & Ruben Mayer – Distributed and Event-Based Systems

Avidgor is a kind of pop star for the stream processing and distributed systems community, or at least for me! The papers he published about a probabilistic CEP engine with late arrival and event uncertainty were visionary.

The speakers started by explaining the basics of stream processing then went deeper into the event recognition language and architecture. They detailed pub/sub applied to event recognition and explained the data flow model, which consists of a single unified data processing model where the stream and batch paradigms are the same. This last part was based on Tyler Akidau’s paper.

A second part of the talk focused on elasticity on streams. Stream fission puts operators among different categories:

Firstly, key-based operators, that is a group by operation (as in SQL)
Secondly, window-based operators enable to split processing that needs to have multiple event types correlated with different keys within the same operator
Finally, pane-based operators enable a split-merge strategy where you distribute and merge the result.

Interestingly, Avigdor presented his work about late-arrival processing from a probabilistic viewpoint and not from the watermark perspective. Usually, modern stream processing frameworks use watermarks in order to take into account events that arrive later. Avigdor presented a probabilistic approach to this issue.

What are late-arrival events?

Imagine we want to count the number of cars entering a road segment every three minutes: we have a “tumbling window” every 3 minutes. If an event (ie a car) arrives at 2’55 second in the window but is stuck somewhere in the network for 6 sec, it is called a late-arrival event. The processing time (the time at which the CEP processes the event) is delayed compared to the event time (the time on which the event really occurs).

Note that for CEP, there is clearly a trade-off between timeliness and accuracy, because the slack time will increase the delay to deliver your result but will increase your accuracy. There is always a tradeoff between cost, latency and correctness, and usually, you can only pick two among the three.

Fun fact: If you need to explain what is event time & processing time to your mother (yeah, don’t underestimate the power of this kind of discussion at Christmas dinner), the best way is to take the Star Wars analogy. From an event time perspective (which is the time at which the story really happened) you should follow episode 1, 2, 3,4, 5, 6, 7,8, 9. But if you take the processing time (the time on which we received the episode), it is 4, 5, 6, 1, 2, 3, 7, 8, 9. Isn’t it great ?!

Final Thoughts

CER has been explored from many viewpoints. However, never in the research history was there a meeting gathering representatives of these communities. This was the objective of this seminar. Having all these people in a castle in the middle of nowhere was a blast! I had very passionate discussions during meals but also during the night at the library with the most brilliant brains on stream and CEP. On the other hand, I still had some fun discussions about comparing Star Trek DIscovery and Picard! Finally, the most important things I will remember after this seminar… are the endless ping pong games with Till Rohrmann and Alessandro Margara :-).

Blog Event

Fourth Workshop on Real-Time and Stream Analytics in Big Data: key takeaways

Last December, Eura Nova’s research center held the fourth workshop on real-time and stream analytics in big data at the 2019 IEEE Conference on Big Data in Los Angeles. The workshop brought together leading players including Confluent, Apache Pulsar, the University of Virginia and Télécom Paris Tech as well as 8 renowned speakers from 6 different countries. We received more than 30 applications and we are proud to have hosted such interesting presentations of papers in stream mining, IoT, and industry 4.0.

The workshop was a real success with many interesting questions and comments. If you could not attend, our R&D engineer Syrine Ferjaoui brought back important elements from the presentations for you.

First keynote speaker:

First of all, the workshop started with the keynote of Matteo Merli, PMC member at Apache Pulsar. His talk “Messaging and Streaming” explained how Pulsar can be a unified infrastructure that supports messaging and streaming.

Matteo introduced messaging as events that are being created and streaming as analysing events that just happened. These are two different processing concepts but they need a single infrastructure. He then explained the architecture view of Pulsar, which has separate layers between the brokers and the bookies (BookKeeper instances that handle persistent storage of messages). This means that brokers and bookies can be added independently, traffic can be shifted very quickly across brokers, and new bookies will ramp up on traffic quickly. This segmented distribution makes the architecture of Pulsar more flexible and dynamic.

Pulsar has other interesting features such as durability, low latency, high throughput, high availability, unified messaging model, high scalability, native computing, … The roadmap includes working on Pulsar storage API to allow direct access to data stored in Pulsar and to retrieve and process data more efficiently. They are also working on higher-level messaging features.”

Second keynote speaker:

The second keynote was given by John Roesler, a Kafka committer at Confluent. He talked about Kafka Streams and the evolution of streaming paradigms.

To design software, we, developers, used to separate the application logic from the database. To scale the database capacity, we then started to use a search index to do ETL jobs and query the database in a fast and optimal way. However, this created bugs in the software, added data consistency issues, and created more complexity in the system. Later, we started to use HDFS for a more flexible design. While enabling replication and distributed storage, this solution added more latency and supported batch processing only. It did not meet the needs of real-time processing use cases.

At this point, streaming helped a lot. The next step was to add a streaming platform that reads from sources, does some computation, and sinks the result somewhere else. The KafkaStreams design is a set of multiple lambda stateful functions, which makes it a good fit for a microservices architecture. With Kafka Streams’ new updates, the app logic is linked to a relational database with ACID guarantees.

Finally, John Roesler considers that “software is a fractal”, a never-ending pattern: a software architecture is complex and even when we zoom into a single component, it is still complex. But for the Kafka Streams’ design, when we zoom out, it looks like a set of services interacting and connected to each other and this simplifies the aforementioned designs.

John concluded by mentioning open problems that can be dealt with in stream processing, including semantics, observability, operability, and maintainability.

Workshop Invited Speakers:

After the keynotes, 8 selected papers were presented, covering mainly these 6 topics: (1) Stream Processing for IoT, (2) Serverless and HPC (High Performance Computing), (3) Collaborative Streaming, (4) Stream Mining, (5) Image Mining and (6) Real-time Machine Learning. Some papers are not yet available, as they will be published in the proceeding of the IEEE Big Data Conference. In the meantime, do not hesitate to contact our R&D department at [email protected] to discuss how you can leverage stream processing in your projects.

Scalable and Reliable Multi-Dimensional Aggregation of Sensor Data Streams (Sören Henning, Wilhelm Hasselbring)

Sören and Wilhelm are engineers in the Software Engineering Group from Kiel University. They propose a stream processing architecture which allows for aggregating sensors in hierarchical groups, supports multiple hierarchies in parallel, provides reconfiguration at runtime, and preserves the scalability and reliability qualities of streaming.

Performance Characterization and Modeling of Serverless and HPC Streaming Applications (Andre Luckow, Shantenu Jha)

Andre Luckow, head of Blockchain and Emerging Technologies at BMW Group, and Shantenu Jha, associate professor at Rutgers University, presented StreamInsight, which provides insight into the performance of streaming applications and infrastructure, their selection, configuration, and scaling behaviour.

Collaborative Streaming: Trust Requirements for Price Sharing (Tobias Grubenmann, Daniele Dell’Aglio, Abraham Bernstein)

The paper is written by Tobias Grubenmann, researcher at The University of Hong Kong, in collaboration with Daniele Dell’Aglio and Abraham Bernstein, researchers at the University of Zurich. They present the Collaborative Stream Processing (CSP), a model where the costs, which are set exogenously by providers, are shared between multiple consumers, the collaborators. For this, they identify the important requirements for CSP to establish trust between the collaborators and propose a CSP algorithm adhering to these requirements.

Kennard-Stone Balance Algorithm for Time-series Big Data Stream Mining (Tengyue Li, Simon Fong, and Raymond Wong)

Tengyue Li and Simon Fong (researcher and associate professor at the University of Macau, China) and Raymond Wong (associate professor at UNSW Sydney) worked on the Kennard-Stone Balance algorithm used as a new data conversion method. Training a prediction model effectively using big data streams poses certain challenges in machine learning. In this paper, the authors apply the Kennard-Stone algorithm on time-series to extract a meaningful representation of big data streams, which improves the performance of a machine learning model.

Assessing the Effects of TV Ad Events on Digital Search: On the Selection of Outcome Measures (Shawndra Hill, Anthony Colas, H. Andrew Schwartz, and Gordon Burtch)

Shawndra Hill (Microsoft), Anthony Colas (University of Florida), H. Andrew Schwartz (State University of New York at Stony Brook) and Gordon Burtch (University of Minnesota) explained their work on the interactions between TV content and online behaviours such as response to digital advertising. They developed AdMiner, a tool that can track online activity around a brand and provide actionable insights into ad campaigns.

MLK Smart Corridor: An Urban Testbed for Smart City Applications (Austin Harris, Jose Stovall, and Mina Sartipi)

Austin Harris, Jose Stovall, and Mina Sartipi (researchers and CUIP director at the University of Tennessee at Chattanooga) have helped to create Chattanooga’s smart corridor, used to test new technologies and generate data-driven outcomes. In their talk, they presented the corridor, used as a test bed for research in smart city developments in a real-world environment. The wireless communication infrastructure and network of sensors in combination with data analytics provide a means of monitoring and controlling city resources and infrastructure in real time.

Image Mining for Real Time Quality Assurance in Rapid Prototyping (Sebastian Trinks and Carsten Felde)

Sebastian Trinks and Carsten Felde (TU Bergakademie Freiberg) presented how image mining can help avoiding errors and low quality of printed prototypes in real time. This can result in saving resources and increasing efficiency when developing new products.

Real-Time Machine Learning Competition on Data Streams at the IEEE Big Data 2019 (Dihia Boulegane)

This year, IEEE Big Data held the Real-time Machine Learning Competition on Data Streams. As the competition is focused on streaming, its online platform required a specific infrastructure that meets data stream mining requirements. Dihia Boulegane is a Ph.D. student at Télécom ParisTech working in collaboration with Orange Labs on machine learning for IoT networks monitoring. She was in charge of implementing the streaming engine of the dedicated platform of the competition. Dihia explained its components, the technologies used, and the challenges met to build the platform. At the end, the platform was able to provide multiple streams to multiple users, to receive multiple streams, to process them and to provide the leader board and live results.

Special thanks to our keynote guests, Matteo Merli and John Roesler, and all the attendees and speakers! We are looking forward to an even more successful workshop in the coming edition of the IEEE Big Data Conference. Stay tuned for paper submission dates!

Event

IEEE Big Data 2019 – A Summary

At the beginning of the month, our R&D director Sabri Skhiri and our R&D engineer Syrine Ferjaoui travelled to Los Angeles to attend IEEE Big Data Conference. It is one of the most influential academic gatherings in distributed machine learning. This year, it featured 879 authors, shortlisted from 2009 applicants. They came from 28 countries and presented 210 papers. Back in Belgium, Sabri and Syrine give you their opinion on the event itself and the important elements from the keynotes, the tutorials, the workshops and the interesting papers.

The Big Trends

Sabri says: “The main trends were deep learning, NLP, privacy-preserving approaches, GAN, graph mining and stream mining. In my view, the level of the papers was quite good. Authors are becoming ever more skilled in data science, maths and algorithms. This goes to show that to be a good data scientist, you need an extensive set of advanced skills. Interestingly, there was almost nothing about distributed computing! This is a big move compared to the previous editions. The only presentations that had something to do with distributed systems were about optimisation strategies, an area similar to what our ECCO team researches. The Big Data Conference focuses on data science; it does not really look into its scalability. Distributed computing topics tend to be dealt with at conferences like DEBS, VLDB, USENIX, SIGMOD, etc. As a result, this conference is an amazing place to see hundreds of data science use cases with, most of the time, an interesting contribution.”

The Keynotes

The keynotes were focused on data science as well. We even heard the term “Big Data Science”.

Keynote 1: Responsible Data Science by Lise Getoor – Professor at UC Santa Cruz

Syrine says: “The first keynote was my favourite. Lise started by comparing machine learning to a black box. The goal was to unpack the box and invite people to use data science and to use it wisely. To autonomise ethical decision-making, we should move away from maximising AI systems autonomy and move toward human-centric systems. To do this, we should make sure that human-centric systems have three qualities: (1) be knowledge-based, (2) be data-driven, and (3) support human values. Achieving responsible data science requires both machine-learning and ethics.”

Keynote 2: DataCommons “Google for Data” by Ramanathan Guha – Google

Guha presented DataCommons, a project started by Google to combine data from different open sources. Syrine explains: “Google’s DataCommons project allows users to pretend that the Web is one website, enabling developers to pretend all this data is in one database. The long-term vision of Google is to aggregate all data from publicly available sources (Medicare, Wikidata, sequence data, Landsat, CDC, Census…) into a single Open Knowledge Graph. The goal is to reduce or eliminate the data download-clean-store process. Instead, users can access and use already cleaned data in the cloud. Data can be public or private (internet & intranet). This will avoid repeated data wrangling and ease the burden of data storage, indexing, etc.”

The Tutorials

This year, IEEE Big Data held nine tutorials. Our R&D director explains: “At this type of events, tutorials are always a good way to learn a complete state of the art in a couple of hours. I particularly appreciated the tutorial on “Taming Unstructured Big data: Automated Information Extraction for Massive Text” by the team of the famous Jiawei Han (he is a kind of pop star in data mining and the father of Graph Cube). I found out that many papers about named entity relations were published in the past two years. The idea is to be able to extract supervised, semi-supervised, and unsupervised relations between entities: for instance, discovering that “Trump” is “President of” “USA”. They also propose new approaches to integrate knowledge bases such as DBPedia or YAGO to infer new unknown relations from a corpus. This is just amazing!”

Syrine adds: “The tutorial on NewSQL principles, systems, and current trends was interesting as it explained why we should consider using NoSQL/NewSQL to deal with data interconnections and very high scalability. After attending this tutorial, I was motivated to order this book about Principles of Distributed Database Systems. For fans of deep learning, the tutorial “Deep Learning on Big Data with Multi-Node GPU Jobs” covers a lot about large-scale GPU-based deep-learning systems. If you missed the conference, all resources can be found on this link.”

The Workshops

The EURA NOVA research centre organised the fourth workshop on Real-time and Stream Analytics in Big Data, at the 2019 IEEE conference on Big Data. We were really happy to welcome Matteo Merli from Apache Pulsar and John Roesler from Confluent as keynotes speakers. Thank you to them and to all the attendees and speakers! They had a great time, with captivating talks and a lot of interesting questions and comments. The summary of the event will soon be available on our website. The slides of the keynotes are available here:

Favourite Papers

A personal selection of interesting papers:

Subspace Clustering with Active Learning (Hankui Peng)

The paper tackles a really interesting problematic faced by a lot of data scientists. Introducing active learning is a cool idea and so is the way they used a mathematical trick to make their approach feasible.

High Impact Customer Acquisition & Retention Modelling – A Scalable Data Mashup Approach (Kajanan Sangaralingam, Nisha Verma, Aravind Ravi, Su Won Bae)

Su Won Bae, from Mobilewalla, presented how they can define a complete customer acquisition model by mixing their data with their customer data (in this case, a worldwide leader in food delivery). Sabri says: “The quality of data science models highly depends on the data they can train on. I am convinced we will go in the same direction as Mobilewalla in the future to have richer models. However, mixing data must be done with care as it may raise some privacy issues; our purpose has to have legal ground.”

Parsimonious Morpheme Segmentation with an Application to Enriching Word Embeddings (Ahmed El-Kishky, Frank Xu, Aston Zhang, Jiawei Han)

The speaker presented MorphMine, a method for unsupervised morpheme segmentation. It can generate morpheme candidates that are filtered out using entropy to select the best morphemes from a corpus. Then, these morphemes can be used to highly improve the word embedding model and the downstream machine learning tasks.

Blog Event

Flink Forward: The Key Takeaways

Early October 2019, 6 EURA NOVA engineers travelled to Berlin to attend the Flink Forward Conference, dedicated to Apache Flink users and stream processing communities.

In this article, they will give you their opinion about Ververica’s’ main announcement, the impact of Ververica acquisition by Alibaba, the big trends, and a selection of their favourite talks.

Alibaba!

This is the first Flink Forward conference since the acquisition of Ververica (formerly known as data Artisans) by Alibaba, which has been one of the largest users of Flink and second-largest contributor for years. Our R&D director Sabri Skhiri says: “The only significant impact of this acquisition on the conference is that the venue is now at the Berlin Business Center instead of the Kulturbrauerei. There, we could see that the Apache Flink user’s community has grown significantly as well as their commits on Flink. This edition was a bit more business and enterprise-oriented than previous ones, although it still had its technical DNA and a lot of technical talks. All in all, this was a very good mix. Alibaba folks are deeply committed to open source and creating technology impact. We saw a lot of activities from them such as the integration of the Blink SQL runner, the hive integration or the new scheduling model. In summary, a great event.”

First Keynote Announcement

Keynote: Stream Processing and Applications in the Modern Age (Stephan Ewen)

During the first keynote, Ververica took the opportunity to announce the launch of Stateful Functions (statefun.io), an open-source framework built on top of Flink to run stateful serverless functions. It bridges the gap between Function as a Service and stream processing.

Sabri says: ”Last year, they announced their streaming ledger that brings ACID transactions between states to stream processing applications. This year, they announced the launch of Stateful Functions, a framework that reduces the complexity of building and orchestrating stateful applications at scale. In the streaming world, this announcement does not change a lot of things. However, in the microservice community, this opens new doors in terms of design patterns, especially in the way data feeding and stateful operations can be designed more flexibly.”

You can find the video of the presentation here.

The Big Trends

1. Unified batch and streaming

A significant trend of this edition is the “Unified Stream and Batch” moto. Our R&D engineer Syrine Ferjaoui says: “Flink currently features different APIs, the DataSet API for batch processing and the DataStream API for stream processing. In addition, the Table API is already a unified API on top of both (DataSet and DataStream) with declarative-style programming. Now, they are working on a solution to unify truly the batch and streaming APIs.”

Sabri adds: “In Flink 1.9, they released the State API with which a state created in batch can be used in a stream application – interesting for bootstrapping/backfilling states. But the community is going further by proposing in Flink 2.0 a unique Data API that will merge DataSet and DataStream while still taking advantage of the batch properties to optimise the execution.”

Every talk was exploring in a way or another how this unification can be pushed forward. For instance, in the Pulsar talk, they were thinking about using Pulsar as a back end to transparently bootstrap a state and then switch on stream using (1) pulsar capability in terms of segment storage and (2) unified data stream API in Flink.”

2.”Enterprise-grade” Flink:

Flink is moving clearly toward an “enterprise-grade” technology. Sabri says: “The first signal is that Cloudera adopted Apache Flink into its Data Platform. Also, AWS Kinetics now integrates Flink as a client. Adoption by such big players goes to show that Flink is well on the way to gain enterprise-grade support. The second signal is the release of the Ververica Platform that highly facilitates enterprise-grade operations. Thirdly, the integration of the Hive Metastore with the pluggable catalogue architecture is a significant step towards better governance and metadata management. Finally, there were many talks about lowering the barrier to deploying Flink in prod. The topics included APIs, configuration, memory management, K8S operators, etc.”

3.The ML path

Finally, regarding ML/AI, there is still a lot of work to get over the gap with the Spark ecosystem. However, the Alibaba folks are working hard on this topic and we can already see the first results. Sabri says: “The refactoring of the Flink ML interface to work on Flink Table APIs is excellent. There is an excellent vision of integrating Flink as a data prep engine for ML and serving layer; and the roadmap looks great.”

Interesting talks

A personal selection by Charles & Christophe of interesting talks to check out :

For Charles, our data architect:

Aljoscha Krettek & Timo Walther, respectively a co-founder at Ververica and a PMC member of Apache Flink work on the Flink APIs. They give a summary of recent contributions to Flink’s Table & SQL APIs. It was a very good overview of what is going on in terms of refactoring and where we are going.
Roman Grebennikov is a software developer from Findify AB. His talk focused on Flink serialisation framework and common problems happening around it. He illustrated and explained several ways to optimise Flink jobs by taking care of the serialisation, which in most cases represents about 60% of the processing.

For Christophe, our software engineer:

Konstantin Klauf is the head of product for the Ververica Platform based on Apache Flink. He discussed Apache Flink worst practices by sharing anecdotes and hard-learned lessons of adopting distributed stream processing. It was a humorous list of general good practices when working with Flink from planning, requirement, deployment, and maintenance.
Aaron Levin and Mike Mintz are software engineers in a Stripe’s streaming team. They talked about the many challenges they encountered writing the specialised dual source. This talk was a very well-told story about a simple use case with a high constraint: all-time deduplication of transactions at Stripe (a payment platform‎). It was funny, insightful, full of lessons learned and echoed some of digazu’s features: the history replayer.

Blog Event