Schloss Dagstuhl: Where Computer Science Meets

Which direction stream and complex event processing is going to take? Last week, the world’s best-known international researchers met in Schloss Dagstuhl, Germany,  to present and discuss their research. Among the members were present Avigdor Gal, Professor at the Israel Institute of Technology, Alessandro Margara, Assistant Professor at the Polytechnic University of Milan, or Till Rohrmann, engineering lead at Veverica.

Invited to talk about the requirements and needs from the industry, our R&D director Sabri Skhiri explains: “The seminar brought together world-class computer scientists and practitioners working on complex event recognition, distributed systems, databases, stream reasoning and artificial intelligence. Our objective was to disseminate the recent foundational results in each of these isolated fields among all participants, to identify the open problems that need to be resolved, and to establish new research collaborations among these fields”.

What were the big trends and intakes gathered by those brilliant minds? Let’s find out with Sabri!

 

 

The Big Trends

This seminar is a bit particular as it does not show any trends but rather gives a picture of all the communities working on CER in a way or another. I was fascinated by the diversity of researchers. I  did not expect to see such a rich variety of fields: knowledge representation, spatial reasoning, logic-based reasoning, data management, learning-based approaches, event-driven processing, process mining, database theory, stream mining,… According to me, the composite event recognition models that are the best at recognising complex events would include:

  1. Data flow model
  2. Ontology-based and reasoning model
  3. Symbolic reasoning model
  4. Automata-based model

We also identified common challenges across these models and communities. The three priority topics areas we identified are:

  1. Expressivity: composability & hierarchies
  2. Evaluation strategy, parallelization and distribution
  3. Uncertainty management

 

Favourite Talk

Kurt Rothermel from TU Stuttgart – Time-sensitive Complex Event Processing

My first reaction to load shedding was: “It is useless since customers do not want to lose any event, that is why so much effort is spent today on exactly once semantics…“. However, there is a trend today in stream processing, which is the trade-off between cost, latency, and correctness. Tyler Akidau described this challenge as a choice between one of three propositions: fast and correct, cheap and correct, or fast and cheap.  Tyler was talking about streaming but that rule applies in the same way in a CEP context. The load shedding strategy directly falls in the third proposition. In this perspective, the work of Kurt is highly relevant.

 

Favourite Tutorial

Jacopo Urbani & Fredrik Heintz – Stream Reasoning

Concretely, stream reasoning is incremental reasoning over rapidly changing information. The tutorial opened new perspectives on stream processing for me. It tried to answer a very interesting question: how can you provide reasoning about context from streams of data? I definitely come from the database and event-based systems communities and I did not know at all that stream reasoning was so mature. This community has been evolving from having a continuous version of SPARKQL to a complete distributed stream reasoning semantics. It is interesting to see that the work we have done in the LEAD algebra and semantics is deeply inspired by this community. However, we have never used any reasoning logic on top of LEAD. But after a few hours of the tutorial, I realise that (1) reasoning can be used for query rewriting and optimisation (2) it is worth evaluating at least BigSR,  the LARS implementation on Flink.

 

Avigdor Gal & Ruben Mayer – Distributed and Event-Based Systems

Avidgor is a kind of pop star for the stream processing and distributed systems community, or at least for me! The papers he published about a probabilistic CEP engine with late arrival and event uncertainty were visionary.

The speakers started by explaining the basics of stream processing then went deeper into the event recognition language and architecture. They detailed pub/sub applied to event recognition and explained the data flow model, which consists of a single unified data processing model where the stream and batch paradigms are the same.  This last part was based on Tyler Akidau’s paper.

A second part of the talk focused on elasticity on streams. Stream fission puts operators among different categories:

  • Firstly, key-based operators, that is a group by operation (as in SQL)
  • Secondly, window-based operators enable to split processing that needs to have multiple event types correlated with different keys within the same operator
  • Finally, pane-based operators enable a split-merge strategy where you distribute and merge the result.

Interestingly, Avigdor presented his work about late-arrival processing from a probabilistic viewpoint and not from the watermark perspective. Usually, modern stream processing frameworks use watermarks in order to take into account events that arrive later. Avigdor presented a probabilistic approach to this issue.

 

What are late-arrival events?

Imagine we want to count the number of cars entering a road segment every three minutes: we have a “tumbling window” every 3 minutes. If an event (ie a car) arrives at 2’55 second in the window but is stuck somewhere in the network for 6 sec, it is called a late-arrival event. The processing time (the time at which the CEP processes the event) is delayed compared to the event time (the time on which the event really occurs).

Note that for CEP, there is clearly a trade-off between timeliness and accuracy, because the slack time will increase the delay to deliver your result but will increase your accuracy. There is always a tradeoff between cost, latency and correctness, and usually, you can only pick two among the three.

Fun fact: If you need to explain what is event time & processing time to your mother (yeah, don’t underestimate the power of this kind of discussion at Christmas dinner), the best way is to take the Star Wars analogy. From an event time perspective (which is the time at which the story really happened) you should follow episode 1, 2, 3,4, 5, 6, 7,8, 9. But if you take the processing time (the time on which we received the episode), it is 4, 5, 6, 1, 2, 3, 7, 8, 9.  Isn’t it great ?!

 

Final Thoughts

CER has been explored from many viewpoints. However, never in the research history was there a meeting gathering representatives of these communities. This was the objective of this seminar. Having all these people in a castle in the middle of nowhere was a blast! I had very passionate discussions during meals but also during the night at the library with the most brilliant brains on stream and CEP. On the other hand, I still had some fun discussions about comparing Star Trek DIscovery and Picard! Finally, the most important things I will remember after this seminar… are the endless ping pong games with Till Rohrmann and Alessandro Margara :-).

Throwback To 2019

At EURA NOVA, we believe technology is a catalyst for change. To embrace it, we strive to stay at the edge of knowledge. Investing in research allows us to continuously become more proficient, to maintain our know-how at the cutting edge of IT, to share its benefits with our customers, and to incubate the products of tomorrow. As we look back on the year 2019, we are both proud and happy of the work achieved!

 

Published papers:

We are happy to say that our R&D department has published five peer-reviewed scientific papers last year.

 

  • LEAD: A Formal Specification For Event Processing

 

In June, our R&D engineer Anas presented his work on complex event processing at the 13Th ACM international Conference on distributed and event-based systems, which was taking place in Germany.

Anas Al Bassit, Skhiri Sabri, LEAD: A Formal Specification For Event Processing, in 13Th ACM international Conference on distributed and event-based systems 2019

 

  • Coherence Regularization for Neural Topic Models

 

In July, our R&D engineer Kate presented her paper on neural topic models at the 16th International Symposium on Neural Networks taking place in Moscow.

Katsiaryna Krasnashchok, Aymen Cherif, Coherence Regularization for Neural Topic Models. in 16th International Symposium on Neural Networks 2019 (ISNN 2019)

 

  • STRASS: A Light and Effective Method for Extractive Summarization

 

In August, our PhD student Léo was in Italy to present his paper at the 2019 ACL Student Research Workshop.

Léo Bouscarrat, Antoine Bonnefoy, Thomas Peel, Cécile Pereira, STRASS: A Light and Effective Method for Extractive Summarization Based on Sentence Embeddings, in 2019 ACL Student Research Workshop, Florence, Italy.

 

  • GraphOpt: Framework for Automatic Parameters Tuning of Graph Processing Frameworks

 

In December, the paper written by our former intern and now full-time colleague Muaz was presented in Los Angeles at the third IEEE International Workshop on Benchmarking, Performance Tuning and Optimization for Big Data Applications.

Muaz Twaty, Amine Ghrab, Skhiri Sabri: GraphOpt: a Framework for Automatic Parameters Tuning of Graph Processing Frameworks. 2019 IEEE International Conference on Big Data (Big Data) Workshops, Los Angeles, CA, USA.

 

  • A Performance Prediction Model for Spark Applications

 

In June 2020, our paper written as part of the ECCO research project we have been leading at EURA NOVA will be presented at the Big Data congress 2020 taking place in Hawaii.

Florian Demesmaeker, Amine Ghrab, Usama Javaid, Ahmed Amir Kanoun, A Performance Prediction Model for Spark Applications, in the proceedings of Big Data congress 2020.

 

IEEE Big Data Workshop

Last December, Eura Nova’s research centre held the fourth workshop on real-time and stream analytics in big data at the 2019 IEEE Conference on Big Data in Los Angeles. The workshop brought together leading players including Confluent, Apache Pulsar, the University of Virginia and Télécom Paris Tech as well as 8 renowned speakers from 6 different countries. We received more than 30 applications and we are proud to have hosted such interesting presentations of papers in stream mining, IoT, and industry 4.0. Special thanks to our keynote guests, Matteo Merli (Apache Pulsar) and John Roesler (Confluent), and all the attendees and speakers!

 

JERICHO, research driving innovations

The mission of the JERICHO research track is to make the latest technologies available to our client, to offer them a competitive edge to play along megacorporations.  After two years of intense work, seven published papers, presentations in international conferences spanning Russia, the United States, Germany, Australia, or Belgium, our Jericho project has come to an end.

And the adventure continues! We are really excited to continue our work on innovative solutions for the next data challenges with our new research track ASGARD.

Our R&D director Sabri Skhiri says: “The costs of data solutions and the lack of data scientists will increase in the next 3 to 5 years and solutions to reduce them will benefit from a large market. In this sense, ASGARD is precisely in the strategy of Eura Nova. ASGARD aims to reduce these costs by automating the most expensive tasks. As the world becomes increasingly digital and reinvents itself, innovation and research are essential in the market.”

 

Academic collaboration

This year, we welcomed nine interns across our three offices. A big kudo to our intern Muaz who successfully finished his master thesis in collaboration with EURA NOVA! The goal of his thesis was to optimise the configuration of distributed graph frameworks. He now joined EURA NOVA to work as a full-time employee.

 

Talks & seminars

This year, the research team had the pleasure to be invited at several international conferences:

  • In February, our research director Sabri Skhiri gave a seminar on modern Stateful Stream Processing at EPT. Our R&D engineer Syrine Ferjaoui also went to Morocco to give a workshop about data architecture at the Annual International Conference on Arab Women In Computing.
  • In March, Sabri was at the World AI Show in Dubaï to talk about successfully deploying AI projects in production. He was also invited to Barcelona Tech to give a Big Data Architecture & Design  seminar.
  • In June, our data privacy officer Nazanin Gifani gave a masterclass on Fairness and Transparency in AI at the DI Summit in Brussels.
  • In September, our R&D project manager Shivom Aggarwal talked at the Arab Future Cities Summit 2019 about deploying AI at industrial scale for smart cities.
  • In October, our software engineer Christophe Philemotte was in San Francisco to talk at the Kafka Summit about crossing the streams thanks to Kafka and Flink.
  • In November, Sabri was invited as a keynote speaker at the 17th International Conference on Service-Oriented Computing to share his experience about the convergence between micro-service, stateful stream processing and function as a service.

 

Summer schools & conferences

This year, Euranovians attended more than 15 prestigious international conferences and summits across the world to remain up to date and grow our network. We investigated the state of the art in streaming, data science, DevOps, computer vision or cloud engineering at conferences such as Flink Forward, Spark AI Summit, Kubecon, IEEE Big Data, DataWorks Summit, Kafka Summit, NeurIPS, RedHat, Elixir LDN or CVPR.

Euranovians brought back what they learned for the rest of the team and the big data community. Find our public summaries, identified trends and review of conferences here:

 

Fourth Workshop on Real-Time and Stream Analytics in Big Data: key takeaways

Last December, Eura Nova’s research center held the fourth workshop on real-time and stream analytics in big data at the 2019 IEEE Conference on Big Data in Los Angeles. The workshop brought together leading players including Confluent, Apache Pulsar, the University of Virginia and Télécom Paris Tech as well as 8 renowned speakers from 6 different countries. We received more than 30 applications and we are proud to have hosted such interesting presentations of papers in stream mining, IoT, and industry 4.0.

The workshop was a real success with many interesting questions and comments. If you could not attend, our R&D engineer Syrine Ferjaoui brought back important elements from the presentations for you.

 

First keynote speaker:

First of all, the workshop started with the keynote of Matteo Merli, PMC member at Apache Pulsar. His talk “Messaging and Streaming” explained how Pulsar can be a unified infrastructure that supports messaging and streaming.

Matteo introduced messaging as events that are being created and streaming as analysing events that just happened. These are two different processing concepts but they need a single infrastructure. He then explained the architecture view of Pulsar, which has separate layers between the brokers and the bookies (BookKeeper instances that handle persistent storage of messages). This means that brokers and bookies can be added independently, traffic can be shifted very quickly across brokers, and new bookies will ramp up on traffic quickly. This segmented distribution makes the architecture of Pulsar more flexible and dynamic.

Pulsar has other interesting features such as durability, low latency, high throughput, high availability, unified messaging model, high scalability, native computing, … The roadmap includes working on Pulsar storage API to allow direct access to data stored in Pulsar and to retrieve and process data more efficiently. They are also working on higher-level messaging features.”

 

Second keynote speaker:

The second keynote was given by John Roesler, a Kafka committer at Confluent. He talked about Kafka Streams and the evolution of streaming paradigms.

To design software, we, developers, used to separate the application logic from the database. To scale the database capacity, we then started to use a search index to do ETL jobs and query the database in a fast and optimal way. However, this created bugs in the software, added data consistency issues, and created more complexity in the system. Later, we started to use HDFS for a more flexible design. While enabling replication and distributed storage, this solution added more latency and supported batch processing only. It did not meet the needs of real-time processing use cases.

At this point, streaming helped a lot. The next step was to add a streaming platform that reads from sources, does some computation, and sinks the result somewhere else. The KafkaStreams design is a set of multiple lambda stateful functions, which makes it a good fit for a microservices architecture.  With Kafka Streams’ new updates, the app logic is linked to a relational database with ACID guarantees.

Finally, John Roesler considers that “software is a fractal”, a never-ending pattern: a software architecture is complex and even when we zoom into a single component, it is still complex. But for the Kafka Streams’ design, when we zoom out, it looks like a set of services interacting and connected to each other and this simplifies the aforementioned designs.

John concluded by mentioning open problems that can be dealt with in stream processing, including semantics, observability, operability, and maintainability.

 

Workshop Invited Speakers:

After the keynotes, 8 selected papers were presented, covering mainly these 6 topics: (1) Stream Processing for IoT, (2) Serverless and HPC (High Performance Computing), (3) Collaborative Streaming, (4) Stream Mining, (5) Image Mining and (6) Real-time Machine Learning. Some papers are not yet available, as they will be published in the proceeding of the IEEE Big Data Conference. In the meantime, do not hesitate to contact our R&D department at research@euranova.eu to discuss how you can leverage stream processing in your projects.

Sören and Wilhelm are engineers in the Software Engineering Group from Kiel University. They propose a stream processing architecture which allows for aggregating sensors in hierarchical groups, supports multiple hierarchies in parallel, provides reconfiguration at runtime, and preserves the scalability and reliability qualities of streaming.

Andre Luckow, head of Blockchain and Emerging Technologies at BMW Group, and Shantenu Jha, associate professor at Rutgers University, presented StreamInsight, which provides insight into the performance of streaming applications and infrastructure, their selection, configuration, and scaling behaviour.

The paper is written by Tobias Grubenmann, researcher at The University of Hong Kong, in collaboration with Daniele Dell’Aglio and Abraham Bernstein, researchers at the University of Zurich. They present the Collaborative Stream Processing (CSP), a model where the costs, which are set exogenously by providers, are shared between multiple consumers, the collaborators. For this, they identify the important requirements for CSP to establish trust between the collaborators and propose a CSP algorithm adhering to these requirements.

  • Kennard-Stone Balance Algorithm for Time-series Big Data Stream Mining (Tengyue Li, Simon Fong, and Raymond Wong)

Tengyue Li and Simon Fong (researcher and associate professor at the University of Macau, China) and Raymond Wong (associate professor at UNSW Sydney) worked on the Kennard-Stone Balance algorithm used as a new data conversion method. Training a prediction model effectively using big data streams poses certain challenges in machine learning. In this paper, the authors apply the Kennard-Stone algorithm on time-series to extract a meaningful representation of big data streams, which improves the performance of a machine learning model.

 

  • Assessing the Effects of TV Ad Events on Digital Search: On the Selection of Outcome Measures (Shawndra Hill, Anthony Colas, H. Andrew Schwartz, and Gordon Burtch)

Shawndra Hill (Microsoft), Anthony Colas (University of Florida), H. Andrew Schwartz (State University of New York at Stony Brook) and Gordon Burtch (University of Minnesota) explained their work on the interactions between TV content and online behaviours such as response to digital advertising. They developed AdMiner, a tool that can track online activity around a brand and provide actionable insights into ad campaigns.

 

Austin Harris, Jose Stovall, and Mina Sartipi (researchers and CUIP director at the University of Tennessee at Chattanooga) have helped to create Chattanooga’s smart corridor, used to test new technologies and generate data-driven outcomes. In their talk, they presented the corridor, used as a test bed for research in smart city developments in a real-world environment. The wireless communication infrastructure and network of sensors in combination with data analytics provide a means of monitoring and controlling city resources and infrastructure in real time.

 

Sebastian Trinks and Carsten Felde (TU Bergakademie Freiberg) presented how image mining can help avoiding errors and low quality of printed prototypes in real time. This can result in saving resources and increasing efficiency when developing new products.

 

This year, IEEE Big Data held the Real-time Machine Learning Competition on Data Streams. As the competition is focused on streaming, its online platform required a specific infrastructure that meets data stream mining requirements. Dihia Boulegane is a Ph.D. student at Télécom ParisTech working in collaboration with Orange Labs on machine learning for IoT networks monitoring. She was in charge of implementing the streaming engine of the dedicated platform of the competition. Dihia explained its components, the technologies used, and the challenges met to build the platform. At the end, the platform was able to provide multiple streams to multiple users, to receive multiple streams, to process them and to provide the leader board and live results.

 

Special thanks to our keynote guests, Matteo Merli and John Roesler, and all the attendees and speakers! We are looking forward to an even more successful workshop in the coming edition of the IEEE Big Data Conference. Stay tuned for paper submission dates!

 

Flink Forward: The Key Takeaways

Early October 2019, 6 EURA NOVA engineers travelled to Berlin to attend the Flink Forward Conference, dedicated to Apache Flink users and stream processing communities.

In this article, they will give you their opinion about Ververica’s’ main announcement, the impact of Ververica acquisition by Alibaba, the big trends, and a selection of their favourite talks.

 

Alibaba!

This is the first Flink Forward conference since the acquisition of Ververica (formerly known as data Artisans) by Alibaba, which has been one of the largest users of Flink and second-largest contributor for years. Our R&D director Sabri Skhiri says: “The only significant impact of this acquisition on the conference is that the venue is now at the Berlin Business Center instead of the Kulturbrauerei. There, we could see that the Apache Flink user’s community has grown significantly as well as their commits on Flink. This edition was a bit more business and enterprise-oriented than previous ones, although it still had its technical DNA and a lot of technical talks. All in all, this was a very good mix. Alibaba folks are deeply committed to open source and creating technology impact. We saw a lot of activities from them such as the integration of the Blink SQL runner, the hive integration or the new scheduling model. In summary, a great event.”

 

First Keynote Announcement

Keynote: Stream Processing and Applications in the Modern Age (Stephan Ewen)

During the first keynote, Ververica took the opportunity to announce the launch of Stateful Functions (statefun.io), an open-source framework built on top of Flink to run stateful serverless functions. It bridges the gap between Function as a Service and stream processing.

Sabri says: ”Last year, they announced their streaming ledger that brings ACID transactions between states to stream processing applications. This year, they announced the launch of Stateful Functions, a framework that reduces the complexity of building and orchestrating stateful applications at scale. In the streaming world, this announcement does not change a lot of things. However, in the microservice community, this opens new doors in terms of design patterns, especially in the way data feeding and stateful operations can be designed more flexibly.”

You can find the video of the presentation here.

 

The Big Trends

1. Unified batch and streaming

A significant trend of this edition is the “Unified Stream and Batch” moto. Our R&D engineer Syrine Ferjaoui says: “Flink currently features different APIs, the DataSet API for batch processing and the DataStream API for stream processing. In addition, the Table API is already a unified API on top of both (DataSet and DataStream) with declarative-style programming. Now, they are working on a solution to unify truly the batch and streaming APIs.”

Sabri adds: “In Flink 1.9, they released the State API with which a state created in batch can be used in a stream application – interesting for bootstrapping/backfilling states. But the community is going further by proposing in Flink 2.0 a unique Data API that will merge DataSet and DataStream while still taking advantage of the batch properties to optimise the execution.”

Every talk was exploring in a way or another how this unification can be pushed forward. For instance, in the Pulsar talk, they were thinking about using Pulsar as a back end to transparently bootstrap a state and then switch on stream using (1) pulsar capability in terms of segment storage and (2) unified data stream API in Flink.”

 

2.”Enterprise-grade” Flink:

Flink is moving clearly toward an “enterprise-grade” technology. Sabri says: “The first signal is that Cloudera adopted Apache Flink into its Data Platform. Also, AWS Kinetics now integrates Flink as a client. Adoption by such big players goes to show that Flink is well on the way to gain enterprise-grade support. The second signal is the release of the Ververica Platform that highly facilitates enterprise-grade operations. Thirdly, the integration of the Hive Metastore with the pluggable catalogue architecture is a significant step towards better governance and metadata management. Finally, there were many talks about lowering the barrier to deploying Flink in prod. The topics included APIs, configuration, memory management, K8S operators, etc.”

 

3.The ML path

Finally, regarding ML/AI, there is still a lot of work to get over the gap with the Spark ecosystem. However, the Alibaba folks are working hard on this topic and we can already see the first results. Sabri says: “The refactoring of the Flink ML interface to work on Flink Table APIs is excellent. There is an excellent vision of integrating Flink as a data prep engine for ML and serving layer; and the roadmap looks great.”

 

Interesting talks

A personal selection by Charles & Christophe of interesting talks to check out :

For Charles, our data architect:

  • Aljoscha Krettek & Timo Walther, respectively a co-founder at Ververica and a PMC member of Apache Flink work on the Flink APIs. They give a summary of recent contributions to Flink’s Table & SQL APIs. It was a very good overview of what is going on in terms of refactoring and where we are going.
  • Roman Grebennikov is a software developer from Findify AB. His talk focused on Flink serialisation framework and common problems happening around it. He illustrated and explained several ways to optimise Flink jobs by taking care of the serialisation, which in most cases represents about 60% of the processing.

For Christophe, our software engineer:

  • Konstantin Klauf is the head of product for the Ververica Platform based on Apache Flink. He discussed Apache Flink worst practices by sharing anecdotes and hard-learned lessons of adopting distributed stream processing. It was a humorous list of general good practices when working with Flink from planning, requirement, deployment, and maintenance.
  • Aaron Levin and Mike Mintz are software engineers in a Stripe’s streaming team. They talked about the many challenges they encountered writing the specialised dual source. This talk was a very well-told story about a simple use case with a high constraint: all-time deduplication of transactions at Stripe (a payment platform‎). It was funny, insightful, full of lessons learned and echoed some of digazu’s features: the history replayer.

 

DEBS 2019: A Summary

Last month, our R&D engineer Anas Albassit and director Sabri Skhiri travelled to Germany to attend and present at DEBS 2019, one of the most specialised conferences in Distributed Event-Based Systems. DEBS has a long history: from active databases to streaming engines, distributed publish-subscribe systems, etc. it has always been the pioneer of distributed and high-performance systems. In this article, Anas and Sabri share with us what they learned there and what struck them as particularly useful.

 

Big Trends

 

This edition focused around streaming language, scheduling, elasticity, distributed event processing, platform and middleware. Our R&D director Sabri Skhiri says: “For someone working in distributed computing and data management, DEBS is one of the major conferences with SIGMOD and VLDB. Even though it is quite small (80 participants vs 1000 for IEEE Big Data), this is a niche conference of experts from a small yet amazingly talented community of researchers. The keynotes were just great, with a good balance between pure research and industry. This conference tackling distributed computing and streaming is heaven for data scientists and architects like us!”

 

 

Keynotes

 

Open Problems in Stream Processing: A Call To Action (Tyler Akidau)

 

Tyler Akidau is the technical lead for the Data Processing Languages & Systems group at Google. He argues that, even though stream processing has gone from niche to mainstream, this is just the beginning. For him, the need for active exploration of new ideas is all the more pressing. Sabri reacts: “The stream has been there for 30 years. We have Spark, Flink, Dataflow, KStream, MSF Trill. But is it all we can do? Is there nothing to do anymore? Tyler Akidau brilliantly presented that stream processing as a field of research is alive and well.”

The talk was mainly about raising opened or partially opened questions in the streaming world.

  • Firstly, the evolution of a pipeline over time may require changes in the persisted state. In that case, how to gracefully update everything online without stopping the working pipeline? How can we auto tune or auto build optimised systems? Do we need to rethink the way we build such systems now?
  • Secondly, Tyler Akidau moved on the importance of SQL in streams and the missing parts to fill there. One of the main topics was the mathematical formalisation of streaming operators, in addition to providing richer standards and clarifying the ambiguities coming from the nature of the streams like out of order events and latency.
  • The next point focused on the tradeoff triangle between latency, costs and correctness. How to figure out what do we need? How to describe correctly every term from a system-behaviour point of view? Is it possible to prioritise different factors depending on the urgency of some task (e.g. streaming v.s batch) automatically?
  • Last but not least, how can we improve stream processing? What kind of database optimisations can be adopted here, and can we think of more optimisations that can only apply on streams?

 

Tyler Akidau concluded by pointing out that even though streaming systems are more capable and robust than ever, they often remain difficult to use, difficult to maintain, and difficult to understand.

[EDIT] Thank you Tyler for reaching out and for sharing your slide with us! They are available on the following link. If you would like to discuss more insights from the talk, do not hesitate to contact our researchers at research@euranova.eu.

 

 

Interesting Papers

 

STRETCH: Scalable and Elastic Deterministic Streaming Analysis with Virtual Shared-Nothing Parallelism (Hannaneh Najdataei)

 

Hannaneh Najdataei, Researcher and PhD Student at the Chalmers University of Technology in Sweden, presented her framework STRETCH.

Anas explains: “The performance of a streaming engine depends on the throughput and latency of stateful analysis. To achieve the best performance, we need to process a large amount of data (i.e. to be scalable) while handling fluctuations in data rate (i.e. to be elastic). Distributed processing requires the ability to parallelise the processing elastically. Optimally, we should reduce the number of parallel operators when the workload decreases and add operators when more resources are needed. For stateful operators, elasticity reconfigurations require to redistribute the states according to the new cluster configuration (i.e. less or more operators). In this case, we need to find a tradeoff between a share-nothing and a share-all state architecture.”

Sabri adds: “The paper proposes STRECH, a virtual share-nothing parallelism concept that does not require state transfer. The idea is that all workers read the same sequence of input tuples through an intra-node streaming framework. What is surprising in this paper is the parallelism model: all workers get the same sequence of tuples to guarantee the deterministic execution of the stream. On the contrary, in streaming, you usually have a distribution of your tuples per key. Still, they have obtained impressive results matching the throughput and latency figures of the front of state-of-the-art solutions, while also achieving fast elastic reconfigurations.”

To know more about the STRETCH framework, you can find the slides of the presentation here and the paper here.

 

Uncertainty-Aware Event Analytics over Distributed Settings (Nikos Giatrakos)

 

Nikos Giatrakos is a PhD researcher from the Technical University of Crete. He presented his work to do uncertainty-aware event analytics. Sabri reacts: “Getting high performance by sampling the input stream and sacrificing a bit of the result precision is the new trend in research. The idea is to parse only some of the events to be able to handle a bigger load, but still controlling the level of uncertainty you have on the result. I see 2 great applications: (1) get approximated results when needed but also (2) proactive detection before events happen.”

While the idea of filtering by controlling the probability of the error is not new, the paper had several novel points:

  • The decomposition of the error into filters from a pattern matching query
  • The conditional probability assertion for a wide range of aggregation functions
  • A central coordinator for calculating the global PDF and then detecting the pattern

To know more about his paper, you can find the slides of the presentation here and the paper here.

 

LEAD: A Formal Specification For Event Processing (Anas Al Bassit)

 

On the fourth day of the conference, our R&D engineer Anas presented his paper proposing a formal specification for CEP language.

Processing event streams is an increasingly important area for modern businesses aiming to detect and efficiently react to critical situations in near real-time. Due to CEP languages’ limitations and imprecise semantics, describing interesting situations remains challenging. In this paper, Anas presents a formal specification for processing complex events. The paper provides an algebra that consists of a set of operators for constructing complex events (patterns), temporally restricting the construction process and choosing among several selection and consumption policies.

To know more about his paper, you can find the slides of the presentation here and the paper here.

 

 

Tutorials

 

Correctness and Consistency of Event-based Systems  (Opher Etzion)

 

The second day of the conference was dedicated to tutorials from experts in the field. Anas gives insights into his favourite training: Correctness & Consistency of Event-Based Systems. He explains: “The speaker was Opher Etzion, one of the pioneers in the domain of event processing. The tutorial lasted for about 4 hours. What is interesting is that the speaker demonstrates with examples that building an event-based system is not trivial. Even more, a lot of existing systems are incorrect and give inconsistent results due to some problems in their semantics. To ensure correctness, you have at least to understand the sources of latencies in your system and ensure fairness between all the agents, in addition to defining a set of policies to tell the system when, how, where and what events you are looking for.”

LEAD: A Formal Specification For Event Processing

Processing event streams is an increasingly important area for modern businesses aiming to detect and efficiently react to critical situations in near real-time. The need to govern the behaviour of systems where such streams exist has led to the development of numerous Complex Event Processing (CEP) engines, capable of detecting patterns and analyzing event streams. Although current CEP systems provide real-time analysis foundations for a variety of applications, several challenges arise due to languages’ limitations and imprecise semantics, as well as the lack of power to handle big data requirements. In this paper, we discuss such systems, analyzing some of the most sensitive issues in this domain. Further, in this context, we present our contributions expressed in LEAD, a formal specification for processing complex events. LEAD provides an algebra that consists of a set of operators for constructing complex events (patterns), temporally restricting the construction process and choosing among several selection and consumption policies. We show how to build LEAD rules to demonstrate the expressive power of our approach. Furthermore, we introduce a novel approach of interpreting these rules into a logical execution plan, built with temporal prioritized coloured petri nets.

Anas Al Bassit, Skhiri Sabri, LEAD: A Formal Specification For Event Processing, in 13Th ACM international Conference on distributed and event-based systems 2019

Click here to access the paper.

DataWorks Summit: the big trends

Last month, four EURA NOVA engineers travelled to Barcelona to attend the Dataworks Summit. The conference is organised by Hortonworks, now known as Cloudera and it is about how to apply open source Big Data technology to accelerate digital transformation initiatives. They came back with a lot to say about the hot topics in AI, machine learning, architecture, the cloud, and the use cases! In this article, they share with us what they learned there and what struck them as particularly useful.

 

Big Trends

 

Data architecture

This year, one of the most important trends at the conference was data management and data architecture. Our R&D director Sabri Skhiri says: “There was a real focus on taking data lakes to their next stage and on making them actionable for AI and machine learning. The notion of data hubs was often mentioned, notably during the keynote speeches by Cloudera, IBM, and Pure Storage. However, most of the vendors of platforms have not been able yet to provide a fully-fledged ecosystem that allows the exploration, governance, and industrialisation of big data”.

AI industrialisation

This brings us to the second motto of the conference: AI industrialisation is a must. Our data engineer Khalil Amdouni explains: “The conference has been migrating towards AI topics. In the past, the conference used to focus mostly on data ingestion and data processing. It has been moving towards data science. Everyone is talking about AI and machine learning and how to put data science models into production. It’s looking into how to move from data exploration to industrialisation; we heard a lot about Cloudera’s Data Science Workbench etc.”

Production environment

The third trend of the conference was the separation between data processing tools and AI frameworks. Khalil explains: “Spark, Cloudera, Kubernetes are now all providing production environments (data science management platforms such as Cloudera Data Science Workbench, the Databricks Runtime ML, Kubeflow…) to integrate with machine learning frameworks such as Tensorflow or Python. Sabri adds: “This is interesting but we should first speak about “productisation”, data science models lifecycles, continuous integration and delivery. There are still a lot of shortcomings, like the fact that you need to centralise all your data in one partition before starting your favourite AI framework”.

Data governance

Another hot topic of the conference was data governance and compliance with regulations. Our R&D director goes on to say: “Everybody is speaking about the importance to be GDPR compliant and is proposing tools like Atlas, Egeria, IBM Infosphere, … but no one says how to actually comply with the GDPR during model deployments or how to deal with access policy management.”

 

 

Favourite Talks

 

Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka

Itai Yaffe presented the journey made by Nielsen’s Marketing Cloud division to provide its customers with real-time analytics tools to profile their target audiences. To achieve its goal, NMC needed to continuously transform its data infrastructure to ingest billions of events per day in a scalable and yet cost-efficient manner.

Sabri says: “The first version of NMC’s architecture includes CSV files and standalone Java applications with an OLAP database to expose the result. To reach their goal, NMC’s teams had to scale the process up to handle 10 times as much data”.

Their first step was to change the architecture: they moved to Kafka to ingest data, they leveraged Spark to stream and to aggregate data, and they used HDFS to store data.

Sabri explains: “The issue here was that they had to manage the statefulness of the Spark applications on HDFS by themselves. In addition, the system was error-prone in case of failure. They tried again and looked into Spark Structured Streaming, then tried to combine Spark Streaming with batch ETLs and finally decided to use Kafka to imitate streaming over their data lake. This evolution made the situation really interesting from a business and architectural point of view. Their business goal is to support decision making with machine learning to deliver reports on campaigns. Over the years, they adapted their architecture to go further and reach that objective”.

Our architect Cyrille Duverne adds: “Their story showed how much effort is required to build a long-term architecture. Tools are not enough; you first need the use cases that lead to an architectural vision. Only then can you choose the tools that will support the vision.  To build this architecture, you need time and people with the right skills”.

To know more about NMC’s journey, you can find the slides of the presentation here.

 

Federated Learning

Chris Wallace is a data scientist at Cloudera Fast Forward Labs. He presented how his team leveraged federated learning to predict maintenance problems when customers of a manufacturer are not willing to share with the manufacturer the details of how their components failed, but want the manufacturer to provide them with a strategy to maintain the faulty parts.

Our architect Cyrille Duverne explains: “In this case, federated learning is a kind of distributed deep learning where you train the model on decentralised data. The main idea is that a network of nodes shares models rather than training data with the server. Each node has the untrained model that they will train using the data they have. Each node then sends a copy of its trained model back to the central server that will take the average and send the new model to the different nodes. The process is repeated until the final version of the model is reached.”

Our data scientist Malian De Ron explains: “I find federated learning very interesting. As data scientists, we can work directly on updating models, but we don’t have access to all the training data. Federated learning can be useful for use cases where the customers want to keep their data anonymous. For example, we work for a financial company that works with a bank. Neither of them is willing to share their data. By using federated learning, the training data could remain in its original location, which could satisfy our customer’s privacy concerns.

To know more about federated learning, you can find the slides of the presentation here.

 

Data governance with Egeria: The industry’s first open metadata standard

John Mertic is the director of program management for ODPi, the Linux Foundation’s Open Data Platform initiative. He talked about their new open metadata standard Egeria, introduced in September. John Mertic explained how the standard supports the free flow of standardised metadata between different technologies and vendor platforms, enabling organisations to locate, manage, and use their data resources more effectively.

Sabri says: ”Companies have 40 years of evolution embedded in their IT systems, resulting in high complexity of data lineage and data silos. In the complex new world of big data and real time, security models have to track data throughout the organisation. This is why data governance and metadata management are hot topics in conferences. Everybody is talking about it and proposes tools such as Egeria, IBM InfoSphere, or Atlas. I talked with IBM InfoSphere people and I had an overview of the Egeria tool. It can be used to federate the IBM InfoSphere Information Governance Catalog, Apache Atlas and even other Egeria cohorts. The IBM Governance Catalog can pull information directly from Egeria and integrate the metadata, the lineage, and even tags from Atlas”.

To know more about Egeria, please find the slides of the presentation here.

 

 

Final Thoughts

 

When working with clients as they make their journey to the new digital world, we noticed recurrent problems in the areas of data access, usage, and governance. In many conferences, we hear stories of companies facing these challenges and making a lot of ad hoc choices but lacking a long-term architectural vision. To crack the challenges, our R&D director Sabri Skhiri designed the Data Architecture Vision (DAV), which later led to digazu.

The Dataworks conference highlighted the need to take data lakes to their next stage. The digazu platform, with its integrated and managed data lake, meets that need. It is a true data hub that integrates real-time and batch dataflows, that collects data from multiple sources, stores it, and distributes it to applications and users across the whole organisation.

Another need mentioned at the conference was that of providing companies with production environments to deploy models. Leveraging ever-increasing amounts of data to provide new services or solve problems requires increasing resources in terms of expertise, time and money. digazu offers a scalable way to keep data pipelines open for business in real time or batches without an army of data experts, lines of code, or complex training.

A third need highlighted at the conference is for companies to reach good data governance. There are already excellent governance tools such as Atlas, Egeria, IBM Infosphere to support the free flow of standardised metadata. digazu opens the door to automated regulatory compliance by providing ready-to-use connectors to data management and governance tools.

To learn more about digazu, visit digazu.com

 

Third Workshop on Real-Time and Stream Analytics in Big Data: key takeaways

Last month, EURA NOVA research centre organised the third workshop on real-time and stream analytics in big data, collocated with the 2018 IEEE conference on big data in Seattle. The workshop brought together the leading actors in the field including data Artisans, the University of Virginia and Télécom Paris Tech as well as 9 well-known speakers from 6 different countries. We received more than 30 applications and we are proud to have hosted such interesting presentations of papers in data architecture, stream mining, complex event processing and IoT.

The workshop was a real success, with captivating talks and a lot of interesting questions and comments. If you could not attend the event, our R&D engineer Syrine Ferjaoui has brought back for you the important elements from the keynotes and the presented papers.

 

First keynote speaker:

First of all, the workshop started with the keynote of Fabian Hueske, PMC member at Apache Flink & co-founder of data Artisans. His talk “Unified Processing of Static and Streaming Data with SQL on Apache Flink” presented Flink’s features and its relational unified APIs for batch and streaming data. Fabian Hueske insisted on the importance of unifying stream and batch for 2 major points: the usability and the portability. Flink includes a set of features such as materialised views to speed-up the analytical queries, dynamic tables, updates propagation and processing, continuous queries, approaches to handle time in stream processing, watermarks and queries on infinite sized tables. With all these features, Flink helps its users to build data pipelines with low-latency ETL, stream & batch analytics and to power live dashboards.

Our research director Sabri Skhiri adds: “Apache flink is currently working on a set of connectors. They have already the HDFS sink, the JDBC sink and since they are pushing Flink as the standard technology for data pipelines and materialised views, they want to expand their connectors set.”

 

Second keynote speaker:

Secondly, our research director Sabri Skhiri talked about data management, and stream and real-time analytics. His talk “The challenge of Data Management in the Big Data Era & its underlying Enterprise architecture shift” started with defining data architecture as a global plan depicting how to collect, store, use and manage data to answer the 8 main challenging questions that are essential to building a solid and efficient solution. During his talk, our director considered deriving microservices from data streams as the new wave of architecture and he discussed the Data Architecture Vision (DAV) set throughout 10 years of research and development at EURA NOVA. The DAV later led to the development of digazu, a data engineering platform containing all the different components needed to collect, store, govern, transform, and analyse all the data in the company’s IT environment.

 

Workshop Invited Speakers:

After the keynotes, 9 selected papers have been presented, covering mainly these 4 topics: (1) Data Streaming Architecture, (2) CEP/CER, (3) Stream Mining & (4) IoT Device Integration:

Isah and Zulkernine (Queen’s University, Kingston, Canada) propose a scalable and fault-tolerant data stream ingestion and integration framework that can serve as a reusable component across many feeds of structured and unstructured input data in a given platform. Our R&D engineer Syrine Ferjaoui explains: “The ingestion layer (that integrates Apache NiFi and Kafka) is used to decouple streaming analytics layers (acquire, buffer, pre-process, distribute data streams).  This NiFi-Kafka “NiFKaf” integration takes advantage of the high configuration of NiFi and the addition of several data of consumers provided by Kafka.This way, it supports many data sources, languages and content formats, ensures high throughput and low latency, supports large numbers of data consumers, enables data buffering during temporary spikes in workload and employs a replay mechanism, and is scalable”.

 

The paper by Trinks & Felden (TU Bergakademie Freiberg, Germany) presents Edge Computing which is an extended approach to cloud computing. It describes an architecture scheme that consists of 3 layers: node layer (gadgets, smartphones, embedded systems, sensors), edge layer (routers, switches, small/macro base station) and cloud layer (datacenters, servers, databases, storages). Edge Computing is used to minimise energy consumption, bandwidth, latency and increase safety and privacy level and employs real-time analytics within its architecture.

 

Link prediction refers to the likelihood of a link appearing in the future based on the current status of a graph. The previous works for link prediction such as sketch-based approaches and dynamic attributed networks do not give exact results and cannot handle deletion or modification in the graph nor the large volume of data. The goal of the authors (University of Louisiana, USA) is to design a graph-processing approach for link prediction that ensures real-time prediction and extraction of accurate features from the graph with exact results. Syrine details: “Graph processing can be edge-centric, vertex-centric or neighbourhood centric. This paper proposed two new graph processing frameworks for handling each graph streams: vertex-centric processing & neighbourhood-centric processing. These frameworks are able to predict 100% of the links with an average graph ingestion time between [149.3 – 242.7] ms”.

 

Researchers from the University of New Mexico have developed a robust distributed matching system, called DisPatch. In a scenario where multiple data sources or producers publish data to the Kafka system, DisPatch is the data consumer that matches a pattern with a guaranteed maximum delay after the pattern appears in the stream. Syrine reacts: “Given a time series T of length n, and a query Q of length m, it normally takes O(nm) to calculate the Euclidean distance/correlation between Q and all subsequences of T, but this method calculates the results in O(log(n)) by exploiting the overlaps. As a result, DisPatch guarantees exactness and bounded delay at the same time”.

 

In this paper, the authors (Adobe Research, California, USA) discuss Adobe’s  Identity Graph that provides a comprehensive solution to the challenge posed by fragmentation of identities. Our R&D engineer details: “Identity graph helps in connecting data across channels, domains and devices to solve a fundamental problem in the Digital Marketing domain. The fragmented profiles of a consumer are linked together in order to provide a unified view across devices. This means that an identity graph connects all the known identifiers that correlate with the individual consumer. The researchers built identity relationships by using both online data traffic and offline CRM data logs from customer’s backend systems. To do that, they are using two approaches: deterministic linking and probabilistic linking. They combined them using deterministic as a base and expanding using probabilistic clusters”.

 

The authors (Purdue University, USA) propose a novel fitting algorithm for big data logistic regression by combining Fisher Scoring and IRWLS. Syrine details: “The revised IRWLS algorithms can break the memory barrier and is suitable for streamed computing. It is per row updatable and does not need to load the whole dataset into the memory. This algorithm has a fast convergence speed (usually around 3). The limitation of this method is the structured data with large n (rows) and small p (columns)”.

 

Dynamic Time Warping (DTW) is able to match natural time series with similar shapes, but a different length of patterns. The authors (Linnaeus University, Sweden) described enhancements to the DTW algorithm that allow it to be used efficiently in a streaming scenario. Syrine explains: “Their solution is composed of 3 parts: (1) a very fast implementation of the DTW (2) an append operation for the DTW which works in linear or constant time and (3) an approximation of a sliding window that allows DTW to forget old time steps, improving the processing of “never-ending” streams. In short, DTW encapsulates all data behaviour information in a single value and enables the use of a tiny fraction of data compared to the original sensed data while still obtaining highly accurate results”.

 

There is a rapid emergence of new applications involving mobile wireless sensor networks (MWSN) in the field of Internet of Things (IoT). Although useful, MWSN still carry the restrictions of having limited memory, energy, and computational capacity. At the same time, the amount of data collected in the IoT is exponentially increasing.The authors (Florida International University, USA) propose a Behavior-Based Trend Prediction (BBTP), which is a data abstraction and trend prediction technique, designed to adress the limited memory constraint in addition to providing future trend predictions. Predictions made by BBTP can be employed by real-time decision-making aplications and data monitoring.

 

Lightweight Temporal Compression (LTC) is among the lossy stream compression methods that provide the highest compression rate for the lowest CPU and memory consumption. As such, it is well suited to compress data streams in energy-constrained systems such as connected objects. In this paper, Li, Sarbishei, Nourani and Glatard (Concordia University &  Motsai Research, Canada) investigate the extension of LTC to higher dimensions. Syrine adds: “They described how multi-dimensional LTC compression saves substantial amounts of energy (up to 20%) and is feasible on connected objects. The implementation with Euclidean norm is more intuitive than infinity norm for nD sensors, as well as more CPU & memory intensive and leads to lower compression ratios”.

 

Special thanks to our keynote speaker Fabian Hueske, and all the attendees and speakers! We are looking forward to an even more successful workshop in the coming edition of the IEEE Big Data Conference. Stay tuned for paper submission dates!

Spark+AI Summit: a summary

A few weeks ago, Sabri Skhiri and Florian Demesmaeker were in London to attend the Spark+AI summit. They came back with a lot to say about the new features of Spark and the presented use cases! In this article, they will give you their opinion about Databricks’ main announcement, the intakes of their favourite talks and training, and what they thought of the new name of the conference.

 

A new name

This year, Spark expanded the summit’s scope and renamed it “Spark + AI Summit”. The goal of Databricks, announced by its co-founder Ali Ghodsi, is to incorporate unified aspects of data and AI.

Florian Demesmaeker, our R&D engineer, explains: “In some of the keynote talks, the speakers talked about use cases where the job of the data engineer is strongly reduced. The data scientists can easily experiment with data, travelling back and forth in time. This means more focus on AI, rather than on the data engineering part that makes all data accessible to the data scientists”.

 

Main announcement

In line with this change of name, Databricks announced the release of a complete data science lifecycle on the cloud.

Sabri Skhiri, our R&D Director, explains “It is interesting to see that the change in the event name is actually very visible in the change of Databricks’ strategy. Their tools are now completely dedicated to stream ETL, and there is a huge focus on integrated data management”.

Databricks’ new features include Databricks Delta which creates data pipeline and provides data views and exploration features. Secondly, the Databricks Runtime ML is a ready-to-use environment providing a set of pre-loaded ML frameworks where the data scientist can play with data. Finally, the MLflow tool allows to simplify the ML models development at enterprise scale.

Our R&D Director precises: “Together, these features provide a complete and unified approach to machine learning lifecycle and pipeline automation. This looks like a very competitive SaaS offer for integrated data management, available on AWS and Azure. However, the metadata management and the security aspect is still the missing piece”.

 

The training day

The first day of the conference was dedicated to training workshops that include a mix of instruction and hands-on exercises to help attendants improve their Apache Spark skills.

Florian gives insights into his favourite training Tuning and Best Practices. He explains: “The aim of the training was to make programmers aware of how Spark works internally, in order to be able to write optimised applications. They presented a few situations, each one showing one relatively slow process. Then they presented a step-by-step procedure to debug the situation and to find the points that could be improved in the current situation. In summary, tips and tricks to adapt to different situations”.

 

Favourite talks

The sessions at the conference covered data engineering and data science contents along with best practices for productionising AI. The talks were divided into roughly two categories: Spark programming and deployment, and applications on top of Spark (AI applications).

Florian Demesmaeker explains: “I attended 28 talks. The keynotes from Databricks were quite interesting, they presented Delta and MLflow. I also enjoyed the talks about tools to optimise the internals of Spark, these provided good technical details. Other talks were about use cases on top of Spark, it was interesting to see what challenges other companies face and how they address them”.

Sabri Skhiri adds: “The talk Learning to Rank Datasets for Search was very inspiring. Oscar Castañeda-Villagrán, a data scientist working at Xoom (a Paypal service) talked about learning to rank R data set. The idea is that we can extract metadata when the data pipeline is arriving in the lake. Going further, you can not only extract metadata but also calculate a kind of judgment relevance score that will be used for bootstrapping the learning to rank process. In this way, a user can search and retrieve the relevant R data set in the lake. A very good idea for the metadata-driven exploration”.

 

 

Early September 2018, 8 EURA NOVA engineers travelled to Berlin to attend the Flink Forward Conference, dedicated to Apache Flink users and stream processing communities. You can read their feedback here.

Flink Forward 2018: What You Want to Know and What You (Will) Need to Know.

Early September 2018, 8 EURA NOVA engineers travelled to Berlin to attend the Flink Forward Conference, dedicated to Apache Flink users and stream processing communities.

They came back with a lot to say about the hot topics in stream processing and the presented use cases! In this article, they will give you their opinion about data Artisans’ main announcement, the intakes of their favourite talks, and what they thought makes Flink Forward different from other conferences.

 

First keynote announcement:

During the keynote speech, data Artisans announced that they now bring ACID transactions directly on streaming data with data Artisans Streaming Ledger.

Charles Bonneau, our software architect, says: “This feature allows ACID transactions between multiple operators’ event-processing operations and internal states. This means that streaming applications can now update multiple states in one transaction. For example, an application that transfers money from one bank account to another can finally be implemented using Flink with strong consistency guarantees. Both bank accounts will have their balance updated at the same time as if there was a master data-management state”.

For Sabri Skhiri, our R&D director, this opens the doors to a brand new range of applications, especially in data-driven real-time services but also in streaming data management. He explains: “They are pushing forward the concept of streaming. Now, you could imagine a master data-management state that can be updated by operational streaming applications in real time. This will allow even more complex and advanced use cases of stream processing!”.

 

Favourite talks:

In 2 days, each Euranovian attended about 18 talks and use case presentations, with speakers from tech giants such as IBM, Netflix, Alibaba, and Uber as well as speakers from smaller companies.

Charles explains: “The conclusions are reassuring: most of them face the same issues that we see at our clients’ and our solutions are all valuable. They include a stream-first data architecture, a stream-first data pipeline product, and Flink developers skills. Even though a number of companies are at the very edge of the technology and their issues do not yet require continuous flows of a considerable amount of events, we are ready”.

For our R&D Director Sabri Skhiri, the keynote speech from Lightbend was one of the most interesting ones. He explains: “Viktor Klang, Lightbend deputy CTO, talked about the convergence between microservices and stream processing.  At EURA NOVA, we have been advocating for this convergence for more than a year in our architecture practice. The idea is simple: asynchronous microservices can be designed as stream processing stages. This is fantastic because it makes modern stateful stream processing frameworks the perfect target for implementing reactive microservices. With stateful deployment, exactly once semantics, high availability and ACID access to states, microservices can become stateful streaming apps.”

 

Vision-oriented Flink Conference:

Our colleagues came back with sparkles in their eyes. When we asked them how they felt about the event, Sabri Skhiri explained:

“Very often, this type of conferences tend to be business oriented. They are focused on how to make the framework easy to use and available to as many people as possible. By contrast, this year’s Flink Forward conference was all about innovation and vision. data Artisans shared their vision of what the Flink framework will be within 3 to 5 years and talked about what role stream processing and big data have within this vision.  In fact, almost all the talks were very technical. They were testimonies of big names in the industry, such as Alibaba, Netflix, and ING about problems encountered on the field and how they have been solved, which is often out of the box. The Flink-Alibaba partnership is a sharing one. Alibaba are way ahead with their technology. They keep their lead for 1 year and then they share their work and make their code open source. data Artisans have a great long-term vision of stream processing. I can see a lot of very interesting architecture discussions in the coming months!”

 

Stream Processing Technology:

When most frameworks cannot process considerable streams of live data and provide results in real time, Flink provides a single runtime for the streaming and batch processing while being highly scalable.

Cyrille Duverne, our Lead Data Architect, confirms: “Flink is definitely a real-time processor! We’re speaking about true real time, not only mini batches etc… Plus, the introduction of ACID transaction management in the new version of data Artisans’ Flink distribution creates a good marketing edge”.

Sabri Skirhi and our R&D engineer Florian Demesmaeker were at the Spark Summit this week. Stay tuned for part 2 with their feedback!