Kafka Summit: The Key Takeaways

At the beginning of the month, our software engineer Christophe Philemotte was in San Francisco to make a presentation at the Kafka Summit organised by Confluent. The Kafka Summit is one of the main events for data architects, engineers, DevOps, and developers who want to learn about streaming data. In this article, Christophe shares with you the latest trends from the conference.

Main observations

This year, one of the most important takeaways at the conference was that Confluent is working towards building an active database with KSQL.

Christophe details: “KSQL is the streaming SQL engine that enables real-time data processing against Apache Kafka. With KSQL, Confluent is embracing the SQL streaming and the integration of its stack into it. They also aim to have the interactivity we already have with a classic database. In short, they are moving towards this new paradigm of active data and passive query where KSQL would make it easy to read, write, and process streaming data in real-time, at scale, using SQL-like semantics. Still, KSQL shouldn’t be chosen over Flink, for instance, without proper consideration of its limitations. For example, real checkpointing and savepoint are missing, as well as global shuffling. There are still constraints on partitioning in some operators and there is no global windowing.”

While talking about SQL streaming, they also mentioned user-defined function or machine learning integration. Find more information on the summit website.

Another interesting point was the shared approaches and themes that were addressed by different companies. For example, 30% of the talks were about the operations. About 5 talks were dedicated to methods how to deploy on Kubernetes, and several other speakers mentioned that deploying on Kubernetes was their target. Real-time analytics, integrations/ETL/DataOps, and of course data pipelines were also often mentioned.

Keynotes talks:

During the first keynote talk, Jun Rao, the co-founder of Confluent, looked back at Apache Kafka’s last years and what brought them to where they are today. Christophe says: “One interesting point was the concept of democratising data. They envision Kafka as a one-stop self-service shop for devs, data scientists, etc. Still, the users have to overcome a lot of challenges such as operations, integrations, security, or cold storage. Challenges that we are solving with digazu.”

You can find the video of the presentation here.

Jay Kreps, the CEO of Confluent as well as one of the co-creators of Apache Kafka started his talk by discussing the sentence “Software is eating the world”, by Marc Andreessen. Christophe adds: “The idea is that software must be integrated into an ecosystem of other software. The users are no longer just humans. In some cases, the software will be used almost exclusively by other software.”

Jay Kreps also talked about the new steps for Apache Kafka. He announced that the next release of Kafka KSQL in November will enable users to directly register inputs and outputs thanks to Kafka Connect source and sink connectors. They are also working on better interactivity that will allow users to see the results more quickly in the KSQL CLI.

You can find the video of the presentation here.

Our Favourite Use Cases:

Kafka on Kubernetes: Keeping It Simple (Nikki Thean)

Nikki Thean is a staff engineer at Etsy, where she helps deploying Kafka at Etsy. She talked about Etsy’s Cloud Migration and how running Kafka on Kubernetes was the best option for them and was not half as complicated as they thought it had to be. Christophe explains: “At the DataWorks Summit in Barcelona, the message was that K8S resource management was not yet ready to replace YARN. We now see that K8S is the new YARN for many people who are using it to deploy their cluster. For example, Etsy or Confluent Cloud.”

In her talk, Nikki Thean explained how a Kafka-on-K8S setup works. Christophe explains: “The main lessons from her talk are:

We can start simply without an operator.
We must pay attention to the Kubernetes liveness and readiness probes. They can be used to make a service more robust and more resilient since K8S can restart them if necessary. However, if these probes are not configured carefully, they will kill the brokers unnecessarily.
Considering the price to deploy in multiple zones on Google Cloud Platform, a good solution is to deploy at least Zookeeper (the most critical element of the cluster) on multiple zones. Given the low flow of data, it will not be too expensive and Zookeeper will allow identifying which Kafka node has the data.”

You can find the video and the slides of the presentation here.

Mission-Critical, Real-Time Fault-Detection for NASA’s Deep Space Network using Apache Kafka (Rishi Verma)

Rishi Verma is a manager at the NASA Jet Propulsion Laboratory. He talked about the new software system being deployed by NASA to upgrade its Deep Space Network (DSN) that operates spacecraft communication links for NASA deep-space spacecraft missions. Christophe says: “It was a super interesting use case! The DSN Complex Event Processing (DCEP) software assembly is a new software system that brings into the DSN next-generation “Big Data” infrastructural tools to do IoT with their legacy assets. The objective is to correlate real-time network data with other critical data assets (in their example, an old radio antenna). They recover all the data on Kafka, then they process it and then they predict signal loss on the basis of weather conditions.”

You can find the video and the slides of the presentation here.

Cross the Streams Thanks to Kafka and Flink (Christophe Philemotte)

Christophe is the CTO of digazu, the batch and real-time data sharing platform developed by EURA NOVA. In his talk, he explained how you could build a similar data platform and how you could plug Flink into the Kafka ecosystem, as well as what the common pitfalls are and what Flink requires to be deployed on Kubernetes.

Christophe says: “The feedback was positive and I received a lot of questions during the Q&A session and after the talk, notably about Flink vs KSQL vs Spark. Another question that I received a lot is when to use Table, SQL or DataStream API. My answer was that Table and SQL APIs are two different flavours of the same API. The Table API you have a LINQ experience while with the SQL API you have a SQL experience. They are both perfect for data processing that can be expressed simply in SQL. That means in a lot of cases. The DataStream API is a lower-level API compared with the Table and SQL APIs. It gives more control on what you can do, which means it also requires a thorough understanding of Flink core mechanisms. Going for the DataStream API is usually a good choice either when your stream processing cannot be expressed in SQL and requires specific implementation, or when you need to optimise the processing.

The sandbox provided was also very popular.

You can find the video and the slides of the presentation here.

Our Favourite User Practice:

Please Upgrade Apache Kafka. Now. (Gwen Shapira)

Gwen Shapira is a software engineer at Confluent working on core Apache Kafka. She reviewed all the recent releases and made suggestions on how to de-risk upgrades.

Christophe says: “Gwen Shapira talked about why it is essential to upgrade even though it is risky and time-consuming. She explained that each new release fixes from 30 to 140 bugs and listed the improvements you will get from upgrading”. Among them:

The Apache Kafka team is working on improvement to build a reliable replication. For example, watermarking has been improved greatly.
They are working on controller design towards the removal of Zookeeper.
Finally, some releases are critical for specific reasons (e.g. proper resolution of IP when you work with K8S, JBOD, or EOS).

In the second part of her talk, Gwen Shapira made suggestions to upgrade as safely as possible. Christophe explains: “She recommended to take good care of backup configuration and documentation. Regarding documentation, she recommended to read the list of notable changes, to act upon text in bold font, and once you have finished reading, to go over it all again!”

Christophe’s last word? “Be sure to check out slide 35: it lists the ways how not to upgrade!”

You can find the video and the slides of the presentation here.

Blog Event

4th Workshop on Real-time & Stream Analytics in Big Data

EURA NOVA Research centre is proud and excited to organize the fourth workshop on Real-time and Stream analytics in Big Data, collocated with the 2019 IEEE conference on Big Data. The workshop will take place in December in Los Angeles, USA.

Stream processing and real-time analytics in data science have become some of the most important topics of Big Data. To refine new opportunities and use cases required by the industry, we are bringing together experts passioned about the subject.

This year, we are excited to have two amazing keynotes from Confluent KStream and Apache Pulsar:

Matteo Merli is one of the co-founders of Streamlio, he serves as the PMC chair for Apache Pulsar and he’s a member of the Apache BookKeeper PMC. Previously, he spent several years at Yahoo building database replication systems and multi-tenant messaging platforms. Matteo was the co-creator and lead developer for the Pulsar project within Yahoo.
John Roesler is a software engineer at Confluent and a contributor to Apache Kafka, primarily to Kafka Streams. Before that, he spent eight years at Bazaarvoice, on a team designing and building a large-scale streaming database and a high-throughput declarative Stream Processing engine.

If you want to join us, authors from the industry and the academia are invited to contribute to the conference by submitting articles. Check out the workshop website to find all the information you will need. Your paper will be reviewed by a prestigious panel of international experts from both the academic and the industrial worlds.

Event

ACL 2019: Takeaways from the conference

Last month our R&D Project Director Cécile Pereira and our PhD student Léo Bouscarrat travelled to Florence to attend and present to ACL 2019. ACL is one of the biggest conferences in Natural Language Processing. This year all the records were broken with more than 3500 attendees, 660 accepted papers to the main conference, 9 tutorials and more than 20 workshops. All the talks of the main conference were recorded and are accessible online. In this article, Cécile and Léo share with you the latest trends from the conference!

Big trends

A new paradigm in NLP?

This year, ACL’s selection of topics has shown the importance that has taken self-training methods such as BERT (Devlin et al., 2019) or XLNet (Yang et al., 2019). These methods consist of feeding huge models with a vast amount of data and then train them on easy tasks (for example, predict masked words in the original sentence or predict if two sentences are following each other).

These models should be able to learn faster and with less data on a more specific and complex task. With this method, the way to train a model to solve an NLP task has changed. Here is this new paradigm:

Select a pre-trained model (trained with self-training)
Add a layer on the output of this model (it will depend on your task) and fine-tune the model by giving the inputs and outputs of your task
Evaluate your model

Many papers were using this paradigm to achieve state of the art on several tasks (out of the 660 papers of the main conference, 47 have the word BERT in their abstract).

Contextual embeddings, like BERT, take into account the context of the sentence into the embeddings of the words. BERT can be used for a large variety of tasks including but not limited to classification (Reimers et al., Chalkidis et al.), named entity recognition (Arkhipov et al., Emelyanov and Artemova) and question answering (Li et al., Liu et al.).

So it is working. But the remaining question is why?

Several presentations discussed the explainability of BERT (for example Jawahar et al. and Clark et al.). Those papers discuss that, as the different layers are learning different things, the different heads seems to specialize in certain types of words or certain syntactic or semantic task.

The conference highlighted the need for adversarial training and testing as those models are very good to learn bias in the dataset (Niven and Yao, Jiang and Bansal). For those not familiar with the concept, adversarial training and testing consist to train and/or test on an adversarial dataset. This dataset is composed of examples, often generated ones, where the model fails to predict the correct answers. Adversarial training is generally used to verify if the models learn bias in the dataset (like the negation in Niven and Yao). It can also improve the quality of the models.

Improving the experiments in NLP

Several presentations showed that adversarial training can improve the results and robustness of the model (Zhu et al and Jiang and Bansal, Mohit Bansal slides available here).

The meeting was also a moment to discuss the impact of the use of standard splits on benchmark data. Standard split means that, if you want to work on a specific task, you will generally look for the training, validation and test splits used in other publications and use the same.

However, Gorman and Bedrick argue that the use of random split should be preferred. They explain it by trying to reproduce the results of nine part-of-speech taggers on a specific dataset. They reproduced the same rankings on the standard splits. However, when they did it on random splits, the ranks of the taggers, considering the same metric, varied.

This showed that getting a better ranking on a specific split doesn’t mean that you are better in general. Since in some fields of research, the improvements between each paper are small, the use of standard split does not guarantee that a model is really better than another one on the task. Random splits could improve this by adding a notion of variance on the performances.

Domain adaptation

The last trend in NLP consists of using models or embeddings learned on huge datasets of general data from sources such as Wikipedia, books or newspapers.

When you want to work on specialised domains such as the biomedical, legal or financial domains, you need specialised embeddings. However, you generally don’t have enough specialised data to re-train the embeddings or the models.

A solution is to use and modify pre-trained models for your specific task. This is called domain adaptation. There are several ways to do it. For example, Boukkouri et al combined a general embedding and a smaller one learned on their domain. Hu et al fine-tuned a general model on their data. These methods allow using recent models (which needs a lot of data) on some specific domains that do not fit those requirements.

Trendy topics

Machine translation

Machine translation is still a huge topic with no less than 46 papers in the main conference (according to the ACL 2019 chair blog post), an entire two-days workshop dedicated to it and Liang Huang invited talk. Liang Huang is a principal scientist of Baidu Silicon Valley AI Lab who talked about the current state of simultaneous translation and Baidu research’s new approach. They were able to do an English/Chinese translation with 3 seconds of delay only. The demo is available here: https://simultrans-demo.github.io/. One can also notice that the ACL best long paper award was on this topic (Zhang et al.)!

Conversational systems

Conversational systems (also called chatbots) were also a trendy topic, with 52 papers, a workshop, and the invited talk from Pascale Fung.

Pascale Fung is a Professor at the Hong Kong University of Science & Technology. She presented the state of the art of conversational systems. For her, recent advances are going in three directions: learning to memorise, learning to personalise and learning to empathise. She presented her current work on conversational systems that can empathise, showing that improvements have been made but there is still work to do. She ended with questions about the ethics of this sector: how can we build systems that are secure, safe and fair for all?

Knowledge graph

Knowledge graphs are also pretty trendy, they seem to be a good way to add knowledge to models. It can be used for Question-Answering or Conversational systems. The blog post of Michael Galkin makes a review of the most interesting articles in this sector.

Bias in NLP

After recent papers showed that models in NLP are biased (Bolukbasi et al., 2016 ; Caliskan et al, 2017) there is more and more work on what we can do about that, reflected by a session and a workshop during the meeting (https://genderbiasnlp.talp.cat/).

Several works about removing gender bias from models have been previously published. But the work of Gonen and Goldberg explains that, for now, it’s only “Lipstick on a pig”.

We observed two main areas on the topic. Firstly, removing/controlling gender bias in the models (like in automatic translation, Habash et al., Escudé Font et al., Ik Cho et al.). Secondly, measuring bias in the models and society (with articles proposed by sociologists, like Karve et al., Hitti et al., Basta et al., Kurita et al.).

Summarization

There were several papers about summarization (including our own paper https://arxiv.org/abs/1907.07323) which have been summarized by RecitalAI on their GitHub.

Conclusions

ACL was a great place to measure the trends in the NLP field. As models are becoming better, data scientists are applying them to a large variety of topics including automatic translation, search engine, and chatbots.

As the NLP community and topics are becoming bigger and bigger, we hope that this summary of our biased takeaways from the meeting could help you navigate the nearly 700 ACL papers of this year.

Blog Event

DEBS 2019: A Summary

Last month, our R&D engineer Anas Albassit and director Sabri Skhiri travelled to Germany to attend and present at DEBS 2019, one of the most specialised conferences in Distributed Event-Based Systems. DEBS has a long history: from active databases to streaming engines, distributed publish-subscribe systems, etc. it has always been the pioneer of distributed and high-performance systems. In this article, Anas and Sabri share with us what they learned there and what struck them as particularly useful.

Big Trends

This edition focused around streaming language, scheduling, elasticity, distributed event processing, platform and middleware. Our R&D director Sabri Skhiri says: “For someone working in distributed computing and data management, DEBS is one of the major conferences with SIGMOD and VLDB. Even though it is quite small (80 participants vs 1000 for IEEE Big Data), this is a niche conference of experts from a small yet amazingly talented community of researchers. The keynotes were just great, with a good balance between pure research and industry. This conference tackling distributed computing and streaming is heaven for data scientists and architects like us!”

Keynotes

Open Problems in Stream Processing: A Call To Action (Tyler Akidau)

Tyler Akidau is the technical lead for the Data Processing Languages & Systems group at Google. He argues that, even though stream processing has gone from niche to mainstream, this is just the beginning. For him, the need for active exploration of new ideas is all the more pressing. Sabri reacts: “The stream has been there for 30 years. We have Spark, Flink, Dataflow, KStream, MSF Trill. But is it all we can do? Is there nothing to do anymore? Tyler Akidau brilliantly presented that stream processing as a field of research is alive and well.”

The talk was mainly about raising opened or partially opened questions in the streaming world.

Firstly, the evolution of a pipeline over time may require changes in the persisted state. In that case, how to gracefully update everything online without stopping the working pipeline? How can we auto tune or auto build optimised systems? Do we need to rethink the way we build such systems now?
Secondly, Tyler Akidau moved on the importance of SQL in streams and the missing parts to fill there. One of the main topics was the mathematical formalisation of streaming operators, in addition to providing richer standards and clarifying the ambiguities coming from the nature of the streams like out of order events and latency.
The next point focused on the tradeoff triangle between latency, costs and correctness. How to figure out what do we need? How to describe correctly every term from a system-behaviour point of view? Is it possible to prioritise different factors depending on the urgency of some task (e.g. streaming v.s batch) automatically?
Last but not least, how can we improve stream processing? What kind of database optimisations can be adopted here, and can we think of more optimisations that can only apply on streams?

Tyler Akidau concluded by pointing out that even though streaming systems are more capable and robust than ever, they often remain difficult to use, difficult to maintain, and difficult to understand.

[EDIT] Thank you Tyler for reaching out and for sharing your slide with us! They are available on the following link. If you would like to discuss more insights from the talk, do not hesitate to contact our researchers at [email protected].

Interesting Papers

STRETCH: Scalable and Elastic Deterministic Streaming Analysis with Virtual Shared-Nothing Parallelism (Hannaneh Najdataei)

Hannaneh Najdataei, Researcher and PhD Student at the Chalmers University of Technology in Sweden, presented her framework STRETCH.

Anas explains: “The performance of a streaming engine depends on the throughput and latency of stateful analysis. To achieve the best performance, we need to process a large amount of data (i.e. to be scalable) while handling fluctuations in data rate (i.e. to be elastic). Distributed processing requires the ability to parallelise the processing elastically. Optimally, we should reduce the number of parallel operators when the workload decreases and add operators when more resources are needed. For stateful operators, elasticity reconfigurations require to redistribute the states according to the new cluster configuration (i.e. less or more operators). In this case, we need to find a tradeoff between a share-nothing and a share-all state architecture.”

Sabri adds: “The paper proposes STRECH, a virtual share-nothing parallelism concept that does not require state transfer. The idea is that all workers read the same sequence of input tuples through an intra-node streaming framework. What is surprising in this paper is the parallelism model: all workers get the same sequence of tuples to guarantee the deterministic execution of the stream. On the contrary, in streaming, you usually have a distribution of your tuples per key. Still, they have obtained impressive results matching the throughput and latency figures of the front of state-of-the-art solutions, while also achieving fast elastic reconfigurations.”

To know more about the STRETCH framework, you can find the slides of the presentation here and the paper here.

Uncertainty-Aware Event Analytics over Distributed Settings (Nikos Giatrakos)

Nikos Giatrakos is a PhD researcher from the Technical University of Crete. He presented his work to do uncertainty-aware event analytics. Sabri reacts: “Getting high performance by sampling the input stream and sacrificing a bit of the result precision is the new trend in research. The idea is to parse only some of the events to be able to handle a bigger load, but still controlling the level of uncertainty you have on the result. I see 2 great applications: (1) get approximated results when needed but also (2) proactive detection before events happen.”

While the idea of filtering by controlling the probability of the error is not new, the paper had several novel points:

The decomposition of the error into filters from a pattern matching query
The conditional probability assertion for a wide range of aggregation functions
A central coordinator for calculating the global PDF and then detecting the pattern

To know more about his paper, you can find the slides of the presentation here and the paper here.

LEAD: A Formal Specification For Event Processing (Anas Al Bassit)

On the fourth day of the conference, our R&D engineer Anas presented his paper proposing a formal specification for CEP language.

Processing event streams is an increasingly important area for modern businesses aiming to detect and efficiently react to critical situations in near real-time. Due to CEP languages’ limitations and imprecise semantics, describing interesting situations remains challenging. In this paper, Anas presents a formal specification for processing complex events. The paper provides an algebra that consists of a set of operators for constructing complex events (patterns), temporally restricting the construction process and choosing among several selection and consumption policies.

To know more about his paper, you can find the slides of the presentation here and the paper here.

Tutorials

Correctness and Consistency of Event-based Systems (Opher Etzion)

The second day of the conference was dedicated to tutorials from experts in the field. Anas gives insights into his favourite training: Correctness & Consistency of Event-Based Systems. He explains: “The speaker was Opher Etzion, one of the pioneers in the domain of event processing. The tutorial lasted for about 4 hours. What is interesting is that the speaker demonstrates with examples that building an event-based system is not trivial. Even more, a lot of existing systems are incorrect and give inconsistent results due to some problems in their semantics. To ensure correctness, you have at least to understand the sources of latencies in your system and ensure fairness between all the agents, in addition to defining a set of policies to tell the system when, how, where and what events you are looking for.”

Blog Event

DataWorks Summit: the big trends

Last month, four EURA NOVA engineers travelled to Barcelona to attend the Dataworks Summit. The conference is organised by Hortonworks, now known as Cloudera and it is about how to apply open source Big Data technology to accelerate digital transformation initiatives. They came back with a lot to say about the hot topics in AI, machine learning, architecture, the cloud, and the use cases! In this article, they share with us what they learned there and what struck them as particularly useful.

Big Trends

Data architecture

This year, one of the most important trends at the conference was data management and data architecture. Our R&D director Sabri Skhiri says: “There was a real focus on taking data lakes to their next stage and on making them actionable for AI and machine learning. The notion of data hubs was often mentioned, notably during the keynote speeches by Cloudera, IBM, and Pure Storage. However, most of the vendors of platforms have not been able yet to provide a fully-fledged ecosystem that allows the exploration, governance, and industrialisation of big data”.

AI industrialisation

This brings us to the second motto of the conference: AI industrialisation is a must. Our data engineer Khalil Amdouni explains: “The conference has been migrating towards AI topics. In the past, the conference used to focus mostly on data ingestion and data processing. It has been moving towards data science. Everyone is talking about AI and machine learning and how to put data science models into production. It’s looking into how to move from data exploration to industrialisation; we heard a lot about Cloudera’s Data Science Workbench etc.”

Production environment

The third trend of the conference was the separation between data processing tools and AI frameworks. Khalil explains: “Spark, Cloudera, Kubernetes are now all providing production environments (data science management platforms such as Cloudera Data Science Workbench, the Databricks Runtime ML, Kubeflow…) to integrate with machine learning frameworks such as Tensorflow or Python. Sabri adds: “This is interesting but we should first speak about “productisation”, data science models lifecycles, continuous integration and delivery. There are still a lot of shortcomings, like the fact that you need to centralise all your data in one partition before starting your favourite AI framework”.

Data governance

Another hot topic of the conference was data governance and compliance with regulations. Our R&D director goes on to say: “Everybody is speaking about the importance to be GDPR compliant and is proposing tools like Atlas, Egeria, IBM Infosphere, … but no one says how to actually comply with the GDPR during model deployments or how to deal with access policy management.”

Favourite Talks

Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka

Itai Yaffe presented the journey made by Nielsen’s Marketing Cloud division to provide its customers with real-time analytics tools to profile their target audiences. To achieve its goal, NMC needed to continuously transform its data infrastructure to ingest billions of events per day in a scalable and yet cost-efficient manner.

Sabri says: “The first version of NMC’s architecture includes CSV files and standalone Java applications with an OLAP database to expose the result. To reach their goal, NMC’s teams had to scale the process up to handle 10 times as much data”.

Their first step was to change the architecture: they moved to Kafka to ingest data, they leveraged Spark to stream and to aggregate data, and they used HDFS to store data.

Sabri explains: “The issue here was that they had to manage the statefulness of the Spark applications on HDFS by themselves. In addition, the system was error-prone in case of failure. They tried again and looked into Spark Structured Streaming, then tried to combine Spark Streaming with batch ETLs and finally decided to use Kafka to imitate streaming over their data lake. This evolution made the situation really interesting from a business and architectural point of view. Their business goal is to support decision making with machine learning to deliver reports on campaigns. Over the years, they adapted their architecture to go further and reach that objective”.

Our architect Cyrille Duverne adds: “Their story showed how much effort is required to build a long-term architecture. Tools are not enough; you first need the use cases that lead to an architectural vision. Only then can you choose the tools that will support the vision. To build this architecture, you need time and people with the right skills”.

To know more about NMC’s journey, you can find the slides of the presentation here.

Federated Learning

Chris Wallace is a data scientist at Cloudera Fast Forward Labs. He presented how his team leveraged federated learning to predict maintenance problems when customers of a manufacturer are not willing to share with the manufacturer the details of how their components failed, but want the manufacturer to provide them with a strategy to maintain the faulty parts.

Our architect Cyrille Duverne explains: “In this case, federated learning is a kind of distributed deep learning where you train the model on decentralised data. The main idea is that a network of nodes shares models rather than training data with the server. Each node has the untrained model that they will train using the data they have. Each node then sends a copy of its trained model back to the central server that will take the average and send the new model to the different nodes. The process is repeated until the final version of the model is reached.”

Our data scientist Malian De Ron explains: “I find federated learning very interesting. As data scientists, we can work directly on updating models, but we don’t have access to all the training data. Federated learning can be useful for use cases where the customers want to keep their data anonymous. For example, we work for a financial company that works with a bank. Neither of them is willing to share their data. By using federated learning, the training data could remain in its original location, which could satisfy our customer’s privacy concerns.

To know more about federated learning, you can find the slides of the presentation here.

Data governance with Egeria: The industry’s first open metadata standard

John Mertic is the director of program management for ODPi, the Linux Foundation’s Open Data Platform initiative. He talked about their new open metadata standard Egeria, introduced in September. John Mertic explained how the standard supports the free flow of standardised metadata between different technologies and vendor platforms, enabling organisations to locate, manage, and use their data resources more effectively.

Sabri says: ”Companies have 40 years of evolution embedded in their IT systems, resulting in high complexity of data lineage and data silos. In the complex new world of big data and real time, security models have to track data throughout the organisation. This is why data governance and metadata management are hot topics in conferences. Everybody is talking about it and proposes tools such as Egeria, IBM InfoSphere, or Atlas. I talked with IBM InfoSphere people and I had an overview of the Egeria tool. It can be used to federate the IBM InfoSphere Information Governance Catalog, Apache Atlas and even other Egeria cohorts. The IBM Governance Catalog can pull information directly from Egeria and integrate the metadata, the lineage, and even tags from Atlas”.

To know more about Egeria, please find the slides of the presentation here.

Final Thoughts

When working with clients as they make their journey to the new digital world, we noticed recurrent problems in the areas of data access, usage, and governance. In many conferences, we hear stories of companies facing these challenges and making a lot of ad hoc choices but lacking a long-term architectural vision. To crack the challenges, our R&D director Sabri Skhiri designed the Data Architecture Vision (DAV), which later led to digazu.

The Dataworks conference highlighted the need to take data lakes to their next stage. The digazu platform, with its integrated and managed data lake, meets that need. It is a true data hub that integrates real-time and batch dataflows, that collects data from multiple sources, stores it, and distributes it to applications and users across the whole organisation.

Another need mentioned at the conference was that of providing companies with production environments to deploy models. Leveraging ever-increasing amounts of data to provide new services or solve problems requires increasing resources in terms of expertise, time and money. digazu offers a scalable way to keep data pipelines open for business in real time or batches without an army of data experts, lines of code, or complex training.

A third need highlighted at the conference is for companies to reach good data governance. There are already excellent governance tools such as Atlas, Egeria, IBM Infosphere to support the free flow of standardised metadata. digazu opens the door to automated regulatory compliance by providing ready-to-use connectors to data management and governance tools.

To learn more about digazu, visit digazu.com

Blog Event

BOOT CAMP 2019

EURA NOVA is launching an intense 3-month I.T. boot camp starting September 2019.

Event

Third Workshop on Real-Time and Stream Analytics in Big Data: key takeaways

Last month, EURA NOVA research centre organised the third workshop on real-time and stream analytics in big data, collocated with the 2018 IEEE conference on big data in Seattle. The workshop brought together the leading actors in the field including data Artisans, the University of Virginia and Télécom Paris Tech as well as 9 well-known speakers from 6 different countries. We received more than 30 applications and we are proud to have hosted such interesting presentations of papers in data architecture, stream mining, complex event processing and IoT.

The workshop was a real success, with captivating talks and a lot of interesting questions and comments. If you could not attend the event, our R&D engineer Syrine Ferjaoui has brought back for you the important elements from the keynotes and the presented papers.

First keynote speaker:

First of all, the workshop started with the keynote of Fabian Hueske, PMC member at Apache Flink & co-founder of data Artisans. His talk “Unified Processing of Static and Streaming Data with SQL on Apache Flink” presented Flink’s features and its relational unified APIs for batch and streaming data. Fabian Hueske insisted on the importance of unifying stream and batch for 2 major points: the usability and the portability. Flink includes a set of features such as materialised views to speed-up the analytical queries, dynamic tables, updates propagation and processing, continuous queries, approaches to handle time in stream processing, watermarks and queries on infinite sized tables. With all these features, Flink helps its users to build data pipelines with low-latency ETL, stream & batch analytics and to power live dashboards.

Our research director Sabri Skhiri adds: “Apache flink is currently working on a set of connectors. They have already the HDFS sink, the JDBC sink and since they are pushing Flink as the standard technology for data pipelines and materialised views, they want to expand their connectors set.”

Second keynote speaker:

Secondly, our research director Sabri Skhiri talked about data management, and stream and real-time analytics. His talk “The challenge of Data Management in the Big Data Era & its underlying Enterprise architecture shift” started with defining data architecture as a global plan depicting how to collect, store, use and manage data to answer the 8 main challenging questions that are essential to building a solid and efficient solution. During his talk, our director considered deriving microservices from data streams as the new wave of architecture and he discussed the Data Architecture Vision (DAV) set throughout 10 years of research and development at EURA NOVA. The DAV later led to the development of digazu, a data engineering platform containing all the different components needed to collect, store, govern, transform, and analyse all the data in the company’s IT environment.

Workshop Invited Speakers:

After the keynotes, 9 selected papers have been presented, covering mainly these 4 topics: (1) Data Streaming Architecture, (2) CEP/CER, (3) Stream Mining & (4) IoT Device Integration:

A Scalable and Robust Framework for Data Stream Ingestion (Isah and Zulkernine):

Isah and Zulkernine (Queen’s University, Kingston, Canada) propose a scalable and fault-tolerant data stream ingestion and integration framework that can serve as a reusable component across many feeds of structured and unstructured input data in a given platform. Our R&D engineer Syrine Ferjaoui explains: “The ingestion layer (that integrates Apache NiFi and Kafka) is used to decouple streaming analytics layers (acquire, buffer, pre-process, distribute data streams). This NiFi-Kafka “NiFKaf” integration takes advantage of the high configuration of NiFi and the addition of several data of consumers provided by Kafka.This way, it supports many data sources, languages and content formats, ensures high throughput and low latency, supports large numbers of data consumers, enables data buffering during temporary spikes in workload and employs a replay mechanism, and is scalable”.

Edge Computing Architecture to Support Real Time Analytic Applications (Trinks & Felden):

The paper by Trinks & Felden (TU Bergakademie Freiberg, Germany) presents Edge Computing which is an extended approach to cloud computing. It describes an architecture scheme that consists of 3 layers: node layer (gadgets, smartphones, embedded systems, sensors), edge layer (routers, switches, small/macro base station) and cloud layer (datacenters, servers, databases, storages). Edge Computing is used to minimise energy consumption, bandwidth, latency and increase safety and privacy level and employs real-time analytics within its architecture.

Distributed Real Time Link Prediction on Graph Streams (Katragadda, Gottumukkala, Pusala, Raghavan, Wojtkiewicz):

Link prediction refers to the likelihood of a link appearing in the future based on the current status of a graph. The previous works for link prediction such as sketch-based approaches and dynamic attributed networks do not give exact results and cannot handle deletion or modification in the graph nor the large volume of data. The goal of the authors (University of Louisiana, USA) is to design a graph-processing approach for link prediction that ensures real-time prediction and extraction of accurate features from the graph with exact results. Syrine details: “Graph processing can be edge-centric, vertex-centric or neighbourhood centric. This paper proposed two new graph processing frameworks for handling each graph streams: vertex-centric processing & neighbourhood-centric processing. These frameworks are able to predict 100% of the links with an average graph ingestion time between [149.3 – 242.7] ms”.

DisPatch: Distributed Pattern Matching over Streaming Time Series (Hamooni and Mueen):

Researchers from the University of New Mexico have developed a robust distributed matching system, called DisPatch. In a scenario where multiple data sources or producers publish data to the Kafka system, DisPatch is the data consumer that matches a pattern with a guaranteed maximum delay after the pattern appears in the stream. Syrine reacts: “Given a time series T of length n, and a query Q of length m, it normally takes O(nm) to calculate the Euclidean distance/correlation between Q and all subsequences of T, but this method calculates the results in O(log(n)) by exploiting the overlaps. As a result, DisPatch guarantees exactness and bounded delay at the same time”.

Using Information in Access Logs for Large Scale User Identity Linkage (Jalali, Krishnamoorthy, and Biswas):

In this paper, the authors (Adobe Research, California, USA) discuss Adobe’s Identity Graph that provides a comprehensive solution to the challenge posed by fragmentation of identities. Our R&D engineer details: “Identity graph helps in connecting data across channels, domains and devices to solve a fundamental problem in the Digital Marketing domain. The fragmented profiles of a consumer are linked together in order to provide a unified view across devices. This means that an identity graph connects all the known identifiers that correlate with the individual consumer. The researchers built identity relationships by using both online data traffic and offline CRM data logs from customer’s backend systems. To do that, they are using two approaches: deterministic linking and probabilistic linking. They combined them using deterministic as a base and expanding using probabilistic clusters”.

Streaming Algorithm for Big Data Logistic Regression (Yang, Wang, Xu & Zhang)

The authors (Purdue University, USA) propose a novel fitting algorithm for big data logistic regression by combining Fisher Scoring and IRWLS. Syrine details: “The revised IRWLS algorithms can break the memory barrier and is suitable for streamed computing. It is per row updatable and does not need to load the whole dataset into the memory. This algorithm has a fast convergence speed (usually around 3). The limitation of this method is the structured data with large n (rows) and small p (columns)”.

Efficient Dynamic Time Warping for Big Data Streams (Martins & Kerren):

Dynamic Time Warping (DTW) is able to match natural time series with similar shapes, but a different length of patterns. The authors (Linnaeus University, Sweden) described enhancements to the DTW algorithm that allow it to be used efficiently in a streaming scenario. Syrine explains: “Their solution is composed of 3 parts: (1) a very fast implementation of the DTW (2) an append operation for the DTW which works in linear or constant time and (3) an approximation of a sliding window that allows DTW to forget old time steps, improving the processing of “never-ending” streams. In short, DTW encapsulates all data behaviour information in a single value and enables the use of a tiny fraction of data compared to the original sensed data while still obtaining highly accurate results”.

Using Candlestick Charting and Dynamic Time Warping for Data Behavior Modeling and Trend Prediction for MWSN in IoT Concepcion (Aleman, Pissinou, Alemany, Kamhoua)

There is a rapid emergence of new applications involving mobile wireless sensor networks (MWSN) in the field of Internet of Things (IoT). Although useful, MWSN still carry the restrictions of having limited memory, energy, and computational capacity. At the same time, the amount of data collected in the IoT is exponentially increasing.The authors (Florida International University, USA) propose a Behavior-Based Trend Prediction (BBTP), which is a data abstraction and trend prediction technique, designed to adress the limited memory constraint in addition to providing future trend predictions. Predictions made by BBTP can be employed by real-time decision-making aplications and data monitoring.

A multi-dimensional extension of the Lightweight Temporal Compression method (Li, Sarbishei, Nourani, and Glatard):

Lightweight Temporal Compression (LTC) is among the lossy stream compression methods that provide the highest compression rate for the lowest CPU and memory consumption. As such, it is well suited to compress data streams in energy-constrained systems such as connected objects. In this paper, Li, Sarbishei, Nourani and Glatard (Concordia University & Motsai Research, Canada) investigate the extension of LTC to higher dimensions. Syrine adds: “They described how multi-dimensional LTC compression saves substantial amounts of energy (up to 20%) and is feasible on connected objects. The implementation with Euclidean norm is more intuitive than infinity norm for nD sensors, as well as more CPU & memory intensive and leads to lower compression ratios”.

Special thanks to our keynote speaker Fabian Hueske, and all the attendees and speakers! We are looking forward to an even more successful workshop in the coming edition of the IEEE Big Data Conference. Stay tuned for paper submission dates!

Event

IEEE Big Data 2018: a summary

At the beginning of the month, our R&D director Sabri Skirhi and our R&D engineer Syrine Ferjaoui travelled to Seattle to attend IEEE Big Data. The conference is one of the most influent in this domain, gathering more than 1100 attendees, 5 keynotes, 9 tutorials, and 8 daily tracks in parallel. Back in Belgium, our R&D director gives you his opinion on the conference itself and the important elements from the keynotes, the tutorials, the workshops and the interesting papers.

Favourite Talks

Keynote 1: Decentralized Machine Learning – Google AI

The IEEE Big Data conference started with the inspiring keynote of Blaise Agüera y Arcas, a distinguished researcher at Google AI. Our director details: “The straightforward thesis of the talk is that we can, and we must, use the mobile device for local deep neural network computing. Blaise Agüera explained that since the launch of Tensorflow, Google Brain has built specialised hardware servers to run efficiently deep neural network computing jobs. Nowadays, we find on the market specialised chips that are smaller than a coin of 1 cent and that costs less than a cappuccino. Using them, you can run very efficiently deep neural net computing jobs on mobile at low frequency, low energy and even continuously. For example, the Google camera embeds deep neural nets and does not need to send data to the server side for face or situation detection. But Dr Blaise is going further. He works on reusing the existing techniques in distributed neural net and sharing the learned gradient in a parameter server and sharing them to all device. This is what we call federated learning, and it has impacted many research areas, such as edge computing. The idea of edge computing is to execute light tasks on the edge of the network in order to offload the server/cloud. But here, this is changing the game since the nature of the job is not light anymore. In addition, the concept of federated learning does not try to offload the server but changes the role of the server as a coordinator between edge devices. Secondly, it has impacted neural net compression. The question is then: do we still need to compress networks when we can either distribute the neural net on the server side or have specialised chips on the device side?”

Keynote 2: Big Data for Speech and Language Processing – MSF Research

The second keynote, Xuedong Huang, is a Microsoft Technical Fellow of Microsoft Cloud and AI. He was presenting the latest advances in Speech recognition and Text To Speech (TTS). The key papers behind this technology can be found here and on the research group page. Our director explains: “The first part of the keynote was about the MSF live captioning that will be soon integrated natively in PowerPoint. That is just impressive. Everything that the speaker is saying is capturing by the tool. I personally tested the Translator Android application and it works just fine! The second part of the keynote was focused on the Text To Speech (TTS). The speaker was showing a set of very interesting examples of how voice can be modelled. For instance, if the system learns a model out of hours of discussions, it can apply my voice in Chinese or Arabic or it can learn from a group of person in order to get a better accent and expression”.

The Tutorials

This year, IEEE Big Data organised 9 tutorials. Our R&D director explains: “This is probably what I like the most at an academic conference. A research group presents a complete state-of-the-art review in their domain and usually position their own work in the story. My favourite was Progress in Zeroth Order Optimization and Its Applications to Adversarial Robustness in Deep Learning. It was one of the coolest research topics I have seen so far. They discussed how you can fool a deep neural network in order to get a wrong classification. The idea is great: finding the minimal noise you can add to a picture in order to increase the probability of a wrong classification. In this setting, you don’t know anything about the classifier, but you can submit images and you will get a label. Indeed, that looks like a black box optimisation setting. That is precisely why they use Zeroth order optimisation. The research topic is so cool, you can manage to fool the classifier to make it recognize a piano in an image picturing a bagel! Can you imagine the impact, at the era of the electronic passport, where image recognition starts to be used in the signature process? What if I can find how to fool an algorithm to be classified as someone else with just a few grey pixels on my picture?”

The Workshops

EURA NOVA research centre organised the third workshop on Real-time and Stream analytics in Big Data, collocated with the 2018 IEEE conference on Big Data. Our Research Director Sabri Skhiri talked about data management, and stream and real-time analytics. Thank you to our keynote speaker Fabian Hueske, and all the attendees and speakers! They had a great time, with captivating talks and a lot of interesting questions and comments. The summary of the event is available on our website. The slides of the opening session and the slides of the second keynote are available here.

Final Feelings

In the early age of the conference, IEEE Big Data was mainly focused on the big data infrastructure. In the following years, the conference became data science oriented, with a significant increase in the number and the complexity of data science use cases. When we asked how he felt about the event, Sabri explained: “I have been attending this conference since the first occurrence. The most important shift I have seen is really about the content. This year, the infrastructure papers have almost disappeared. On the other hand, the vast majority of the publications are on data science. We can really see that it is becoming a conference for ML practitioners. The side effect is the complexification of the discussed topics. Machine learning notions are supposed to be known, deep neural networks are becoming the norm. Going further, the authors are also good at using distributed frameworks, especially Spark. For them, the infrastructure is not a problem anymore, this is part of the daily job”.

The Papers

A personal selection of interesting papers:

Learning Effective Embeddings for Machine-Generated Emails with Applications to Email Category Prediction: a nice paper from Google, Twitter and Facebook. The main intuition is that the B2C email in your mailbox and especially the sequences of such email can tell a lot about you and your interest and even your next actions. The idea is then to find a relevant embedding for emails. Interesting, especially in the context of B2C emails.
Learning-based Automatic Parameter Tuning for Big Data Analytics Frameworks: they don’t consider job feature and then they need to re-train the performance model for each job. But the approach for the optimization is interesting. Especially, the intuition behind space exploration and space exploitation that has its roots in Reinforcement Learning.
A Reinforcement Learning Based Resource Management Approach for Time-critical Workloads in Distributed Computing Environment: although the paper is not rocket science and even a bit naive, it represents a complete Reinforcement Learning modelling. They give a good illustration of the state, action and reward modelling.
An Integrated Knowledge Graph to Automate GDPR and PCI DSS Compliance: the idea is to parse a law text such as the GDPR regulation and to create a knowledge graph (using triplet RDF representation) and then to be able to ask questions. In the future works, they plan to be able to check the compliance of your data privacy policy with the knowledge graph.

Blog Event

Spark+AI Summit: a summary

A few weeks ago, Sabri Skhiri and Florian Demesmaeker were in London to attend the Spark+AI summit. They came back with a lot to say about the new features of Spark and the presented use cases! In this article, they will give you their opinion about Databricks’ main announcement, the intakes of their favourite talks and training, and what they thought of the new name of the conference.

A new name

This year, Spark expanded the summit’s scope and renamed it “Spark + AI Summit”. The goal of Databricks, announced by its co-founder Ali Ghodsi, is to incorporate unified aspects of data and AI.

Florian Demesmaeker, our R&D engineer, explains: “In some of the keynote talks, the speakers talked about use cases where the job of the data engineer is strongly reduced. The data scientists can easily experiment with data, travelling back and forth in time. This means more focus on AI, rather than on the data engineering part that makes all data accessible to the data scientists”.

Main announcement

In line with this change of name, Databricks announced the release of a complete data science lifecycle on the cloud.

Sabri Skhiri, our R&D Director, explains “It is interesting to see that the change in the event name is actually very visible in the change of Databricks’ strategy. Their tools are now completely dedicated to stream ETL, and there is a huge focus on integrated data management”.

Databricks’ new features include Databricks Delta which creates data pipeline and provides data views and exploration features. Secondly, the Databricks Runtime ML is a ready-to-use environment providing a set of pre-loaded ML frameworks where the data scientist can play with data. Finally, the MLflow tool allows to simplify the ML models development at enterprise scale.

Our R&D Director precises: “Together, these features provide a complete and unified approach to machine learning lifecycle and pipeline automation. This looks like a very competitive SaaS offer for integrated data management, available on AWS and Azure. However, the metadata management and the security aspect is still the missing piece”.

The training day

The first day of the conference was dedicated to training workshops that include a mix of instruction and hands-on exercises to help attendants improve their Apache Spark skills.

Florian gives insights into his favourite training Tuning and Best Practices. He explains: “The aim of the training was to make programmers aware of how Spark works internally, in order to be able to write optimised applications. They presented a few situations, each one showing one relatively slow process. Then they presented a step-by-step procedure to debug the situation and to find the points that could be improved in the current situation. In summary, tips and tricks to adapt to different situations”.

Favourite talks

The sessions at the conference covered data engineering and data science contents along with best practices for productionising AI. The talks were divided into roughly two categories: Spark programming and deployment, and applications on top of Spark (AI applications).

Florian Demesmaeker explains: “I attended 28 talks. The keynotes from Databricks were quite interesting, they presented Delta and MLflow. I also enjoyed the talks about tools to optimise the internals of Spark, these provided good technical details. Other talks were about use cases on top of Spark, it was interesting to see what challenges other companies face and how they address them”.

Sabri Skhiri adds: “The talk Learning to Rank Datasets for Search was very inspiring. Oscar Castañeda-Villagrán, a data scientist working at Xoom (a Paypal service) talked about learning to rank R data set. The idea is that we can extract metadata when the data pipeline is arriving in the lake. Going further, you can not only extract metadata but also calculate a kind of judgment relevance score that will be used for bootstrapping the learning to rank process. In this way, a user can search and retrieve the relevant R data set in the lake. A very good idea for the metadata-driven exploration”.

Early September 2018, 8 EURA NOVA engineers travelled to Berlin to attend the Flink Forward Conference, dedicated to Apache Flink users and stream processing communities. You can read their feedback here.

Blog Event

Flink Forward 2018: What You Want to Know and What You (Will) Need to Know.

Early September 2018, 8 EURA NOVA engineers travelled to Berlin to attend the Flink Forward Conference, dedicated to Apache Flink users and stream processing communities.

They came back with a lot to say about the hot topics in stream processing and the presented use cases! In this article, they will give you their opinion about data Artisans’ main announcement, the intakes of their favourite talks, and what they thought makes Flink Forward different from other conferences.

First keynote announcement:

During the keynote speech, data Artisans announced that they now bring ACID transactions directly on streaming data with data Artisans Streaming Ledger.

Charles Bonneau, our software architect, says: “This feature allows ACID transactions between multiple operators’ event-processing operations and internal states. This means that streaming applications can now update multiple states in one transaction. For example, an application that transfers money from one bank account to another can finally be implemented using Flink with strong consistency guarantees. Both bank accounts will have their balance updated at the same time as if there was a master data-management state”.

For Sabri Skhiri, our R&D director, this opens the doors to a brand new range of applications, especially in data-driven real-time services but also in streaming data management. He explains: “They are pushing forward the concept of streaming. Now, you could imagine a master data-management state that can be updated by operational streaming applications in real time. This will allow even more complex and advanced use cases of stream processing!”.

Favourite talks:

In 2 days, each Euranovian attended about 18 talks and use case presentations, with speakers from tech giants such as IBM, Netflix, Alibaba, and Uber as well as speakers from smaller companies.

Charles explains: “The conclusions are reassuring: most of them face the same issues that we see at our clients’ and our solutions are all valuable. They include a stream-first data architecture, a stream-first data pipeline product, and Flink developers skills. Even though a number of companies are at the very edge of the technology and their issues do not yet require continuous flows of a considerable amount of events, we are ready”.

For our R&D Director Sabri Skhiri, the keynote speech from Lightbend was one of the most interesting ones. He explains: “Viktor Klang, Lightbend deputy CTO, talked about the convergence between microservices and stream processing. At EURA NOVA, we have been advocating for this convergence for more than a year in our architecture practice. The idea is simple: asynchronous microservices can be designed as stream processing stages. This is fantastic because it makes modern stateful stream processing frameworks the perfect target for implementing reactive microservices. With stateful deployment, exactly once semantics, high availability and ACID access to states, microservices can become stateful streaming apps.”

Vision-oriented Flink Conference:

Our colleagues came back with sparkles in their eyes. When we asked them how they felt about the event, Sabri Skhiri explained:

“Very often, this type of conferences tend to be business oriented. They are focused on how to make the framework easy to use and available to as many people as possible. By contrast, this year’s Flink Forward conference was all about innovation and vision. data Artisans shared their vision of what the Flink framework will be within 3 to 5 years and talked about what role stream processing and big data have within this vision. In fact, almost all the talks were very technical. They were testimonies of big names in the industry, such as Alibaba, Netflix, and ING about problems encountered on the field and how they have been solved, which is often out of the box. The Flink-Alibaba partnership is a sharing one. Alibaba are way ahead with their technology. They keep their lead for 1 year and then they share their work and make their code open source. data Artisans have a great long-term vision of stream processing. I can see a lot of very interesting architecture discussions in the coming months!”

Stream Processing Technology:

When most frameworks cannot process considerable streams of live data and provide results in real time, Flink provides a single runtime for the streaming and batch processing while being highly scalable.

Cyrille Duverne, our Lead Data Architect, confirms: “Flink is definitely a real-time processor! We’re speaking about true real time, not only mini batches etc… Plus, the introduction of ACID transaction management in the new version of data Artisans’ Flink distribution creates a good marketing edge”.

Sabri Skirhi and our R&D engineer Florian Demesmaeker were at the Spark Summit this week. Stay tuned for part 2 with their feedback!

Blog Event