Fourth Workshop on Real-Time and Stream Analytics in Big Data: key takeaways

Last December, Eura Nova’s research center held the fourth workshop on real-time and stream analytics in big data at the 2019 IEEE Conference on Big Data in Los Angeles. The workshop brought together leading players including Confluent, Apache Pulsar, the University of Virginia and Télécom Paris Tech as well as 8 renowned speakers from 6 different countries. We received more than 30 applications and we are proud to have hosted such interesting presentations of papers in stream mining, IoT, and industry 4.0.

The workshop was a real success with many interesting questions and comments. If you could not attend, our R&D engineer Syrine Ferjaoui brought back important elements from the presentations for you.

First keynote speaker:

First of all, the workshop started with the keynote of Matteo Merli, PMC member at Apache Pulsar. His talk “Messaging and Streaming” explained how Pulsar can be a unified infrastructure that supports messaging and streaming.

Matteo introduced messaging as events that are being created and streaming as analysing events that just happened. These are two different processing concepts but they need a single infrastructure. He then explained the architecture view of Pulsar, which has separate layers between the brokers and the bookies (BookKeeper instances that handle persistent storage of messages). This means that brokers and bookies can be added independently, traffic can be shifted very quickly across brokers, and new bookies will ramp up on traffic quickly. This segmented distribution makes the architecture of Pulsar more flexible and dynamic.

Pulsar has other interesting features such as durability, low latency, high throughput, high availability, unified messaging model, high scalability, native computing, … The roadmap includes working on Pulsar storage API to allow direct access to data stored in Pulsar and to retrieve and process data more efficiently. They are also working on higher-level messaging features.”

Second keynote speaker:

The second keynote was given by John Roesler, a Kafka committer at Confluent. He talked about Kafka Streams and the evolution of streaming paradigms.

To design software, we, developers, used to separate the application logic from the database. To scale the database capacity, we then started to use a search index to do ETL jobs and query the database in a fast and optimal way. However, this created bugs in the software, added data consistency issues, and created more complexity in the system. Later, we started to use HDFS for a more flexible design. While enabling replication and distributed storage, this solution added more latency and supported batch processing only. It did not meet the needs of real-time processing use cases.

At this point, streaming helped a lot. The next step was to add a streaming platform that reads from sources, does some computation, and sinks the result somewhere else. The KafkaStreams design is a set of multiple lambda stateful functions, which makes it a good fit for a microservices architecture. With Kafka Streams’ new updates, the app logic is linked to a relational database with ACID guarantees.

Finally, John Roesler considers that “software is a fractal”, a never-ending pattern: a software architecture is complex and even when we zoom into a single component, it is still complex. But for the Kafka Streams’ design, when we zoom out, it looks like a set of services interacting and connected to each other and this simplifies the aforementioned designs.

John concluded by mentioning open problems that can be dealt with in stream processing, including semantics, observability, operability, and maintainability.

Workshop Invited Speakers:

After the keynotes, 8 selected papers were presented, covering mainly these 6 topics: (1) Stream Processing for IoT, (2) Serverless and HPC (High Performance Computing), (3) Collaborative Streaming, (4) Stream Mining, (5) Image Mining and (6) Real-time Machine Learning. Some papers are not yet available, as they will be published in the proceeding of the IEEE Big Data Conference. In the meantime, do not hesitate to contact our R&D department at [email protected] to discuss how you can leverage stream processing in your projects.

Scalable and Reliable Multi-Dimensional Aggregation of Sensor Data Streams (Sören Henning, Wilhelm Hasselbring)

Sören and Wilhelm are engineers in the Software Engineering Group from Kiel University. They propose a stream processing architecture which allows for aggregating sensors in hierarchical groups, supports multiple hierarchies in parallel, provides reconfiguration at runtime, and preserves the scalability and reliability qualities of streaming.

Performance Characterization and Modeling of Serverless and HPC Streaming Applications (Andre Luckow, Shantenu Jha)

Andre Luckow, head of Blockchain and Emerging Technologies at BMW Group, and Shantenu Jha, associate professor at Rutgers University, presented StreamInsight, which provides insight into the performance of streaming applications and infrastructure, their selection, configuration, and scaling behaviour.

Collaborative Streaming: Trust Requirements for Price Sharing (Tobias Grubenmann, Daniele Dell’Aglio, Abraham Bernstein)

The paper is written by Tobias Grubenmann, researcher at The University of Hong Kong, in collaboration with Daniele Dell’Aglio and Abraham Bernstein, researchers at the University of Zurich. They present the Collaborative Stream Processing (CSP), a model where the costs, which are set exogenously by providers, are shared between multiple consumers, the collaborators. For this, they identify the important requirements for CSP to establish trust between the collaborators and propose a CSP algorithm adhering to these requirements.

Kennard-Stone Balance Algorithm for Time-series Big Data Stream Mining (Tengyue Li, Simon Fong, and Raymond Wong)

Tengyue Li and Simon Fong (researcher and associate professor at the University of Macau, China) and Raymond Wong (associate professor at UNSW Sydney) worked on the Kennard-Stone Balance algorithm used as a new data conversion method. Training a prediction model effectively using big data streams poses certain challenges in machine learning. In this paper, the authors apply the Kennard-Stone algorithm on time-series to extract a meaningful representation of big data streams, which improves the performance of a machine learning model.

Assessing the Effects of TV Ad Events on Digital Search: On the Selection of Outcome Measures (Shawndra Hill, Anthony Colas, H. Andrew Schwartz, and Gordon Burtch)

Shawndra Hill (Microsoft), Anthony Colas (University of Florida), H. Andrew Schwartz (State University of New York at Stony Brook) and Gordon Burtch (University of Minnesota) explained their work on the interactions between TV content and online behaviours such as response to digital advertising. They developed AdMiner, a tool that can track online activity around a brand and provide actionable insights into ad campaigns.

MLK Smart Corridor: An Urban Testbed for Smart City Applications (Austin Harris, Jose Stovall, and Mina Sartipi)

Austin Harris, Jose Stovall, and Mina Sartipi (researchers and CUIP director at the University of Tennessee at Chattanooga) have helped to create Chattanooga’s smart corridor, used to test new technologies and generate data-driven outcomes. In their talk, they presented the corridor, used as a test bed for research in smart city developments in a real-world environment. The wireless communication infrastructure and network of sensors in combination with data analytics provide a means of monitoring and controlling city resources and infrastructure in real time.

Image Mining for Real Time Quality Assurance in Rapid Prototyping (Sebastian Trinks and Carsten Felde)

Sebastian Trinks and Carsten Felde (TU Bergakademie Freiberg) presented how image mining can help avoiding errors and low quality of printed prototypes in real time. This can result in saving resources and increasing efficiency when developing new products.

Real-Time Machine Learning Competition on Data Streams at the IEEE Big Data 2019 (Dihia Boulegane)

This year, IEEE Big Data held the Real-time Machine Learning Competition on Data Streams. As the competition is focused on streaming, its online platform required a specific infrastructure that meets data stream mining requirements. Dihia Boulegane is a Ph.D. student at Télécom ParisTech working in collaboration with Orange Labs on machine learning for IoT networks monitoring. She was in charge of implementing the streaming engine of the dedicated platform of the competition. Dihia explained its components, the technologies used, and the challenges met to build the platform. At the end, the platform was able to provide multiple streams to multiple users, to receive multiple streams, to process them and to provide the leader board and live results.

Special thanks to our keynote guests, Matteo Merli and John Roesler, and all the attendees and speakers! We are looking forward to an even more successful workshop in the coming edition of the IEEE Big Data Conference. Stay tuned for paper submission dates!

Event

A Performance Prediction Model for Spark Applications

Apache Spark is a popular open-source distributed-processing framework that enables efficient processing of massive amounts of data. It has a large number of parameters that need to be tuned to get the best performance. However, tuning these parameters manually is a complex and time-consuming task. Therefore, a robust performance model to predict applications execution time could greatly help in accelerating the deployment and optimization of big data applications relying on Spark. In this paper, we ran extensive experiments on a selected set of Spark applications that cover the most common workloads to generate a representative dataset of execution time. In addition, we extracted application and data features to build a machine learning-based performance model to predict Spark applications execution time. The experiments show that boosting algorithms achieved better results compared to other algorithms.

Florian Demesmaeker, Amine Ghrab, Usama Javaid, Ahmed Amir Kanoun, A Performance Prediction Model for Spark Applications, in the proceedings of Big Data congress 2020.

Click here to access the paper in its preprint form.

Papers

IEEE Big Data 2019 – A Summary

At the beginning of the month, our R&D director Sabri Skhiri and our R&D engineer Syrine Ferjaoui travelled to Los Angeles to attend IEEE Big Data Conference. It is one of the most influential academic gatherings in distributed machine learning. This year, it featured 879 authors, shortlisted from 2009 applicants. They came from 28 countries and presented 210 papers. Back in Belgium, Sabri and Syrine give you their opinion on the event itself and the important elements from the keynotes, the tutorials, the workshops and the interesting papers.

The Big Trends

Sabri says: “The main trends were deep learning, NLP, privacy-preserving approaches, GAN, graph mining and stream mining. In my view, the level of the papers was quite good. Authors are becoming ever more skilled in data science, maths and algorithms. This goes to show that to be a good data scientist, you need an extensive set of advanced skills. Interestingly, there was almost nothing about distributed computing! This is a big move compared to the previous editions. The only presentations that had something to do with distributed systems were about optimisation strategies, an area similar to what our ECCO team researches. The Big Data Conference focuses on data science; it does not really look into its scalability. Distributed computing topics tend to be dealt with at conferences like DEBS, VLDB, USENIX, SIGMOD, etc. As a result, this conference is an amazing place to see hundreds of data science use cases with, most of the time, an interesting contribution.”

The Keynotes

The keynotes were focused on data science as well. We even heard the term “Big Data Science”.

Keynote 1: Responsible Data Science by Lise Getoor – Professor at UC Santa Cruz

Syrine says: “The first keynote was my favourite. Lise started by comparing machine learning to a black box. The goal was to unpack the box and invite people to use data science and to use it wisely. To autonomise ethical decision-making, we should move away from maximising AI systems autonomy and move toward human-centric systems. To do this, we should make sure that human-centric systems have three qualities: (1) be knowledge-based, (2) be data-driven, and (3) support human values. Achieving responsible data science requires both machine-learning and ethics.”

Keynote 2: DataCommons “Google for Data” by Ramanathan Guha – Google

Guha presented DataCommons, a project started by Google to combine data from different open sources. Syrine explains: “Google’s DataCommons project allows users to pretend that the Web is one website, enabling developers to pretend all this data is in one database. The long-term vision of Google is to aggregate all data from publicly available sources (Medicare, Wikidata, sequence data, Landsat, CDC, Census…) into a single Open Knowledge Graph. The goal is to reduce or eliminate the data download-clean-store process. Instead, users can access and use already cleaned data in the cloud. Data can be public or private (internet & intranet). This will avoid repeated data wrangling and ease the burden of data storage, indexing, etc.”

The Tutorials

This year, IEEE Big Data held nine tutorials. Our R&D director explains: “At this type of events, tutorials are always a good way to learn a complete state of the art in a couple of hours. I particularly appreciated the tutorial on “Taming Unstructured Big data: Automated Information Extraction for Massive Text” by the team of the famous Jiawei Han (he is a kind of pop star in data mining and the father of Graph Cube). I found out that many papers about named entity relations were published in the past two years. The idea is to be able to extract supervised, semi-supervised, and unsupervised relations between entities: for instance, discovering that “Trump” is “President of” “USA”. They also propose new approaches to integrate knowledge bases such as DBPedia or YAGO to infer new unknown relations from a corpus. This is just amazing!”

Syrine adds: “The tutorial on NewSQL principles, systems, and current trends was interesting as it explained why we should consider using NoSQL/NewSQL to deal with data interconnections and very high scalability. After attending this tutorial, I was motivated to order this book about Principles of Distributed Database Systems. For fans of deep learning, the tutorial “Deep Learning on Big Data with Multi-Node GPU Jobs” covers a lot about large-scale GPU-based deep-learning systems. If you missed the conference, all resources can be found on this link.”

The Workshops

The EURA NOVA research centre organised the fourth workshop on Real-time and Stream Analytics in Big Data, at the 2019 IEEE conference on Big Data. We were really happy to welcome Matteo Merli from Apache Pulsar and John Roesler from Confluent as keynotes speakers. Thank you to them and to all the attendees and speakers! They had a great time, with captivating talks and a lot of interesting questions and comments. The summary of the event will soon be available on our website. The slides of the keynotes are available here:

Favourite Papers

A personal selection of interesting papers:

Subspace Clustering with Active Learning (Hankui Peng)

The paper tackles a really interesting problematic faced by a lot of data scientists. Introducing active learning is a cool idea and so is the way they used a mathematical trick to make their approach feasible.

High Impact Customer Acquisition & Retention Modelling – A Scalable Data Mashup Approach (Kajanan Sangaralingam, Nisha Verma, Aravind Ravi, Su Won Bae)

Su Won Bae, from Mobilewalla, presented how they can define a complete customer acquisition model by mixing their data with their customer data (in this case, a worldwide leader in food delivery). Sabri says: “The quality of data science models highly depends on the data they can train on. I am convinced we will go in the same direction as Mobilewalla in the future to have richer models. However, mixing data must be done with care as it may raise some privacy issues; our purpose has to have legal ground.”

Parsimonious Morpheme Segmentation with an Application to Enriching Word Embeddings (Ahmed El-Kishky, Frank Xu, Aston Zhang, Jiawei Han)

The speaker presented MorphMine, a method for unsupervised morpheme segmentation. It can generate morpheme candidates that are filtered out using entropy to select the best morphemes from a corpus. Then, these morphemes can be used to highly improve the word embedding model and the downstream machine learning tasks.

Blog Event

GraphOpt: Framework for Automatic Parameters Tuning of Graph Processing Frameworks

Finding the optimal configuration of a black-box system is a difficult problem that requires a lot of time and human labor. Big data processing frameworks are among the increasingly popular systems whose tuning is a complex and time consuming. The challenge of automatically finding the optimal parameters of big data frameworks attracted a lot of research in recent years. Some of the studies focused on optimizing specific frameworks such as distributed stream processing, or finding the best cloud configurations, while others proposed general services for optimizing any black-box system. In this paper, we introduce a new use case in the domain of automatic parameter tuning: optimizing the parameters of distributed graph processing frameworks. This task is notably difficult given the particular challenges of distributed graph processing that include the graph partitioning and the iterative nature of graph algorithms.

To address this challenge, we designed and implemented GraphOpt: an efficient and scalable black-box optimization framework that automatically tunes distributed graph processing frameworks. GraphOpt implements state-of-the-art optimization algorithms and introduces a new hill-climbing-based search algorithm. These algorithms are used to optimize the performance of two major graph processing frameworks: Giraph and GraphX. Extensive experiments were run on GraphOpt using multiple graph benchmarks to evaluate its performance and show that it provides up to 47.8% improvement compared to random search and an average improvement of up to 5.7%.

Muaz Twaty, Amine Ghrab, Skhiri Sabri: GraphOpt: a Framework for Automatic Parameters Tuning of Graph Processing Frameworks. 2019 IEEE International Conference on Big Data (Big Data) Workshops, Los Angeles, CA, USA.

The paper was published at the third IEEE International Workshop on Benchmarking, Performance Tuning and Optimization for Big Data Applications (BPOD 2019).

You can access it here in its preprint version.

Do not hesitate to contact our R&D department at [email protected] to discuss how you can leverage graph processing in your projects.

Papers

Master Thesis 2020

This document introduces you to master thesis supervised by our research & development department. Each project offers you the chance to be actively involved in the development of solutions to address tomorrow’s challenges in ICT and implementing them today!

If you are interested in one of our offers, please send us your application to [email protected], including your CV and motivation regarding your top three master thesis subject (described in the document).

If you are interested in working on a topic that is not in our range of offers, we would be delighted to hear your proposition and invite you get in touch.

Master thesis subjects and application guidelines are available here: Master Thesis Offers.

Academic collaborations Blog

Flink Forward: The Key Takeaways

Early October 2019, 6 EURA NOVA engineers travelled to Berlin to attend the Flink Forward Conference, dedicated to Apache Flink users and stream processing communities.

In this article, they will give you their opinion about Ververica’s’ main announcement, the impact of Ververica acquisition by Alibaba, the big trends, and a selection of their favourite talks.

Alibaba!

This is the first Flink Forward conference since the acquisition of Ververica (formerly known as data Artisans) by Alibaba, which has been one of the largest users of Flink and second-largest contributor for years. Our R&D director Sabri Skhiri says: “The only significant impact of this acquisition on the conference is that the venue is now at the Berlin Business Center instead of the Kulturbrauerei. There, we could see that the Apache Flink user’s community has grown significantly as well as their commits on Flink. This edition was a bit more business and enterprise-oriented than previous ones, although it still had its technical DNA and a lot of technical talks. All in all, this was a very good mix. Alibaba folks are deeply committed to open source and creating technology impact. We saw a lot of activities from them such as the integration of the Blink SQL runner, the hive integration or the new scheduling model. In summary, a great event.”

First Keynote Announcement

Keynote: Stream Processing and Applications in the Modern Age (Stephan Ewen)

During the first keynote, Ververica took the opportunity to announce the launch of Stateful Functions (statefun.io), an open-source framework built on top of Flink to run stateful serverless functions. It bridges the gap between Function as a Service and stream processing.

Sabri says: ”Last year, they announced their streaming ledger that brings ACID transactions between states to stream processing applications. This year, they announced the launch of Stateful Functions, a framework that reduces the complexity of building and orchestrating stateful applications at scale. In the streaming world, this announcement does not change a lot of things. However, in the microservice community, this opens new doors in terms of design patterns, especially in the way data feeding and stateful operations can be designed more flexibly.”

You can find the video of the presentation here.

The Big Trends

1. Unified batch and streaming

A significant trend of this edition is the “Unified Stream and Batch” moto. Our R&D engineer Syrine Ferjaoui says: “Flink currently features different APIs, the DataSet API for batch processing and the DataStream API for stream processing. In addition, the Table API is already a unified API on top of both (DataSet and DataStream) with declarative-style programming. Now, they are working on a solution to unify truly the batch and streaming APIs.”

Sabri adds: “In Flink 1.9, they released the State API with which a state created in batch can be used in a stream application – interesting for bootstrapping/backfilling states. But the community is going further by proposing in Flink 2.0 a unique Data API that will merge DataSet and DataStream while still taking advantage of the batch properties to optimise the execution.”

Every talk was exploring in a way or another how this unification can be pushed forward. For instance, in the Pulsar talk, they were thinking about using Pulsar as a back end to transparently bootstrap a state and then switch on stream using (1) pulsar capability in terms of segment storage and (2) unified data stream API in Flink.”

2.”Enterprise-grade” Flink:

Flink is moving clearly toward an “enterprise-grade” technology. Sabri says: “The first signal is that Cloudera adopted Apache Flink into its Data Platform. Also, AWS Kinetics now integrates Flink as a client. Adoption by such big players goes to show that Flink is well on the way to gain enterprise-grade support. The second signal is the release of the Ververica Platform that highly facilitates enterprise-grade operations. Thirdly, the integration of the Hive Metastore with the pluggable catalogue architecture is a significant step towards better governance and metadata management. Finally, there were many talks about lowering the barrier to deploying Flink in prod. The topics included APIs, configuration, memory management, K8S operators, etc.”

3.The ML path

Finally, regarding ML/AI, there is still a lot of work to get over the gap with the Spark ecosystem. However, the Alibaba folks are working hard on this topic and we can already see the first results. Sabri says: “The refactoring of the Flink ML interface to work on Flink Table APIs is excellent. There is an excellent vision of integrating Flink as a data prep engine for ML and serving layer; and the roadmap looks great.”

Interesting talks

A personal selection by Charles & Christophe of interesting talks to check out :

For Charles, our data architect:

Aljoscha Krettek & Timo Walther, respectively a co-founder at Ververica and a PMC member of Apache Flink work on the Flink APIs. They give a summary of recent contributions to Flink’s Table & SQL APIs. It was a very good overview of what is going on in terms of refactoring and where we are going.
Roman Grebennikov is a software developer from Findify AB. His talk focused on Flink serialisation framework and common problems happening around it. He illustrated and explained several ways to optimise Flink jobs by taking care of the serialisation, which in most cases represents about 60% of the processing.

For Christophe, our software engineer:

Konstantin Klauf is the head of product for the Ververica Platform based on Apache Flink. He discussed Apache Flink worst practices by sharing anecdotes and hard-learned lessons of adopting distributed stream processing. It was a humorous list of general good practices when working with Flink from planning, requirement, deployment, and maintenance.
Aaron Levin and Mike Mintz are software engineers in a Stripe’s streaming team. They talked about the many challenges they encountered writing the specialised dual source. This talk was a very well-told story about a simple use case with a high constraint: all-time deduplication of transactions at Stripe (a payment platform‎). It was funny, insightful, full of lessons learned and echoed some of digazu’s features: the history replayer.

Blog Event

Kafka Summit: The Key Takeaways

At the beginning of the month, our software engineer Christophe Philemotte was in San Francisco to make a presentation at the Kafka Summit organised by Confluent. The Kafka Summit is one of the main events for data architects, engineers, DevOps, and developers who want to learn about streaming data. In this article, Christophe shares with you the latest trends from the conference.

Main observations

This year, one of the most important takeaways at the conference was that Confluent is working towards building an active database with KSQL.

Christophe details: “KSQL is the streaming SQL engine that enables real-time data processing against Apache Kafka. With KSQL, Confluent is embracing the SQL streaming and the integration of its stack into it. They also aim to have the interactivity we already have with a classic database. In short, they are moving towards this new paradigm of active data and passive query where KSQL would make it easy to read, write, and process streaming data in real-time, at scale, using SQL-like semantics. Still, KSQL shouldn’t be chosen over Flink, for instance, without proper consideration of its limitations. For example, real checkpointing and savepoint are missing, as well as global shuffling. There are still constraints on partitioning in some operators and there is no global windowing.”

While talking about SQL streaming, they also mentioned user-defined function or machine learning integration. Find more information on the summit website.

Another interesting point was the shared approaches and themes that were addressed by different companies. For example, 30% of the talks were about the operations. About 5 talks were dedicated to methods how to deploy on Kubernetes, and several other speakers mentioned that deploying on Kubernetes was their target. Real-time analytics, integrations/ETL/DataOps, and of course data pipelines were also often mentioned.

Keynotes talks:

During the first keynote talk, Jun Rao, the co-founder of Confluent, looked back at Apache Kafka’s last years and what brought them to where they are today. Christophe says: “One interesting point was the concept of democratising data. They envision Kafka as a one-stop self-service shop for devs, data scientists, etc. Still, the users have to overcome a lot of challenges such as operations, integrations, security, or cold storage. Challenges that we are solving with digazu.”

You can find the video of the presentation here.

Jay Kreps, the CEO of Confluent as well as one of the co-creators of Apache Kafka started his talk by discussing the sentence “Software is eating the world”, by Marc Andreessen. Christophe adds: “The idea is that software must be integrated into an ecosystem of other software. The users are no longer just humans. In some cases, the software will be used almost exclusively by other software.”

Jay Kreps also talked about the new steps for Apache Kafka. He announced that the next release of Kafka KSQL in November will enable users to directly register inputs and outputs thanks to Kafka Connect source and sink connectors. They are also working on better interactivity that will allow users to see the results more quickly in the KSQL CLI.

You can find the video of the presentation here.

Our Favourite Use Cases:

Kafka on Kubernetes: Keeping It Simple (Nikki Thean)

Nikki Thean is a staff engineer at Etsy, where she helps deploying Kafka at Etsy. She talked about Etsy’s Cloud Migration and how running Kafka on Kubernetes was the best option for them and was not half as complicated as they thought it had to be. Christophe explains: “At the DataWorks Summit in Barcelona, the message was that K8S resource management was not yet ready to replace YARN. We now see that K8S is the new YARN for many people who are using it to deploy their cluster. For example, Etsy or Confluent Cloud.”

In her talk, Nikki Thean explained how a Kafka-on-K8S setup works. Christophe explains: “The main lessons from her talk are:

We can start simply without an operator.
We must pay attention to the Kubernetes liveness and readiness probes. They can be used to make a service more robust and more resilient since K8S can restart them if necessary. However, if these probes are not configured carefully, they will kill the brokers unnecessarily.
Considering the price to deploy in multiple zones on Google Cloud Platform, a good solution is to deploy at least Zookeeper (the most critical element of the cluster) on multiple zones. Given the low flow of data, it will not be too expensive and Zookeeper will allow identifying which Kafka node has the data.”

You can find the video and the slides of the presentation here.

Mission-Critical, Real-Time Fault-Detection for NASA’s Deep Space Network using Apache Kafka (Rishi Verma)

Rishi Verma is a manager at the NASA Jet Propulsion Laboratory. He talked about the new software system being deployed by NASA to upgrade its Deep Space Network (DSN) that operates spacecraft communication links for NASA deep-space spacecraft missions. Christophe says: “It was a super interesting use case! The DSN Complex Event Processing (DCEP) software assembly is a new software system that brings into the DSN next-generation “Big Data” infrastructural tools to do IoT with their legacy assets. The objective is to correlate real-time network data with other critical data assets (in their example, an old radio antenna). They recover all the data on Kafka, then they process it and then they predict signal loss on the basis of weather conditions.”

You can find the video and the slides of the presentation here.

Cross the Streams Thanks to Kafka and Flink (Christophe Philemotte)

Christophe is the CTO of digazu, the batch and real-time data sharing platform developed by EURA NOVA. In his talk, he explained how you could build a similar data platform and how you could plug Flink into the Kafka ecosystem, as well as what the common pitfalls are and what Flink requires to be deployed on Kubernetes.

Christophe says: “The feedback was positive and I received a lot of questions during the Q&A session and after the talk, notably about Flink vs KSQL vs Spark. Another question that I received a lot is when to use Table, SQL or DataStream API. My answer was that Table and SQL APIs are two different flavours of the same API. The Table API you have a LINQ experience while with the SQL API you have a SQL experience. They are both perfect for data processing that can be expressed simply in SQL. That means in a lot of cases. The DataStream API is a lower-level API compared with the Table and SQL APIs. It gives more control on what you can do, which means it also requires a thorough understanding of Flink core mechanisms. Going for the DataStream API is usually a good choice either when your stream processing cannot be expressed in SQL and requires specific implementation, or when you need to optimise the processing.

The sandbox provided was also very popular.

You can find the video and the slides of the presentation here.

Our Favourite User Practice:

Please Upgrade Apache Kafka. Now. (Gwen Shapira)

Gwen Shapira is a software engineer at Confluent working on core Apache Kafka. She reviewed all the recent releases and made suggestions on how to de-risk upgrades.

Christophe says: “Gwen Shapira talked about why it is essential to upgrade even though it is risky and time-consuming. She explained that each new release fixes from 30 to 140 bugs and listed the improvements you will get from upgrading”. Among them:

The Apache Kafka team is working on improvement to build a reliable replication. For example, watermarking has been improved greatly.
They are working on controller design towards the removal of Zookeeper.
Finally, some releases are critical for specific reasons (e.g. proper resolution of IP when you work with K8S, JBOD, or EOS).

In the second part of her talk, Gwen Shapira made suggestions to upgrade as safely as possible. Christophe explains: “She recommended to take good care of backup configuration and documentation. Regarding documentation, she recommended to read the list of notable changes, to act upon text in bold font, and once you have finished reading, to go over it all again!”

Christophe’s last word? “Be sure to check out slide 35: it lists the ways how not to upgrade!”

You can find the video and the slides of the presentation here.

Blog Event

4th Workshop on Real-time & Stream Analytics in Big Data

EURA NOVA Research centre is proud and excited to organize the fourth workshop on Real-time and Stream analytics in Big Data, collocated with the 2019 IEEE conference on Big Data. The workshop will take place in December in Los Angeles, USA.

Stream processing and real-time analytics in data science have become some of the most important topics of Big Data. To refine new opportunities and use cases required by the industry, we are bringing together experts passioned about the subject.

This year, we are excited to have two amazing keynotes from Confluent KStream and Apache Pulsar:

Matteo Merli is one of the co-founders of Streamlio, he serves as the PMC chair for Apache Pulsar and he’s a member of the Apache BookKeeper PMC. Previously, he spent several years at Yahoo building database replication systems and multi-tenant messaging platforms. Matteo was the co-creator and lead developer for the Pulsar project within Yahoo.
John Roesler is a software engineer at Confluent and a contributor to Apache Kafka, primarily to Kafka Streams. Before that, he spent eight years at Bazaarvoice, on a team designing and building a large-scale streaming database and a high-throughput declarative Stream Processing engine.

If you want to join us, authors from the industry and the academia are invited to contribute to the conference by submitting articles. Check out the workshop website to find all the information you will need. Your paper will be reviewed by a prestigious panel of international experts from both the academic and the industrial worlds.

Event

ACL 2019: Takeaways from the conference

Last month our R&D Project Director Cécile Pereira and our PhD student Léo Bouscarrat travelled to Florence to attend and present to ACL 2019. ACL is one of the biggest conferences in Natural Language Processing. This year all the records were broken with more than 3500 attendees, 660 accepted papers to the main conference, 9 tutorials and more than 20 workshops. All the talks of the main conference were recorded and are accessible online. In this article, Cécile and Léo share with you the latest trends from the conference!

Big trends

A new paradigm in NLP?

This year, ACL’s selection of topics has shown the importance that has taken self-training methods such as BERT (Devlin et al., 2019) or XLNet (Yang et al., 2019). These methods consist of feeding huge models with a vast amount of data and then train them on easy tasks (for example, predict masked words in the original sentence or predict if two sentences are following each other).

These models should be able to learn faster and with less data on a more specific and complex task. With this method, the way to train a model to solve an NLP task has changed. Here is this new paradigm:

Select a pre-trained model (trained with self-training)
Add a layer on the output of this model (it will depend on your task) and fine-tune the model by giving the inputs and outputs of your task
Evaluate your model

Many papers were using this paradigm to achieve state of the art on several tasks (out of the 660 papers of the main conference, 47 have the word BERT in their abstract).

Contextual embeddings, like BERT, take into account the context of the sentence into the embeddings of the words. BERT can be used for a large variety of tasks including but not limited to classification (Reimers et al., Chalkidis et al.), named entity recognition (Arkhipov et al., Emelyanov and Artemova) and question answering (Li et al., Liu et al.).

So it is working. But the remaining question is why?

Several presentations discussed the explainability of BERT (for example Jawahar et al. and Clark et al.). Those papers discuss that, as the different layers are learning different things, the different heads seems to specialize in certain types of words or certain syntactic or semantic task.

The conference highlighted the need for adversarial training and testing as those models are very good to learn bias in the dataset (Niven and Yao, Jiang and Bansal). For those not familiar with the concept, adversarial training and testing consist to train and/or test on an adversarial dataset. This dataset is composed of examples, often generated ones, where the model fails to predict the correct answers. Adversarial training is generally used to verify if the models learn bias in the dataset (like the negation in Niven and Yao). It can also improve the quality of the models.

Improving the experiments in NLP

Several presentations showed that adversarial training can improve the results and robustness of the model (Zhu et al and Jiang and Bansal, Mohit Bansal slides available here).

The meeting was also a moment to discuss the impact of the use of standard splits on benchmark data. Standard split means that, if you want to work on a specific task, you will generally look for the training, validation and test splits used in other publications and use the same.

However, Gorman and Bedrick argue that the use of random split should be preferred. They explain it by trying to reproduce the results of nine part-of-speech taggers on a specific dataset. They reproduced the same rankings on the standard splits. However, when they did it on random splits, the ranks of the taggers, considering the same metric, varied.

This showed that getting a better ranking on a specific split doesn’t mean that you are better in general. Since in some fields of research, the improvements between each paper are small, the use of standard split does not guarantee that a model is really better than another one on the task. Random splits could improve this by adding a notion of variance on the performances.

Domain adaptation

The last trend in NLP consists of using models or embeddings learned on huge datasets of general data from sources such as Wikipedia, books or newspapers.

When you want to work on specialised domains such as the biomedical, legal or financial domains, you need specialised embeddings. However, you generally don’t have enough specialised data to re-train the embeddings or the models.

A solution is to use and modify pre-trained models for your specific task. This is called domain adaptation. There are several ways to do it. For example, Boukkouri et al combined a general embedding and a smaller one learned on their domain. Hu et al fine-tuned a general model on their data. These methods allow using recent models (which needs a lot of data) on some specific domains that do not fit those requirements.

Trendy topics

Machine translation

Machine translation is still a huge topic with no less than 46 papers in the main conference (according to the ACL 2019 chair blog post), an entire two-days workshop dedicated to it and Liang Huang invited talk. Liang Huang is a principal scientist of Baidu Silicon Valley AI Lab who talked about the current state of simultaneous translation and Baidu research’s new approach. They were able to do an English/Chinese translation with 3 seconds of delay only. The demo is available here: https://simultrans-demo.github.io/. One can also notice that the ACL best long paper award was on this topic (Zhang et al.)!

Conversational systems

Conversational systems (also called chatbots) were also a trendy topic, with 52 papers, a workshop, and the invited talk from Pascale Fung.

Pascale Fung is a Professor at the Hong Kong University of Science & Technology. She presented the state of the art of conversational systems. For her, recent advances are going in three directions: learning to memorise, learning to personalise and learning to empathise. She presented her current work on conversational systems that can empathise, showing that improvements have been made but there is still work to do. She ended with questions about the ethics of this sector: how can we build systems that are secure, safe and fair for all?

Knowledge graph

Knowledge graphs are also pretty trendy, they seem to be a good way to add knowledge to models. It can be used for Question-Answering or Conversational systems. The blog post of Michael Galkin makes a review of the most interesting articles in this sector.

Bias in NLP

After recent papers showed that models in NLP are biased (Bolukbasi et al., 2016 ; Caliskan et al, 2017) there is more and more work on what we can do about that, reflected by a session and a workshop during the meeting (https://genderbiasnlp.talp.cat/).

Several works about removing gender bias from models have been previously published. But the work of Gonen and Goldberg explains that, for now, it’s only “Lipstick on a pig”.

We observed two main areas on the topic. Firstly, removing/controlling gender bias in the models (like in automatic translation, Habash et al., Escudé Font et al., Ik Cho et al.). Secondly, measuring bias in the models and society (with articles proposed by sociologists, like Karve et al., Hitti et al., Basta et al., Kurita et al.).

Summarization

There were several papers about summarization (including our own paper https://arxiv.org/abs/1907.07323) which have been summarized by RecitalAI on their GitHub.

Conclusions

ACL was a great place to measure the trends in the NLP field. As models are becoming better, data scientists are applying them to a large variety of topics including automatic translation, search engine, and chatbots.

As the NLP community and topics are becoming bigger and bigger, we hope that this summary of our biased takeaways from the meeting could help you navigate the nearly 700 ACL papers of this year.

Blog Event

STRASS: A Light and Effective Method for Extractive Summarization

This paper introduces STRASS: Summarization by TRAnsformation Selection and Scoring. It is an extractive text summarization method which leverages the semantic information in existing sentence embedding spaces. Our method creates an extractive summary by selecting the sentences with the closest embeddings to the document embedding. The model learns a transformation of the document embedding to minimize the similarity between the extractive summary and the ground truth summary. As the transformation is only composed of a dense layer, the training can be done on CPU, therefore, inexpensive. Moreover, inference time is short and linear according to the number of sentences. As a second contribution, we introduce the French CASS dataset, composed of judgments from the French Court of cassation and their corresponding summaries. On this dataset, our results show that our method performs similarly to the state of the art extractive methods with effective training and inferring time.

Léo Bouscarrat, Antoine Bonnefoy, Thomas Peel, Cécile Pereira, STRASS: A Light and Effective Method for Extractive Summarization Based on Sentence Embeddings, in 2019 ACL Student Research Workshop, Florence, Italy.

Click here to access the paper.

Florence, Italy

Papers