NAVI GATIONSEARCH BOX
Join us on LinkedIn Follow us on Twitter
Eura Nova RD
Eura Nova

Activity

In this section you will find EURA NOVA’s latest news and activities.

Academic Programmes Activity

19-11-2020

INTERNSHIPS 2021

This document presents internships supervised by our software engineering department or by our research & development department. Each project is an opportunity to feel both empowered and responsible for your own professional development and for your contribution to the company.

 

If you are interested in one of our offers, please send us your application to career@euranova.eu, including your CV and motivation regarding your top three internship positions (described in the document).

 

If you wish to read the testimonies of students who have done an internship at EURA NOVA, visit our blog, or read directly their experiences:

If you are interested in working on a topic that is not in our range of offers, we would be delighted to hear your proposition and invite you to get in touch.

Internship subjects and application guidelines are available here: Internship Offers.

Download file (.pdf)

Academic Programmes Activity

19-11-2020

MASTER THESIS & PFE 2021

This document introduces you to master thesis and graduation projects supervised by our research & development department. Each project offers you the chance to be actively involved in the development of solutions to address tomorrow’s challenges in ICT and implementing them today!

If you are interested in one of our offers, please send us your application to career@euranova.eu, including your CV and motivation regarding your favourite master thesis subject (described in the document).

If you are interested in working on a topic that is not in our range of offers, we would be delighted to hear your proposition and invite you to get in touch.

Master thesis subjects and application guidelines are available here: Master Thesis Offers.

Download file (.pdf)

Activity

05-11-2020

ECML 2020 – The Keynotes

A few weeks ago, the biggest European conference on machine learning was held: ECML 2020. Our research engineer Nourchène, our R&D consultant Gianmarco, and our data scientist Ronan attended the event from Tunisia, Belgium and Marseille. In this article, they tell you about the different keynote talks they attended. 

Gemma Galdon-Clavell – Algorithmic Auditing: how to open the black-box of ML

Nourchène says: “I loved the talk given by Gemma Galdon-Clavell during which she addressed the problem of ethics in AI, as computer science engineers do not often question what they are producing from a moral standpoint. In her talk, Gemma points out the importance of data used to train a machine learning model. Data are provided by humans, but people are not perfect, they are likely to make wrong decisions. The model will then learn to behave the same way. So we might end up creating an unethical model. This can lead to two different behaviours: users either will follow the system’s recommendations at any cost or decide not to if they find the decisions not reasonable. Data will then continue to be biased, which creates a sort of deadlock.”

 

Ronan adds: “Algorithms do not produce biases from anywhere; they reproduce and amplify biases they can find in the data they ingest. As a result, we have to pay attention first to the quality of the data we use. Gemma emphasizes that algorithmic auditing is the key to understanding if the algorithm meets the expectations and if it complies with the regulations. The audit does not only cover the technical part and the way the algorithm was coded. It also focuses on how the problem was approached and the means deployed to solve it.”

 

Nourchène explains: “The speaker suggests that before creating a product, computer science engineers and developers need to ask the following questions: Is the product desirable and what is the problem that it tries to solve? Is it acceptable and does it involve users? Is it legal? Finally, does it use the right data? Gemma also suggests that ethics be taught in engineering schools. I totally agree with that because nowadays technology does not always seek to solve real problems, its goal is rather to make a fortune out of the proposed product.”

 

Max Welling – Amortized and Neural Augmented Inference

Gianmarco says: ‘My favourite talk was the one held by Max Welling. It clearly showed and unified the underlying theoretical grounds of many superficially different models, without failing to provide real-world applications. More concretely, the talk showed how to develop hybrid amortized methods that combine classical learning, inference and optimization algorithms with learned neural networks, which is of strong interest, especially in physics-related fields.

It provided a comprehensive and complete exposition of the topic of amortized neural inference and, as a consequence, it did not fail in bringing the spectator up-to-date with applications in that regard. Max Welling presented how a learned neural network can augment or correct a classical solution (attained by means of expert-knowledge or classical equations), or reversely, how a neural network can be fed useful information computed by a classical method.”

 

Been Kim – Interpretability for everyone

Gianmarco says:  “I was exposed to many new topics and applications I was not familiar with. Talks like Interpretability for everyone that offered more abstract research were the ones that struck my attention the most. The talk presented the latest discoveries and tools in terms of interpretability quantification. It also introduces how to extract interpretability from a black-box end-to-end model, which I find very important for the construction of more robust models and model diagnosis.”

 

Doina Precup – Building Knowledge For AI Agents With Reinforcement Learning

Ronan says: “I really liked the talk given by Doina Precup on how to build knowledge in the field of reinforcement learning. I only had little knowledge of this field. Thankfully, Doina introduced us quickly to the key concepts of reinforcement learning. She also presented us with some big successes of RL, presented different RL mechanisms and went towards the problem of using existing knowledge to build a life-long learning agent. Doina concluded her talk with a lot of open and inspiring questions: How can we exploit previously learned knowledge and apply it to new environments not related in any manner to the previous ones? How well is an agent preserving and enhancing its knowledge? These questions might not have definitive answers or just answers at all but I found very relevant and interesting the interrogations she raises on how we can represent knowledge.

 

Stephan Günnemann about Certifiable Robustness of ML Models for Graphs

Ronan says: In this technical talk, Stephan presented us different methods to assess GNN robustness. To certificate the robustness of a GNN, an evaluation of its sensitivity to perturbations needs to be conducted. For example, you can search for a worst-case scenario, and verify that the margin is positive to ensure the model is robust. Stephan’s talk was very pleasant to listen to, as he accompanied it with several examples and applications of the methods he presented us. Finally, he concluded that ML models for graphs aren’t reliable but that we can apply certificates and robustification principles to provide guarantees for a reliable use of GNNs.

 

Watch the talks: 

If you wish to catch up on talks we mentioned or those you missed, all the sessions, paper and presentation recordings are available (for a limited time) from the ECML website.

Gemma Galdon-Clavell

Max Welling : 

Been Kim

Doina Precup

 

Stephan Günnemann 

Activity

05-11-2020

ECML 2020 – A Summary

A few weeks ago, the biggest European conference on machine learning was held: ECML 2020. Our research engineer Nourchène, our R&D consultant Gianmarco, and our data scientist Ronan attended the event from Tunisia, Belgium and Marseille. What were the big trends and their favourite talks? What did they think of the online remote format? Let’s find out with them!

 

The Big Trends

The overall conference was very well up-to-date with the outside world’s latest trends and needs. Gianmarco explains: “The conference was rich in presentations which covered nearly all possible topics in machine learning. However, I had the impression that Graph Neural Networks and Generative Models had a little more presence than other models. Transfer learning was also another topic that seemed to be very relevant throughout the conference.”

 

Remote Format For The First Time

Due to the COVID-19 pandemic, the conference was fully virtual. The talks were pre-recorded and made available prior to the conference. The live sessions were dedicated to questions and answers, with a very brief presentation at the beginning of the session. 

Nourchène explains: “The downside was that we had to watch the whole presentation beforehand, otherwise it was difficult to follow the discussion and to interact with the speaker. Fun fact: there was a session where even the moderator was not aware of this Q&A aspect and asked the speaker why the presentation was so short! The good thing is that, since the presentations were pre-recorded, it was possible to watch the presentations from sessions running in parallel.”

Gianmarco adds: “I have not had many remote conferences in my life, but I was genuinely surprised to see how well-organised this one was. The remote framework was very well-designed, the web interface was fully functional, and they took advantage of all the benefits that a remote event can have like re-watchable presentations.”

Kudos to the organising committee for pulling it off!

 

The Keynotes

We wrote an article with more details about different keynotes that you can find on this link, but here is a teaser: 

Gemma Galdon-Clavell – Algorithmic Auditing: how to open the black-box of ML

In her talk, Gemma points out the importance of data used to train a machine learning model. According to her, algorithmic auditing is the key to understanding if the algorithm meets the expectations and if it complies with the regulations. This audit does not only cover the technical part and the way the algorithm was coded. It also focuses on how the problem was approached and the means deployed to solve it. Read our detailed review here

 

Max Welling – Amortized and Neural Augmented Inference

The talk showed and unified the underlying theoretical grounds of many superficially different models, without failing to provide real-world applications. It provides a comprehensive and complete exposition of the topic of amortized neural inference and, as a consequence, it did not fail in bringing the spectator up-to-date with applications in that regard. Read more here

 

Been Kim – Interpretability for everyone

The talk presented the latest discoveries and tools in terms of interpretability quantification. It also introduces how to extract interpretability from a black-box end-to-end model. Read more in our article.

 

Doina Precup – Building Knowledge For AI Agents With Reinforcement Learning

Doina Precup talks on how to build knowledge in the field of reinforcement learning. She also presents some big successes of RL, presented different RL mechanisms and went towards the problem of using existing knowledge to build a life-long learning agent. Discover more!

 

Stephan Günnemann – Certifiable Robustness of ML Models for Graphs

Stephan presented different methods to assess GNN robustness: an evaluation of its sensitivity to perturbations needs to be conducted. Learn more with Ronan here.

 

Interesting Paper?

Si-An Chen; Voot Tangkaratt; Hsuan-Tien Lin; Masashi Sugiyama – Active deep Q-learning with demonstration

Nourchène says: “The authors presented their paper proposing different groups of techniques for learning from demonstration in Reinforcement Learning, like RL Expert Demonstration (RLED) or Active RL Demonstration (ARLD). These techniques can be used to fasten the learning process of an RL agent. They also propose an uncertainty-based query strategy named Active Deep Q-Network, based on DQN, to dynamically estimate the uncertainty of recent states and use the queried demonstration data.“

 

Favourite tutorial

Learning With Imbalanced Domains and Rare Event Detection

Ronan says: “This tutorial was interesting and well-structured. Imbalance domains and rare-events prediction concern a lot of domains: financial, medical, data distribution… and will always remain a centre of attention in designing the appropriate solution to a problem. As a consequence, it will remain a core problem in the research. I particularly liked this tutorial as it covered a lot of different approaches: unsupervised (statistical-based, proximity-based, clustering-based), supervised and semi-supervised and compared them. As there is no ideal solution that can be applied to every problem, you have to know what exists before choosing the one that better fits your problem. The tutorial also covered different methods to properly evaluate the performance of an algorithm on an imbalanced task. ”

 

Conclusion

The conference provided a wide range of machine learning topics in the form of presentations about the latest trends, technologies and applications. As Nourchène says:  “it is an optimal platform to stay up-to-date, to widen one’s perspectives and/or dig deeper into a specific topic.

 

Watch the talks: 

If you wish to catch up on talks we mentioned or those you missed, all the sessions, paper and presentation recordings are available (for a limited time) from the ECML website.

 

Gemma Galdon-Clavell

 

Max Welling 

 

Been Kim

 

Doina Precup

 

Stephan Günnemann

 

Active deep Q-learning with demonstration: Read the paper 

Activity

30-10-2020

Our engineer Amine Ghrab presented his PhD public defense on the BI on Graph Project

Last Thursday, our engineer Amine Ghrab presented the BI on Graph project during his PhD public defense. Amine did an amazing job at the edge between Industry & Academia. Amine’s thesis was done in collaboration with the CODE/WIT Lab of the Université Libre de Bruxelles and the Universitat Politècnica de Catalunya, with the support of Prof. Oscar Romero & Prof. Esteban Zimanyi!

In his PhD thesis, Amine defined how BI environments can be enriched with Graph Data structures. Over the past decade, business and social environments have become increasingly complex and interconnected. As a result, graphs have emerged as a widespread abstraction tool at the core of the information infrastructure that supports these environments. In particular, the integration of graphs into data warehouse systems has appeared as a way to extend current information systems with graphs management and analysis capabilitiesGoing forward, Amine redefined the concepts of multidimensional cube on graph and showed how it can open new doors for data analysts. Finally, he showed how a graph data warehouse architecture can be defined.

Congratulation for your achievements!

You can find below a list of related publications:

Activity

24-08-2020

Privacy Policy Classification with XLNet

The popularisation of privacy policies has become an attractive subject of research in recent years, notably after the General Data Protection Regulation came into force in the European Union. While GDPR gives Data Subjects more rights and control over the use of their personal data, length and complexity of privacy policies can still prevent them from exercising those rights. An accepted way to improve the interpretability of privacy policies is through assigning understandable categories to every paragraph or segment in said documents. The current state of the art in privacy policy analysis has established a baseline in multi-label classification on the dataset containing 115 privacy policies, using BERT Transformers. In this paper, we propose a new classification model based on the XLNet. Trained on the same dataset, our model improves the baseline F1 macro and micro averages by 1-3% for both majority vote and union-based gold standards. Moreover, the results reported by our XLNet-based model have been achieved without fine-tuning on domain-specific data, which reduces the training time and complexity, compared to the BERT-based model. To make our method reproducible, we report our hyper-parameters and provide access to all used resources, including code. This work may, therefore, be considered as a first step to establishing a new baseline for privacy policy classification.

 

Majd Mustapha, Katsiaryna Krasnashchok, Anas Al Bassit and Sabri Skhiri, Privacy Policy Classification with XLNet, Proc. of the 15th DPM International Workshop on Data Privacy Management, Surrey, UK, 2020.

Activity

21-08-2020

Towards Privacy Policy Conceptual Modeling

After GDPR enforcement in May 2018, the problem of implementing privacy by design and staying compliant with regulations has been more prominent than ever for businesses of all sizes, which is evident from frequent cases against companies and significant fines paid due to non-compliance. Consequently, numerous research works have been emerging in this area. Yet, to this moment, no publicly available model can offer a comprehensive representation of privacy policies written in natural language, that is machine-readable, interoperable and suitable for automatic compliance checking. Meanwhile, privacy policies stay one of the main means of communication between a business (Data Controller) and a Data Subject, when it comes to the use of personal data. In this paper, we propose a conceptual model for fine-grained representation of privacy policies. We reuse and adapt existing Semantic Web resources in the spirit of interoperability. We represent our model as an ODRL profile and demonstrate how existing privacy policies can be translated into ODRL-like policies, consisting of deontic rules. We enrich our model with vocabularies for describing personal data processing in great detail, making it suitable for further usage in downstream applications, such as access control tools, to support adoption and implementation of privacy by design. We also demonstrate our model’s capability of handling personal data processing rules in other types of documents, namely data processing agreements, essential for controlling data privacy in a relationship between a Controller and a Processor.

 

Katsiaryna Krasnashchok, Majd Mustapha, Anas Al Bassit and Sabri Skhiri,  Towards Privacy Policy Conceptual Modeling, Proc. of the 39th International Conference on Conceptual Modeling, Vienna, Austria, 2020.

Activity

30-03-2020

TopoGraph: an End-To-End Framework to Build and Analyze Graph Cubes

Graphs are a fundamental structure that provides an intuitive abstraction for modelling and analyzing complex and highly interconnected data. Given the potential complexity of such data, some approaches proposed extending decision-support systems with multidimensional analysis capabilities over graphs. In this paper, we introduce TopoGraph, an end-to-end framework for building and analyzing graph cubes. TopoGraph extends the existing graph cube models by defining new types of dimensions and measures and organizing them within a multidimensional space that guarantees multidimensional integrity constraints. This results in defining three new types of graph cubes: property graph cubes, topological graph cubes, and graph-structured cubes. Afterwards, we define the algebraic OLAP operations for such novel cubes. We implement and experimentally validate TopoGraph with different types of real-world datasets.

 

The paper will be published soon in Information Systems Frontiers, and is already available online on Springer. Currently, it is unfortunately available only to subscribers, but do not hesitate to reach out to us for more information!

 

Amine Ghrab, Oscar Romero, Sabri Skhiri, Esteban Zimányi, TopoGraph: an End-To-End Framework to Build and Analyze Graph Cubes, published in Information Systems Frontiers (2020).

 

 

Activity Conferences

27-02-2020

Thirty-Fourth AAAI Conference On Artificial Intelligence: A Summary

Two weeks ago, our young research engineers Hounaida Zemzem and Rania Saidi were in New York for the Thirty-Fourth AAAI Conference On Artificial Intelligence. The conference promotes research in artificial intelligence and fosters scientific exchange between researchers, practitioners, scientists, students, and engineers in AI and its affiliated disciplines. Rania and Hounaida attended dozens of technical paper presentations, workshops, and tutorials on their favourite research areas: reinforcement learning for Hounaida and graph theory for Rania. What were the big trends and their favourite talks? Let’s find out with them!

 

The Big Trends:

Rania says: “The conference focused mostly on advanced AI topics such as graph theory, NLP, Online Learning, Neural Nets Theory and Knowledge Representation. It also looked into real-world applications such as online advertising, email marketing, health care, recommender systems, etc.”

Hounaida adds: “I thought it was very successful given the large number of attendees as well as the quality of the accepted papers (7737 submissions were reviewed and 1,591 accepted). The talks showed the power of AI to tackle problems or improve situations in various domains.”

 

Favourite talks and tutorials

Hounaida explains: “Several of the sessions I attended were very insightful. My favourite talk was given by Mohammad Ghavamzadeh, an AI researcher at Facebook. He gave a tutorial on Exploration-Exploitation in Reinforcement Learning. The tutorial by William Yeoh, assistant professor at Washington University in St. Louis, was also amazing. He talked about Multi-Agent Distributed Constrained Optimization. Both their talks were clear and funny.”

 

Rania’s feedback? “One of my favourite talks was given by Yolanda Gil, the president of the Association for the Advancement of Artificial Intelligence (AAAI). She gave a personal perspective on AI and its watershed moments, demonstrated the utility of AI in addressing future challenges, and insisted on the fact that AI is now necessary to science. I also learned a lot about the state of the art in graph theory. The tutorial given by Yao Ma, Wei jin, Lingfu Wu and Tengfei Ma was really interesting. They explained Graph Neural Networks: Models and Application​s. Finally, the tutorial presented by Chengxi Zang and Fei Wang about Differential Deep Learning on Graphs and its Applications was excellent. Both were really inspiring and generated a lot of ideas about how to continue to expand my research in the field! ”

 

Favourite papers

A personal selection by Rania & Hounaida of interesting papers to check out :

For Hounaida:

 

For Rania:

 

Final thoughts

After attending their first conference as Euranovians, what will Rania & Hounaida remember? Hounaida concludes: “Going to New York for the AAAI-20 Conference as one of the ENX data scientists was an amazing experience. I met many brilliant and sharp international experts in various fields. I enjoyed the one-week talks with so many special events, offline discussions, and the night strolls!”

Activity Conferences

20-02-2020

Schloss Dagstuhl: where computer science meets

Which direction stream and complex event processing is going to take? Last week, the world’s best-known international researchers met in Schloss Dagstuhl, Germany,  to present and discuss their research. Among the members were present Avigdor Gal, Professor at the Israel Institute of Technology, Alessandro Margara, Assistant Professor at the Polytechnic University of Milan, or Till Rohrmann, engineering lead at Veverica.

Invited to talk about the requirements and needs from the industry, our R&D director Sabri Skhiri explains: “The seminar brought together world-class computer scientists and practitioners working on complex event recognition, distributed systems, databases, stream reasoning and artificial intelligence. Our objective was to disseminate the recent foundational results in each of these isolated fields among all participants, to identify the open problems that need to be resolved, and to establish new research collaborations among these fields”.

What were the big trends and intakes gathered by those brilliant minds? Let’s find out with Sabri!

 

 

The Big Trends

This seminar is a bit particular as it does not show any trends but rather gives a picture of all the communities working on CER in a way or another. I was fascinated by the diversity of researchers. I  did not expect to see such a rich variety of fields: knowledge representation, spatial reasoning, logic-based reasoning, data management, learning-based approaches, event-driven processing, process mining, database theory, stream mining,… According to me, the composite event recognition models that are the best at recognising complex events would include:

  1. Data flow model
  2. Ontology-based and reasoning model
  3. Symbolic reasoning model
  4. Automata-based model

We also identified common challenges across these models and communities. The three priority topics areas we identified are:

  1. Expressivity: composability & hierarchies
  2. Evaluation strategy, parallelization and distribution
  3. Uncertainty management

 

Favourite Talk

Kurt Rothermel from TU Stuttgart – Time-sensitive Complex Event Processing

My first reaction to load shedding was: “It is useless since customers do not want to lose any event, that is why so much effort is spent today on exactly once semantics…“. However, there is a trend today in stream processing, which is the trade-off between cost, latency, and correctness. Tyler Akidau described this challenge as a choice between one of three propositions: fast and correct, cheap and correct, or fast and cheap.  Tyler was talking about streaming but that rule applies in the same way in a CEP context. The load shedding strategy directly falls in the third proposition. In this perspective, the work of Kurt is highly relevant.

 

Favourite Tutorial

Jacopo Urbani & Fredrik Heintz – Stream Reasoning

Concretely, stream reasoning is incremental reasoning over rapidly changing information. The tutorial opened new perspectives on stream processing for me. It tried to answer a very interesting question: how can you provide reasoning about context from streams of data? I definitely come from the database and event-based systems communities and I did not know at all that stream reasoning was so mature. This community has been evolving from having a continuous version of SPARKQL to a complete distributed stream reasoning semantics. It is interesting to see that the work we have done in the LEAD algebra and semantics is deeply inspired by this community. However, we have never used any reasoning logic on top of LEAD. But after a few hours of the tutorial, I realise that (1) reasoning can be used for query rewriting and optimisation (2) it is worth evaluating at least BigSR,  the LARS implementation on Flink.

 

Avigdor Gal & Ruben Mayer – Distributed and Event-Based Systems

Avidgor is a kind of pop star for the stream processing and distributed systems community, or at least for me! The papers he published about a probabilistic CEP engine with late arrival and event uncertainty were visionary.

The speakers started by explaining the basics of stream processing then went deeper into the event recognition language and architecture. They detailed pub/sub applied to event recognition and explained the data flow model, which consists of a single unified data processing model where the stream and batch paradigms are the same.  This last part was based on Tyler Akidau’s paper.

A second part of the talk focused on elasticity on streams. Stream fission puts operators among different categories:

  • Firstly, key-based operators, that is a group by operation (as in SQL)
  • Secondly, window-based operators enable to split processing that needs to have multiple event types correlated with different keys within the same operator
  • Finally, pane-based operators enable a split-merge strategy where you distribute and merge the result.

Interestingly, Avigdor presented his work about late-arrival processing from a probabilistic viewpoint and not from the watermark perspective. Usually, modern stream processing frameworks use watermarks in order to take into account events that arrive later. Avigdor presented a probabilistic approach to this issue.

 

What are late-arrival events?

Imagine we want to count the number of cars entering a road segment every three minutes: we have a “tumbling window” every 3 minutes. If an event (ie a car) arrives at 2’55 second in the window but is stuck somewhere in the network for 6 sec, it is called a late-arrival event. The processing time (the time at which the CEP processes the event) is delayed compared to the event time (the time on which the event really occurs).

Note that for CEP, there is clearly a trade-off between timeliness and accuracy, because the slack time will increase the delay to deliver your result but will increase your accuracy. There is always a tradeoff between cost, latency and correctness, and usually, you can only pick two among the three.

Fun fact: If you need to explain what is event time & processing time to your mother (yeah, don’t underestimate the power of this kind of discussion at Christmas dinner), the best way is to take the Star Wars analogy. From an event time perspective (which is the time at which the story really happened) you should follow episode 1, 2, 3,4, 5, 6, 7,8, 9. But if you take the processing time (the time on which we received the episode), it is 4, 5, 6, 1, 2, 3, 7, 8, 9.  Isn’t it great ?!

 

Final Thoughts

CER has been explored from many viewpoints. However, never in the research history was there a meeting gathering representatives of these communities. This was the objective of this seminar. Having all these people in a castle in the middle of nowhere was a blast! I had very passionate discussions during meals but also during the night at the library with the most brilliant brains on stream and CEP. On the other hand, I still had some fun discussions about comparing Star Trek DIscovery and Picard! Finally, the most important things I will remember after this seminar… are the endless ping pong games with Till Rohrmann and Alessandro Margara :-).

Activity

21-01-2020

Throwback to 2019

At EURA NOVA, we believe technology is a catalyst for change. To embrace it, we strive to stay at the edge of knowledge. Investing in research allows us to continuously become more proficient, to maintain our know-how at the cutting edge of IT, to share its benefits with our customers, and to incubate the products of tomorrow. As we look back on the year 2019, we are both proud and happy of the work achieved!

 

Published papers:

We are happy to say that our R&D department has published five peer-reviewed scientific papers last year.

 

  • LEAD: A Formal Specification For Event Processing

 

In June, our R&D engineer Anas presented his work on complex event processing at the 13Th ACM international Conference on distributed and event-based systems, which was taking place in Germany.

Anas Al Bassit, Skhiri Sabri, LEAD: A Formal Specification For Event Processing, in 13Th ACM international Conference on distributed and event-based systems 2019

 

  • Coherence Regularization for Neural Topic Models

 

In July, our R&D engineer Kate presented her paper on neural topic models at the 16th International Symposium on Neural Networks taking place in Moscow.

Katsiaryna Krasnashchok, Aymen Cherif, Coherence Regularization for Neural Topic Models. in 16th International Symposium on Neural Networks 2019 (ISNN 2019)

 

  • STRASS: A Light and Effective Method for Extractive Summarization

 

In August, our PhD student Léo was in Italy to present his paper at the 2019 ACL Student Research Workshop.

Léo Bouscarrat, Antoine Bonnefoy, Thomas Peel, Cécile Pereira, STRASS: A Light and Effective Method for Extractive Summarization Based on Sentence Embeddings, in 2019 ACL Student Research Workshop, Florence, Italy.

 

  • GraphOpt: Framework for Automatic Parameters Tuning of Graph Processing Frameworks

 

In December, the paper written by our former intern and now full-time colleague Muaz was presented in Los Angeles at the third IEEE International Workshop on Benchmarking, Performance Tuning and Optimization for Big Data Applications.

Muaz Twaty, Amine Ghrab, Skhiri Sabri: GraphOpt: a Framework for Automatic Parameters Tuning of Graph Processing Frameworks. 2019 IEEE International Conference on Big Data (Big Data) Workshops, Los Angeles, CA, USA.

 

  • A Performance Prediction Model for Spark Applications

 

In June 2020, our paper written as part of the ECCO research project we have been leading at EURA NOVA will be presented at the Big Data congress 2020 taking place in Hawaii.

Florian Demesmaeker, Amine Ghrab, Usama Javaid, Ahmed Amir Kanoun, A Performance Prediction Model for Spark Applications, in the proceedings of Big Data congress 2020.

 

IEEE Big Data Workshop

Last December, Eura Nova’s research centre held the fourth workshop on real-time and stream analytics in big data at the 2019 IEEE Conference on Big Data in Los Angeles. The workshop brought together leading players including Confluent, Apache Pulsar, the University of Virginia and Télécom Paris Tech as well as 8 renowned speakers from 6 different countries. We received more than 30 applications and we are proud to have hosted such interesting presentations of papers in stream mining, IoT, and industry 4.0. Special thanks to our keynote guests, Matteo Merli (Apache Pulsar) and John Roesler (Confluent), and all the attendees and speakers!

 

JERICHO, research driving innovations

The mission of the JERICHO research track is to make the latest technologies available to our client, to offer them a competitive edge to play along megacorporations.  After two years of intense work, seven published papers, presentations in international conferences spanning Russia, the United States, Germany, Australia, or Belgium, our Jericho project has come to an end.

And the adventure continues! We are really excited to continue our work on innovative solutions for the next data challenges with our new research track ASGARD.

Our R&D director Sabri Skhiri says: “The costs of data solutions and the lack of data scientists will increase in the next 3 to 5 years and solutions to reduce them will benefit from a large market. In this sense, ASGARD is precisely in the strategy of Eura Nova. ASGARD aims to reduce these costs by automating the most expensive tasks. As the world becomes increasingly digital and reinvents itself, innovation and research are essential in the market.”

 

Academic collaboration

This year, we welcomed nine interns across our three offices. A big kudo to our intern Muaz who successfully finished his master thesis in collaboration with EURA NOVA! The goal of his thesis was to optimise the configuration of distributed graph frameworks. He now joined EURA NOVA to work as a full-time employee.

 

Talks & seminars

This year, the research team had the pleasure to be invited at several international conferences:

  • In February, our research director Sabri Skhiri gave a seminar on modern Stateful Stream Processing at EPT. Our R&D engineer Syrine Ferjaoui also went to Morocco to give a workshop about data architecture at the Annual International Conference on Arab Women In Computing.
  • In March, Sabri was at the World AI Show in Dubaï to talk about successfully deploying AI projects in production. He was also invited to Barcelona Tech to give a Big Data Architecture & Design  seminar.
  • In June, our data privacy officer Nazanin Gifani gave a masterclass on Fairness and Transparency in AI at the DI Summit in Brussels.
  • In September, our R&D project manager Shivom Aggarwal talked at the Arab Future Cities Summit 2019 about deploying AI at industrial scale for smart cities.
  • In October, our software engineer Christophe Philemotte was in San Francisco to talk at the Kafka Summit about crossing the streams thanks to Kafka and Flink.
  • In November, Sabri was invited as a keynote speaker at the 17th International Conference on Service-Oriented Computing to share his experience about the convergence between micro-service, stateful stream processing and function as a service.

 

Summer schools & conferences

This year, Euranovians attended more than 15 prestigious international conferences and summits across the world to remain up to date and grow our network. We investigated the state of the art in streaming, data science, DevOps, computer vision or cloud engineering at conferences such as Flink Forward, Spark AI Summit, Kubecon, IEEE Big Data, DataWorks Summit, Kafka Summit, NeurIPS, RedHat, Elixir LDN or CVPR.

Euranovians brought back what they learned for the rest of the team and the big data community. Find our public summaries, identified trends and review of conferences here:

 

Activity Conferences

09-01-2020

Fourth Workshop on Real-Time and Stream Analytics in Big Data: key takeaways

Last December, Eura Nova’s research center held the fourth workshop on real-time and stream analytics in big data at the 2019 IEEE Conference on Big Data in Los Angeles. The workshop brought together leading players including Confluent, Apache Pulsar, the University of Virginia and Télécom Paris Tech as well as 8 renowned speakers from 6 different countries. We received more than 30 applications and we are proud to have hosted such interesting presentations of papers in stream mining, IoT, and industry 4.0.

The workshop was a real success with many interesting questions and comments. If you could not attend, our R&D engineer Syrine Ferjaoui brought back important elements from the presentations for you.

 

First keynote speaker:

First of all, the workshop started with the keynote of Matteo Merli, PMC member at Apache Pulsar. His talk “Messaging and Streaming” explained how Pulsar can be a unified infrastructure that supports messaging and streaming.

Matteo introduced messaging as events that are being created and streaming as analysing events that just happened. These are two different processing concepts but they need a single infrastructure. He then explained the architecture view of Pulsar, which has separate layers between the brokers and the bookies (BookKeeper instances that handle persistent storage of messages). This means that brokers and bookies can be added independently, traffic can be shifted very quickly across brokers, and new bookies will ramp up on traffic quickly. This segmented distribution makes the architecture of Pulsar more flexible and dynamic.

Pulsar has other interesting features such as durability, low latency, high throughput, high availability, unified messaging model, high scalability, native computing, … The roadmap includes working on Pulsar storage API to allow direct access to data stored in Pulsar and to retrieve and process data more efficiently. They are also working on higher-level messaging features.”

 

Second keynote speaker:

The second keynote was given by John Roesler, a Kafka committer at Confluent. He talked about Kafka Streams and the evolution of streaming paradigms.

To design software, we, developers, used to separate the application logic from the database. To scale the database capacity, we then started to use a search index to do ETL jobs and query the database in a fast and optimal way. However, this created bugs in the software, added data consistency issues, and created more complexity in the system. Later, we started to use HDFS for a more flexible design. While enabling replication and distributed storage, this solution added more latency and supported batch processing only. It did not meet the needs of real-time processing use cases.

At this point, streaming helped a lot. The next step was to add a streaming platform that reads from sources, does some computation, and sinks the result somewhere else. The KafkaStreams design is a set of multiple lambda stateful functions, which makes it a good fit for a microservices architecture.  With Kafka Streams’ new updates, the app logic is linked to a relational database with ACID guarantees.

Finally, John Roesler considers that “software is a fractal”, a never-ending pattern: a software architecture is complex and even when we zoom into a single component, it is still complex. But for the Kafka Streams’ design, when we zoom out, it looks like a set of services interacting and connected to each other and this simplifies the aforementioned designs.

John concluded by mentioning open problems that can be dealt with in stream processing, including semantics, observability, operability, and maintainability.

 

Workshop Invited Speakers:

After the keynotes, 8 selected papers were presented, covering mainly these 6 topics: (1) Stream Processing for IoT, (2) Serverless and HPC (High Performance Computing), (3) Collaborative Streaming, (4) Stream Mining, (5) Image Mining and (6) Real-time Machine Learning. Some papers are not yet available, as they will be published in the proceeding of the IEEE Big Data Conference. In the meantime, do not hesitate to contact our R&D department at research@euranova.eu to discuss how you can leverage stream processing in your projects.

Sören and Wilhelm are engineers in the Software Engineering Group from Kiel University. They propose a stream processing architecture which allows for aggregating sensors in hierarchical groups, supports multiple hierarchies in parallel, provides reconfiguration at runtime, and preserves the scalability and reliability qualities of streaming.

Andre Luckow, head of Blockchain and Emerging Technologies at BMW Group, and Shantenu Jha, associate professor at Rutgers University, presented StreamInsight, which provides insight into the performance of streaming applications and infrastructure, their selection, configuration, and scaling behaviour.

The paper is written by Tobias Grubenmann, researcher at The University of Hong Kong, in collaboration with Daniele Dell’Aglio and Abraham Bernstein, researchers at the University of Zurich. They present the Collaborative Stream Processing (CSP), a model where the costs, which are set exogenously by providers, are shared between multiple consumers, the collaborators. For this, they identify the important requirements for CSP to establish trust between the collaborators and propose a CSP algorithm adhering to these requirements.

  • Kennard-Stone Balance Algorithm for Time-series Big Data Stream Mining (Tengyue Li, Simon Fong, and Raymond Wong)

Tengyue Li and Simon Fong (researcher and associate professor at the University of Macau, China) and Raymond Wong (associate professor at UNSW Sydney) worked on the Kennard-Stone Balance algorithm used as a new data conversion method. Training a prediction model effectively using big data streams poses certain challenges in machine learning. In this paper, the authors apply the Kennard-Stone algorithm on time-series to extract a meaningful representation of big data streams, which improves the performance of a machine learning model.

 

  • Assessing the Effects of TV Ad Events on Digital Search: On the Selection of Outcome Measures (Shawndra Hill, Anthony Colas, H. Andrew Schwartz, and Gordon Burtch)

Shawndra Hill (Microsoft), Anthony Colas (University of Florida), H. Andrew Schwartz (State University of New York at Stony Brook) and Gordon Burtch (University of Minnesota) explained their work on the interactions between TV content and online behaviours such as response to digital advertising. They developed AdMiner, a tool that can track online activity around a brand and provide actionable insights into ad campaigns.

 

Austin Harris, Jose Stovall, and Mina Sartipi (researchers and CUIP director at the University of Tennessee at Chattanooga) have helped to create Chattanooga’s smart corridor, used to test new technologies and generate data-driven outcomes. In their talk, they presented the corridor, used as a test bed for research in smart city developments in a real-world environment. The wireless communication infrastructure and network of sensors in combination with data analytics provide a means of monitoring and controlling city resources and infrastructure in real time.

 

Sebastian Trinks and Carsten Felde (TU Bergakademie Freiberg) presented how image mining can help avoiding errors and low quality of printed prototypes in real time. This can result in saving resources and increasing efficiency when developing new products.

 

This year, IEEE Big Data held the Real-time Machine Learning Competition on Data Streams. As the competition is focused on streaming, its online platform required a specific infrastructure that meets data stream mining requirements. Dihia Boulegane is a Ph.D. student at Télécom ParisTech working in collaboration with Orange Labs on machine learning for IoT networks monitoring. She was in charge of implementing the streaming engine of the dedicated platform of the competition. Dihia explained its components, the technologies used, and the challenges met to build the platform. At the end, the platform was able to provide multiple streams to multiple users, to receive multiple streams, to process them and to provide the leader board and live results.

 

Special thanks to our keynote guests, Matteo Merli and John Roesler, and all the attendees and speakers! We are looking forward to an even more successful workshop in the coming edition of the IEEE Big Data Conference. Stay tuned for paper submission dates!

 

Activity

06-01-2020

A Performance Prediction Model for Spark Applications

Apache Spark is a popular open-source distributed-processing framework that enables efficient processing of massive amounts of data. It has a large number of parameters that need to be tuned to get the best performance. However, tuning these parameters manually is a complex and time-consuming task. Therefore, a robust performance model to predict applications execution time could greatly help in accelerating the deployment and optimization of big data applications relying on Spark. In this paper, we ran extensive experiments on a selected set of Spark applications that cover the most common workloads to generate a representative dataset of execution time. In addition, we extracted application and data features to build a machine learning-based performance model to predict Spark applications execution time. The experiments show that boosting algorithms achieved better results compared to other algorithms.

The paper will be published at the Big Data congress 2020 taking place in Hawaii. In the meantime, do not hesitate to contact our R&D department at research@euranova.eu to discuss how you can optimise distributed processing frameworks in your projects.

Florian Demesmaeker, Amine Ghrab, Usama Javaid, Ahmed Amir Kanoun, A Performance Prediction Model for Spark Applications, in the proceedings of Big Data congress 2020.

Activity Conferences

03-01-2020

IEEE Big Data 2019: a summary

At the beginning of the month, our R&D director Sabri Skhiri and our R&D engineer Syrine Ferjaoui travelled to Los Angeles to attend IEEE Big Data Conference. It is one of the most influential academic gatherings in distributed machine learning. This year, it featured 879 authors, shortlisted from 2009 applicants. They came from 28 countries and presented 210 papers. Back in Belgium, Sabri and Syrine give you their opinion on the event itself and the important elements from the keynotes, the tutorials, the workshops and the interesting papers.

 

The Big Trends

Sabri says: “The main trends were deep learning, NLP, privacy-preserving approaches, GAN, graph mining and stream mining. In my view, the level of the papers was quite good. Authors are becoming ever more skilled in data science, maths and algorithms. This goes to show that to be a good data scientist, you need an extensive set of advanced skills. Interestingly, there was almost nothing about distributed computing! This is a big move compared to the previous editions. The only presentations that had something to do with distributed systems were about optimisation strategies, an area similar to what our ECCO team researches. The Big Data Conference focuses on data science; it does not really look into its scalability.  Distributed computing topics tend to be dealt with at conferences like DEBS, VLDB, USENIX, SIGMOD, etc. As a result, this conference is an amazing place to see hundreds of data science use cases with, most of the time, an interesting contribution.”

 

The Keynotes

 

The keynotes were focused on data science as well. We even heard the term “Big Data Science”.

Keynote 1: Responsible Data Science by Lise Getoor – Professor at UC Santa Cruz

Syrine says: “The first keynote was my favourite. Lise started by comparing machine learning to a black box. The goal was to unpack the box and invite people to use data science and to use it wisely. To autonomise ethical decision-making, we should move away from maximising AI systems autonomy and move toward human-centric systems. To do this, we should make sure that human-centric systems have three qualities: (1) be knowledge-based, (2) be data-driven, and (3) support human values. Achieving responsible data science requires both machine-learning and ethics.”

 

Keynote 2: DataCommons “Google for Data” by Ramanathan Guha – Google

Guha presented DataCommons, a project started by Google to combine data from different open sources. Syrine explains: “Google’s DataCommons project allows users to pretend that the Web is one website, enabling developers to pretend all this data is in one database. The long-term vision of Google is to aggregate all data from publicly available sources (Medicare, Wikidata, sequence data, Landsat, CDC, Census…) into a single Open Knowledge Graph. The goal is to ​reduce or eliminate the ​​data download-clean-store​ process. Instead, users can access and use already cleaned data in the cloud. ​Data can be public or private (internet & intranet). This will avoid repeated data wrangling  and ease the burden of data storage, indexing, etc.”

 

The Tutorials

This year, IEEE Big Data held nine tutorials. Our R&D director explains: “At this type of events, tutorials are always a good way to learn a complete state of the art in a couple of hours. I particularly appreciated the tutorial on “Taming Unstructured Big data: Automated Information Extraction for Massive Textby the team of the famous Jiawei Han (he is a kind of pop star in data mining and the father of Graph Cube). I found out that many papers about named entity relations were published in the past two years. The idea is to be able to extract supervised, semi-supervised, and unsupervised relations between entities: for instance, discovering that “Trump” is “President of” “USA”. They also propose new approaches to integrate knowledge bases such as DBPedia or YAGO to infer new unknown relations from a corpus. This is just amazing!”

 

Syrine adds: “The tutorial on NewSQL principles, systems, and current trends was interesting as it explained why we should consider using NoSQL/NewSQL to deal with data interconnections and very high scalability. After attending this tutorial, I was motivated to order this book about Principles of Distributed Database Systems. For fans of deep learning, the tutorial “Deep Learning on Big Data with Multi-Node GPU Jobs” covers a lot about large-scale GPU-based deep-learning systems. If you missed the conference, all resources can be found on this ​link​.”

 

The Workshops

The EURA NOVA research centre organised the fourth workshop on Real-time and Stream Analytics in Big Data, at the 2019 IEEE conference on Big Data. We were really happy to welcome Matteo Merli from Apache Pulsar and John Roesler from Confluent as keynotes speakers. Thank you to them and to all the attendees and speakers! They had a great time, with captivating talks and a lot of interesting questions and comments. The summary of the event will soon be available on our website. The slides of the keynotes are available here:

 

 

Favourite Papers

A personal selection of interesting papers:

The paper tackles a really interesting problematic faced by a lot of data scientists. Introducing active learning is a cool idea and so is the way they used a mathematical trick to make their approach feasible.

Su Won Bae, from Mobilewalla, presented how they can define a complete customer acquisition model by mixing their data with their customer data (in this case, a worldwide leader in food delivery). Sabri says: “The quality of data science models highly depends on the data they can train on. I am convinced we will go in the same direction as Mobilewalla in the future to have richer models. However, mixing data must be done with care as it may raise some privacy issues;  our purpose has to have legal ground.”

The speaker presented MorphMine, a method for unsupervised morpheme segmentation.  It can generate morpheme candidates that are filtered out using entropy to select the best morphemes from a corpus. Then, these morphemes can be used to highly improve the word embedding model and the downstream machine learning tasks.

 

 

Activity Conferences

24-10-2019

Flink Forward: The Key Takeaways

Early October 2019, 6 EURA NOVA engineers travelled to Berlin to attend the Flink Forward Conference, dedicated to Apache Flink users and stream processing communities.

In this article, they will give you their opinion about Ververica’s’ main announcement, the impact of Ververica acquisition by Alibaba, the big trends, and a selection of their favourite talks.

 

Alibaba!

This is the first Flink Forward conference since the acquisition of Ververica (formerly known as data Artisans) by Alibaba, which has been one of the largest users of Flink and second-largest contributor for years. Our R&D director Sabri Skhiri says: “The only significant impact of this acquisition on the conference is that the venue is now at the Berlin Business Center instead of the Kulturbrauerei. There, we could see that the Apache Flink user’s community has grown significantly as well as their commits on Flink. This edition was a bit more business and enterprise-oriented than previous ones, although it still had its technical DNA and a lot of technical talks. All in all, this was a very good mix. Alibaba folks are deeply committed to open source and creating technology impact. We saw a lot of activities from them such as the integration of the Blink SQL runner, the hive integration or the new scheduling model. In summary, a great event.”

 

First Keynote Announcement

Keynote: Stream Processing and Applications in the Modern Age (Stephan Ewen)

During the first keynote, Ververica took the opportunity to announce the launch of Stateful Functions (statefun.io), an open-source framework built on top of Flink to run stateful serverless functions. It bridges the gap between Function as a Service and stream processing.

Sabri says: ”Last year, they announced their streaming ledger that brings ACID transactions between states to stream processing applications. This year, they announced the launch of Stateful Functions, a framework that reduces the complexity of building and orchestrating stateful applications at scale. In the streaming world, this announcement does not change a lot of things. However, in the microservice community, this opens new doors in terms of design patterns, especially in the way data feeding and stateful operations can be designed more flexibly.”

You can find the video of the presentation here.

 

The Big Trends

1. Unified batch and streaming

A significant trend of this edition is the “Unified Stream and Batch” moto. Our R&D engineer Syrine Ferjaoui says: “Flink currently features different APIs, the DataSet API for batch processing and the DataStream API for stream processing. In addition, the Table API is already a unified API on top of both (DataSet and DataStream) with declarative-style programming. Now, they are working on a solution to unify truly the batch and streaming APIs.”

Sabri adds: “In Flink 1.9, they released the State API with which a state created in batch can be used in a stream application – interesting for bootstrapping/backfilling states. But the community is going further by proposing in Flink 2.0 a unique Data API that will merge DataSet and DataStream while still taking advantage of the batch properties to optimise the execution.”

Every talk was exploring in a way or another how this unification can be pushed forward. For instance, in the Pulsar talk, they were thinking about using Pulsar as a back end to transparently bootstrap a state and then switch on stream using (1) pulsar capability in terms of segment storage and (2) unified data stream API in Flink.”

 

2.”Enterprise-grade” Flink:

Flink is moving clearly toward an “enterprise-grade” technology. Sabri says: “The first signal is that Cloudera adopted Apache Flink into its Data Platform. Also, AWS Kinetics now integrates Flink as a client. Adoption by such big players goes to show that Flink is well on the way to gain enterprise-grade support. The second signal is the release of the Ververica Platform that highly facilitates enterprise-grade operations. Thirdly, the integration of the Hive Metastore with the pluggable catalogue architecture is a significant step towards better governance and metadata management. Finally, there were many talks about lowering the barrier to deploying Flink in prod. The topics included APIs, configuration, memory management, K8S operators, etc.”

 

3.The ML path

Finally, regarding ML/AI, there is still a lot of work to get over the gap with the Spark ecosystem. However, the Alibaba folks are working hard on this topic and we can already see the first results. Sabri says: “The refactoring of the Flink ML interface to work on Flink Table APIs is excellent. There is an excellent vision of integrating Flink as a data prep engine for ML and serving layer; and the roadmap looks great.”

 

Interesting talks

A personal selection by Charles & Christophe of interesting talks to check out :

For Charles, our data architect:

  • Aljoscha Krettek & Timo Walther, respectively a co-founder at Ververica and a PMC member of Apache Flink work on the Flink APIs. They give a summary of recent contributions to Flink’s Table & SQL APIs. It was a very good overview of what is going on in terms of refactoring and where we are going.
  • Roman Grebennikov is a software developer from Findify AB. His talk focused on Flink serialisation framework and common problems happening around it. He illustrated and explained several ways to optimise Flink jobs by taking care of the serialisation, which in most cases represents about 60% of the processing.

For Christophe, our software engineer:

  • Konstantin Klauf is the head of product for the Ververica Platform based on Apache Flink. He discussed Apache Flink worst practices by sharing anecdotes and hard-learned lessons of adopting distributed stream processing. It was a humorous list of general good practices when working with Flink from planning, requirement, deployment, and maintenance.
  • Aaron Levin and Mike Mintz are software engineers in a Stripe’s streaming team. They talked about the many challenges they encountered writing the specialised dual source. This talk was a very well-told story about a simple use case with a high constraint: all-time deduplication of transactions at Stripe (a payment platform‎). It was funny, insightful, full of lessons learned and echoed some of digazu’s features: the history replayer.

 

Activity Conferences

18-10-2019

Kafka Summit: The Key Takeaways

At the beginning of the month, our software engineer Christophe Philemotte was in San Francisco to make a presentation at the Kafka Summit organised by Confluent. The Kafka Summit is one of the main events for data architects, engineers, DevOps, and developers who want to learn about streaming data. In this article, Christophe shares with you the latest trends from the conference.

 

Main observations

This year, one of the most important takeaways at the conference was that Confluent is working towards building an active database with KSQL.

Christophe details: “KSQL is the streaming SQL engine that enables real-time data processing against Apache Kafka. With KSQL, Confluent is embracing the SQL streaming and the integration of its stack into it. They also aim to have the interactivity we already have with a classic database. In short, they are moving towards this new paradigm of active data and passive query where KSQL would make it easy to read, write, and process streaming data in real-time, at scale, using SQL-like semantics. Still, KSQL shouldn’t be chosen over Flink, for instance, without proper consideration of its limitations. For example, real checkpointing and savepoint are missing, as well as global shuffling. There are still constraints on partitioning in some operators and there is no global windowing.”

While talking about SQL streaming, they also mentioned user-defined function or machine learning integration. Find more information on the summit website.

Another interesting point was the shared approaches and themes that were addressed by different companies. For example, 30% of the talks were about the operations. About 5 talks were dedicated to methods how to deploy on Kubernetes, and several other speakers mentioned that deploying on Kubernetes was their target. Real-time analytics, integrations/ETL/DataOps, and of course data pipelines were also often mentioned.

 

Keynotes talks:

During the first keynote talk, Jun Rao, the co-founder of Confluent, looked back at Apache Kafka’s last years and what brought them to where they are today. Christophe says: “One interesting point was the concept of democratising data. They envision Kafka as a one-stop self-service shop for devs, data scientists, etc. Still, the users have to overcome a lot of challenges such as operations, integrations, security, or cold storage. Challenges that we are solving with digazu.”

You can find the video of the presentation here.

Jay Kreps, the CEO of Confluent as well as one of the co-creators of Apache Kafka started his talk by discussing the sentence “Software is eating the world”, by Marc Andreessen. Christophe adds: “The idea is that software must be integrated into an ecosystem of other software. The users are no longer just humans. In some cases, the software will be used almost exclusively by other software.”

Jay Kreps also talked about the new steps for Apache Kafka. He announced that the next release of Kafka KSQL in November will enable users to directly register inputs and outputs thanks to Kafka Connect source and sink connectors. They are also working on better interactivity that will allow users to see the results more quickly in the KSQL CLI.

You can find the video of the presentation here.

 

Our Favourite Use Cases:

Kafka on Kubernetes: Keeping It Simple (Nikki Thean)

Nikki Thean is a staff engineer at Etsy, where she helps deploying Kafka at Etsy. She talked about Etsy’s Cloud Migration and how running Kafka on Kubernetes was the best option for them and was not half as complicated as they thought it had to be. Christophe explains: “At the DataWorks Summit in Barcelona, the message was that K8S resource management was not yet ready to replace YARN. We now see that K8S is the new YARN for many people who are using it to deploy their cluster. For example, Etsy or Confluent Cloud.”

In her talk, Nikki Thean explained how a Kafka-on-K8S setup works. Christophe explains: “The main lessons from her talk are:

  • We can start simply without an operator.
  • We must pay attention to the Kubernetes liveness and readiness probes. They can be used to make a service more robust and more resilient since K8S can restart them if necessary. However, if these probes are not configured carefully, they will kill the brokers unnecessarily.
  • Considering the price to deploy in multiple zones on Google Cloud Platform, a good solution is to deploy at least Zookeeper (the most critical element of the cluster) on multiple zones. Given the low flow of data, it will not be too expensive and Zookeeper will allow identifying which Kafka node has the data.”

You can find the video and the slides of the presentation here.

 

Mission-Critical, Real-Time Fault-Detection for NASA’s Deep Space Network using Apache Kafka (Rishi Verma)

Rishi Verma is a manager at the NASA Jet Propulsion Laboratory. He talked about the new software system being deployed by NASA to upgrade its Deep Space Network (DSN) that operates spacecraft communication links for NASA deep-space spacecraft missions. Christophe says: “It was a super interesting use case! The DSN Complex Event Processing (DCEP) software assembly is a new software system that brings into the DSN next-generation “Big Data” infrastructural tools to do IoT with their legacy assets. The objective is to correlate real-time network data with other critical data assets (in their example, an old radio antenna). They recover all the data on Kafka, then they process it and then they predict signal loss on the basis of weather conditions.”

You can find the video and the slides of the presentation here.

 

 

Cross the Streams Thanks to Kafka and Flink (Christophe Philemotte)

Christophe is the CTO of digazu, the batch and real-time data sharing platform developed by EURA NOVA. In his talk, he explained how you could build a similar data platform and how you could plug Flink into the Kafka ecosystem, as well as what the common pitfalls are and what Flink requires to be deployed on Kubernetes.

Christophe says: “The feedback was positive and I received a lot of questions during the Q&A session and after the talk, notably about Flink vs KSQL vs Spark. Another question that I received a lot is when to use Table, SQL or DataStream API. My answer was that Table and SQL APIs are two different flavours of the same API. The Table API you have a LINQ experience while with the SQL API you have a SQL experience. They are both perfect for data processing that can be expressed simply in SQL. That means in a lot of cases. The DataStream API is a lower-level API compared with the Table and SQL APIs. It gives more control on what you can do, which means it also requires a thorough understanding of Flink core mechanisms. Going for the DataStream API is usually a good choice either when your stream processing cannot be expressed in SQL and requires specific implementation, or when you need to optimise the processing.

The sandbox provided was also very popular.

You can find the video and the slides of the presentation here.

 

Our Favourite User Practice:

 

Please Upgrade Apache Kafka. Now. (Gwen Shapira)

Gwen Shapira is a software engineer at Confluent working on core Apache Kafka. She reviewed all the recent releases and made suggestions on how to de-risk upgrades.

Christophe says: “Gwen Shapira talked about why it is essential to upgrade even though it is risky and time-consuming. She explained that each new release fixes from 30 to 140 bugs and listed the improvements you will get from upgrading”. Among them:

  • The Apache Kafka team is working on improvement to build a reliable replication. For example, watermarking has been improved greatly.
  • They are working on controller design towards the removal of Zookeeper.
  • Finally, some releases are critical for specific reasons (e.g. proper resolution of IP when you work with K8S, JBOD, or EOS).

 

In the second part of her talk, Gwen Shapira made suggestions to upgrade as safely as possible. Christophe explains: “She recommended to take good care of backup configuration and documentation. Regarding documentation, she recommended to read the list of notable changes, to act upon text in bold font, and once you have finished reading, to go over it all again!”

Christophe’s last word? “Be sure to check out slide 35: it lists the ways how not to upgrade!”

You can find the video and the slides of the presentation here.

Activity Conferences

27-09-2019

4th Workshop on Real-time & Stream Analytics in Big Data

EURA NOVA Research centre is proud and excited to organize the fourth workshop on Real-time and Stream analytics in Big Data, collocated with the 2019 IEEE conference on Big Data. The workshop will take place in December in Los Angeles, USA.

Stream processing and real-time analytics in data science have become some of the most important topics of Big Data. To refine new opportunities and use cases required by the industry, we are bringing together experts passioned about the subject. 

This year, we are excited to have two amazing keynotes from Confluent KStream and Apache Pulsar: 

  • Matteo Merli is one of the co-founders of Streamlio, he serves as the PMC chair for Apache Pulsar and he’s a member of the Apache BookKeeper PMC. Previously, he spent several years at Yahoo building database replication systems and multi-tenant messaging platforms. Matteo was the co-creator and lead developer for the Pulsar project within Yahoo.
  • John Roesler is a software engineer at Confluent and a contributor to Apache Kafka, primarily to Kafka Streams. Before that, he spent eight years at Bazaarvoice, on a team designing and building a large-scale streaming database and a high-throughput declarative Stream Processing engine.

 

If you want to join us, authors from the industry and the academia are invited to contribute to the conference by submitting articles. Check out the workshop website to find all the information you will need. Your paper will be reviewed by a prestigious panel of international experts from both the academic and the industrial worlds.

 

Activity Conferences

28-08-2019

ACL 2019: Takeaways from the conference

Last month our R&D Project Director Cécile Pereira and our PhD student Léo Bouscarrat travelled to Florence to attend and present to ACL 2019. ACL is one of the biggest conferences in Natural Language Processing. This year all the records were broken with more than 3500 attendees, 660 accepted papers to the main conference, 9 tutorials and more than 20 workshops. All the talks of the main conference were recorded and are accessible online. In this article, Cécile and Léo share with you the latest trends from the conference!

 

 

Big trends

 

A new paradigm in NLP?

This year, ACL’s selection of topics has shown the importance that has taken self-training methods such as BERT (Devlin et al., 2019) or XLNet (Yang et al., 2019). These methods consist of feeding huge models with a vast amount of data and then train them on easy tasks (for example, predict masked words in the original sentence or predict if two sentences are following each other).

These models should be able to learn faster and with less data on a more specific and complex task. With this method, the way to train a model to solve an NLP task has changed. Here is this new paradigm:

  1. Select a pre-trained model (trained with self-training)
  2. Add a layer on the output of this model (it will depend on your task) and fine-tune the model by giving the inputs and outputs of your task
  3. Evaluate your model

Many papers were using this paradigm to achieve state of the art on several tasks (out of the 660 papers of the main conference, 47 have the word BERT in their abstract).

Contextual embeddings, like BERT, take into account the context of the sentence into the embeddings of the words. BERT can be used for a large variety of tasks including but not limited to classification (Reimers et al., Chalkidis et al.), named entity recognition (Arkhipov et al., Emelyanov and Artemova) and question answering (Li et al., Liu et al.).

So it is working. But the remaining question is why?

Several presentations discussed the explainability of BERT (for example Jawahar et al. and Clark et al.). Those papers discuss that, as the different layers are learning different things, the different heads seems to specialize in certain types of words or certain syntactic or semantic task.

The conference highlighted the need for adversarial training and testing as those models are very good to learn bias in the dataset (Niven and Yao, Jiang and Bansal). For those not familiar with the concept, adversarial training and testing consist to train and/or test on an adversarial dataset. This dataset is composed of examples, often generated ones, where the model fails to predict the correct answers. Adversarial training is generally used to verify if the models learn bias in the dataset (like the negation in Niven and Yao). It can also improve the quality of the models.

 

Improving the experiments in NLP

Several presentations showed that adversarial training can improve the results and robustness of the model (Zhu et al and  Jiang and Bansal, Mohit Bansal slides available here).

The meeting was also a moment to discuss the impact of the use of standard splits on benchmark data.  Standard split means that, if you want to work on a specific task, you will generally look for the training, validation and test splits used in other publications and use the same.

However, Gorman and Bedrick argue that the use of random split should be preferred. They explain it by trying to reproduce the results of nine part-of-speech taggers on a specific dataset. They reproduced the same rankings on the standard splits. However, when they did it on random splits, the ranks of the taggers, considering the same metric, varied.

This showed that getting a better ranking on a specific split doesn’t mean that you are better in general. Since in some fields of research, the improvements between each paper are small, the use of standard split does not guarantee that a model is really better than another one on the task.  Random splits could improve this by adding a notion of variance on the performances.

 

Domain adaptation

The last trend in NLP consists of using models or embeddings learned on huge datasets of general data from sources such as Wikipedia, books or newspapers.

When you want to work on specialised domains such as the biomedical, legal or financial domains, you need specialised embeddings. However, you generally don’t have enough specialised data to re-train the embeddings or the models.

A solution is to use and modify pre-trained models for your specific task. This is called domain adaptation. There are several ways to do it. For example, Boukkouri et al combined a general embedding and a smaller one learned on their domain. Hu et al fine-tuned a general model on their data. These methods allow using recent models (which needs a lot of data) on some specific domains that do not fit those requirements.

 

 

Trendy topics

 

Machine translation

Machine translation is still a huge topic with no less than 46 papers in the main conference (according to the ACL 2019 chair blog post), an entire two-days workshop dedicated to it and Liang Huang invited talk. Liang Huang is a principal scientist of Baidu Silicon Valley AI Lab who talked about the current state of simultaneous translation and Baidu research’s new approach.  They were able to do an English/Chinese translation with 3 seconds of delay only. The demo is available here: https://simultrans-demo.github.io/. One can also notice that the ACL best long paper award was on this topic (Zhang et al.)!

 

Conversational systems

Conversational systems (also called chatbots) were also a trendy topic, with 52 papers, a workshop, and the invited talk from Pascale Fung.

Pascale Fung is a Professor at the Hong Kong University of Science & Technology. She presented the state of the art of conversational systems. For her, recent advances are going in three directions: learning to memorise, learning to personalise and learning to empathise.  She presented her current work on conversational systems that can empathise, showing that improvements have been made but there is still work to do. She ended with questions about the ethics of this sector: how can we build systems that are secure, safe and fair for all?

 

Knowledge graph

Knowledge graphs are also pretty trendy, they seem to be a good way to add knowledge to models. It can be used for Question-Answering or Conversational systems. The blog post of Michael Galkin makes a review of the most interesting articles in this sector.

 

Bias in NLP

After recent papers showed that models in NLP are biased (Bolukbasi et al., 2016 ; Caliskan et al, 2017) there is more and more work on what we can do about that, reflected by a session and a workshop during the meeting (https://genderbiasnlp.talp.cat/).

Several works about removing gender bias from models have been previously published. But the work of Gonen and Goldberg explains that, for now, it’s only “Lipstick on a pig”.

We observed two main areas on the topic. Firstly, removing/controlling gender bias in the models (like in automatic translation, Habash et al., Escudé Font et al., Ik Cho et al.). Secondly, measuring bias in the models and society (with articles proposed by sociologists, like Karve et al., Hitti et al., Basta et al., Kurita et al.).

 

Summarization

There were several papers about summarization (including our own paper https://arxiv.org/abs/1907.07323) which have been summarized by RecitalAI on their GitHub.

 

 

Conclusions

ACL was a great place to measure the trends in the NLP field. As models are becoming better, data scientists are applying them to a large variety of topics including automatic translation, search engine, and chatbots.

As the NLP community and topics are becoming bigger and bigger, we hope that this summary of our biased takeaways from the meeting could help you navigate the nearly 700 ACL papers of this year.

 

Activity

19-07-2019

STRASS: A Light and Effective Method for Extractive Summarization

This paper introduces STRASS: Summarization by TRAnsformation Selection and Scoring. It is an extractive text summarization method which leverages the semantic information in existing sentence embedding spaces. Our method creates an extractive summary by selecting the sentences with the closest embeddings to the document embedding. The model learns a transformation of the document embedding to minimize the similarity between the extractive summary and the ground truth summary. As the transformation is only composed of a dense layer, the training can be done on CPU, therefore, inexpensive. Moreover, inference time is short and linear according to the number of sentences. As a second contribution, we introduce the French CASS dataset, composed of judgments from the French Court of cassation and their corresponding summaries. On this dataset, our results show that our method performs similarly to the state of the art extractive methods with effective training and inferring time.

 

Léo Bouscarrat, Antoine Bonnefoy, Thomas Peel, Cécile Pereira, STRASS: A Light and Effective Method for Extractive Summarization Based on Sentence Embeddings, in 2019 ACL Student Research Workshop, Florence, Italy.

Florence, Italy

Download file (.pdf)

Page 1 of 712345...Last »