NAVI GATIONSEARCH BOX
Join us on LinkedIn Follow us on Twitter
Eura Nova RD
Eura Nova

Activity

In this section you will find EURA NOVA’s latest news and activities.

Activity

15-01-2021

Towards a continuous evaluation of calibration

For safety-critical systems involving AI components (such as in planes, cars, or healthcare), safety and associated certification tasks are one of the main challenges, which can become costly and difficult to address.

One key aspect is to ensure that the decisions a machine-learning classifier makes are properly calibrated. This Thursday, our engineer Nicolas presented at the MLSC workshop part of the research work on classifiers calibration carried out with our senior data scientist Antoine Bonnefoy.

The Machine Learning in Certified Systems workshop brought together machine learning researchers with international authorities and industry experts to present the main open questions and methods for verification and certification of critical software. The objective was also to define the future research agenda towards the medium-term goal of certifying critical systems involving AI components. The workshop included invited talks, a poster session and panel discussions.
Nicolas talked about improving the calibration of classifiers and its evaluation through the introduction of continuous estimators of related errors.

Watch him present his poster presentation on Youtube.

 

You can find the poster pdf below!

 

Download file (.pdf)

Activity

13-01-2021

Reinforcement Learning Course at ENSI

Reinforcement learning is one of the most active research areas in artificial intelligence and applies to a wide range of use cases in different sectors. What makes the technology unique in that it creates autonomous systems that learn from trial-and-error interaction to maximise the total amount of reward it receives while interacting with a complex, uncertain environment.
To provide students with the skills needed in a transforming AI landscape, the ENSI school (National School for Computer Science) invited us to give a training course on the subject. Last week, our research engineer Nourchène gave a 15-hour module aimed at final year engineering students in the Software Engineering programme and the Intelligent Systems master.

During the course, she gave an introduction to reinforcement learning and its main principles :

  • What makes it different from traditional machine learning techniques (supervised and unsupervised ML)
  • Modelling by Markovian processes
  • The three main families of RL algorithms
  • The RL taxonomy with examples of algorithms from each family
  • A practical part with the OpenAI Gym toolkit to familiarise students with the different environments available and test some of them.

 

If you wish to know more about RL, do not hesitate to reach out to research.euranova.eu!

Thank you to the ENSI team for the invitation. It was a pleasure for us to be able to exchange with students. If you are a student interested in the field of Reinforcement Learning, we propose graduation projects on the subject. You can find all the internship offers on our website.

Activity

13-01-2021

Padhoc: a computational pipeline for pathway reconstruction on the fly

Molecular pathway databases represent cellular processes in a structured and standardized way. These databases support the community-wide utilization of pathway information in biological research and the computational analysis of high-throughput biochemical data. Although pathway databases are critical in genomics research, the fast progress of biomedical sciences prevents databases from staying up-to-date. Moreover, the compartmentalization of cellular reactions into defined pathways reflects arbitrary choices that might not always be aligned with the needs of the researcher. Today, no tool exists that allow the easy creation of user-defined pathway representations.

Here we present Padhoc, a pipeline for pathway ad hoc reconstruction. Based on a set of user-provided keywords, Padhoc combines natural language processing, database knowledge extraction, orthology search and powerful graph algorithms to create navigable pathways tailored to the user’s needs. We validate Padhoc with a set of well-established Escherichia coli pathways and demonstrate usability to create not-yet-available pathways in model (human) and non-model (sweet orange) organisms.

Salvador Casaní-Galdón, Cecile Pereira, Ana Conesa, Padhoc: a computational pipeline for pathway reconstruction on the fly, Bioinformatics, Volume 36 (2):i795–i803, December 2020.

DOI : https://doi.org/10.1093/bioinformatics/btaa811

Download file (.pdf)

Activity

13-01-2021

2Be3-Net : Combining 2D and 3D convolutional neural networks for 3D PET scans predictions

Radiomics – high-dimensional features extracted from clinical images – is the main approach used to develop predictive models based on 3D Positron Emission Tomography (PET) scans of patients suffering from cancer. Radiomics extraction relies on an accurate segmentation of the tumoral region, which is a time-consuming task subject to inter-observer variability. On the other hand, data-driven approaches such as deep convolutional neural networks (CNN) struggle to achieve great performances on PET images due to the absence of available large PET datasets combined to the size of 3D networks. In this paper, we assemble several public datasets to create a PET dataset large of 2800 scans and propose a deep learning architecture named “2Be3-Net” associating a 2D feature extractor to a 3D CNN predictor. First, we take advantage of a 2D pre-trained model to extract feature maps out of 2D PET slices. Then we apply a 3D CNN on top of the concatenation of the previously extracted feature maps to compute patient-wise predictions. Experiments suggest that 2Be3-Net has an improved ability to exploit spatial information compared to 2D or 3D only CNN solutions. We also evaluate our network on the prediction of clinical outcomes of head-and-neck cancer. The proposed pipeline outperforms PET radiomics approaches on the prediction of loco-regional recurrences and overall survival. Innovative deep learning architectures combining a pre-trained network with a 3D CNN could therefore be a great alternative to traditional CNN and radiomics approaches while empowering small and medium-sized datasets. 

Ronan Thomas, Elsa Schalck, Damien Fourure, Antoine Bonnefoy and Inaki Cervera-Marzal, 2Be3-Net : Combining 2D and 3D convolutional neural networks for 3D PET scans predictions, Proc. of the 2nd International Conference on Medical Imaging and Computer-Aided Diagnosis, 2021.

Activity

04-12-2020

Talking graph analytics with students

Last Saturday, our Tunisian team Safa, Ichraf Hamza and Amine took part in the ENSI (Ecole Nationale des Sciences de l’Informatique) virtual forum to share their experience and meet the students! 

Our graph specialist Amine Ghrab talked to students about the power of graph analytics. 

Did you know that domains such as social networks, transportation, and biological networks are naturally modelled as graphs? He explained how a multitude of emerging problems can be represented using graph models and are efficiently solved using graph algorithms: 

“Over the past decade, business and social environments have become increasingly complex and interconnected. As a result, graphs have emerged as a widespread abstraction tool at the core of the information infrastructure that supports these environments. In the presentation, I discussed

  • The value of Graphs and their emergence in a multitude of domains
  • The growing graph ecosystem of industrial graph tools
  • Data analytics beyond the euclidean space: with examples of graph querying, mining, and Graph ML 
  • The integration of graphs within established BI systems, where graph warehouses extend current information systems with graphs management and analysis capabilities.”

If you wish to know more about graphs or have access to the slide, do not hesitate to reach out to research.euranova.eu!

Kudo to the ENSI team for the organisation. It was a pleasure for us to be able to exchange with students. If you are a student interested in the field of graphs, Amine proposes a graduation project on the subject. You can find all the internship offers on our website.

 

Academic Programmes Activity

19-11-2020

INTERNSHIPS 2021

This document presents internships supervised by our software engineering department or by our research & development department. Each project is an opportunity to feel both empowered and responsible for your own professional development and for your contribution to the company.

 

If you are interested in one of our offers, please send us your application to career@euranova.eu, including your CV and motivation regarding your top three internship positions (described in the document).

 

If you wish to read the testimonies of students who have done an internship at Eura Nova, visit our blog, or read directly their experiences:

If you are interested in working on a topic that is not in our range of offers, we would be delighted to hear your proposition and invite you to get in touch.

Internship subjects and application guidelines are available here: Internship Offers.

Download file (.pdf)

Academic Programmes Activity

19-11-2020

MASTER THESIS & PFE 2021

This document introduces you to master thesis and graduation projects supervised by our research & development department. Each project offers you the chance to be actively involved in the development of solutions to address tomorrow’s challenges in ICT and implementing them today!

If you are interested in one of our offers, please send us your application to career@euranova.eu, including your CV and motivation regarding your favourite master thesis subject (described in the document).

If you are interested in working on a topic that is not in our range of offers, we would be delighted to hear your proposition and invite you to get in touch.

Master thesis subjects and application guidelines are available here: Master Thesis Offers.

Download file (.pdf)

Activity

05-11-2020

ECML 2020 – The Keynotes

A few weeks ago, the biggest European conference on machine learning was held: ECML 2020. Our research engineer Nourchène, our R&D consultant Gianmarco, and our data scientist Ronan attended the event from Tunisia, Belgium and Marseille. In this article, they tell you about the different keynote talks they attended. 

Gemma Galdon-Clavell – Algorithmic Auditing: how to open the black-box of ML

Nourchène says: “I loved the talk given by Gemma Galdon-Clavell during which she addressed the problem of ethics in AI, as computer science engineers do not often question what they are producing from a moral standpoint. In her talk, Gemma points out the importance of data used to train a machine learning model. Data are provided by humans, but people are not perfect, they are likely to make wrong decisions. The model will then learn to behave the same way. So we might end up creating an unethical model. This can lead to two different behaviours: users either will follow the system’s recommendations at any cost or decide not to if they find the decisions not reasonable. Data will then continue to be biased, which creates a sort of deadlock.”

 

Ronan adds: “Algorithms do not produce biases from anywhere; they reproduce and amplify biases they can find in the data they ingest. As a result, we have to pay attention first to the quality of the data we use. Gemma emphasizes that algorithmic auditing is the key to understanding if the algorithm meets the expectations and if it complies with the regulations. The audit does not only cover the technical part and the way the algorithm was coded. It also focuses on how the problem was approached and the means deployed to solve it.”

 

Nourchène explains: “The speaker suggests that before creating a product, computer science engineers and developers need to ask the following questions: Is the product desirable and what is the problem that it tries to solve? Is it acceptable and does it involve users? Is it legal? Finally, does it use the right data? Gemma also suggests that ethics be taught in engineering schools. I totally agree with that because nowadays technology does not always seek to solve real problems, its goal is rather to make a fortune out of the proposed product.”

 

Max Welling – Amortized and Neural Augmented Inference

Gianmarco says: ‘My favourite talk was the one held by Max Welling. It clearly showed and unified the underlying theoretical grounds of many superficially different models, without failing to provide real-world applications. More concretely, the talk showed how to develop hybrid amortized methods that combine classical learning, inference and optimization algorithms with learned neural networks, which is of strong interest, especially in physics-related fields.

It provided a comprehensive and complete exposition of the topic of amortized neural inference and, as a consequence, it did not fail in bringing the spectator up-to-date with applications in that regard. Max Welling presented how a learned neural network can augment or correct a classical solution (attained by means of expert-knowledge or classical equations), or reversely, how a neural network can be fed useful information computed by a classical method.”

 

Been Kim – Interpretability for everyone

Gianmarco says:  “I was exposed to many new topics and applications I was not familiar with. Talks like Interpretability for everyone that offered more abstract research were the ones that struck my attention the most. The talk presented the latest discoveries and tools in terms of interpretability quantification. It also introduces how to extract interpretability from a black-box end-to-end model, which I find very important for the construction of more robust models and model diagnosis.”

 

Doina Precup – Building Knowledge For AI Agents With Reinforcement Learning

Ronan says: “I really liked the talk given by Doina Precup on how to build knowledge in the field of reinforcement learning. I only had little knowledge of this field. Thankfully, Doina introduced us quickly to the key concepts of reinforcement learning. She also presented us with some big successes of RL, presented different RL mechanisms and went towards the problem of using existing knowledge to build a life-long learning agent. Doina concluded her talk with a lot of open and inspiring questions: How can we exploit previously learned knowledge and apply it to new environments not related in any manner to the previous ones? How well is an agent preserving and enhancing its knowledge? These questions might not have definitive answers or just answers at all but I found very relevant and interesting the interrogations she raises on how we can represent knowledge.

 

Stephan Günnemann about Certifiable Robustness of ML Models for Graphs

Ronan says: In this technical talk, Stephan presented us different methods to assess GNN robustness. To certificate the robustness of a GNN, an evaluation of its sensitivity to perturbations needs to be conducted. For example, you can search for a worst-case scenario, and verify that the margin is positive to ensure the model is robust. Stephan’s talk was very pleasant to listen to, as he accompanied it with several examples and applications of the methods he presented us. Finally, he concluded that ML models for graphs aren’t reliable but that we can apply certificates and robustification principles to provide guarantees for a reliable use of GNNs.

 

Watch the talks: 

If you wish to catch up on talks we mentioned or those you missed, all the sessions, paper and presentation recordings are available (for a limited time) from the ECML website.

Gemma Galdon-Clavell

Max Welling : 

Been Kim

Doina Precup

 

Stephan Günnemann 

Activity

05-11-2020

ECML 2020 – A Summary

A few weeks ago, the biggest European conference on machine learning was held: ECML 2020. Our research engineer Nourchène, our R&D consultant Gianmarco, and our data scientist Ronan attended the event from Tunisia, Belgium and Marseille. What were the big trends and their favourite talks? What did they think of the online remote format? Let’s find out with them!

 

The Big Trends

The overall conference was very well up-to-date with the outside world’s latest trends and needs. Gianmarco explains: “The conference was rich in presentations which covered nearly all possible topics in machine learning. However, I had the impression that Graph Neural Networks and Generative Models had a little more presence than other models. Transfer learning was also another topic that seemed to be very relevant throughout the conference.”

 

Remote Format For The First Time

Due to the COVID-19 pandemic, the conference was fully virtual. The talks were pre-recorded and made available prior to the conference. The live sessions were dedicated to questions and answers, with a very brief presentation at the beginning of the session. 

Nourchène explains: “The downside was that we had to watch the whole presentation beforehand, otherwise it was difficult to follow the discussion and to interact with the speaker. Fun fact: there was a session where even the moderator was not aware of this Q&A aspect and asked the speaker why the presentation was so short! The good thing is that, since the presentations were pre-recorded, it was possible to watch the presentations from sessions running in parallel.”

Gianmarco adds: “I have not had many remote conferences in my life, but I was genuinely surprised to see how well-organised this one was. The remote framework was very well-designed, the web interface was fully functional, and they took advantage of all the benefits that a remote event can have like re-watchable presentations.”

Kudos to the organising committee for pulling it off!

 

The Keynotes

We wrote an article with more details about different keynotes that you can find on this link, but here is a teaser: 

Gemma Galdon-Clavell – Algorithmic Auditing: how to open the black-box of ML

In her talk, Gemma points out the importance of data used to train a machine learning model. According to her, algorithmic auditing is the key to understanding if the algorithm meets the expectations and if it complies with the regulations. This audit does not only cover the technical part and the way the algorithm was coded. It also focuses on how the problem was approached and the means deployed to solve it. Read our detailed review here

 

Max Welling – Amortized and Neural Augmented Inference

The talk showed and unified the underlying theoretical grounds of many superficially different models, without failing to provide real-world applications. It provides a comprehensive and complete exposition of the topic of amortized neural inference and, as a consequence, it did not fail in bringing the spectator up-to-date with applications in that regard. Read more here

 

Been Kim – Interpretability for everyone

The talk presented the latest discoveries and tools in terms of interpretability quantification. It also introduces how to extract interpretability from a black-box end-to-end model. Read more in our article.

 

Doina Precup – Building Knowledge For AI Agents With Reinforcement Learning

Doina Precup talks on how to build knowledge in the field of reinforcement learning. She also presents some big successes of RL, presented different RL mechanisms and went towards the problem of using existing knowledge to build a life-long learning agent. Discover more!

 

Stephan Günnemann – Certifiable Robustness of ML Models for Graphs

Stephan presented different methods to assess GNN robustness: an evaluation of its sensitivity to perturbations needs to be conducted. Learn more with Ronan here.

 

Interesting Paper?

Si-An Chen; Voot Tangkaratt; Hsuan-Tien Lin; Masashi Sugiyama – Active deep Q-learning with demonstration

Nourchène says: “The authors presented their paper proposing different groups of techniques for learning from demonstration in Reinforcement Learning, like RL Expert Demonstration (RLED) or Active RL Demonstration (ARLD). These techniques can be used to fasten the learning process of an RL agent. They also propose an uncertainty-based query strategy named Active Deep Q-Network, based on DQN, to dynamically estimate the uncertainty of recent states and use the queried demonstration data.“

 

Favourite tutorial

Learning With Imbalanced Domains and Rare Event Detection

Ronan says: “This tutorial was interesting and well-structured. Imbalance domains and rare-events prediction concern a lot of domains: financial, medical, data distribution… and will always remain a centre of attention in designing the appropriate solution to a problem. As a consequence, it will remain a core problem in the research. I particularly liked this tutorial as it covered a lot of different approaches: unsupervised (statistical-based, proximity-based, clustering-based), supervised and semi-supervised and compared them. As there is no ideal solution that can be applied to every problem, you have to know what exists before choosing the one that better fits your problem. The tutorial also covered different methods to properly evaluate the performance of an algorithm on an imbalanced task. ”

 

Conclusion

The conference provided a wide range of machine learning topics in the form of presentations about the latest trends, technologies and applications. As Nourchène says:  “it is an optimal platform to stay up-to-date, to widen one’s perspectives and/or dig deeper into a specific topic.

 

Watch the talks: 

If you wish to catch up on talks we mentioned or those you missed, all the sessions, paper and presentation recordings are available (for a limited time) from the ECML website.

 

Gemma Galdon-Clavell

 

Max Welling 

 

Been Kim

 

Doina Precup

 

Stephan Günnemann

 

Active deep Q-learning with demonstration: Read the paper 

Activity

30-10-2020

Our engineer Amine Ghrab presented his PhD public defense on the BI on Graph Project

Last Thursday, our engineer Amine Ghrab presented the BI on Graph project during his PhD public defense. Amine did an amazing job at the edge between Industry & Academia. Amine’s thesis was done in collaboration with the CODE/WIT Lab of the Université Libre de Bruxelles and the Universitat Politècnica de Catalunya, with the support of Prof. Oscar Romero & Prof. Esteban Zimanyi!

In his PhD thesis, Amine defined how BI environments can be enriched with Graph Data structures. Over the past decade, business and social environments have become increasingly complex and interconnected. As a result, graphs have emerged as a widespread abstraction tool at the core of the information infrastructure that supports these environments. In particular, the integration of graphs into data warehouse systems has appeared as a way to extend current information systems with graphs management and analysis capabilitiesGoing forward, Amine redefined the concepts of multidimensional cube on graph and showed how it can open new doors for data analysts. Finally, he showed how a graph data warehouse architecture can be defined.

Congratulation for your achievements!

You can find below a list of related publications:

Activity

24-08-2020

Privacy Policy Classification with XLNet

The popularisation of privacy policies has become an attractive subject of research in recent years, notably after the General Data Protection Regulation came into force in the European Union. While GDPR gives Data Subjects more rights and control over the use of their personal data, length and complexity of privacy policies can still prevent them from exercising those rights. An accepted way to improve the interpretability of privacy policies is through assigning understandable categories to every paragraph or segment in said documents. The current state of the art in privacy policy analysis has established a baseline in multi-label classification on the dataset containing 115 privacy policies, using BERT Transformers. In this paper, we propose a new classification model based on the XLNet. Trained on the same dataset, our model improves the baseline F1 macro and micro averages by 1-3% for both majority vote and union-based gold standards. Moreover, the results reported by our XLNet-based model have been achieved without fine-tuning on domain-specific data, which reduces the training time and complexity, compared to the BERT-based model. To make our method reproducible, we report our hyper-parameters and provide access to all used resources, including code. This work may, therefore, be considered as a first step to establishing a new baseline for privacy policy classification.

 

Majd Mustapha, Katsiaryna Krasnashchok, Anas Al Bassit and Sabri Skhiri, Privacy Policy Classification with XLNet, Proc. of the 15th DPM International Workshop on Data Privacy Management, Surrey, UK, 2020.

Download file (.pdf)

Activity

21-08-2020

Towards Privacy Policy Conceptual Modeling

After GDPR enforcement in May 2018, the problem of implementing privacy by design and staying compliant with regulations has been more prominent than ever for businesses of all sizes, which is evident from frequent cases against companies and significant fines paid due to non-compliance. Consequently, numerous research works have been emerging in this area. Yet, to this moment, no publicly available model can offer a comprehensive representation of privacy policies written in natural language, that is machine-readable, interoperable and suitable for automatic compliance checking. Meanwhile, privacy policies stay one of the main means of communication between a business (Data Controller) and a Data Subject, when it comes to the use of personal data. In this paper, we propose a conceptual model for fine-grained representation of privacy policies. We reuse and adapt existing Semantic Web resources in the spirit of interoperability. We represent our model as an ODRL profile and demonstrate how existing privacy policies can be translated into ODRL-like policies, consisting of deontic rules. We enrich our model with vocabularies for describing personal data processing in great detail, making it suitable for further usage in downstream applications, such as access control tools, to support adoption and implementation of privacy by design. We also demonstrate our model’s capability of handling personal data processing rules in other types of documents, namely data processing agreements, essential for controlling data privacy in a relationship between a Controller and a Processor.

The paper is available online on Springer. Currently, it is unfortunately freely available only to subscribers, but do not hesitate to reach out to us for more information!

Krasnashchok K., Mustapha M., Al Bassit A., Skhiri S. Towards Privacy Policy Conceptual Modeling. In Dobbie G., Frank U., Kappel G., Liddle S.W., Mayr H.C. (eds), Proc. of the 39th International Conference on Conceptual Modeling, LNCS 12400, 2020. Springer, Cham.

DOI : https://doi.org/10.1007/978-3-030-62522-1_32

Activity

30-03-2020

TopoGraph: an End-To-End Framework to Build and Analyze Graph Cubes

Graphs are a fundamental structure that provides an intuitive abstraction for modelling and analyzing complex and highly interconnected data. Given the potential complexity of such data, some approaches proposed extending decision-support systems with multidimensional analysis capabilities over graphs. In this paper, we introduce TopoGraph, an end-to-end framework for building and analyzing graph cubes. TopoGraph extends the existing graph cube models by defining new types of dimensions and measures and organizing them within a multidimensional space that guarantees multidimensional integrity constraints. This results in defining three new types of graph cubes: property graph cubes, topological graph cubes, and graph-structured cubes. Afterwards, we define the algebraic OLAP operations for such novel cubes. We implement and experimentally validate TopoGraph with different types of real-world datasets.

 

The paper will be published soon in Information Systems Frontiers, and is already available online on Springer. Currently, it is unfortunately available only to subscribers, but do not hesitate to reach out to us for more information!

 

Amine Ghrab, Oscar Romero, Sabri Skhiri, Esteban Zimányi, TopoGraph: an End-To-End Framework to Build and Analyze Graph Cubes, published in Information Systems Frontiers (2020).

 

 

Activity Conferences

27-02-2020

Thirty-Fourth AAAI Conference On Artificial Intelligence: A Summary

Two weeks ago, our young research engineers Hounaida Zemzem and Rania Saidi were in New York for the Thirty-Fourth AAAI Conference On Artificial Intelligence. The conference promotes research in artificial intelligence and fosters scientific exchange between researchers, practitioners, scientists, students, and engineers in AI and its affiliated disciplines. Rania and Hounaida attended dozens of technical paper presentations, workshops, and tutorials on their favourite research areas: reinforcement learning for Hounaida and graph theory for Rania. What were the big trends and their favourite talks? Let’s find out with them!

 

The Big Trends:

Rania says: “The conference focused mostly on advanced AI topics such as graph theory, NLP, Online Learning, Neural Nets Theory and Knowledge Representation. It also looked into real-world applications such as online advertising, email marketing, health care, recommender systems, etc.”

Hounaida adds: “I thought it was very successful given the large number of attendees as well as the quality of the accepted papers (7737 submissions were reviewed and 1,591 accepted). The talks showed the power of AI to tackle problems or improve situations in various domains.”

 

Favourite talks and tutorials

Hounaida explains: “Several of the sessions I attended were very insightful. My favourite talk was given by Mohammad Ghavamzadeh, an AI researcher at Facebook. He gave a tutorial on Exploration-Exploitation in Reinforcement Learning. The tutorial by William Yeoh, assistant professor at Washington University in St. Louis, was also amazing. He talked about Multi-Agent Distributed Constrained Optimization. Both their talks were clear and funny.”

 

Rania’s feedback? “One of my favourite talks was given by Yolanda Gil, the president of the Association for the Advancement of Artificial Intelligence (AAAI). She gave a personal perspective on AI and its watershed moments, demonstrated the utility of AI in addressing future challenges, and insisted on the fact that AI is now necessary to science. I also learned a lot about the state of the art in graph theory. The tutorial given by Yao Ma, Wei jin, Lingfu Wu and Tengfei Ma was really interesting. They explained Graph Neural Networks: Models and Application​s. Finally, the tutorial presented by Chengxi Zang and Fei Wang about Differential Deep Learning on Graphs and its Applications was excellent. Both were really inspiring and generated a lot of ideas about how to continue to expand my research in the field! ”

 

Favourite papers

A personal selection by Rania & Hounaida of interesting papers to check out :

For Hounaida:

 

For Rania:

 

Final thoughts

After attending their first conference as Euranovians, what will Rania & Hounaida remember? Hounaida concludes: “Going to New York for the AAAI-20 Conference as one of the ENX data scientists was an amazing experience. I met many brilliant and sharp international experts in various fields. I enjoyed the one-week talks with so many special events, offline discussions, and the night strolls!”

Activity Conferences

20-02-2020

Schloss Dagstuhl: where computer science meets

Which direction stream and complex event processing is going to take? Last week, the world’s best-known international researchers met in Schloss Dagstuhl, Germany,  to present and discuss their research. Among the members were present Avigdor Gal, Professor at the Israel Institute of Technology, Alessandro Margara, Assistant Professor at the Polytechnic University of Milan, or Till Rohrmann, engineering lead at Veverica.

Invited to talk about the requirements and needs from the industry, our R&D director Sabri Skhiri explains: “The seminar brought together world-class computer scientists and practitioners working on complex event recognition, distributed systems, databases, stream reasoning and artificial intelligence. Our objective was to disseminate the recent foundational results in each of these isolated fields among all participants, to identify the open problems that need to be resolved, and to establish new research collaborations among these fields”.

What were the big trends and intakes gathered by those brilliant minds? Let’s find out with Sabri!

 

 

The Big Trends

This seminar is a bit particular as it does not show any trends but rather gives a picture of all the communities working on CER in a way or another. I was fascinated by the diversity of researchers. I  did not expect to see such a rich variety of fields: knowledge representation, spatial reasoning, logic-based reasoning, data management, learning-based approaches, event-driven processing, process mining, database theory, stream mining,… According to me, the composite event recognition models that are the best at recognising complex events would include:

  1. Data flow model
  2. Ontology-based and reasoning model
  3. Symbolic reasoning model
  4. Automata-based model

We also identified common challenges across these models and communities. The three priority topics areas we identified are:

  1. Expressivity: composability & hierarchies
  2. Evaluation strategy, parallelization and distribution
  3. Uncertainty management

 

Favourite Talk

Kurt Rothermel from TU Stuttgart – Time-sensitive Complex Event Processing

My first reaction to load shedding was: “It is useless since customers do not want to lose any event, that is why so much effort is spent today on exactly once semantics…“. However, there is a trend today in stream processing, which is the trade-off between cost, latency, and correctness. Tyler Akidau described this challenge as a choice between one of three propositions: fast and correct, cheap and correct, or fast and cheap.  Tyler was talking about streaming but that rule applies in the same way in a CEP context. The load shedding strategy directly falls in the third proposition. In this perspective, the work of Kurt is highly relevant.

 

Favourite Tutorial

Jacopo Urbani & Fredrik Heintz – Stream Reasoning

Concretely, stream reasoning is incremental reasoning over rapidly changing information. The tutorial opened new perspectives on stream processing for me. It tried to answer a very interesting question: how can you provide reasoning about context from streams of data? I definitely come from the database and event-based systems communities and I did not know at all that stream reasoning was so mature. This community has been evolving from having a continuous version of SPARKQL to a complete distributed stream reasoning semantics. It is interesting to see that the work we have done in the LEAD algebra and semantics is deeply inspired by this community. However, we have never used any reasoning logic on top of LEAD. But after a few hours of the tutorial, I realise that (1) reasoning can be used for query rewriting and optimisation (2) it is worth evaluating at least BigSR,  the LARS implementation on Flink.

 

Avigdor Gal & Ruben Mayer – Distributed and Event-Based Systems

Avidgor is a kind of pop star for the stream processing and distributed systems community, or at least for me! The papers he published about a probabilistic CEP engine with late arrival and event uncertainty were visionary.

The speakers started by explaining the basics of stream processing then went deeper into the event recognition language and architecture. They detailed pub/sub applied to event recognition and explained the data flow model, which consists of a single unified data processing model where the stream and batch paradigms are the same.  This last part was based on Tyler Akidau’s paper.

A second part of the talk focused on elasticity on streams. Stream fission puts operators among different categories:

  • Firstly, key-based operators, that is a group by operation (as in SQL)
  • Secondly, window-based operators enable to split processing that needs to have multiple event types correlated with different keys within the same operator
  • Finally, pane-based operators enable a split-merge strategy where you distribute and merge the result.

Interestingly, Avigdor presented his work about late-arrival processing from a probabilistic viewpoint and not from the watermark perspective. Usually, modern stream processing frameworks use watermarks in order to take into account events that arrive later. Avigdor presented a probabilistic approach to this issue.

 

What are late-arrival events?

Imagine we want to count the number of cars entering a road segment every three minutes: we have a “tumbling window” every 3 minutes. If an event (ie a car) arrives at 2’55 second in the window but is stuck somewhere in the network for 6 sec, it is called a late-arrival event. The processing time (the time at which the CEP processes the event) is delayed compared to the event time (the time on which the event really occurs).

Note that for CEP, there is clearly a trade-off between timeliness and accuracy, because the slack time will increase the delay to deliver your result but will increase your accuracy. There is always a tradeoff between cost, latency and correctness, and usually, you can only pick two among the three.

Fun fact: If you need to explain what is event time & processing time to your mother (yeah, don’t underestimate the power of this kind of discussion at Christmas dinner), the best way is to take the Star Wars analogy. From an event time perspective (which is the time at which the story really happened) you should follow episode 1, 2, 3,4, 5, 6, 7,8, 9. But if you take the processing time (the time on which we received the episode), it is 4, 5, 6, 1, 2, 3, 7, 8, 9.  Isn’t it great ?!

 

Final Thoughts

CER has been explored from many viewpoints. However, never in the research history was there a meeting gathering representatives of these communities. This was the objective of this seminar. Having all these people in a castle in the middle of nowhere was a blast! I had very passionate discussions during meals but also during the night at the library with the most brilliant brains on stream and CEP. On the other hand, I still had some fun discussions about comparing Star Trek DIscovery and Picard! Finally, the most important things I will remember after this seminar… are the endless ping pong games with Till Rohrmann and Alessandro Margara :-).

Activity

21-01-2020

Throwback to 2019

At EURA NOVA, we believe technology is a catalyst for change. To embrace it, we strive to stay at the edge of knowledge. Investing in research allows us to continuously become more proficient, to maintain our know-how at the cutting edge of IT, to share its benefits with our customers, and to incubate the products of tomorrow. As we look back on the year 2019, we are both proud and happy of the work achieved!

 

Published papers:

We are happy to say that our R&D department has published five peer-reviewed scientific papers last year.

 

  • LEAD: A Formal Specification For Event Processing

 

In June, our R&D engineer Anas presented his work on complex event processing at the 13Th ACM international Conference on distributed and event-based systems, which was taking place in Germany.

Anas Al Bassit, Skhiri Sabri, LEAD: A Formal Specification For Event Processing, in 13Th ACM international Conference on distributed and event-based systems 2019

 

  • Coherence Regularization for Neural Topic Models

 

In July, our R&D engineer Kate presented her paper on neural topic models at the 16th International Symposium on Neural Networks taking place in Moscow.

Katsiaryna Krasnashchok, Aymen Cherif, Coherence Regularization for Neural Topic Models. in 16th International Symposium on Neural Networks 2019 (ISNN 2019)

 

  • STRASS: A Light and Effective Method for Extractive Summarization

 

In August, our PhD student Léo was in Italy to present his paper at the 2019 ACL Student Research Workshop.

Léo Bouscarrat, Antoine Bonnefoy, Thomas Peel, Cécile Pereira, STRASS: A Light and Effective Method for Extractive Summarization Based on Sentence Embeddings, in 2019 ACL Student Research Workshop, Florence, Italy.

 

  • GraphOpt: Framework for Automatic Parameters Tuning of Graph Processing Frameworks

 

In December, the paper written by our former intern and now full-time colleague Muaz was presented in Los Angeles at the third IEEE International Workshop on Benchmarking, Performance Tuning and Optimization for Big Data Applications.

Muaz Twaty, Amine Ghrab, Skhiri Sabri: GraphOpt: a Framework for Automatic Parameters Tuning of Graph Processing Frameworks. 2019 IEEE International Conference on Big Data (Big Data) Workshops, Los Angeles, CA, USA.

 

  • A Performance Prediction Model for Spark Applications

 

In June 2020, our paper written as part of the ECCO research project we have been leading at EURA NOVA will be presented at the Big Data congress 2020 taking place in Hawaii.

Florian Demesmaeker, Amine Ghrab, Usama Javaid, Ahmed Amir Kanoun, A Performance Prediction Model for Spark Applications, in the proceedings of Big Data congress 2020.

 

IEEE Big Data Workshop

Last December, Eura Nova’s research centre held the fourth workshop on real-time and stream analytics in big data at the 2019 IEEE Conference on Big Data in Los Angeles. The workshop brought together leading players including Confluent, Apache Pulsar, the University of Virginia and Télécom Paris Tech as well as 8 renowned speakers from 6 different countries. We received more than 30 applications and we are proud to have hosted such interesting presentations of papers in stream mining, IoT, and industry 4.0. Special thanks to our keynote guests, Matteo Merli (Apache Pulsar) and John Roesler (Confluent), and all the attendees and speakers!

 

JERICHO, research driving innovations

The mission of the JERICHO research track is to make the latest technologies available to our client, to offer them a competitive edge to play along megacorporations.  After two years of intense work, seven published papers, presentations in international conferences spanning Russia, the United States, Germany, Australia, or Belgium, our Jericho project has come to an end.

And the adventure continues! We are really excited to continue our work on innovative solutions for the next data challenges with our new research track ASGARD.

Our R&D director Sabri Skhiri says: “The costs of data solutions and the lack of data scientists will increase in the next 3 to 5 years and solutions to reduce them will benefit from a large market. In this sense, ASGARD is precisely in the strategy of Eura Nova. ASGARD aims to reduce these costs by automating the most expensive tasks. As the world becomes increasingly digital and reinvents itself, innovation and research are essential in the market.”

 

Academic collaboration

This year, we welcomed nine interns across our three offices. A big kudo to our intern Muaz who successfully finished his master thesis in collaboration with EURA NOVA! The goal of his thesis was to optimise the configuration of distributed graph frameworks. He now joined EURA NOVA to work as a full-time employee.

 

Talks & seminars

This year, the research team had the pleasure to be invited at several international conferences:

  • In February, our research director Sabri Skhiri gave a seminar on modern Stateful Stream Processing at EPT. Our R&D engineer Syrine Ferjaoui also went to Morocco to give a workshop about data architecture at the Annual International Conference on Arab Women In Computing.
  • In March, Sabri was at the World AI Show in Dubaï to talk about successfully deploying AI projects in production. He was also invited to Barcelona Tech to give a Big Data Architecture & Design  seminar.
  • In June, our data privacy officer Nazanin Gifani gave a masterclass on Fairness and Transparency in AI at the DI Summit in Brussels.
  • In September, our R&D project manager Shivom Aggarwal talked at the Arab Future Cities Summit 2019 about deploying AI at industrial scale for smart cities.
  • In October, our software engineer Christophe Philemotte was in San Francisco to talk at the Kafka Summit about crossing the streams thanks to Kafka and Flink.
  • In November, Sabri was invited as a keynote speaker at the 17th International Conference on Service-Oriented Computing to share his experience about the convergence between micro-service, stateful stream processing and function as a service.

 

Summer schools & conferences

This year, Euranovians attended more than 15 prestigious international conferences and summits across the world to remain up to date and grow our network. We investigated the state of the art in streaming, data science, DevOps, computer vision or cloud engineering at conferences such as Flink Forward, Spark AI Summit, Kubecon, IEEE Big Data, DataWorks Summit, Kafka Summit, NeurIPS, RedHat, Elixir LDN or CVPR.

Euranovians brought back what they learned for the rest of the team and the big data community. Find our public summaries, identified trends and review of conferences here:

 

Activity Conferences

09-01-2020

Fourth Workshop on Real-Time and Stream Analytics in Big Data: key takeaways

Last December, Eura Nova’s research center held the fourth workshop on real-time and stream analytics in big data at the 2019 IEEE Conference on Big Data in Los Angeles. The workshop brought together leading players including Confluent, Apache Pulsar, the University of Virginia and Télécom Paris Tech as well as 8 renowned speakers from 6 different countries. We received more than 30 applications and we are proud to have hosted such interesting presentations of papers in stream mining, IoT, and industry 4.0.

The workshop was a real success with many interesting questions and comments. If you could not attend, our R&D engineer Syrine Ferjaoui brought back important elements from the presentations for you.

 

First keynote speaker:

First of all, the workshop started with the keynote of Matteo Merli, PMC member at Apache Pulsar. His talk “Messaging and Streaming” explained how Pulsar can be a unified infrastructure that supports messaging and streaming.

Matteo introduced messaging as events that are being created and streaming as analysing events that just happened. These are two different processing concepts but they need a single infrastructure. He then explained the architecture view of Pulsar, which has separate layers between the brokers and the bookies (BookKeeper instances that handle persistent storage of messages). This means that brokers and bookies can be added independently, traffic can be shifted very quickly across brokers, and new bookies will ramp up on traffic quickly. This segmented distribution makes the architecture of Pulsar more flexible and dynamic.

Pulsar has other interesting features such as durability, low latency, high throughput, high availability, unified messaging model, high scalability, native computing, … The roadmap includes working on Pulsar storage API to allow direct access to data stored in Pulsar and to retrieve and process data more efficiently. They are also working on higher-level messaging features.”

 

Second keynote speaker:

The second keynote was given by John Roesler, a Kafka committer at Confluent. He talked about Kafka Streams and the evolution of streaming paradigms.

To design software, we, developers, used to separate the application logic from the database. To scale the database capacity, we then started to use a search index to do ETL jobs and query the database in a fast and optimal way. However, this created bugs in the software, added data consistency issues, and created more complexity in the system. Later, we started to use HDFS for a more flexible design. While enabling replication and distributed storage, this solution added more latency and supported batch processing only. It did not meet the needs of real-time processing use cases.

At this point, streaming helped a lot. The next step was to add a streaming platform that reads from sources, does some computation, and sinks the result somewhere else. The KafkaStreams design is a set of multiple lambda stateful functions, which makes it a good fit for a microservices architecture.  With Kafka Streams’ new updates, the app logic is linked to a relational database with ACID guarantees.

Finally, John Roesler considers that “software is a fractal”, a never-ending pattern: a software architecture is complex and even when we zoom into a single component, it is still complex. But for the Kafka Streams’ design, when we zoom out, it looks like a set of services interacting and connected to each other and this simplifies the aforementioned designs.

John concluded by mentioning open problems that can be dealt with in stream processing, including semantics, observability, operability, and maintainability.

 

Workshop Invited Speakers:

After the keynotes, 8 selected papers were presented, covering mainly these 6 topics: (1) Stream Processing for IoT, (2) Serverless and HPC (High Performance Computing), (3) Collaborative Streaming, (4) Stream Mining, (5) Image Mining and (6) Real-time Machine Learning. Some papers are not yet available, as they will be published in the proceeding of the IEEE Big Data Conference. In the meantime, do not hesitate to contact our R&D department at research@euranova.eu to discuss how you can leverage stream processing in your projects.

Sören and Wilhelm are engineers in the Software Engineering Group from Kiel University. They propose a stream processing architecture which allows for aggregating sensors in hierarchical groups, supports multiple hierarchies in parallel, provides reconfiguration at runtime, and preserves the scalability and reliability qualities of streaming.

Andre Luckow, head of Blockchain and Emerging Technologies at BMW Group, and Shantenu Jha, associate professor at Rutgers University, presented StreamInsight, which provides insight into the performance of streaming applications and infrastructure, their selection, configuration, and scaling behaviour.

The paper is written by Tobias Grubenmann, researcher at The University of Hong Kong, in collaboration with Daniele Dell’Aglio and Abraham Bernstein, researchers at the University of Zurich. They present the Collaborative Stream Processing (CSP), a model where the costs, which are set exogenously by providers, are shared between multiple consumers, the collaborators. For this, they identify the important requirements for CSP to establish trust between the collaborators and propose a CSP algorithm adhering to these requirements.

  • Kennard-Stone Balance Algorithm for Time-series Big Data Stream Mining (Tengyue Li, Simon Fong, and Raymond Wong)

Tengyue Li and Simon Fong (researcher and associate professor at the University of Macau, China) and Raymond Wong (associate professor at UNSW Sydney) worked on the Kennard-Stone Balance algorithm used as a new data conversion method. Training a prediction model effectively using big data streams poses certain challenges in machine learning. In this paper, the authors apply the Kennard-Stone algorithm on time-series to extract a meaningful representation of big data streams, which improves the performance of a machine learning model.

 

  • Assessing the Effects of TV Ad Events on Digital Search: On the Selection of Outcome Measures (Shawndra Hill, Anthony Colas, H. Andrew Schwartz, and Gordon Burtch)

Shawndra Hill (Microsoft), Anthony Colas (University of Florida), H. Andrew Schwartz (State University of New York at Stony Brook) and Gordon Burtch (University of Minnesota) explained their work on the interactions between TV content and online behaviours such as response to digital advertising. They developed AdMiner, a tool that can track online activity around a brand and provide actionable insights into ad campaigns.

 

Austin Harris, Jose Stovall, and Mina Sartipi (researchers and CUIP director at the University of Tennessee at Chattanooga) have helped to create Chattanooga’s smart corridor, used to test new technologies and generate data-driven outcomes. In their talk, they presented the corridor, used as a test bed for research in smart city developments in a real-world environment. The wireless communication infrastructure and network of sensors in combination with data analytics provide a means of monitoring and controlling city resources and infrastructure in real time.

 

Sebastian Trinks and Carsten Felde (TU Bergakademie Freiberg) presented how image mining can help avoiding errors and low quality of printed prototypes in real time. This can result in saving resources and increasing efficiency when developing new products.

 

This year, IEEE Big Data held the Real-time Machine Learning Competition on Data Streams. As the competition is focused on streaming, its online platform required a specific infrastructure that meets data stream mining requirements. Dihia Boulegane is a Ph.D. student at Télécom ParisTech working in collaboration with Orange Labs on machine learning for IoT networks monitoring. She was in charge of implementing the streaming engine of the dedicated platform of the competition. Dihia explained its components, the technologies used, and the challenges met to build the platform. At the end, the platform was able to provide multiple streams to multiple users, to receive multiple streams, to process them and to provide the leader board and live results.

 

Special thanks to our keynote guests, Matteo Merli and John Roesler, and all the attendees and speakers! We are looking forward to an even more successful workshop in the coming edition of the IEEE Big Data Conference. Stay tuned for paper submission dates!

 

Activity

06-01-2020

A Performance Prediction Model for Spark Applications

Apache Spark is a popular open-source distributed-processing framework that enables efficient processing of massive amounts of data. It has a large number of parameters that need to be tuned to get the best performance. However, tuning these parameters manually is a complex and time-consuming task. Therefore, a robust performance model to predict applications execution time could greatly help in accelerating the deployment and optimization of big data applications relying on Spark. In this paper, we ran extensive experiments on a selected set of Spark applications that cover the most common workloads to generate a representative dataset of execution time. In addition, we extracted application and data features to build a machine learning-based performance model to predict Spark applications execution time. The experiments show that boosting algorithms achieved better results compared to other algorithms.

The paper will be published at the Big Data congress 2020 taking place in Hawaii. In the meantime, do not hesitate to contact our R&D department at research@euranova.eu to discuss how you can optimise distributed processing frameworks in your projects.

Florian Demesmaeker, Amine Ghrab, Usama Javaid, Ahmed Amir Kanoun, A Performance Prediction Model for Spark Applications, in the proceedings of Big Data congress 2020.

Activity Conferences

03-01-2020

IEEE Big Data 2019: a summary

At the beginning of the month, our R&D director Sabri Skhiri and our R&D engineer Syrine Ferjaoui travelled to Los Angeles to attend IEEE Big Data Conference. It is one of the most influential academic gatherings in distributed machine learning. This year, it featured 879 authors, shortlisted from 2009 applicants. They came from 28 countries and presented 210 papers. Back in Belgium, Sabri and Syrine give you their opinion on the event itself and the important elements from the keynotes, the tutorials, the workshops and the interesting papers.

 

The Big Trends

Sabri says: “The main trends were deep learning, NLP, privacy-preserving approaches, GAN, graph mining and stream mining. In my view, the level of the papers was quite good. Authors are becoming ever more skilled in data science, maths and algorithms. This goes to show that to be a good data scientist, you need an extensive set of advanced skills. Interestingly, there was almost nothing about distributed computing! This is a big move compared to the previous editions. The only presentations that had something to do with distributed systems were about optimisation strategies, an area similar to what our ECCO team researches. The Big Data Conference focuses on data science; it does not really look into its scalability.  Distributed computing topics tend to be dealt with at conferences like DEBS, VLDB, USENIX, SIGMOD, etc. As a result, this conference is an amazing place to see hundreds of data science use cases with, most of the time, an interesting contribution.”

 

The Keynotes

 

The keynotes were focused on data science as well. We even heard the term “Big Data Science”.

Keynote 1: Responsible Data Science by Lise Getoor – Professor at UC Santa Cruz

Syrine says: “The first keynote was my favourite. Lise started by comparing machine learning to a black box. The goal was to unpack the box and invite people to use data science and to use it wisely. To autonomise ethical decision-making, we should move away from maximising AI systems autonomy and move toward human-centric systems. To do this, we should make sure that human-centric systems have three qualities: (1) be knowledge-based, (2) be data-driven, and (3) support human values. Achieving responsible data science requires both machine-learning and ethics.”

 

Keynote 2: DataCommons “Google for Data” by Ramanathan Guha – Google

Guha presented DataCommons, a project started by Google to combine data from different open sources. Syrine explains: “Google’s DataCommons project allows users to pretend that the Web is one website, enabling developers to pretend all this data is in one database. The long-term vision of Google is to aggregate all data from publicly available sources (Medicare, Wikidata, sequence data, Landsat, CDC, Census…) into a single Open Knowledge Graph. The goal is to ​reduce or eliminate the ​​data download-clean-store​ process. Instead, users can access and use already cleaned data in the cloud. ​Data can be public or private (internet & intranet). This will avoid repeated data wrangling  and ease the burden of data storage, indexing, etc.”

 

The Tutorials

This year, IEEE Big Data held nine tutorials. Our R&D director explains: “At this type of events, tutorials are always a good way to learn a complete state of the art in a couple of hours. I particularly appreciated the tutorial on “Taming Unstructured Big data: Automated Information Extraction for Massive Textby the team of the famous Jiawei Han (he is a kind of pop star in data mining and the father of Graph Cube). I found out that many papers about named entity relations were published in the past two years. The idea is to be able to extract supervised, semi-supervised, and unsupervised relations between entities: for instance, discovering that “Trump” is “President of” “USA”. They also propose new approaches to integrate knowledge bases such as DBPedia or YAGO to infer new unknown relations from a corpus. This is just amazing!”

 

Syrine adds: “The tutorial on NewSQL principles, systems, and current trends was interesting as it explained why we should consider using NoSQL/NewSQL to deal with data interconnections and very high scalability. After attending this tutorial, I was motivated to order this book about Principles of Distributed Database Systems. For fans of deep learning, the tutorial “Deep Learning on Big Data with Multi-Node GPU Jobs” covers a lot about large-scale GPU-based deep-learning systems. If you missed the conference, all resources can be found on this ​link​.”

 

The Workshops

The EURA NOVA research centre organised the fourth workshop on Real-time and Stream Analytics in Big Data, at the 2019 IEEE conference on Big Data. We were really happy to welcome Matteo Merli from Apache Pulsar and John Roesler from Confluent as keynotes speakers. Thank you to them and to all the attendees and speakers! They had a great time, with captivating talks and a lot of interesting questions and comments. The summary of the event will soon be available on our website. The slides of the keynotes are available here:

 

 

Favourite Papers

A personal selection of interesting papers:

The paper tackles a really interesting problematic faced by a lot of data scientists. Introducing active learning is a cool idea and so is the way they used a mathematical trick to make their approach feasible.

Su Won Bae, from Mobilewalla, presented how they can define a complete customer acquisition model by mixing their data with their customer data (in this case, a worldwide leader in food delivery). Sabri says: “The quality of data science models highly depends on the data they can train on. I am convinced we will go in the same direction as Mobilewalla in the future to have richer models. However, mixing data must be done with care as it may raise some privacy issues;  our purpose has to have legal ground.”

The speaker presented MorphMine, a method for unsupervised morpheme segmentation.  It can generate morpheme candidates that are filtered out using entropy to select the best morphemes from a corpus. Then, these morphemes can be used to highly improve the word embedding model and the downstream machine learning tasks.

 

 

Page 1 of 712345...Last »