This document presents internships supervised by our software engineering department or by our research & development department. Each project is an opportunity to feel both empowered and responsible for your own professional development and for your contribution to the company.
Continue readingMASTER THESIS & PFE 2021
This document introduces you to master thesis and graduation projects supervised by our research & development department. Each project offers you the chance to be actively involved in the development of solutions to address tomorrow’s challenges in ICT and implementing them today!
Continue readingECML 2020 – The Keynotes
A few weeks ago, the biggest European conference on machine learning was held: ECML 2020. Our research engineer Nourchène, our R&D consultant Gianmarco, and our data scientist Ronan attended the event from Tunisia, Belgium and Marseille. In this article, they tell you about the different keynote talks they attended.
Gemma Galdon-Clavell – Algorithmic Auditing: how to open the black-box of ML
Nourchène says: “I loved the talk given by Gemma Galdon-Clavell during which she addressed the problem of ethics in AI, as computer science engineers do not often question what they are producing from a moral standpoint. In her talk, Gemma points out the importance of data used to train a machine learning model. Data are provided by humans, but people are not perfect, they are likely to make wrong decisions. The model will then learn to behave the same way. So we might end up creating an unethical model. This can lead to two different behaviours: users either will follow the system’s recommendations at any cost or decide not to if they find the decisions not reasonable. Data will then continue to be biased, which creates a sort of deadlock.”
Ronan adds: “Algorithms do not produce biases from anywhere; they reproduce and amplify biases they can find in the data they ingest. As a result, we have to pay attention first to the quality of the data we use. Gemma emphasizes that algorithmic auditing is the key to understanding if the algorithm meets the expectations and if it complies with the regulations. The audit does not only cover the technical part and the way the algorithm was coded. It also focuses on how the problem was approached and the means deployed to solve it.”
Nourchène explains: “The speaker suggests that before creating a product, computer science engineers and developers need to ask the following questions: Is the product desirable and what is the problem that it tries to solve? Is it acceptable and does it involve users? Is it legal? Finally, does it use the right data? Gemma also suggests that ethics be taught in engineering schools. I totally agree with that because nowadays technology does not always seek to solve real problems, its goal is rather to make a fortune out of the proposed product.”
Max Welling – Amortized and Neural Augmented Inference
Gianmarco says: ‘My favourite talk was the one held by Max Welling. It clearly showed and unified the underlying theoretical grounds of many superficially different models, without failing to provide real-world applications. More concretely, the talk showed how to develop hybrid amortized methods that combine classical learning, inference and optimization algorithms with learned neural networks, which is of strong interest, especially in physics-related fields.
It provided a comprehensive and complete exposition of the topic of amortized neural inference and, as a consequence, it did not fail in bringing the spectator up-to-date with applications in that regard. Max Welling presented how a learned neural network can augment or correct a classical solution (attained by means of expert-knowledge or classical equations), or reversely, how a neural network can be fed useful information computed by a classical method.”
Been Kim – Interpretability for everyone
Gianmarco says: “I was exposed to many new topics and applications I was not familiar with. Talks like Interpretability for everyone that offered more abstract research were the ones that struck my attention the most. The talk presented the latest discoveries and tools in terms of interpretability quantification. It also introduces how to extract interpretability from a black-box end-to-end model, which I find very important for the construction of more robust models and model diagnosis.”
Doina Precup – Building Knowledge For AI Agents With Reinforcement Learning
Ronan says: “I really liked the talk given by Doina Precup on how to build knowledge in the field of reinforcement learning. I only had little knowledge of this field. Thankfully, Doina introduced us quickly to the key concepts of reinforcement learning. She also presented us with some big successes of RL, presented different RL mechanisms and went towards the problem of using existing knowledge to build a life-long learning agent. Doina concluded her talk with a lot of open and inspiring questions: How can we exploit previously learned knowledge and apply it to new environments not related in any manner to the previous ones? How well is an agent preserving and enhancing its knowledge? These questions might not have definitive answers or just answers at all but I found very relevant and interesting the interrogations she raises on how we can represent knowledge.
Stephan Günnemann about Certifiable Robustness of ML Models for Graphs
Ronan says: In this technical talk, Stephan presented us different methods to assess GNN robustness. To certificate the robustness of a GNN, an evaluation of its sensitivity to perturbations needs to be conducted. For example, you can search for a worst-case scenario, and verify that the margin is positive to ensure the model is robust. Stephan’s talk was very pleasant to listen to, as he accompanied it with several examples and applications of the methods he presented us. Finally, he concluded that ML models for graphs aren’t reliable but that we can apply certificates and robustification principles to provide guarantees for a reliable use of GNNs.
Watch the talks:
If you wish to catch up on talks we mentioned or those you missed, all the sessions, paper and presentation recordings are available (for a limited time) from the ECML website.
Gemma Galdon-Clavell
Max Welling :
Been Kim
Doina Precup
Stephan Günnemann
ECML 2020 – A Summary
A few weeks ago, the biggest European conference on machine learning was held: ECML 2020. Our research engineer Nourchène, our R&D consultant Gianmarco, and our data scientist Ronan attended the event from Tunisia, Belgium and Marseille. What were the big trends and their favourite talks? What did they think of the online remote format? Let’s find out with them!
The Big Trends
The overall conference was very well up-to-date with the outside world’s latest trends and needs. Gianmarco explains: “The conference was rich in presentations which covered nearly all possible topics in machine learning. However, I had the impression that Graph Neural Networks and Generative Models had a little more presence than other models. Transfer learning was also another topic that seemed to be very relevant throughout the conference.”
Remote Format For The First Time
Due to the COVID-19 pandemic, the conference was fully virtual. The talks were pre-recorded and made available prior to the conference. The live sessions were dedicated to questions and answers, with a very brief presentation at the beginning of the session.
Nourchène explains: “The downside was that we had to watch the whole presentation beforehand, otherwise it was difficult to follow the discussion and to interact with the speaker. Fun fact: there was a session where even the moderator was not aware of this Q&A aspect and asked the speaker why the presentation was so short! The good thing is that, since the presentations were pre-recorded, it was possible to watch the presentations from sessions running in parallel.”
Gianmarco adds: “I have not had many remote conferences in my life, but I was genuinely surprised to see how well-organised this one was. The remote framework was very well-designed, the web interface was fully functional, and they took advantage of all the benefits that a remote event can have like re-watchable presentations.”
Kudos to the organising committee for pulling it off!
The Keynotes
We wrote an article with more details about different keynotes that you can find on this link, but here is a teaser:
Gemma Galdon-Clavell – Algorithmic Auditing: how to open the black-box of ML
In her talk, Gemma points out the importance of data used to train a machine learning model. According to her, algorithmic auditing is the key to understanding if the algorithm meets the expectations and if it complies with the regulations. This audit does not only cover the technical part and the way the algorithm was coded. It also focuses on how the problem was approached and the means deployed to solve it. Read our detailed review here.
Max Welling – Amortized and Neural Augmented Inference
The talk showed and unified the underlying theoretical grounds of many superficially different models, without failing to provide real-world applications. It provides a comprehensive and complete exposition of the topic of amortized neural inference and, as a consequence, it did not fail in bringing the spectator up-to-date with applications in that regard. Read more here.
Been Kim – Interpretability for everyone
The talk presented the latest discoveries and tools in terms of interpretability quantification. It also introduces how to extract interpretability from a black-box end-to-end model. Read more in our article.
Doina Precup – Building Knowledge For AI Agents With Reinforcement Learning
Doina Precup talks on how to build knowledge in the field of reinforcement learning. She also presents some big successes of RL, presented different RL mechanisms and went towards the problem of using existing knowledge to build a life-long learning agent. Discover more!
Stephan Günnemann – Certifiable Robustness of ML Models for Graphs
Stephan presented different methods to assess GNN robustness: an evaluation of its sensitivity to perturbations needs to be conducted. Learn more with Ronan here.
Interesting Paper?
Si-An Chen; Voot Tangkaratt; Hsuan-Tien Lin; Masashi Sugiyama – Active deep Q-learning with demonstration
Nourchène says: “The authors presented their paper proposing different groups of techniques for learning from demonstration in Reinforcement Learning, like RL Expert Demonstration (RLED) or Active RL Demonstration (ARLD). These techniques can be used to fasten the learning process of an RL agent. They also propose an uncertainty-based query strategy named Active Deep Q-Network, based on DQN, to dynamically estimate the uncertainty of recent states and use the queried demonstration data.“
Favourite tutorial
Learning With Imbalanced Domains and Rare Event Detection
Ronan says: “This tutorial was interesting and well-structured. Imbalance domains and rare-events prediction concern a lot of domains: financial, medical, data distribution… and will always remain a centre of attention in designing the appropriate solution to a problem. As a consequence, it will remain a core problem in the research. I particularly liked this tutorial as it covered a lot of different approaches: unsupervised (statistical-based, proximity-based, clustering-based), supervised and semi-supervised and compared them. As there is no ideal solution that can be applied to every problem, you have to know what exists before choosing the one that better fits your problem. The tutorial also covered different methods to properly evaluate the performance of an algorithm on an imbalanced task. ”
Conclusion
The conference provided a wide range of machine learning topics in the form of presentations about the latest trends, technologies and applications. As Nourchène says: “it is an optimal platform to stay up-to-date, to widen one’s perspectives and/or dig deeper into a specific topic.”
Watch the talks:
If you wish to catch up on talks we mentioned or those you missed, all the sessions, paper and presentation recordings are available (for a limited time) from the ECML website.
Gemma Galdon-Clavell
Max Welling
Been Kim
Doina Precup
Stephan Günnemann
Active deep Q-learning with demonstration: Read the paper
Internship & Master Thesis Offer – 2021
Our master thesis and internships offers for the coming year, supervised by our software engineering department or by our research & development department, will be available in the course of November, and will cover the following research topics:
Regarding data privacy:
- Legal entity relations with knowledge graph
- Legal NLP
- Privacy by design
- Topic modeling
- Text summarisation
- …
Regarding data automation
- GAN for multimodal representation
- AutoML
- Optimization methods
- Computer vision
- Graph Embeddings
- …
Regarding data pipelines
- Reinforcement learning
- Optimisation methods
- Stream Processing
- CEP
- Network compression
- …
Regarding data quality
- Denoising technique
- GAN for missing data
- Semi-Supervised learning
- Data cleaning
- Attention Model for Structural dep.
- …
Each project is an opportunity to feel both empowered and responsible for your professional development and to address tomorrow’s challenges in ICT, coached by the Eura Nova crew. The detailed offers will be available mid-november. In the meantime, do not hesitate to contact us at career@euranova.eu for any question regarding internships and master thesis!
As an example, the documents listed below present our 2020 master thesis and internships:
Internships 2020
This document presents internships supervised by our software engineering department or by our research & development department. Each project is an opportunity to feel both empowered and responsible for your own professional development and for your contribution to the company.
If you are interested in one of our offers, please send us your application to career@euranova.eu, including your CV and motivation regarding your top three internship positions (described in the document).
If you wish to read the testimonies of students who have done an internship at EURA NOVA, visit our blog, or read directly their experiences:
If you are interested in working on a topic that is not in our range of offers, we would be delighted to hear your proposition and invite you get in touch.
Internship subjects and application guidelines are available here: Internship Offers.
Thirty-Fourth AAAI Conference On Artificial Intelligence: A Summary
Two weeks ago, our young research engineers Hounaida Zemzem and Rania Saidi were in New York for the Thirty-Fourth AAAI Conference On Artificial Intelligence. The conference promotes research in artificial intelligence and fosters scientific exchange between researchers, practitioners, scientists, students, and engineers in AI and its affiliated disciplines. Rania and Hounaida attended dozens of technical paper presentations, workshops, and tutorials on their favourite research areas: reinforcement learning for Hounaida and graph theory for Rania. What were the big trends and their favourite talks? Let’s find out with them!
The Big Trends:
Rania says: “The conference focused mostly on advanced AI topics such as graph theory, NLP, Online Learning, Neural Nets Theory and Knowledge Representation. It also looked into real-world applications such as online advertising, email marketing, health care, recommender systems, etc.”
Hounaida adds: “I thought it was very successful given the large number of attendees as well as the quality of the accepted papers (7737 submissions were reviewed and 1,591 accepted). The talks showed the power of AI to tackle problems or improve situations in various domains.”
Favourite talks and tutorials
Hounaida explains: “Several of the sessions I attended were very insightful. My favourite talk was given by Mohammad Ghavamzadeh, an AI researcher at Facebook. He gave a tutorial on Exploration-Exploitation in Reinforcement Learning. The tutorial by William Yeoh, assistant professor at Washington University in St. Louis, was also amazing. He talked about Multi-Agent Distributed Constrained Optimization. Both their talks were clear and funny.”
Rania’s feedback? “One of my favourite talks was given by Yolanda Gil, the president of the Association for the Advancement of Artificial Intelligence (AAAI). She gave a personal perspective on AI and its watershed moments, demonstrated the utility of AI in addressing future challenges, and insisted on the fact that AI is now necessary to science. I also learned a lot about the state of the art in graph theory. The tutorial given by Yao Ma, Wei jin, Lingfu Wu and Tengfei Ma was really interesting. They explained Graph Neural Networks: Models and Applications. Finally, the tutorial presented by Chengxi Zang and Fei Wang about Differential Deep Learning on Graphs and its Applications was excellent. Both were really inspiring and generated a lot of ideas about how to continue to expand my research in the field! ”
Favourite papers
A personal selection by Rania & Hounaida of interesting papers to check out :
For Hounaida:
- Generalizable Resource Allocation in Stream Processing via DRL, by Xiang Ni, Jing Li, Mo Yu, Wang Zhou, and Kun-Lung Wu. This paper considers the problem of resource allocation in stream processing, where continuous data flows must be processed in real-time in a large distributed system.
- Scaling All-Goals Updates in Reinforcement Learning Using Convolutional Neural Networks, by Fabio Pardo, Vitaly Levdik, and Petar Kormushev. The authors propose to use convolutional network outputs (Q-values) to generate several sub-goals at once. And this, in order to better guide the agents.
- From Skills to Symbols: Learning Symbolic Representations for Abstract High-Level Planning, by George Konidaris, Leslie Pack Kaelbling, and Tomas Lozano-Perez. The paper tackles the problem of constructing abstract representations for planning in high-dimensional, continuous environments.
For Rania:
- Optimizing Reachability Sets in Temporal Graphs by Delaying, by Argyrios Deligkas and Igor Potapov.
- Learning Hierarchy aware knowledge Graph Embeddings for Link Prediction, by Zhanqiu Zhang, Jianyu Cai, Yongdong Zhang, and Jie Wang. The authors propose a novel knowledge graph embedding model which maps entities into the polar coordinate system reflecting hierarchy.
- Multi-View Multiple Clustering using Deep Matrix Factorization, by Shaowei Wei, 1Jun Wang, Guoxian Yu, Carlotta Domeniconi, and Xiangliang Zhang. The paper introduces a solution to discover multiple clusterings. It gradually factorizes multi-view data matrices into representational subspaces layer-by-layer and generates one clustering in each layer.
Final thoughts
After attending their first conference as Euranovians, what will Rania & Hounaida remember? Hounaida concludes: “Going to New York for the AAAI-20 Conference as one of the ENX data scientists was an amazing experience. I met many brilliant and sharp international experts in various fields. I enjoyed the one-week talks with so many special events, offline discussions, and the night strolls!”
Schloss Dagstuhl: Where Computer Science Meets
Which direction stream and complex event processing is going to take? Last week, the world’s best-known international researchers met in Schloss Dagstuhl, Germany, to present and discuss their research. Among the members were present Avigdor Gal, Professor at the Israel Institute of Technology, Alessandro Margara, Assistant Professor at the Polytechnic University of Milan, or Till Rohrmann, engineering lead at Veverica.
Invited to talk about the requirements and needs from the industry, our R&D director Sabri Skhiri explains: “The seminar brought together world-class computer scientists and practitioners working on complex event recognition, distributed systems, databases, stream reasoning and artificial intelligence. Our objective was to disseminate the recent foundational results in each of these isolated fields among all participants, to identify the open problems that need to be resolved, and to establish new research collaborations among these fields”.
What were the big trends and intakes gathered by those brilliant minds? Let’s find out with Sabri!
The Big Trends
This seminar is a bit particular as it does not show any trends but rather gives a picture of all the communities working on CER in a way or another. I was fascinated by the diversity of researchers. I did not expect to see such a rich variety of fields: knowledge representation, spatial reasoning, logic-based reasoning, data management, learning-based approaches, event-driven processing, process mining, database theory, stream mining,… According to me, the composite event recognition models that are the best at recognising complex events would include:
- Data flow model
- Ontology-based and reasoning model
- Symbolic reasoning model
- Automata-based model
We also identified common challenges across these models and communities. The three priority topics areas we identified are:
- Expressivity: composability & hierarchies
- Evaluation strategy, parallelization and distribution
- Uncertainty management
Favourite Talk
Kurt Rothermel from TU Stuttgart – Time-sensitive Complex Event Processing
My first reaction to load shedding was: “It is useless since customers do not want to lose any event, that is why so much effort is spent today on exactly once semantics…“. However, there is a trend today in stream processing, which is the trade-off between cost, latency, and correctness. Tyler Akidau described this challenge as a choice between one of three propositions: fast and correct, cheap and correct, or fast and cheap. Tyler was talking about streaming but that rule applies in the same way in a CEP context. The load shedding strategy directly falls in the third proposition. In this perspective, the work of Kurt is highly relevant.
Favourite Tutorial
Jacopo Urbani & Fredrik Heintz – Stream Reasoning
Concretely, stream reasoning is incremental reasoning over rapidly changing information. The tutorial opened new perspectives on stream processing for me. It tried to answer a very interesting question: how can you provide reasoning about context from streams of data? I definitely come from the database and event-based systems communities and I did not know at all that stream reasoning was so mature. This community has been evolving from having a continuous version of SPARKQL to a complete distributed stream reasoning semantics. It is interesting to see that the work we have done in the LEAD algebra and semantics is deeply inspired by this community. However, we have never used any reasoning logic on top of LEAD. But after a few hours of the tutorial, I realise that (1) reasoning can be used for query rewriting and optimisation (2) it is worth evaluating at least BigSR, the LARS implementation on Flink.
Avigdor Gal & Ruben Mayer – Distributed and Event-Based Systems
Avidgor is a kind of pop star for the stream processing and distributed systems community, or at least for me! The papers he published about a probabilistic CEP engine with late arrival and event uncertainty were visionary.
The speakers started by explaining the basics of stream processing then went deeper into the event recognition language and architecture. They detailed pub/sub applied to event recognition and explained the data flow model, which consists of a single unified data processing model where the stream and batch paradigms are the same. This last part was based on Tyler Akidau’s paper.
A second part of the talk focused on elasticity on streams. Stream fission puts operators among different categories:
- Firstly, key-based operators, that is a group by operation (as in SQL)
- Secondly, window-based operators enable to split processing that needs to have multiple event types correlated with different keys within the same operator
- Finally, pane-based operators enable a split-merge strategy where you distribute and merge the result.
Interestingly, Avigdor presented his work about late-arrival processing from a probabilistic viewpoint and not from the watermark perspective. Usually, modern stream processing frameworks use watermarks in order to take into account events that arrive later. Avigdor presented a probabilistic approach to this issue.
What are late-arrival events?
Imagine we want to count the number of cars entering a road segment every three minutes: we have a “tumbling window” every 3 minutes. If an event (ie a car) arrives at 2’55 second in the window but is stuck somewhere in the network for 6 sec, it is called a late-arrival event. The processing time (the time at which the CEP processes the event) is delayed compared to the event time (the time on which the event really occurs).
Note that for CEP, there is clearly a trade-off between timeliness and accuracy, because the slack time will increase the delay to deliver your result but will increase your accuracy. There is always a tradeoff between cost, latency and correctness, and usually, you can only pick two among the three.
Fun fact: If you need to explain what is event time & processing time to your mother (yeah, don’t underestimate the power of this kind of discussion at Christmas dinner), the best way is to take the Star Wars analogy. From an event time perspective (which is the time at which the story really happened) you should follow episode 1, 2, 3,4, 5, 6, 7,8, 9. But if you take the processing time (the time on which we received the episode), it is 4, 5, 6, 1, 2, 3, 7, 8, 9. Isn’t it great ?!
Final Thoughts
CER has been explored from many viewpoints. However, never in the research history was there a meeting gathering representatives of these communities. This was the objective of this seminar. Having all these people in a castle in the middle of nowhere was a blast! I had very passionate discussions during meals but also during the night at the library with the most brilliant brains on stream and CEP. On the other hand, I still had some fun discussions about comparing Star Trek DIscovery and Picard! Finally, the most important things I will remember after this seminar… are the endless ping pong games with Till Rohrmann and Alessandro Margara :-).
Throwback To 2019
At EURA NOVA, we believe technology is a catalyst for change. To embrace it, we strive to stay at the edge of knowledge. Investing in research allows us to continuously become more proficient, to maintain our know-how at the cutting edge of IT, to share its benefits with our customers, and to incubate the products of tomorrow. As we look back on the year 2019, we are both proud and happy of the work achieved!
Published papers:
We are happy to say that our R&D department has published five peer-reviewed scientific papers last year.
- LEAD: A Formal Specification For Event Processing
In June, our R&D engineer Anas presented his work on complex event processing at the 13Th ACM international Conference on distributed and event-based systems, which was taking place in Germany.
Anas Al Bassit, Skhiri Sabri, LEAD: A Formal Specification For Event Processing, in 13Th ACM international Conference on distributed and event-based systems 2019
- Coherence Regularization for Neural Topic Models
In July, our R&D engineer Kate presented her paper on neural topic models at the 16th International Symposium on Neural Networks taking place in Moscow.
Katsiaryna Krasnashchok, Aymen Cherif, Coherence Regularization for Neural Topic Models. in 16th International Symposium on Neural Networks 2019 (ISNN 2019)
- STRASS: A Light and Effective Method for Extractive Summarization
In August, our PhD student Léo was in Italy to present his paper at the 2019 ACL Student Research Workshop.
Léo Bouscarrat, Antoine Bonnefoy, Thomas Peel, Cécile Pereira, STRASS: A Light and Effective Method for Extractive Summarization Based on Sentence Embeddings, in 2019 ACL Student Research Workshop, Florence, Italy.
- GraphOpt: Framework for Automatic Parameters Tuning of Graph Processing Frameworks
In December, the paper written by our former intern and now full-time colleague Muaz was presented in Los Angeles at the third IEEE International Workshop on Benchmarking, Performance Tuning and Optimization for Big Data Applications.
Muaz Twaty, Amine Ghrab, Skhiri Sabri: GraphOpt: a Framework for Automatic Parameters Tuning of Graph Processing Frameworks. 2019 IEEE International Conference on Big Data (Big Data) Workshops, Los Angeles, CA, USA.
- A Performance Prediction Model for Spark Applications
In June 2020, our paper written as part of the ECCO research project we have been leading at EURA NOVA will be presented at the Big Data congress 2020 taking place in Hawaii.
Florian Demesmaeker, Amine Ghrab, Usama Javaid, Ahmed Amir Kanoun, A Performance Prediction Model for Spark Applications, in the proceedings of Big Data congress 2020.
IEEE Big Data Workshop
Last December, Eura Nova’s research centre held the fourth workshop on real-time and stream analytics in big data at the 2019 IEEE Conference on Big Data in Los Angeles. The workshop brought together leading players including Confluent, Apache Pulsar, the University of Virginia and Télécom Paris Tech as well as 8 renowned speakers from 6 different countries. We received more than 30 applications and we are proud to have hosted such interesting presentations of papers in stream mining, IoT, and industry 4.0. Special thanks to our keynote guests, Matteo Merli (Apache Pulsar) and John Roesler (Confluent), and all the attendees and speakers!
JERICHO, research driving innovations
The mission of the JERICHO research track is to make the latest technologies available to our client, to offer them a competitive edge to play along megacorporations. After two years of intense work, seven published papers, presentations in international conferences spanning Russia, the United States, Germany, Australia, or Belgium, our Jericho project has come to an end.
And the adventure continues! We are really excited to continue our work on innovative solutions for the next data challenges with our new research track ASGARD.
Our R&D director Sabri Skhiri says: “The costs of data solutions and the lack of data scientists will increase in the next 3 to 5 years and solutions to reduce them will benefit from a large market. In this sense, ASGARD is precisely in the strategy of Eura Nova. ASGARD aims to reduce these costs by automating the most expensive tasks. As the world becomes increasingly digital and reinvents itself, innovation and research are essential in the market.”
Academic collaboration
This year, we welcomed nine interns across our three offices. A big kudo to our intern Muaz who successfully finished his master thesis in collaboration with EURA NOVA! The goal of his thesis was to optimise the configuration of distributed graph frameworks. He now joined EURA NOVA to work as a full-time employee.
Talks & seminars
This year, the research team had the pleasure to be invited at several international conferences:
- In February, our research director Sabri Skhiri gave a seminar on modern Stateful Stream Processing at EPT. Our R&D engineer Syrine Ferjaoui also went to Morocco to give a workshop about data architecture at the Annual International Conference on Arab Women In Computing.
- In March, Sabri was at the World AI Show in Dubaï to talk about successfully deploying AI projects in production. He was also invited to Barcelona Tech to give a Big Data Architecture & Design seminar.
- In June, our data privacy officer Nazanin Gifani gave a masterclass on Fairness and Transparency in AI at the DI Summit in Brussels.
- In September, our R&D project manager Shivom Aggarwal talked at the Arab Future Cities Summit 2019 about deploying AI at industrial scale for smart cities.
- In October, our software engineer Christophe Philemotte was in San Francisco to talk at the Kafka Summit about crossing the streams thanks to Kafka and Flink.
- In November, Sabri was invited as a keynote speaker at the 17th International Conference on Service-Oriented Computing to share his experience about the convergence between micro-service, stateful stream processing and function as a service.
Summer schools & conferences
This year, Euranovians attended more than 15 prestigious international conferences and summits across the world to remain up to date and grow our network. We investigated the state of the art in streaming, data science, DevOps, computer vision or cloud engineering at conferences such as Flink Forward, Spark AI Summit, Kubecon, IEEE Big Data, DataWorks Summit, Kafka Summit, NeurIPS, RedHat, Elixir LDN or CVPR.
Euranovians brought back what they learned for the rest of the team and the big data community. Find our public summaries, identified trends and review of conferences here:
- Flink Forward (Berlin)
- Spark AI summit (Amsterdam)
- IEEE Big Data (Los Angeles)
- Kafka Summit (San Francisco)
- DEBS 2019 ( Darmstadt)
- ACL 2019 (Florence)
- Dataworks Summit (Barcelona)
IEEE Big Data 2019 – A Summary
At the beginning of the month, our R&D director Sabri Skhiri and our R&D engineer Syrine Ferjaoui travelled to Los Angeles to attend IEEE Big Data Conference. It is one of the most influential academic gatherings in distributed machine learning. This year, it featured 879 authors, shortlisted from 2009 applicants. They came from 28 countries and presented 210 papers. Back in Belgium, Sabri and Syrine give you their opinion on the event itself and the important elements from the keynotes, the tutorials, the workshops and the interesting papers.
The Big Trends
Sabri says: “The main trends were deep learning, NLP, privacy-preserving approaches, GAN, graph mining and stream mining. In my view, the level of the papers was quite good. Authors are becoming ever more skilled in data science, maths and algorithms. This goes to show that to be a good data scientist, you need an extensive set of advanced skills. Interestingly, there was almost nothing about distributed computing! This is a big move compared to the previous editions. The only presentations that had something to do with distributed systems were about optimisation strategies, an area similar to what our ECCO team researches. The Big Data Conference focuses on data science; it does not really look into its scalability. Distributed computing topics tend to be dealt with at conferences like DEBS, VLDB, USENIX, SIGMOD, etc. As a result, this conference is an amazing place to see hundreds of data science use cases with, most of the time, an interesting contribution.”
The Keynotes
The keynotes were focused on data science as well. We even heard the term “Big Data Science”.
Keynote 1: Responsible Data Science by Lise Getoor – Professor at UC Santa Cruz
Syrine says: “The first keynote was my favourite. Lise started by comparing machine learning to a black box. The goal was to unpack the box and invite people to use data science and to use it wisely. To autonomise ethical decision-making, we should move away from maximising AI systems autonomy and move toward human-centric systems. To do this, we should make sure that human-centric systems have three qualities: (1) be knowledge-based, (2) be data-driven, and (3) support human values. Achieving responsible data science requires both machine-learning and ethics.”
Keynote 2: DataCommons “Google for Data” by Ramanathan Guha – Google
Guha presented DataCommons, a project started by Google to combine data from different open sources. Syrine explains: “Google’s DataCommons project allows users to pretend that the Web is one website, enabling developers to pretend all this data is in one database. The long-term vision of Google is to aggregate all data from publicly available sources (Medicare, Wikidata, sequence data, Landsat, CDC, Census…) into a single Open Knowledge Graph. The goal is to reduce or eliminate the data download-clean-store process. Instead, users can access and use already cleaned data in the cloud. Data can be public or private (internet & intranet). This will avoid repeated data wrangling and ease the burden of data storage, indexing, etc.”
The Tutorials
This year, IEEE Big Data held nine tutorials. Our R&D director explains: “At this type of events, tutorials are always a good way to learn a complete state of the art in a couple of hours. I particularly appreciated the tutorial on “Taming Unstructured Big data: Automated Information Extraction for Massive Text” by the team of the famous Jiawei Han (he is a kind of pop star in data mining and the father of Graph Cube). I found out that many papers about named entity relations were published in the past two years. The idea is to be able to extract supervised, semi-supervised, and unsupervised relations between entities: for instance, discovering that “Trump” is “President of” “USA”. They also propose new approaches to integrate knowledge bases such as DBPedia or YAGO to infer new unknown relations from a corpus. This is just amazing!”
Syrine adds: “The tutorial on NewSQL principles, systems, and current trends was interesting as it explained why we should consider using NoSQL/NewSQL to deal with data interconnections and very high scalability. After attending this tutorial, I was motivated to order this book about Principles of Distributed Database Systems. For fans of deep learning, the tutorial “Deep Learning on Big Data with Multi-Node GPU Jobs” covers a lot about large-scale GPU-based deep-learning systems. If you missed the conference, all resources can be found on this link.”
The Workshops
The EURA NOVA research centre organised the fourth workshop on Real-time and Stream Analytics in Big Data, at the 2019 IEEE conference on Big Data. We were really happy to welcome Matteo Merli from Apache Pulsar and John Roesler from Confluent as keynotes speakers. Thank you to them and to all the attendees and speakers! They had a great time, with captivating talks and a lot of interesting questions and comments. The summary of the event will soon be available on our website. The slides of the keynotes are available here:
Favourite Papers
A personal selection of interesting papers:
- Subspace Clustering with Active Learning (Hankui Peng)
The paper tackles a really interesting problematic faced by a lot of data scientists. Introducing active learning is a cool idea and so is the way they used a mathematical trick to make their approach feasible.
- High Impact Customer Acquisition & Retention Modelling – A Scalable Data Mashup Approach (Kajanan Sangaralingam, Nisha Verma, Aravind Ravi, Su Won Bae)
Su Won Bae, from Mobilewalla, presented how they can define a complete customer acquisition model by mixing their data with their customer data (in this case, a worldwide leader in food delivery). Sabri says: “The quality of data science models highly depends on the data they can train on. I am convinced we will go in the same direction as Mobilewalla in the future to have richer models. However, mixing data must be done with care as it may raise some privacy issues; our purpose has to have legal ground.”
- Parsimonious Morpheme Segmentation with an Application to Enriching Word Embeddings (Ahmed El-Kishky, Frank Xu, Aston Zhang, Jiawei Han)
The speaker presented MorphMine, a method for unsupervised morpheme segmentation. It can generate morpheme candidates that are filtered out using entropy to select the best morphemes from a corpus. Then, these morphemes can be used to highly improve the word embedding model and the downstream machine learning tasks.