Euranova has 3 fundamental pillars: explore, craft and serve. The explore pillar of Euranova is an independent research centre dedicated to data science, software engineering and AI.
Through the exploration of tomorrow’s engineering and data science to answer today’s problems, our research centre is dedicated to anticipating the challenges that European businesses face. We find solutions to current and future digital challenges with passion, creativity and integrity.
Euranova has 3 fundamental pillars: explore, craft and serve. The explore pillar of Euranova is an independent research centre dedicated to data science, software engineering and AI.
Through the exploration of tomorrow’s engineering and data science to answer today’s problems, our research centre is dedicated to anticipating the challenges that European businesses face. We find solutions to current and future digital challenges with passion, creativity and integrity.
4th Workshop on Real-time & Stream Analytics in Big Data
EURA NOVA Research centre is proud and excited to organize the fourth workshop on Real-time and Stream analytics in Big Data, collocated with the 2019 IEEE conference on Big Data. The workshop will take place in December in Los Angeles, USA. Stream processing and real-time analytics in data science have become some of the most important topics of Big Data. To refine new opportunities and use cases required by the industry, we are bringing together experts passioned about the subject. This year, we are excited to have two amazing keynotes from Confluent KStream and Apache Pulsar: Matteo Merli is one of the co-founders of Streamlio, he serves as the PMC chair for Apache Pulsar and he’s a member of the Apache BookKeeper PMC. Previously, he spent several years at Yahoo building database replication systems and multi-tenant messaging platforms. Matteo was the co-creator and lead developer for the Pulsar project within Yahoo. John Roesler is a software engineer at Confluent and a contributor to Apache Kafka, primarily to Kafka Streams. Before that, he spent eight years at Bazaarvoice, on a team designing and building a large-scale streaming database and a high-throughput declarative Stream Processing engine. If you want to join us, authors from the industry and the academia are invited to contribute to the conference by submitting articles. Check out the workshop website to find all the information you will need. Your paper will be reviewed by a prestigious panel of international experts from both the academic and the industrial worlds.
ACL 2019: Takeaways from the conference
Last month our R&D Project Director Cécile Pereira and our PhD student Léo Bouscarrat travelled to Florence to attend and present to ACL 2019. ACL is one of the biggest conferences in Natural Language Processing. This year all the records were broken with more than 3500 attendees, 660 accepted papers to the main conference, 9 tutorials and more than 20 workshops. All the talks of the main conference were recorded and are accessible online. In this article, Cécile and Léo share with you the latest trends from the conference! Big trends A new paradigm in NLP? This year, ACL’s selection of topics has shown the importance that has taken self-training methods such as BERT (Devlin et al., 2019) or XLNet (Yang et al., 2019). These methods consist of feeding huge models with a vast amount of data and then train them on easy tasks (for example, predict masked words in the original sentence or predict if two sentences are following each other). These models should be able to learn faster and with less data on a more specific and complex task. With this method, the way to train a model to solve an NLP task has changed. Here is this new paradigm: Select a pre-trained model (trained with self-training) Add a layer on the output of this model (it will depend on your task) and fine-tune the model by giving the inputs and outputs of your task Evaluate your model Many papers were using this paradigm to achieve state of the art on several tasks (out of the 660 papers of the main conference, 47 have the word BERT in their abstract). Contextual embeddings, like BERT, take into account the context of the sentence into the embeddings of the words. BERT can be used for a large
STRASS: A Light and Effective Method for Extractive Summarization
This paper introduces STRASS: Summarization by TRAnsformation Selection and Scoring. It is an extractive text summarization method which leverages the semantic information in existing sentence embedding spaces. Our method creates an extractive summary by selecting the sentences with the closest embeddings to the document embedding. The model learns a transformation of the document embedding to minimize the similarity between the extractive summary and the ground truth summary. As the transformation is only composed of a dense layer, the training can be done on CPU, therefore, inexpensive. Moreover, inference time is short and linear according to the number of sentences. As a second contribution, we introduce the French CASS dataset, composed of judgments from the French Court of cassation and their corresponding summaries. On this dataset, our results show that our method performs similarly to the state of the art extractive methods with effective training and inferring time. Léo Bouscarrat, Antoine Bonnefoy, Thomas Peel, Cécile Pereira, STRASS: A Light and Effective Method for Extractive Summarization Based on Sentence Embeddings, in 2019 ACL Student Research Workshop, Florence, Italy. Click here to access the paper. Florence, Italy
DEBS 2019: A Summary
Last month, our R&D engineer Anas Albassit and director Sabri Skhiri travelled to Germany to attend and present at DEBS 2019, one of the most specialised conferences in Distributed Event-Based Systems. DEBS has a long history: from active databases to streaming engines, distributed publish-subscribe systems, etc. it has always been the pioneer of distributed and high-performance systems. In this article, Anas and Sabri share with us what they learned there and what struck them as particularly useful. Big Trends This edition focused around streaming language, scheduling, elasticity, distributed event processing, platform and middleware. Our R&D director Sabri Skhiri says: “For someone working in distributed computing and data management, DEBS is one of the major conferences with SIGMOD and VLDB. Even though it is quite small (80 participants vs 1000 for IEEE Big Data), this is a niche conference of experts from a small yet amazingly talented community of researchers. The keynotes were just great, with a good balance between pure research and industry. This conference tackling distributed computing and streaming is heaven for data scientists and architects like us!” Keynotes Open Problems in Stream Processing: A Call To Action (Tyler Akidau) Tyler Akidau is the technical lead for the Data Processing Languages & Systems group at Google. He argues that, even though stream processing has gone from niche to mainstream, this is just the beginning. For him, the need for active exploration of new ideas is all the more pressing. Sabri reacts: “The stream has been there for 30 years. We have Spark, Flink, Dataflow, KStream, MSF Trill. But is it all we can do? Is there nothing to do anymore? Tyler Akidau brilliantly presented that stream processing as a field of research is alive and well.” The talk was mainly about raising opened
LEAD: A Formal Specification For Event Processing
Processing event streams is an increasingly important area for modern businesses aiming to detect and efficiently react to critical situations in near real-time. The need to govern the behaviour of systems where such streams exist has led to the development of numerous Complex Event Processing (CEP) engines, capable of detecting patterns and analyzing event streams. Although current CEP systems provide real-time analysis foundations for a variety of applications, several challenges arise due to languages’ limitations and imprecise semantics, as well as the lack of power to handle big data requirements. In this paper, we discuss such systems, analyzing some of the most sensitive issues in this domain. Further, in this context, we present our contributions expressed in LEAD, a formal specification for processing complex events. LEAD provides an algebra that consists of a set of operators for constructing complex events (patterns), temporally restricting the construction process and choosing among several selection and consumption policies. We show how to build LEAD rules to demonstrate the expressive power of our approach. Furthermore, we introduce a novel approach of interpreting these rules into a logical execution plan, built with temporal prioritized coloured petri nets. Anas Al Bassit, Skhiri Sabri, LEAD: A Formal Specification For Event Processing, in 13Th ACM international Conference on distributed and event-based systems 2019 Click here to access the paper.
Coherence Regularization for Neural Topic Models
Neural topic models aim to predict the words of a document given the document itself. In such models, perplexity is used as a training criterion, whereas the final quality measure is topic coherence. In this work, we introduce a coherence regularization loss that penalizes incoherent topics during the training of the model. We analyze our approach using coherence and an additional metric – exclusivity, responsible for the uniqueness of the terms in topics. We argue that this combination of metrics is an adequate indicator of the model quality. Our results indicate the effectiveness of our loss and the potential to be used in the future neural topic models. The paper will be published at the 16th International Symposium on Neural Networks taking place in Moscow. In the meantime, do not hesitate to contact our R&D department at [email protected] to discuss how you can leverage neural topic models in your projects. Katsiaryna Krasnashchok, Aymen Cherif, Coherence Regularization for Neural Topic Models. in 16th International Symposium on Neural Networks 2019 (ISNN 2019) Click here to access the paper.
DataWorks Summit: the big trends
Last month, four EURA NOVA engineers travelled to Barcelona to attend the Dataworks Summit. The conference is organised by Hortonworks, now known as Cloudera and it is about how to apply open source Big Data technology to accelerate digital transformation initiatives. They came back with a lot to say about the hot topics in AI, machine learning, architecture, the cloud, and the use cases! In this article, they share with us what they learned there and what struck them as particularly useful. Big Trends Data architecture This year, one of the most important trends at the conference was data management and data architecture. Our R&D director Sabri Skhiri says: “There was a real focus on taking data lakes to their next stage and on making them actionable for AI and machine learning. The notion of data hubs was often mentioned, notably during the keynote speeches by Cloudera, IBM, and Pure Storage. However, most of the vendors of platforms have not been able yet to provide a fully-fledged ecosystem that allows the exploration, governance, and industrialisation of big data”. AI industrialisation This brings us to the second motto of the conference: AI industrialisation is a must. Our data engineer Khalil Amdouni explains: “The conference has been migrating towards AI topics. In the past, the conference used to focus mostly on data ingestion and data processing. It has been moving towards data science. Everyone is talking about AI and machine learning and how to put data science models into production. It’s looking into how to move from data exploration to industrialisation; we heard a lot about Cloudera’s Data Science Workbench etc.” Production environment The third trend of the conference was the separation between data processing tools and AI frameworks. Khalil explains: “Spark, Cloudera, Kubernetes are now all providing production environments
BOOT CAMP 2019
EURA NOVA is launching an intense 3-month I.T. boot camp starting September 2019.
Third Workshop on Real-Time and Stream Analytics in Big Data: key takeaways
Last month, EURA NOVA research centre organised the third workshop on real-time and stream analytics in big data, collocated with the 2018 IEEE conference on big data in Seattle. The workshop brought together the leading actors in the field including data Artisans, the University of Virginia and Télécom Paris Tech as well as 9 well-known speakers from 6 different countries. We received more than 30 applications and we are proud to have hosted such interesting presentations of papers in data architecture, stream mining, complex event processing and IoT. The workshop was a real success, with captivating talks and a lot of interesting questions and comments. If you could not attend the event, our R&D engineer Syrine Ferjaoui has brought back for you the important elements from the keynotes and the presented papers. First keynote speaker: First of all, the workshop started with the keynote of Fabian Hueske, PMC member at Apache Flink & co-founder of data Artisans. His talk “Unified Processing of Static and Streaming Data with SQL on Apache Flink” presented Flink’s features and its relational unified APIs for batch and streaming data. Fabian Hueske insisted on the importance of unifying stream and batch for 2 major points: the usability and the portability. Flink includes a set of features such as materialised views to speed-up the analytical queries, dynamic tables, updates propagation and processing, continuous queries, approaches to handle time in stream processing, watermarks and queries on infinite sized tables. With all these features, Flink helps its users to build data pipelines with low-latency ETL, stream & batch analytics and to power live dashboards. Our research director Sabri Skhiri adds: “Apache flink is currently working on a set of connectors. They have already the HDFS sink, the JDBC sink and since they are pushing Flink as the standard
7 Publications in 2018
At EURA NOVA, we believe investing in research allows us to continuously become more proficient, to maintain our know-how at the cutting edge of IT, and to share its benefits with our customers. As we look back on the year 2018, we are both proud and happy to announce that our R&D department has published 7 publications this year: Firstly, our paper “Pairwise Image Ranking with Deep Comparative Network” was published at the 26th European Symposium on Artificial Neural Networks. The paper, written by our Lead R&D engineer Aymen Cherif and Salim Jouili, discuss how using the pair-wise ranking model can provide better results for instance-level image retrieval. Aymen Cherif, Salim Jouili, Pairwise Image Ranking with Deep Comparative Network. ESANN 2018: ES2018-200 Secondly, our R&D engineer Cécile Pereira participated in the redaction of a paper published in Bioinformatics in May 2018. They propose a novel end-to-end deep learning approach for biomedical NER tasks that leverage the local contexts based on n-gram character and word embeddings via Convolutional Neural Network. Qile Zhu, Xiaolin Li, Ana Conesa, Cécile Pereira, GRAM-CNN: A deep learning approach with local context for named entity recognition in biomedical text, Bioinformatics – May 2018 In July, our R&D engineer Katherine Krasnoschok was in Melbourne, Australia to attend the ACL conference. She presented her poster on topic modelling. Her paper, co-written with Salim Jouili, indicates that involving more named entities positively influences the overall quality of topics. Katsiaryna Krasnashchok, Salim Jouili, Improving Topic Quality by Promoting Named Entities in Topic Modeling, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Vol. 2. 2018 Moreover, our paper “Graph BI & Analytics: Current State and Future Challenges” was accepted for publication and presented at the 20th International Conference on Big
IEEE Big Data 2018: a summary
At the beginning of the month, our R&D director Sabri Skirhi and our R&D engineer Syrine Ferjaoui travelled to Seattle to attend IEEE Big Data. The conference is one of the most influent in this domain, gathering more than 1100 attendees, 5 keynotes, 9 tutorials, and 8 daily tracks in parallel. Back in Belgium, our R&D director gives you his opinion on the conference itself and the important elements from the keynotes, the tutorials, the workshops and the interesting papers. Favourite Talks Keynote 1: Decentralized Machine Learning – Google AI The IEEE Big Data conference started with the inspiring keynote of Blaise Agüera y Arcas, a distinguished researcher at Google AI. Our director details: “The straightforward thesis of the talk is that we can, and we must, use the mobile device for local deep neural network computing. Blaise Agüera explained that since the launch of Tensorflow, Google Brain has built specialised hardware servers to run efficiently deep neural network computing jobs. Nowadays, we find on the market specialised chips that are smaller than a coin of 1 cent and that costs less than a cappuccino. Using them, you can run very efficiently deep neural net computing jobs on mobile at low frequency, low energy and even continuously. For example, the Google camera embeds deep neural nets and does not need to send data to the server side for face or situation detection. But Dr Blaise is going further. He works on reusing the existing techniques in distributed neural net and sharing the learned gradient in a parameter server and sharing them to all device. This is what we call federated learning, and it has impacted many research areas, such as edge computing. The idea of edge computing is to execute light tasks on the edge of the network
Improving Topic Quality by Promoting Named Entities in Topic Modeling
In July, our R&D engineer Katherine Krasnoschok was in Melbourne, Australia to attend the ACL conference. She presented her poster on topic modelling. Her paper, co-written with Salim Jouili, indicates that involving more named entities positively influences the overall quality of topics. News-related content has been extensively studied in both topic modeling research and named entity recognition. However, expressive power of named entities and their potential for improving the quality of discovered topics has not received much attention. In this paper, we use named entities as domain-specific terms for news-centric content and present a new weighting model for Latent Dirichlet Allocation. Our experimental results indicate that involving more named entities in topic descriptors positively influences the overall quality of topics, improving their interpretability, specificity and diversity. Katsiaryna Krasnashchok, Salim Jouili, Improving Topic Quality by Promoting Named Entities in Topic Modeling, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Vol. 2. 2018. Click here to access the paper.