At the beginning of the month, our R&D director Sabri Skhiri and our R&D engineer Syrine Ferjaoui travelled to Los Angeles to attend IEEE Big Data Conference. It is one of the most influential academic gatherings in distributed machine learning. This year, it featured 879 authors, shortlisted from 2009 applicants. They came from 28 countries and presented 210 papers. Back in Belgium, Sabri and Syrine give you their opinion on the event itself and the important elements from the keynotes, the tutorials, the workshops and the interesting papers.
The Big Trends
Sabri says: “The main trends were deep learning, NLP, privacy-preserving approaches, GAN, graph mining and stream mining. In my view, the level of the papers was quite good. Authors are becoming ever more skilled in data science, maths and algorithms. This goes to show that to be a good data scientist, you need an extensive set of advanced skills. Interestingly, there was almost nothing about distributed computing! This is a big move compared to the previous editions. The only presentations that had something to do with distributed systems were about optimisation strategies, an area similar to what our ECCO team researches. The Big Data Conference focuses on data science; it does not really look into its scalability. Distributed computing topics tend to be dealt with at conferences like DEBS, VLDB, USENIX, SIGMOD, etc. As a result, this conference is an amazing place to see hundreds of data science use cases with, most of the time, an interesting contribution.”
The keynotes were focused on data science as well. We even heard the term “Big Data Science”.
Keynote 1: Responsible Data Science by Lise Getoor – Professor at UC Santa Cruz
Syrine says: “The first keynote was my favourite. Lise started by comparing machine learning to a black box. The goal was to unpack the box and invite people to use data science and to use it wisely. To autonomise ethical decision-making, we should move away from maximising AI systems autonomy and move toward human-centric systems. To do this, we should make sure that human-centric systems have three qualities: (1) be knowledge-based, (2) be data-driven, and (3) support human values. Achieving responsible data science requires both machine-learning and ethics.”
Keynote 2: DataCommons “Google for Data” by Ramanathan Guha – Google
Guha presented DataCommons, a project started by Google to combine data from different open sources. Syrine explains: “Google’s DataCommons project allows users to pretend that the Web is one website, enabling developers to pretend all this data is in one database. The long-term vision of Google is to aggregate all data from publicly available sources (Medicare, Wikidata, sequence data, Landsat, CDC, Census…) into a single Open Knowledge Graph. The goal is to reduce or eliminate the data download-clean-store process. Instead, users can access and use already cleaned data in the cloud. Data can be public or private (internet & intranet). This will avoid repeated data wrangling and ease the burden of data storage, indexing, etc.”
This year, IEEE Big Data held nine tutorials. Our R&D director explains: “At this type of events, tutorials are always a good way to learn a complete state of the art in a couple of hours. I particularly appreciated the tutorial on “Taming Unstructured Big data: Automated Information Extraction for Massive Text” by the team of the famous Jiawei Han (he is a kind of pop star in data mining and the father of Graph Cube). I found out that many papers about named entity relations were published in the past two years. The idea is to be able to extract supervised, semi-supervised, and unsupervised relations between entities: for instance, discovering that “Trump” is “President of” “USA”. They also propose new approaches to integrate knowledge bases such as DBPedia or YAGO to infer new unknown relations from a corpus. This is just amazing!”
Syrine adds: “The tutorial on NewSQL principles, systems, and current trends was interesting as it explained why we should consider using NoSQL/NewSQL to deal with data interconnections and very high scalability. After attending this tutorial, I was motivated to order this book about Principles of Distributed Database Systems. For fans of deep learning, the tutorial “Deep Learning on Big Data with Multi-Node GPU Jobs” covers a lot about large-scale GPU-based deep-learning systems. If you missed the conference, all resources can be found on this link.”
The EURA NOVA research centre organised the fourth workshop on Real-time and Stream Analytics in Big Data, at the 2019 IEEE conference on Big Data. We were really happy to welcome Matteo Merli from Apache Pulsar and John Roesler from Confluent as keynotes speakers. Thank you to them and to all the attendees and speakers! They had a great time, with captivating talks and a lot of interesting questions and comments. The summary of the event will soon be available on our website. The slides of the keynotes are available here:
A personal selection of interesting papers:
- Subspace Clustering with Active Learning (Hankui Peng)
The paper tackles a really interesting problematic faced by a lot of data scientists. Introducing active learning is a cool idea and so is the way they used a mathematical trick to make their approach feasible.
- High Impact Customer Acquisition & Retention Modelling – A Scalable Data Mashup Approach (Kajanan Sangaralingam, Nisha Verma, Aravind Ravi, Su Won Bae)
Su Won Bae, from Mobilewalla, presented how they can define a complete customer acquisition model by mixing their data with their customer data (in this case, a worldwide leader in food delivery). Sabri says: “The quality of data science models highly depends on the data they can train on. I am convinced we will go in the same direction as Mobilewalla in the future to have richer models. However, mixing data must be done with care as it may raise some privacy issues; our purpose has to have legal ground.”
- Parsimonious Morpheme Segmentation with an Application to Enriching Word Embeddings (Ahmed El-Kishky, Frank Xu, Aston Zhang, Jiawei Han)
The speaker presented MorphMine, a method for unsupervised morpheme segmentation. It can generate morpheme candidates that are filtered out using entropy to select the best morphemes from a corpus. Then, these morphemes can be used to highly improve the word embedding model and the downstream machine learning tasks.