Last month our R&D Project Director Cécile Pereira and our PhD student Léo Bouscarrat travelled to Florence to attend and present to ACL 2019. ACL is one of the biggest conferences in Natural Language Processing. This year all the records were broken with more than 3500 attendees, 660 accepted papers to the main conference, 9 tutorials and more than 20 workshops. All the talks of the main conference were recorded and are accessible online. In this article, Cécile and Léo share with you the latest trends from the conference!
A new paradigm in NLP?
This year, ACL’s selection of topics has shown the importance that has taken self-training methods such as BERT (Devlin et al., 2019) or XLNet (Yang et al., 2019). These methods consist of feeding huge models with a vast amount of data and then train them on easy tasks (for example, predict masked words in the original sentence or predict if two sentences are following each other).
These models should be able to learn faster and with less data on a more specific and complex task. With this method, the way to train a model to solve an NLP task has changed. Here is this new paradigm:
- Select a pre-trained model (trained with self-training)
- Add a layer on the output of this model (it will depend on your task) and fine-tune the model by giving the inputs and outputs of your task
- Evaluate your model
Many papers were using this paradigm to achieve state of the art on several tasks (out of the 660 papers of the main conference, 47 have the word BERT in their abstract).
Contextual embeddings, like BERT, take into account the context of the sentence into the embeddings of the words. BERT can be used for a large variety of tasks including but not limited to classification (Reimers et al., Chalkidis et al.), named entity recognition (Arkhipov et al., Emelyanov and Artemova) and question answering (Li et al., Liu et al.).
So it is working. But the remaining question is why?
Several presentations discussed the explainability of BERT (for example Jawahar et al. and Clark et al.). Those papers discuss that, as the different layers are learning different things, the different heads seems to specialize in certain types of words or certain syntactic or semantic task.
The conference highlighted the need for adversarial training and testing as those models are very good to learn bias in the dataset (Niven and Yao, Jiang and Bansal). For those not familiar with the concept, adversarial training and testing consist to train and/or test on an adversarial dataset. This dataset is composed of examples, often generated ones, where the model fails to predict the correct answers. Adversarial training is generally used to verify if the models learn bias in the dataset (like the negation in Niven and Yao). It can also improve the quality of the models.
Improving the experiments in NLP
Several presentations showed that adversarial training can improve the results and robustness of the model (Zhu et al and Jiang and Bansal, Mohit Bansal slides available here).
The meeting was also a moment to discuss the impact of the use of standard splits on benchmark data. Standard split means that, if you want to work on a specific task, you will generally look for the training, validation and test splits used in other publications and use the same.
However, Gorman and Bedrick argue that the use of random split should be preferred. They explain it by trying to reproduce the results of nine part-of-speech taggers on a specific dataset. They reproduced the same rankings on the standard splits. However, when they did it on random splits, the ranks of the taggers, considering the same metric, varied.
This showed that getting a better ranking on a specific split doesn’t mean that you are better in general. Since in some fields of research, the improvements between each paper are small, the use of standard split does not guarantee that a model is really better than another one on the task. Random splits could improve this by adding a notion of variance on the performances.
The last trend in NLP consists of using models or embeddings learned on huge datasets of general data from sources such as Wikipedia, books or newspapers.
When you want to work on specialised domains such as the biomedical, legal or financial domains, you need specialised embeddings. However, you generally don’t have enough specialised data to re-train the embeddings or the models.
A solution is to use and modify pre-trained models for your specific task. This is called domain adaptation. There are several ways to do it. For example, Boukkouri et al combined a general embedding and a smaller one learned on their domain. Hu et al fine-tuned a general model on their data. These methods allow using recent models (which needs a lot of data) on some specific domains that do not fit those requirements.
Machine translation is still a huge topic with no less than 46 papers in the main conference (according to the ACL 2019 chair blog post), an entire two-days workshop dedicated to it and Liang Huang invited talk. Liang Huang is a principal scientist of Baidu Silicon Valley AI Lab who talked about the current state of simultaneous translation and Baidu research’s new approach. They were able to do an English/Chinese translation with 3 seconds of delay only. The demo is available here: https://simultrans-demo.github.io/. One can also notice that the ACL best long paper award was on this topic (Zhang et al.)!
Conversational systems (also called chatbots) were also a trendy topic, with 52 papers, a workshop, and the invited talk from Pascale Fung.
Pascale Fung is a Professor at the Hong Kong University of Science & Technology. She presented the state of the art of conversational systems. For her, recent advances are going in three directions: learning to memorise, learning to personalise and learning to empathise. She presented her current work on conversational systems that can empathise, showing that improvements have been made but there is still work to do. She ended with questions about the ethics of this sector: how can we build systems that are secure, safe and fair for all?
Knowledge graphs are also pretty trendy, they seem to be a good way to add knowledge to models. It can be used for Question-Answering or Conversational systems. The blog post of Michael Galkin makes a review of the most interesting articles in this sector.
Bias in NLP
After recent papers showed that models in NLP are biased (Bolukbasi et al., 2016 ; Caliskan et al, 2017) there is more and more work on what we can do about that, reflected by a session and a workshop during the meeting (https://genderbiasnlp.talp.cat/).
Several works about removing gender bias from models have been previously published. But the work of Gonen and Goldberg explains that, for now, it’s only “Lipstick on a pig”.
We observed two main areas on the topic. Firstly, removing/controlling gender bias in the models (like in automatic translation, Habash et al., Escudé Font et al., Ik Cho et al.). Secondly, measuring bias in the models and society (with articles proposed by sociologists, like Karve et al., Hitti et al., Basta et al., Kurita et al.).
There were several papers about summarization (including our own paper https://arxiv.org/abs/1907.07323) which have been summarized by RecitalAI on their GitHub.
ACL was a great place to measure the trends in the NLP field. As models are becoming better, data scientists are applying them to a large variety of topics including automatic translation, search engine, and chatbots.
As the NLP community and topics are becoming bigger and bigger, we hope that this summary of our biased takeaways from the meeting could help you navigate the nearly 700 ACL papers of this year.