The culmination of years of research in conversational AI, this is the first chatbot to blend a diverse set of conversational skills — including empathy, knowledge, and personality — together in one system.
We achieved this milestone through a new chatbot recipe that includes improved decoding techniques, novel blending of skills, and a model with 9.4 billion parameters, which is 3.6x more than the largest existing system.
Today we’re releasing the complete model, code, and evaluation set-up, so that other AI researchers will be able to reproduce this work and continue to advance conversational AI research.
Conversation is an art that we practice every day — when we’re debating food options, deciding the best movie to watch after dinner, or just discussing current events to broaden our worldview. For decades, AI researchers have been working on building an AI system that can converse as well as humans can: asking and answering a wide range of questions, displaying knowledge, and being empathetic, personable, engaging, serious, or fun, as circumstances dictate. So far, systems have excelled primarily at specialized, preprogrammed tasks, like booking a flight. But truly intelligent, human-level AI systems must effortlessly understand the broader context of the conversation and how specific topics relate to each other.
As the culmination of years of our research, we’re announcing that we’ve built and open-sourced">open-sourced BlenderBot, the largest-ever open-domain chatbot. It outperforms others in terms of engagement and also feels more human, according to human evaluators. This is the first time a chatbot has learned to blend several conversational skills — including the ability to assume a persona, discuss nearly any topic, and show empathy — in natural, 14-turn conversation flows. Today we’re sharing new details of the key ingredients that we used to create our new chatbot.
Some of the best current systems have made progress by training high-capacity neural models with millions or billions of parameters using huge text corpora sourced from the web. Our new recipe incorporates not just large-scale neural models, with up to 9.4 billion parameters — or 3.6x more than the largest existing system — but also equally important techniques for blending skills and detailed generation.
Chatbot recipe: Scale, blending skills, and generation strategies
Common to other natural language processing research today, the first step in creating our chatbot was large-scale training. We pretrained large (up to 9.4 billion) Transformer neural networks on large amounts of conversational data. We used previously available public domain conversations that involved 1.5 billion training examples of extracted conversations. Our neural networks are too large to fit on a single device, so we utilized techniques such as column-wise model parallelism, which allows us to split the neural network into smaller, more manageable pieces while maintaining maximum efficiency. Such careful organization of our neural networks enabled us to handle larger networks than we could previously while maintaining the high efficiency needed to scale to terabyte-size data sets.
While learning at scale is important, it’s not the only ingredient necessary for creating the best possible conversationalist. Learning to mimic the average conversation in large-scale public training sets doesn’t necessarily mean that the agent will learn the traits of the best conversationalists. In fact, if not done carefully, it can make the model imitate poor or even toxic behavior. We recently introduced a novel task called Blended Skill Talk (BST) for training and evaluating these desirable skills. BST consists of the following skills, leveraging our previous research:
Engaging use of personality (PersonaChat)
Engaging use of knowledge (Wizard of Wikipedia)
Display of empathy (Empathetic Dialogues)
Ability to blend all three seamlessly (BST)
Blending these skills is a difficult challenge because systems must be able to switch between different tasks when appropriate, like adjusting tone if a person changes from joking to serious. Our new BST data set provides a way to build systems that blend and exhibit these behaviors. We found that fine-tuning the model with BST has a dramatic effect on human evaluations of the bot’s conversational ability.
Training neural models is typically done by minimizing perplexity, which measures how well models can predict and generate the next word. However, to make sure conversational agents don’t repeat themselves or display other shortcomings, researchers typically use a number of possible generation strategies after the model is trained, including beam search, next token sampling, and n-gram blocking. We find that the length of the agent’s utterances is important in achieving better results with human evaluators. If they’re too short, the responses are dull and communicate a lack of interest; if they’re too long, the chatbot seems to waffle and not listen. Contrary to recent research, which finds that sampling outperforms beam search, we show that a careful choice of search hyperparameters can give strong results by controlling this trade-off. In particular, tuning the minimum beam length gives important control over the “dull versus spicy” spectrum of responses.
Putting our recipe to the test
To evaluate our model, we benchmarked its performance against Google’s latest Meena chatbot through pairwise human evaluations. Since their model has not been released, we used the roughly 100 publicly released and randomized logs for this evaluation. Using the ACUTE-Eval method, human evaluators were shown a series of dialogues between humans paired with each respective chatbot. They were asked:
“Who would you prefer to talk to for a long conversation?” (showing engagingness)
“Which speaker sounds more human?” (showing humanness)
When presented with chats showing Meena in action and chats showing BlenderBot in action, 67 percent of the evaluators said that our model sounds more human, and 75 percent said that they would rather have a long conversation with BlenderBot than with Meena.
Further analysis via human evaluation underscored the importance of both blending skills and choosing a generation strategy that produces nonrepetitive, detailed responses. In an A/B comparison between human-to-human and human-to-BlenderBot conversations to measure engagement, models fine-tuned with BST tasks were preferred 49 percent of the time to humans, while models trained only on public domain conversations were preferred just 36 percent of the time.
Decoding strategies, such as beam blocking and controlling for the minimum beam length, also had a large impact on results. After we removed the minimum beam length constraint, the model’s responses were roughly half the length and the performance of our BST models went down, from 49 percent to 21 percent. These results show that while scaling models is important, there are other, equally important parts of the chatbot recipe.
In this graph, we show how often human evaluators preferred our chatbots to human-to-human chats over time. Since 2018, we’ve improved model performance in this evaluation --- from 23% in 2018 to 49% today.
Over the past few years, w've doubled the performance of our chatbot models through various key model improvements, like Specificity Control, Poly-Encoders, and the recipe described in this blog post. Our latest model’s performance is nearly equal to human-level quality in this specific test setup. This would suggest that we have achieved near human-level performance for this type of evaluation; however, our chatbot still has many weaknesses relative to humans, and finding an evaluation method that better exposes these weaknesses is an open problem and part of our future research agenda.
We’re excited about the progress we’ve made in improving open-domain chatbots. However, we are still far from achieving human-level intelligence in dialogue systems. Though it’s rare, our best models still make mistakes, like contradiction or repetition, and can “hallucinate” knowledge, as is seen in other generative systems. Human evaluations are also generally conducted using relatively brief conversations, and we’d most likely find that sufficiently long conversations would make these issues more apparent.
We’re currently exploring ways to further improve the conversational quality of our models in longer conversations with new architectures and different loss functions. We’re also focused on building stronger classifiers to filter out harmful language in dialogues. And we’ve seen preliminary success in studies to help mitigate gender bias in chatbots.
True progress in the field depends on reproducibility --- the opportunity to build upon the best technology possible. We believe that releasing models is essential to enable full, reliable insights into their capabilities. That’s why we’ve made our state of the art open-domain chatbot publicly available through our dialogue research platform ParlAI. By open-sourcing code for fine-tuning and conducting automatic and human evaluations, we hope that the AI research community can build on this work and collectively push conversational AI forward.