Facebook AI has built and open-sourced BlenderBot, the largest-ever open-domain chatbot. It outperforms others in terms of engagement and also feels more human, according to human evaluators.
The culmination of years of research in conversational AI, this is the first chatbot to blend a diverse set of conversational skills — including empathy, knowledge, and personality — together in one system.
We achieved this milestone through a new chatbot recipe that includes improved decoding techniques, novel blending of skills, and a model with 9.4 billion parameters, which is 3.6x more than the largest existing system.
Today we’re releasing the complete model, code, and evaluation set-up, so that other AI researchers will be able to reproduce this work and continue to advance conversational AI research.
Conversation is an art that we practice every day — when we’re debating food options, deciding the best movie to watch after dinner, or just discussing current events to broaden our worldview. For decades, AI researchers have been working on building an AI system that can converse as well as humans can: asking and answering a wide range of questions, displaying knowledge, and being empathetic, personable, engaging, serious, or fun, as circumstances dictate. So far, systems have excelled primarily at specialized, preprogrammed tasks, like booking a flight. But truly intelligent, human-level AI systems must effortlessly understand the broader context of the conversation and how specific topics relate to each other.
As the culmination of years of our research, we’re announcing that we’ve built and open-sourced">open-sourced BlenderBot, the largest-ever open-domain chatbot. It outperforms others in terms of engagement and also feels more human, according to human evaluators. This is the first time a chatbot has learned to blend several conversational skills — including the ability to assume a persona, discuss nearly any topic, and show empathy — in natural, 14-turn conversation flows. Today we’re sharing new details of the key ingredients that we used to create our new chatbot.
Some of the best current systems have made progress by training high-capacity neural models with millions or billions of parameters using huge text corpora sourced from the web. Our new recipe incorporates not just large-scale neural models, with up to 9.4 billion parameters — or 3.6x more than the largest existing system — but also equally important techniques for blending skills and detailed generation.
Chatbot recipe: Scale, blending skills, and generation strategies
Scale
Common to other natural language processing research today, the first step in creating our chatbot was large-scale training. We pretrained large (up to 9.4 billion) Transformer neural networks on large amounts of conversational data. We used previously available public domain conversations that involved 1.5 billion training examples of extracted conversations. Our neural networks are too large to fit on a single device, so we utilized techniques such as column-wise model parallelism, which allows us to split the neural network into smaller, more manageable pieces while maintaining maximum efficiency. Such careful organization of our neural networks enabled us to handle larger networks than we could previously while maintaining the high efficiency needed to scale to terabyte-size data sets.
Blending skills
While learning at scale is important, it’s not the only ingredient necessary for creating the best possible conversationalist. Learning to mimic the average conversation in large-scale public training sets doesn’t necessarily mean that the agent will learn the traits of the best conversationalists. In fact, if not done carefully, it can make the model imitate poor or even toxic behavior. We recently introduced a novel task called Blended Skill Talk (BST) for training and evaluating these desirable skills. BST consists of the following skills, leveraging our previous research:
Engaging use of personality (PersonaChat)
Engaging use of knowledge (Wizard of Wikipedia)
Display of empathy (Empathetic Dialogues)
Ability to blend all three seamlessly (BST)
Blending these skills is a difficult challenge because systems must be able to switch between different tasks when appropriate, like adjusting tone if a person changes from joking to serious. Our new BST data set provides a way to build systems that blend and exhibit these behaviors. We found that fine-tuning the model with BST has a dramatic effect on human evaluations of the bot’s conversational ability.
Generation strategies
Training neural models is typically done by minimizing perplexity, which measures how well models can predict and generate the next word. However, to make sure conversational agents don’t repeat themselves or display other shortcomings, researchers typically use a number of possible generation strategies after the model is trained, including beam search, next token sampling, and n-gram blocking. We find that the length of the agent’s utterances is important in achieving better results with human evaluators. If they’re too short, the responses are dull and communicate a lack of interest; if they’re too long, the chatbot seems to waffle and not listen. Contrary to recent research, which finds that sampling outperforms beam search, we show that a careful choice of search hyperparameters can give strong results by controlling this trade-off. In particular, tuning the minimum beam length gives important control over the “dull versus spicy” spectrum of responses.