BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Get the ultimate pre-trained features applicable to any language model

Featured image

Video

The major limitation is that standard language models are unidirectional, and this limits the choice of architectures that can be used during pre-training. Such restrictions are sub-optimal for sentence-level tasks, and could be very harmful when applying fine-tuning based approaches to token-level tasks such as question answering, where it is crucial to incorporate context from both directions.

BERT alleviates the previously mentioned unidirectionality constraint by using a “masked language model” (MLM) pre-training objective, inspired by the Cloze task (Taylor, 1953). In addition to the masked language model, we also use a “next sentence prediction” task that jointly pretrains text-pair representations.