6 Jul, 2021

Developments in AI: Language models


There is much excitement in the machine learning community surrounding language models (LMs), neural networks trained to “understand” the intricacies of language, semantics, and grammar. These have revolutionised natural language processing (NLP). In this newsletter we’ll go over what they are, some examples of what they can do, and ethical implications to their use, that as a community we must consider.

LMs transform sentences into numerical (vector) representations, which are subsequently used as inputs to a more traditional machine learning model, such as classification or regression. They do this by modelling the statistical distributions of words in sentences; they are trained to predict the most likely words at a given position in a sentence given the surrounding context. The LM does a lot of heavy lifting in terms of finding useful and relevant representations of language, using the most efficient representation of the meaning of a sentence using a handful of real numbers.

Leveraging this approach, the introduction of BERT (Bidirectional Encoder Representations from Transformers) in 2018 by Google researchers [1] constituted a serious paradigm shift, outperforming the previous state-of-the-art LMs on eleven language modelling challenges. BERT was 7.7% better than the competition on the GLUE sentence modelling metric, which evaluates models’ ‘understanding’ of test sentences, which had previously been dominated by a type of recurrent neural network called an LSTM (Long-Short Term Memory). A lot of the success can be attributed to the introduction of a powerful new neural network architecture known as a Transformer, which has been widely adopted in other NLP frameworks, computer vision, and time-series modelling. Transformers are now a state-of-the-art neural architecture, bringing performance gains over traditional sequence models due to computational and data efficiency.

A major advantage of using LMs is that only a relatively small amount of labelled data is required to solve a supervised learning problem. Raw, unlabelled data is used to train the LM, for example the text of Wikipedia articles or Reddit posts – you just need a very large corpus of human-written text. Once the LM “understands” language, you can fine-tune a specific task with a handful of manually labelled examples to get good results. For example, only a handful of Amazon reviews, labelled by their ‘star’ rating, are required to train a product sentiment classifier if a LM is used. BERT is now used in production for almost every Google search query [2].

After BERT, which had 340 million parameters and was trained on about 16 GB of text, there was a quick succession of bigger and bigger LMs (with strictly increasing improvements in sentence “understanding” metrics as well as intended task performance). A few examples include:

  • (2019) XLNet (340 million parameters, 126 GB of data)
  • (2019) MegatronLM (8.3 billion parameters, 174 GB of data)
  • (2020) T-NLG (17 billion parameters, 174 GB of data)
  • (2020) GPT-3 (175 billion parameters, 570 GB of data)

GPT-3 is broadly the state-of-the-art, and is essentially an LM trained on the entire internet. It is an extremely powerful tool, and as a result of being publicly available, has been used for some mind-blowing applications (see here for some examples: https://www.educative.io/blog/top-uses-gpt-3-deep-learning).

GPT-3 based models can also do things like generate code (HTML, CSS, JavaScript) for websites that have been given a sentence prompting, as well as generate narrative stories, and even ask ‘Albert Einstein’ to teach us about astrophysics [3]:

Me: Teach me about black holes.

Albert Einstein: A black hole is formed when a star has burnt all of its fuel. There are two ways this can happen. If the mass of the star is less than a certain limit, known as the Chandrasekhar Limit, then it can end in a Type 1a supernova explosion. When the mass of the star is above that limit, however, it can collapse in on itself. The reason for this is that the star is being pulled in on itself by its own gravity. It is being accelerated due to its own gravity. When a force is applied to something that is being accelerated, its momentum increases. The more massive the object is, the more force it takes to keep it from slowing down.

Me: what is Chandrasekhar Limit?

Albert Einstein: The Chandrasekhar Limit is 1.44 solar masses. An object with a mass less than this limit will not collapse into a black hole. If it is heavier than this limit, then its gravity will not be strong enough to fight against itself and it will collapse into a black hole.

Me: who is Chandrasekhar

Albert Einstein: Chandrasekhar was a scientist who calculated this mass limit.

Me: When did Chandrasekhar die?

Albert Einstein: He died in 1995 at the age of 84.

While this is clearly extremely impressive technology, there are ongoing concerns about the environmental and ethical consequences of such powerful software. Training a BERT model (by now considered a relatively small LM) has been estimated to consume as much energy as a trans-American flight [4]. This is exacerbated by the fact that the model is often trained a few times to trial different hyperparameters. Cloud computing companies generally use some renewable energy sources and/or carbon credit offsetting, but the majority of energy used is non-renewable [5].

Furthermore, and potentially more worrying, Bender et al [5] note that the datasets used to train massive LMs vastly overrepresent racist, misogynistic, and white-supremacist views, which they suggest is a result of the predominance of this sort of text on the English internet. Machine learning models cannot be separated from their training data, and essentially replicate the patterns observed in training. McGuffie & Newhouse [6] show that it is relatively easy to use GPT-3 to generate large quantities of grammatically coherent, racist, or extremist text which can then be used, for example, to swiftly populate forums and message boards, with the intent to radicalise human readers.

The AI community has yet to agree on approaches for addressing such problems, but the consensus will likely involve a push towards better curated training data for powerful models. For example, Google have pushed this forward in image-based training data by releasing the ‘More Inclusive Annotations for People’ image dataset. This changes labels of humans within images from (person, man, woman, boy, girl) to (person), with secondary gender labelling of (predominantly feminine, predominantly masculine, or unknown) and age labelling of (young, middle, older, or unknown) [7]. On the NLP side, the ‘Translated Wikipedia Biographies’ dataset aims to provide a mechanism for assessing common gender errors in machine translation, such as an implicit grammatical assumption that ‘doctor’ refers to a man [8].

In this month’s Arabesque AI newsletter, we’ve discussed language modelling, some powerful examples of its use, and highlighted a handful of concerns toward their use. There’s no doubt that LM technology is extremely powerful and effective at the task it has been trained to perform, but as a community we must be aware of potential ethical caveats, as well as the evolution of real-world dangers.

Dr Tom McAuliffe – with thanks to Dr Isabelle Lorge (both Arabesque AI)


[1] Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. “Attention is all you need.” In NIPS. 2017.

[2] https://searchengineland.com/google-bert-used-on-almost-every-english-query-342193 (accessed 26/06/21)

[3] https://news.ycombinator.com/item?id=23870595 (accessed 26/06/21)

[4] Strubell, Emma, Ananya Ganesh, and Andrew McCallum. “Energy and policy considerations for deep learning in NLP.” arXiv preprint arXiv:1906.02243. 2019.

[5] Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?🦜.” In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 610-623. 2021.

[6] McGuffie, Kris, and Alex Newhouse. “The radicalization risks of GPT-3 and advanced neural language models.” arXiv preprint arXiv:2009.06807. 2020.

[7] Schumann, Candice, Susanna Ricco, Utsav Prabhu, Vittorio Ferrari, and Caroline Pantofaru. “A Step Toward More Inclusive People Annotations for Fairness.” arXiv preprint arXiv:2105.02317. 2021.

[8] https://ai.googleblog.com/2021/06/a-dataset-for-studying-gender-bias-in.html (accessed 26/06/21)