Transformer models have been boosting NLP for a few years now. Every now and then, new additions make them even more performant. Longformer is one such extension, as it can be used for long texts.
While being applied for many tasks - think machine translation, text summarization and named-entity recognition - classic Transformers always have faced difficulties when texts became too long. This results from the self-attention mechanism applied in these models, which in terms of time and memory consumption scales quadratically with sequence length.
Longformer makes Transformers available to long texts by introducing a sparse attention mechanism and combining it with a global, task specific one. More about that can be read here. In this tutorial, you're going to work with actual Longformer instances, for a variety of tasks. More specifically, after reading it, you will know...
Let's take a look! 🚀
Ever since Transformer models have been introduced in 2017, they have brought about change in the world of NLP. With a variety of architectures, such as BERT and GPT, a wide range of language tasks have been improved to sometimes human-level quality... and in addition, with libraries like HuggingFace Transformers, applying them has been democratized significantly.
As a consequence, we can now create pipelines for machine translation, text summarization and named-entity recognition with only a few lines of code.
Classic Transformers - including GPT and BERT - have one problem though: the time and memory complexity of the self-attention function. As you may recall, this function applies queries, keys and values by means of \(Q\), \(K\) and \(V\) generations from the input embeddings - and more specifically, it performs a multiplication of the sort \(QK^T\). This multiplication is quadratic. In other words, time and memory complexity increases quadratically with sequence length.
In other words, when your sequences (and thus your input length) are really long, Transformers cannot process them anymore - simply because too much time or too many resources are required. To mitigate this, classic Transformers and BERT- and GPT-like approaches truncate text and sometimes adapt their architecture to specific tasks.
While we want a Transformer that can handle long texts without the necessity for any significant changes.
That's why Longformer was introduced. It changes the attention mechanism by applying dilated sliding window based attention, where each token has a 'window' of tokens around that particular token - including dilation - for which attention is computed. In other words, attention is now more local rather than global. To ensure that some global patterns are captured as well (e.g. specific attention to particular tokens), global attention is added as well - but this is more task specific. We have covered the details of Longformer in another article, so make sure to head there if you want to understand Longformer in more detail. Let's now take a look at the example text that we will use today, and then move forward to the code examples.
To show you that Longformer works with really long tasks in a variety of tasks, we'll use some segments from the Wikipedia page about Germany (Wikipedia, 2001). More specifically, we will be using this text:
Germany (German: Deutschland, German pronunciation: [ˈdɔʏtʃlant]), officially the Federal Republic of Germany,[e] is a country at the intersection of Central and Western Europe. It is situated between the Baltic and North seas to the north, and the Alps to the south; covering an area of 357,022 square kilometres (137,847 sq mi), with a population of over 83 million within its 16 constituent states. It borders Denmark to the north, Poland and the Czech Republic to the east, Austria and Switzerland to the south, and France, Luxembourg, Belgium, and the Netherlands to the west. Germany is the second-most populous country in Europe after Russia, as well as the most populous member state of the European Union. Its capital and largest city is Berlin, and its financial centre is Frankfurt; the largest urban area is the Ruhr.
Various Germanic tribes have inhabited the northern parts of modern Germany since classical antiquity. A region named Germania was documented before AD 100. In the 10th century, German territories formed a central part of the Holy Roman Empire. During the 16th century, northern German regions became the centre of the Protestant Reformation. Following the Napoleonic Wars and the dissolution of the Holy Roman Empire in 1806, the German Confederation was formed in 1815. In 1871, Germany became a nation-state when most of the German states unified into the Prussian-dominated German Empire. After World War I and the German Revolution of 1918–1919, the Empire was replaced by the semi-presidential Weimar Republic. The Nazi seizure of power in 1933 led to the establishment of a dictatorship, World War II, and the Holocaust. After the end of World War II in Europe and a period of Allied occupation, Germany was divided into the Federal Republic of Germany, generally known as West Germany, and the German Democratic Republic, East Germany. The Federal Republic of Germany was a founding member of the European Economic Community and the European Union, while the German Democratic Republic was a communist Eastern Bloc state and member of the Warsaw Pact. After the fall of communism, German reunification saw the former East German states join the Federal Republic of Germany on 3 October 1990—becoming a federal parliamentary republic led by a chancellor.
Germany is a great power with a strong economy; it has the largest economy in Europe, the world's fourth-largest economy by nominal GDP, and the fifth-largest by PPP. As a global leader in several industrial, scientific and technological sectors, it is both the world's third-largest exporter and importer of goods. As a developed country, which ranks very high on the Human Development Index, it offers social security and a universal health care system, environmental protections, and a tuition-free university education. Germany is also a member of the United Nations, NATO, the G7, the G20, and the OECD. It also has the fourth-greatest number of UNESCO World Heritage Sites.
To run the code that you will create in the next sections, it is important that you have installed a few things here and there. Make sure to have an environment (preferably) or a global Python environment running on your machine. Then make sure that HuggingFace Transformers is installed through pip install transformers
. As HuggingFace Transformers runs on top of either PyTorch or TensorFlow, install any of the two.
Note that the code examples below are built for PyTorch based HuggingFace. They can be adapted to TensorFlow relatively easily, usually by prepending TF
before the model you are importing, e.g. TFAutoModel
.
Now, we can move forward to showing you how to use Longformer. Specifically, you're going to see code for these tasks:
Let's take a look! 🚀
Longformer can be used for question answering tasks. This requires that the pretrained Longformer is fine-tuned so that it is tailored to the task. Today, you're going to use a Longformer model that has been fine-tuned on the SQuAD v1 language task.
This is a question answering task using the Stanford Question Answering Dataset (SQuAD).
Creating the code involves the following steps:
argmax
with gradients later, so we must import it through import torch
. Then, we also need the AutoTokenizer
and the AutoModelForQuestionAnswering
from HuggingFace transformers
.valhalla/longformer-base-4096-finetuned-squadv1
model. As you can see, it's the Longformer base model fine-tuned on SQuAD v1. As with any fine-tuned Longformer model, it can support up to 4096 tokens in a sequence.text
contains the context that is used by Longformer for answering the question. As you can imagine, it's the text that we specified above. For the question
, we're interested in the size of Germany's economy by national GDP (Germany has the fourth-largest economy can be read in the text).import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("valhalla/longformer-base-4096-finetuned-squadv1")
# Initialize the model
model = AutoModelForQuestionAnswering.from_pretrained("valhalla/longformer-base-4096-finetuned-squadv1")
# Specify text and question
text = """Germany (German: Deutschland, German pronunciation: [ˈdɔʏtʃlant]), officially the Federal Republic of Germany,[e] is a country at the intersection of Central and Western Europe. It is situated between the Baltic and North seas to the north, and the Alps to the south; covering an area of 357,022 square kilometres (137,847 sq mi), with a population of over 83 million within its 16 constituent states. It borders Denmark to the north, Poland and the Czech Republic to the east, Austria and Switzerland to the south, and France, Luxembourg, Belgium, and the Netherlands to the west. Germany is the second-most populous country in Europe after Russia, as well as the most populous member state of the European Union. Its capital and largest city is Berlin, and its financial centre is Frankfurt; the largest urban area is the Ruhr.Various Germanic tribes have inhabited the northern parts of modern Germany since classical antiquity. A region named Germania was documented before AD 100. In the 10th century, German territories formed a central part of the Holy Roman Empire. During the 16th century, northern German regions became the centre of the Protestant Reformation. Following the Napoleonic Wars and the dissolution of the Holy Roman Empire in 1806, the German Confederation was formed in 1815. In 1871, Germany became a nation-state when most of the German states unified into the Prussian-dominated German Empire. After World War I and the German Revolution of 1918–1919, the Empire was replaced by the semi-presidential Weimar Republic. The Nazi seizure of power in 1933 led to the establishment of a dictatorship, World War II, and the Holocaust. After the end of World War II in Europe and a period of Allied occupation, Germany was divided into the Federal Republic of Germany, generally known as West Germany, and the German Democratic Republic, East Germany. The Federal Republic of Germany was a founding member of the European Economic Community and the European Union, while the German Democratic Republic was a communist Eastern Bloc state and member of the Warsaw Pact. After the fall of communism, German reunification saw the former East German states join the Federal Republic of Germany on 3 October 1990—becoming a federal parliamentary republic led by a chancellor.Germany is a great power with a strong economy; it has the largest economy in Europe, the world's fourth-largest economy by nominal GDP, and the fifth-largest by PPP. As a global leader in several industrial, scientific and technological sectors, it is both the world's third-largest exporter and importer of goods. As a developed country, which ranks very high on the Human Development Index, it offers social security and a universal health care system, environmental protections, and a tuition-free university education. Germany is also a member of the United Nations, NATO, the G7, the G20, and the OECD. It also has the fourth-greatest number of UNESCO World Heritage Sites."""
question = "How large is Germany's economy by nominal GDP?"
# Tokenize the input text
encoding = tokenizer(question, text, return_tensors="pt")
input_ids = encoding["input_ids"]
# Get attention mask (local + global attention)
attention_mask = encoding["attention_mask"]
# Get the predictions
start_scores, end_scores = model(input_ids, attention_mask=attention_mask).values()
# Convert predictions into answer
all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0].tolist())
answer_tokens = all_tokens[torch.argmax(start_scores) :torch.argmax(end_scores)+1]
answer = tokenizer.decode(tokenizer.convert_tokens_to_ids(answer_tokens))
# Print answer
print(answer)
The results:
fourth-largest
Yep indeed, Germany has the fourth-largest economy by nominal GDP. Great! :D
Next up is text summarization. This can also be done with Transformers. Compared to other tasks such as question answering, summarization is a generative activity that also greatly benefits from a lot of context. That's why traditionally, sequence-to-sequence architectures have been useful for this purpose.
That's why in the example below, we are using a Longformer2RoBERTa architecture, which utilizes Longformer as the encoder segment, and RoBERTa as the decoder segment. It was fine-tuned on the CNN/DailyMail dataset, which is a common one in the field of text summarization.
So strictly speaking, this is not a full Longformer model, but Longformer merely plays a part in the whole stack. Nevertheless, it works pretty well, as we shall see!
This is how we build the model
LongformerTokenizer
and the EncoderDecoderModel
(which is what you'll need for Seq2Seq!)patrickvonplaten/longformer2roberta-cnn_dailymail-fp16
, which contains the full Seq2Seq model. However, as the encoder segment is Longformer, we can use the Longformer tokenizer - so we use allenai/longformer-base-4096
there.article
into the tokenizer, return the input ids from the PyTorch based Tensors, and then generate the summary with our model
. Once the summary is there, we use the tokenizer
again for decoding the output identifiers into readable text. We skip special tokens.from transformers import LongformerTokenizer, EncoderDecoderModel
# Load model and tokenizer
model = EncoderDecoderModel.from_pretrained("patrickvonplaten/longformer2roberta-cnn_dailymail-fp16")
tokenizer = LongformerTokenizer.from_pretrained("allenai/longformer-base-4096")
# Specify the article
article = """Germany (German: Deutschland, German pronunciation: [ˈdɔʏtʃlant]), officially the Federal Republic of Germany,[e] is a country at the intersection of Central and Western Europe. It is situated between the Baltic and North seas to the north, and the Alps to the south; covering an area of 357,022 square kilometres (137,847 sq mi), with a population of over 83 million within its 16 constituent states. It borders Denmark to the north, Poland and the Czech Republic to the east, Austria and Switzerland to the south, and France, Luxembourg, Belgium, and the Netherlands to the west. Germany is the second-most populous country in Europe after Russia, as well as the most populous member state of the European Union. Its capital and largest city is Berlin, and its financial centre is Frankfurt; the largest urban area is the Ruhr.Various Germanic tribes have inhabited the northern parts of modern Germany since classical antiquity. A region named Germania was documented before AD 100. In the 10th century, German territories formed a central part of the Holy Roman Empire. During the 16th century, northern German regions became the centre of the Protestant Reformation. Following the Napoleonic Wars and the dissolution of the Holy Roman Empire in 1806, the German Confederation was formed in 1815. In 1871, Germany became a nation-state when most of the German states unified into the Prussian-dominated German Empire. After World War I and the German Revolution of 1918–1919, the Empire was replaced by the semi-presidential Weimar Republic. The Nazi seizure of power in 1933 led to the establishment of a dictatorship, World War II, and the Holocaust. After the end of World War II in Europe and a period of Allied occupation, Germany was divided into the Federal Republic of Germany, generally known as West Germany, and the German Democratic Republic, East Germany. The Federal Republic of Germany was a founding member of the European Economic Community and the European Union, while the German Democratic Republic was a communist Eastern Bloc state and member of the Warsaw Pact. After the fall of communism, German reunification saw the former East German states join the Federal Republic of Germany on 3 October 1990—becoming a federal parliamentary republic led by a chancellor.Germany is a great power with a strong economy; it has the largest economy in Europe, the world's fourth-largest economy by nominal GDP, and the fifth-largest by PPP. As a global leader in several industrial, scientific and technological sectors, it is both the world's third-largest exporter and importer of goods. As a developed country, which ranks very high on the Human Development Index, it offers social security and a universal health care system, environmental protections, and a tuition-free university education. Germany is also a member of the United Nations, NATO, the G7, the G20, and the OECD. It also has the fourth-greatest number of UNESCO World Heritage Sites."""
# Tokenize and summarize
input_ids = tokenizer(article, return_tensors="pt").input_ids
output_ids = model.generate(input_ids)
# Get the summary from the output tokens
summary = tokenizer.decode(output_ids[0], skip_special_tokens=True)
# Print summary
print(summary)
The results:
Germany is the second-most populous country in Europe after Russia.
It is the country's second-largest economy and the most populous member state of the European Union.
Germany is also a member of the United Nations, the G7, the OECD and the G20.
Quite a good summary indeed!
Next up is Masked Language Modeling using Longformer. Recall that MLM is a technique used for pretraining BERT-style models. When applied, parts of the text are masked, and the goal of the model is to predict the original text. If it can do so correctly and at scale, it effectively learns the relationships between text and therefore generates the supervision signal through the attention mechanism.
Let's see if we can get this to work with Longformer, so that we can apply MLM to longer texts. As you can see we apply the mask just after the text starts: officially the Federal Republic of Germany,[e] is a {mask}
.
That should be country, indeed, so let's see if we can get the model to produce that.
pipeline
for Masked Language Modeling, the fill-mask
pipeline. We can initialize it with the allenai/longformer-base-4096
model. This base model is the MLM pretrained base model that still requires fine-tuning for task specific behavior. However, because it was pretrained with MLM, we can also use it for MLM and thus Predicting Missing Text. We thus load the pipeline
API from transformers
.mlm.tokenizer
has a specific mask_token
. We simplify it by referring to it as mask
.{mask}
to where country
is written in the original text.text
to our mlm
pipeline to obtain the result, which we then print on screen.from transformers import pipeline
# Initialize MLM pipeline
mlm = pipeline('fill-mask', model='allenai/longformer-base-4096')
# Get mask token
mask = mlm.tokenizer.mask_token
# Get result for particular masked phrase
text = f"""Germany (German: Deutschland, German pronunciation: [ˈdɔʏtʃlant]), officially the Federal Republic of Germany,[e] is a {mask} at the intersection of Central and Western Europe. It is situated between the Baltic and North seas to the north, and the Alps to the south; covering an area of 357,022 square kilometres (137,847 sq mi), with a population of over 83 million within its 16 constituent states. It borders Denmark to the north, Poland and the Czech Republic to the east, Austria and Switzerland to the south, and France, Luxembourg, Belgium, and the Netherlands to the west. Germany is the second-most populous country in Europe after Russia, as well as the most populous member state of the European Union. Its capital and largest city is Berlin, and its financial centre is Frankfurt; the largest urban area is the Ruhr.Various Germanic tribes have inhabited the northern parts of modern Germany since classical antiquity. A region named Germania was documented before AD 100. In the 10th century, German territories formed a central part of the Holy Roman Empire. During the 16th century, northern German regions became the centre of the Protestant Reformation. Following the Napoleonic Wars and the dissolution of the Holy Roman Empire in 1806, the German Confederation was formed in 1815. In 1871, Germany became a nation-state when most of the German states unified into the Prussian-dominated German Empire. After World War I and the German Revolution of 1918–1919, the Empire was replaced by the semi-presidential Weimar Republic. The Nazi seizure of power in 1933 led to the establishment of a dictatorship, World War II, and the Holocaust. After the end of World War II in Europe and a period of Allied occupation, Germany was divided into the Federal Republic of Germany, generally known as West Germany, and the German Democratic Republic, East Germany. The Federal Republic of Germany was a founding member of the European Economic Community and the European Union, while the German Democratic Republic was a communist Eastern Bloc state and member of the Warsaw Pact. After the fall of communism, German reunification saw the former East German states join the Federal Republic of Germany on 3 October 1990—becoming a federal parliamentary republic led by a chancellor.Germany is a great power with a strong economy; it has the largest economy in Europe, the world's fourth-largest economy by nominal GDP, and the fifth-largest by PPP. As a global leader in several industrial, scientific and technological sectors, it is both the world's third-largest exporter and importer of goods. As a developed country, which ranks very high on the Human Development Index, it offers social security and a universal health care system, environmental protections, and a tuition-free university education. Germany is also a member of the United Nations, NATO, the G7, the G20, and the OECD. It also has the fourth-greatest number of UNESCO World Heritage Sites."""
result = mlm(text)
# Print result
print(result)
When we observe the results (we cut off the text at the masked token; it continues in the real results), we can see that it is capable of predicting country
indeed!
[{'sequence': "Germany (German: Deutschland, German pronunciation: [ˈdɔʏtʃlant]), officially the Federal Republic of Germany,[e] is a country
Great!
In this tutorial, we covered practical aspects of the Longformer Transformer model. Using this model, you can now process really long texts, by means of the simple change in attention mechanism compared to the one used in classic Transformers. Put briefly, you have learned...
I hope that this article was useful to you! If it was, please let me know through the comments 💬 Please do the same if you have any questions or other comments. I'd love to hear from you :)
Thank you for reading MachineCurve today and happy engineering! 😎
HuggingFace. (n.d.). Allenai/longformer-base-4096 · Hugging face. Hugging Face – On a mission to solve NLP, one commit at a time. https://huggingface.co/allenai/longformer-base-4096
HuggingFace. (n.d.). Valhalla/longformer-base-4096-finetuned-squadv1 · Hugging face. Hugging Face – On a mission to solve NLP, one commit at a time. https://huggingface.co/valhalla/longformer-base-4096-finetuned-squadv1
HuggingFace. (n.d.). Patrickvonplaten/longformer2roberta-cnn_dailymail-fp16 · Hugging face. Hugging Face – On a mission to solve NLP, one commit at a time. https://huggingface.co/patrickvonplaten/longformer2roberta-cnn_dailymail-fp16
Wikipedia. (2001, November 9). Germany. Wikipedia, the free encyclopedia. Retrieved March 12, 2021, from https://en.wikipedia.org/wiki/Germany
Learn how large language models and other foundation models are working and how you can train open source ones yourself.
Keras is a high-level API for TensorFlow. It is one of the most popular deep learning frameworks.
Read about the fundamentals of machine learning, deep learning and artificial intelligence.
To get in touch with me, please connect with me on LinkedIn. Make sure to write me a message saying hi!
The content on this website is written for educational purposes. In writing the articles, I have attempted to be as correct and precise as possible. Should you find any errors, please let me know by creating an issue or pull request in this GitHub repository.
All text on this website written by me is copyrighted and may not be used without prior permission. Creating citations using content from this website is allowed if a reference is added, including an URL reference to the referenced article.
If you have any questions or remarks, feel free to get in touch.
TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc.
PyTorch, the PyTorch logo and any related marks are trademarks of The Linux Foundation.
Montserrat and Source Sans are fonts licensed under the SIL Open Font License version 1.1.
Mathjax is licensed under the Apache License, Version 2.0.