← Back to homepage

Longformer: Transformers for Long Sequences

March 11, 2021 by Chris

Transformers have really changed the NLP world, in part due to their self-attention component. But this component is problematic in the sense that it has quadratic computational and memory growth with sequence length, due to the QK^T diagonals (Questions, Keys diagonals) in the self-attention component. By consequence, Transformers cannot be trained on really long sequences because resource requirements are just too high. BERT, for example, sets a maximum sequence length of 512 characters.

Hi, I'm Chris!

I know a thing or two about AI and machine learning. Welcome to MachineCurve.com, where machine learning is explained in gentle terms.