Large language models need to handle longer and longer sequences, but current methods are either too slow or too limited, making it hard to go beyond a certain length. To help overcome this obstacle, Microsoft is presenting LongNet, a new kind of Transformer that can deal with sequences of more than 1 billion tokens, without losing performance on shorter ones.
The company has published a paper that proposes a new AI transformer model variant called Microsoft LongNet. A transformer model is a type of neural network architecture that can process sequential data, such as natural language or speech. Large language models (LLM) such as GPT-4 from OpenAi, LLaMA from Meta or PaLM 2 from Google are based on a transformer model that has been trained on extensive text data.
The main innovation of Microsoft LongNet is dilated attention, which covers more tokens as the distance increases, reducing the computation complexity and the dependency between tokens. The paper shows that LongNet can perform well on both long-sequence modeling and general language tasks, and can be easily integrated with existing Transformer-based optimization. The paper also discusses the potential applications of LongNet for modeling very long sequences, such as using a whole corpus or even the entire Internet as a sequence.
AI expert and documenter David Shapiro posted a YouTube video discussing why Microsoft Longnet is a major breakthrough. As an overview of what being able to see a billion tokens means, Shapiro offers an example of a 3GB image. He explains how humans are able to see the whole image and understand it, but also to see minor details in the image and also understand them.
AI is not always as adept at seeing the minor details. It can often see the bigger picture but can become lost in the smaller aspects of what it is looking at. A great example of this is the current large language models that underpin chatbots such as Google Bard or Bing Chat. These powerful AI tools can often surface information that covers a topic but can often provide misinformation when getting into smaller details.
What is Tokenization and Why it is Crucial for AI
Natural Language Processing (NLP) is the field of AI that deals with understanding and generating human language. But before we can feed text to a computer, we need to break it down into smaller pieces that the computer can handle. This process is called tokenization, and it's one of the most basic and essential steps in NLP.
Tokenization is like cutting a cake into slices: you take a large and complex text and split it into smaller and simpler units, such as words, sentences, or characters. For example, the sentence “I love NLP” can be tokenized into three words: “I”, “love”, and “NLP”.
But tokenization is not just a simple slicing operation. It's also an art and a science, as different languages and tasks require different ways of tokenizing text. For instance, some languages, like Chinese or Japanese, do not have spaces between words, so we need to use special algorithms to find the word boundaries. Some tasks, like sentiment analysis or text summarization, may benefit from keeping punctuation marks or emoticons as tokens, as they convey important information.
Tokenization is also a key component of both traditional and modern NLP methods. In traditional methods, such as Count Vectorizer, we use tokenization to create a numerical representation of text based on the frequency of each token. In modern methods, such as Transformers, we use tokenization to create a sequence of tokens that can be processed by a neural network.
Tokenization is therefore a crucial step in NLP, as it determines how the text is represented and understood by the computer. It's also a fascinating topic, as it reveals the diversity and complexity of natural language.
What Expanding the Number of Tokens Means for AI Development
By expanding the number of tokens, the AI model can essentially all of the bigger picture while also being able to focus on the smaller details. Microsoft LongNet's idea is to use dilated attention which expands the number of tokens it uses as the distance expands.
LongNet has several benefits:
- It has a fast computation speed and a small dependency between tokens;
- It can be used as a distributed trainer for very long sequences;
- Its dilated attention can be easily added to any existing Transformer-based optimization.
This means LongNet is able to model long sequences and also general language tasks. David Shapiro explains that Microsoft's paper signals a push towards artificial general intelligence (AGI). He points out that the ability to have more tokens means it can accurately cover massive tasks instantly. Shapiro offers medical research as an example, where thousands of journals can be read by the AI.
By the way, that's the ability to read the entire internet all at once and within seconds. It is also worth noting that LongNet is just the start. As the concept becomes more powerful Shapiro says it will eventually be able to see trillions of tokens and even one day the entire internet. Once that happens the growth will extend beyond human capabilities and the AI could move towards AGI.
LongNet is in the research phase and Shapiro predicts were may not see its capabilities for at least a year. Even so, with the rapid development of AI, it seems that a hugely powerful intelligence could be closer than many people originally predicted. Some forecasts have put the development of a superintelligence at least 20 years aways, while some believe we will never achieve it.