Beyond the Hype: A Technical Look at the Limits of Transformer AI

There’s a growing misconception that generative transformer-based Large Language Models (i.e., OpenAI’s GPT-4) and the systems built around them (i.e., OpenAI’s ChatGPT 4o) are approaching “AGI” (Artificial General Intelligence) levels. It’s understandable that people who haven’t delved too deeply into the architecture, or inner workings, of LLMs might think so – the output of the latest systems are very impressive. However, it’s important to understand how these models and systems actually work and their limitations.

The basic concepts of LLMs are actually simpler than many realize: LLMs break down text data during training and find patterns in the training data. They can then generate more words using these learned patterns and probability. Models do not actually understand or comprehend in any human sense, nor do they learn from user input after they are trained beyond the context they are provided during use (inference), either by the user or mechanisms that can use data from other resources on-demand. In other words, they are static with a hard knowledge cutoff date.

LLMs are also stateless, meaning they do not perform any functions in between waiting for a user’s input. Finally, they are not able to solve math problems via actual calculation or use logic, but instead attempt to answer math and logic problems using the same language-based probability using text they have processed during training.

It could be that LLMs may one day be one part of a complete AGI system (which will need to include the ability to learn without retraining, take input from multiple sources, and other abilities which are completely unavailable to LLMs), but they themselves can never reach AGI status.

I thought it might help to explain how these models are actually trained and how they function:

Training

1) Prior to training a model, a vast amount of data is collected and curated from many sources such as books, news, online repositories like Wikipedia, chat conversations, and much more. This data serves as the foundation for training the model, which is why the resulting pretrained models are also known as foundation models, intended to be further trained, or fine-tuned.

2) The models themselves are static – i.e., they do not learn anything from being used (inference) after training without additional rounds of training, or fine-tuning. To overcome the static nature of pretrained models and their knowledge cutoffs, mechanisms like Retrieval-Augmented Generation, function calling, and vector databases can be used to provide a static model with more in-context data during inference for a particular prompt, but these technologies do not actually alter the models.

3) During training, this data is broken down into tokens (words, parts of words, or even individual characters/symbols). Relationships between these tokens are learned and stored as numerical values representing the relationships between them, or “weights” in this context. The total number of weights is often referred to as parameters (i.e., a model with 8 billion weights may be said to be an 8B model).

4) The pretrained model may now be further trained (fine tuned) for a specific task. This is done in a number of ways but the concept is similar to pretraining. This time, the model may be fed additional data such as examples of how it should respond (i.e., to chat with a user, to follow instructions, etc.) or domain-specific data on a particular subject.

Now that we have a pretrained model, let’s go on to inference, or the process of an application or end user actually sending requests, or prompts, to the model:

Use (Inference)

1) Once training is complete, we have a pretrained model. At this point, this model can perform basic tasks like text completion (i.e., if you feed the model “A cat is” is might respond with “a furry mammal”) but is generally not ready to follow instructions or chat at this point. Note that some models may be trained to perform more complex tasks like instruction following and chat during the initial training.

2) At this point, the model can be loaded (i.e., through code or using a GUI that utilizes a library for loading and interacting with a model) and queried via prompts. This process is known as inference.

3) Before a prompt is sent to the model, it is broken down into tokens which are then converted into vectors, just like the training process. Unlike static vectors in the model itself (weights), these new vectors are known as activations.

4) The model uses its learned weights and the activations from the input tokens to generate a response, token by token. It does this by considering the relationship between the weights (stored knowledge) and activations (user input) and then selecting the most probable next token, repeating until the response is complete.

5) These tokens are converted back into language and presented to the user.

Pitfalls

1) Generalized language is bland without providing your own data and careful prompting: By its nature, AI generated content is often very bland and formulaic, even across different models. This is because they are purposely designed to choose the most probable text based on their training data, and most models share a majority of the same data sources.

2) Static models, stale data, and date cutoffs: Models are static, meaning after a model is trained the weights (or neurons) do not learn any new information or change based on use (inference). For a model to permanently learn and retain new information, they must be further trained through fine tuning or utilize methods to pull data from external sources through methods like Retrieval-Augmented Generation (RAG), function calling, and agents which we will discuss later.

3) Hallucinations: Perhaps one of the most important and most misunderstood limitations of generative AI are hallucinations. Hallucinations are when the AI seemingly makes up facts or generates non-sensical or contradictory text. This may sound like a bug but it’s actually a normal occurrence of how LLMs generate text – not be recalling stored knowledge verbatim but by probability and how words relate to each other. The more data an LLM has seen on a particular subject during training, the more likely it is that probability will generate a correct response. Conversely, the less an LLM has seen about a topic, the more likely it is to hallucinate.

This is a problem regardless of use case, but even more so in the areas of medicine, psychology, and law. There are many examples where LLMs have confidently presented case law to lawyers and harmful medical or therapy advice to practitioners and patients alike, so it’s important that all output is thoroughly verified.

4) Garbage In, Garbage Out (GIGO): Errors in training data or user prompts will result in errors in the LLM’s output. For chat-based inference, where a user converses with a model over time, these errors can be compounded as the model will begin to generalize the user’s prompts as well as its own previous flawed responses, increasing the probability the model will select more incorrect text.

5) Don’t anthropomorphize AI: Except for certain use cases, like roleplay for world building, game development, fiction writing, and entertainment, anthropomorphizing can not only affect the quality of output but can have negative emotional impact as well. It’s important to remember that LLMs generate text based on math and are not sentient, do not have feelings, and only “think” in the manner of a computer and not a living being.

6) LLMs Don’t Perform Math or Understand Logic: As the name Large Language Model suggests, LLMs are language tools and can’t actually perform calculations or follow logic by themselves. They sometimes seem capable of solving simple and logic problems because of similar data they have seen during training, but their responses are still based on generating sensible sounding output based on text they’ve seen. This may be addressed by methods like function calling.

For math problems as an example, models may be trained to identify math problems and then call an external agent like Wolfram Alpha to perform actual calculations, but this too can be flawed because of the complex chain (or pipeline) that must be followed. For example, the model must identify a math problem, pass it to code that breaks the problem down into a format the agent can understand, pass along the problem, retrieve the solution from the agent, and then present it to the user. Like the old game of Telephone, the information may be miscommunicated at any point in the chain which may result in an incorrect answer.

Conclusion

In short, LLMs are powerful language tools, made more powerful when used with other tools such as GUIs and applications that can handle state management (remembering context between prompts), function calling and agents (to work with external data), vector databases (for limited memory between sessions). However, while these tools excel at language-specific tasks like summarization, categorization, translation, sentiment analysis, proofreading and editing, and similar, they have real limitations concerning non-language tasks and will at most only be a part of more advanced AI systems tomorrow.

AightBits