This post builds off a few others that aimed to demystify Large Language Models (LLMs) as well as discuss their limitations and how to work around them. Some more technical terms and concepts were covered in these previous posts (such as tokenization) so it might be useful to go back to them.
The actual models behind OpenAI’s ChatGPT (e.g., GPT-4o, GPT-3.5 Turbo.) and nearly all other LLMs now are made up primarily of two things:
- The language patterns learned during some form of training (e.g., pretraining, fine tuning)
- The functions and layers that use these patterns, user provided context, and probability to generate new text
These learned patterns are static (read-only and unchanged by user interaction, even while in memory). What changes is the small amount of user context and an area of memory that holds the temporary calculations used to generate new text (activations).
After generating output, that temporary area of memory for context and activations is wiped and the model returns to an idle state with no retained memory of the previous prompt. This is what we mean by stateless.
Of course, this single-turn operation is somewhat limited and doesn’t allow for an on-going multi-turn “conversation,” or let a user to ask follow-up questions. One would have to completely rewrite the initial prompt and submit it again, basically starting from scratch.
So how can ChatGPT actually have a back and forth conversation? It’s because ChatGPT is a system that contains not just a model but an application layer and user interface as well. We still have a model with the same limitations of being stateless and static, but now with an application layer and a web-based user interface to maintain state to give the illusion the model is actually remember previous prompts in a session.
The process works as follows:
1) The user makes a request to ChatGPT.
2) The application layer passes the user’s request and any additional parameters like sampler settings to the model through an API.
3) The model generates its response and passes it back to the user interface.
4) The user can now submit another prompt with more data, questions, requests, etc.
5) This new request is sent along with the ENTIRE history of prompts and responses.
In short, the model is like Dory from the animated movie Finding Nemo. It does not remember anything from previous prompts, but with this method is presented with and can consider the entire history each time to simulate a conversation. It’s this ability to engage in multi-turn conversations that made ChatGPT even more useful and accessible to general users.
This process does have limitations, the greatest being that despite the model’s vast amount of generalized language data, it can only process a small window of user text at a time (known as context).
As the conversation continues, context grows, and once the context window is full (say, at 8K tokens), the application must start either truncating earlier input or using sliding windows to summarize previous text before removing context so the entire history can fit in the model’s context window.
The second limitation is that as context grows, so does the amount of memory and computing power required to process it. As there is a quadratic and not linear relationship between activations, or the connections between the relationships of tokenized language data, these requirements can grow quite fast. In short, the model will continue to require more and more RAM and computational power as context grows.
Now that we’ve discussed the problems with models being stateless, we now have to address the fact that models are static: Models only know what they learned during training, and because training can take weeks or even months, their knowledge can become stale very quickly.
Initially, this was addressed be a form of training called fine-tuning, where new data is integrated onto a pretrained model. Think of is as a college student going for another level in education. While often not as effective as training a model from scratch, it has the benefit of being a much faster and less resource-intensive process.
Still, the necessity of having to pretrain or fine-tune a model regularly is costly and time consuming, so other strategies were added to the application, such as Retrieval-Augment Generation (RAG).
There are several methods ChatGPT and other hosted and local models use today to give a static model access to more current data:
1) Programmatic in-context learning: An application can take data from an external source and inject it into the prompt (for example, pulling raw weather data from a weather service to read as natural language weather report).
2) Function calling and agents: The model can identify a user’s request to pull information from a source such as Google, at which point the model will pass the request to an agent which will collect the data and feed it back to the model.
3) Vector databases: As models don’t remember details from session to session, one use of a vector database is to act as long-term memory that can be accessed by the model, and is in fact how ChatGPT’s new memory feature works.
These methods are not just limited to ChatGPT and its user interface; developers can write their own software to integrate LLMs into various systems to automate tasks like translation, summarization, sentiment analysis, and more. We even see this happening on Google, where AI is used to help with searches, and on Meta’s Facebook, where AI is being used to summarize entire comment threads. The opportunities are endless.
So as you can see, using external programs and tools to address the static, stateless nature of LLMs like those used by ChatGPT make LLM systems a lot more powerful, but it’s important to remember the models themselves have the same limitations and are still just static collections of generalized language data that can do only one thing: Generate new language replies based on stored relationships between parts of language using probability.





Leave a comment