Smart Assistants and the Evolution of Rules-Based AI to Transformer-based LLMs

In January 2015, Amazon invited Prime members to preorder their first-generation Amazon Echo for half off the $199 retail price. I jumped on the deal, and four months later received that first iconic cylinder. Even coming from a background that included Natural Language Processing (NLP), voice generation/recognition, and rules-based AI, I was impressed and continued to be for several years.

However, with the increasing popularity of more advanced NLP technologies like RNNs showing up behind customer service chatbots and IVR systems, the early implementation of Alexa began to show its age, often causing frustration as users had to adapt to its limited, intent-based vocabulary and command structure.

Rules-Based AI is Not New

In fact, one of the earliest examples made its debut in 1966. This early chatbot, written by MIT computer scientist Joseph Weizenbaum, could hold a very limited conversation. Originally written in MAD-SLIP and run on an IBM 7094 mainframe, it was later ported to BASIC and runnable on PCs just over ten years later.

That BASIC code would find its way to me in the early 1980s. Already a fan of PCs and AI-themed science fiction, I was hooked. But I also saw the limitations of rules-based AI which, like Alexa’s early form, was constrained by specific phrases and vocabularies.

In 2017, the landscape changed with the paper “Attention Is All You Need,” the landmark work that launched the transformer-based Large Language Model (LLM) revolution. Early LLM-powered apps like ChatGPT and LLaMA could now process freeform plain text more effectively than systems that relied on hard-coded rules or the limited memory and sequential processing of earlier architectures like RNNs.

Finally, makers of smart devices (including Amazon with their Alexa-based line) could offer much improved language capabilities by using LLMs to serve as a kind of human/computer translator.

A (Very) Simple Example

1. Choose a model

First, we need an LLM. For this example, we’ll use Microsoft Phi-4. When quantized (i.e., reduced in precision and therefore size) from its native 16 bits per weight (bpw) to 6 bpw, we can fit this model on a single 16GB NVIDIA GPU. It’s also open source (MIT license), so it’s completely free to use both personally and commercially.

2. Create a system message

Next, we need a system message—guidance for the LLM using some in-context learning techniques—as we want it to take our freeform human language and convert it into something precisely formatted to pass to the application that will actually perform the requested tasks. Here’s an example:

You are a command-to-JSON converter. Return only a JSON object describing the user's intent.

Output must match one of these 4 schemas exactly:

{ "action": "play_music", "artist": "", "song": "", "album": "", "genre": "" }

{ "action": "get_weather", "date": "", "location": "" }

{ "action": "get_news" }

{ "action": "control_device", "device": "", "device_action": "", "value": 0 }

If the user request doesn't match any command, return:

{ "action": "none" }

Important:

Always preserve full device names (e.g., "living room lights", "kitchen fan", "bedroom thermostat")
Use exact wording from the user's request when possible
Leave fields blank or 0 if unknown
Return ONLY valid JSON. No other text.

3. Test run

User: turn off the living room lights
LLM:

{ "action": "control_device", "device": "living room lights", "device_action": "turn_off", "value": 0 }

User: play music like elton john
LLM:

{ "action": "play_music", "artist": "Elton John", "song": "", "album": "", "genre": "" }

User: what’s the weather in nyc
LLM:

{ "action": "get_weather", "date": "", "location": "NYC" }

User: what’s the weather tomorrow in allentown, pa
LLM:

{ "action": "get_weather", "date": "tomorrow", "location": "Allentown, PA" }

User: what’s new today
LLM:

{ "action": "get_news" }

User: what is the velocity of an unladen swallow?
LLM:

{ "action": "none" }

Conclusion

What we’re doing here is a simple form of function calling, where output is formatted to trigger the application to pause and process the LLM’s response. While this is often part of a broader “agentic AI” system, we’re just focusing here on direct interpretation and response.

Of course, this is just the NLP component. The application still needs to listen for a wake word using STT (speech-to-text), handle the JSON payloads (start playing audio, communicate with a smart device, etc.), and output text-based tasks using TTS (text-to-speech). But that will have to wait for Part 2.

AightBits