The Mystical Q

OpenAI Q*, Primer on Reinforcement Learning, and Implications

Dec 07, 2023

Q* Magic The Gathering-themed playing card

Q*. The real reason why Sam Altman was deposed. Maybe.

Is this all a homage to QAnon? Did everyone suddenly rewatch Star Trek and fall in love with the weird Doctor-Who like extra dimensional character?

Well, although we don’t know for certain, it is more likely referring to Q-learning, a technique in reinforcement learning. It wouldn’t be the first time it would have come up in cutting-edge AI systems. Google’s DeepMind used a deep-learning variant of it for an Atari game-playing system that could perform at expert levels without being taught the game explicitly. Now, it’s likely being used by these other companies.

As with my AI Computing piece, this is a more “technical” post. I’m not going to be deep-diving on algorithmic specifics, but I will explain some of these concepts to build intuition, and then talk about implications.

Also, Amazon has this generative AI model now called Q as well. That one I think is just meant to be “Amazon Questions,” so not really related, even though the news came out at a similar time.

Wait, what is reinforcement learning?

There are a lot of different types of machine learning/AI. We now are quite familiar with generative AI (“make something that looks like this statistical distribution”—like ChatGPT and other LLMs) and what used to be “AI” to the general public (“check which statistical distribution this looks like”—say, is this a picture of a cat or not?).

Reinforcement learning (RL) is used when an AI needs to figure out what to do in a new environment dynamically, and there could be path dependencies.

Board games are a good example. An even more fun example is what this YouTuber did in implementing reinforcement learning to play the original Pokemon (Red, specifically).

The agent gets a state of the world and has actions that it can take. The optimal action taken will change if the game the agent is playing is at the tutorial vs. a final boss. Finally, to teach the agent, you generally give it a reward for taking desirable actions in certain states, or a penalty (which is just a negative reward) for taking undesirable actions in certain states.

This is generalizable and useful for a lot of things. AIs that can play games, sure, but also things like learning (without explicit guidance) how to use a mechanical arm or robotic body.

Back to Q-Learning

Q-learning is a reinforcement learning algorithm. It’s model-free, which means that we don’t have to have an explicit model of how the world works to use it, and the agent can kind of figure it out from seeing rewards from different states.

Q*, in Q-learning, is often used to denote the optimal/converged Q-function, meaning the “correct” value needed to pick the optimal action for every state.

OpenAI’s Q* supposedly is an algorithm that can do grade-school level math. Other than that, the details are pretty sparse.

**So, what does Q-learning have to do with OpenAI’s Q*?**

Maybe nothing! If so, you’ve at least learned about a class of machine learning algorithms that will likely have a lot of interesting applications in the future, especially combined with deep learning and simulated data (which, even years ago, could do stuff like self-teach an agent how to fly a quadcopter).

In reality, OpenAI, the company that brought us ChatGPT, where “GPT” stands for Generative Pre-Trained Transformer (… almost as technical a description as you could come up with) and it… chats with you… is likely not making-up marketing names or trying to be misleading in what’s under the hood of their algorithms.

If we now apply our newfound understanding to what we know:

Q-learning is an RL algorithm that can learn proper actions without explicit guidance, so long as it is provided actions, states, and rewards
OpenAI’s Q* can do grade-school math—which is actually quite difficult for ChatGPT and similar to do consistently because LLMs aren’t really doing math, they’re just predicting what text to say next if someone asks a math question (which isn’t the same thing)

We are completely speculating, but it is pretty reasonable to think that they trained up an AI to, on its own, figure out how to do algebra, systems of equations, or whatever they classify as grade-school math. Basic math like this is pretty easy to provide actions (it’s finite), state (it’s just whatever is there for the math), and rewards (easy to verify if the math was correct or not).

If you throw enough compute and parameters at it, it’s not that crazy to expect that you could get a system that could consistently do grade-school math correctly.

And hey, ChatGPT was a “pre-trained transformer.” It’s kind of clever (and fitting) to call your new, pre-trained model Q* since it is the converged/optimal result after training.

Is it that impressive? Is it the end of humankind?

News articles on OpenAI’s Q* were centered around speculation and OpenAI employee comments about why Sam Altman was overthrown (briefly) from OpenAI. Supposedly, the board—among other things—worried that this powerful algorithm might lead to real AGI (artificial general intelligence, or self-aware/generalized AI, which is the shorthand that the community has for SkyNet/Terminator, WALL-E, and everything in between).

Knowing what we know… that seems like a massive stretch. If you had enough parameters and compute, you could practically learn the entire state-action space for most grade-school math. That’s quite different than higher-level, more abstract math, let alone Fields Medal-type work.

Given the timing (shortly after Sam Altman came back), and on the heels of weeks of articles and speculation that OpenAI had hit a wall—including endless recycled articles about Bill Gates thinking ChatGPT had largely plateaued—this sounds a lot like PR that OpenAI is still in the game. And it’s back, baby, with Sam Altman and this new, sexy algorithm that is super secret but ultra-scary powerful.

We’ll ultimately have to see. Using Q-learning to attack learning math is not a bad idea at all. It’s just a long way between a proof of concept (which is what this sounds like) vs something that would be deeply impactful or revolutionary.

And who knows, maybe it doesn’t actually have anything to do with Q-learning (again, given OpenAI’s naming prowess, I think it probably does, but I don’t know for certain).

Why name things so obviously? And what might it say about defensibility?

So, why did OpenAI make it so clear how they achieved ChatGPT’s impressive results? If it is true that Q* has a lot to do with Q-learning, why did they make it so obvious here?

It may just be the culture of the firm (highly technical, transparent). It may be the weird non-profit, mission-driven hybrid structure they have (it’s for the good of humanity, so they won’t hide their work… though they aren’t fully open-sourcing it either).

However, I’d argue that they don’t lose much. Given that the stuff OpenAI is working on is not secret, proprietary, or differentiated algorithms, it’s not that hard to guess what is under the hood when one of these systems is made available to the public. “Defensibility through obscurity” is similar to “security through obscurity” in cybersecurity (basically, having an implementation that is just “different” and obscure and hoping that protects you)—it may work for a while, but you don’t want to rely on it as your main defense.

Also, being cynical here, as I’ve argued in my other piece on most AI startups being doomed, OpenAI doesn’t really have much defensibility. The real money-maker here, and the reason why Microsoft is so closely tied to it (and how it could flex its muscles with the Sam Altman situation with literally zero formal governance control), is ultimately through OpenAI’s reliance on Microsoft and API calls on the Azure platform (whether branded Azure or OpenAI).

In that case, there’s no reason to hide what you’re doing. If anything, you want to advertise it. If it’s just an ad for a cloud provider’s services, you might as well spell out what you’re doing—it’s what the buyer is paying for, after all, which they can get from you… or someone else.