05 Dec 2023

When Computers Beat Us at Our Own Game

You’ve probably seen the Q* rumours surrounding the OpenAI-Sam-Altman debacle. We can’t comment on the accuracy of these rumours, but we can provide some insight by interpreting Q* in the context of reinforcement learning.

It's fun, inspiring and daunting to consider that we may be approaching another one of 'those moments', where the world’s breath catches and we're forced to contemplate a world where computers beat us at our own game.

Words by

Alex Carruthers

What’s a Q* Value?

OpenAI has already used reinforcement learning to supercharge its language models, having used another technique, called Proximal Policy Optimisation (see PPO), for finetuning their models with Reinforcement Learning from Human-feedback (RLHF).

In reinforcement learning,

1) Q-values are numbers that represent how ‘good’ it is to take a particular action in a particular state. The Q-values are estimated by reinforcement learning algorithms, iteratively improving during the learning process. In other words, the Q-Values are the algorithm’s best guess at what the perfect action would be.
2) Q*-values allow you to know the optimal action because you just need to select the action with the highest q-value. Knowing the Q*-values is equivalent to perfectly solving a reinforcement learning problem, as the agent would always take the optimal action.

It isn’t clear how Q-values alone could be applied in the context of LLMs. However, the claim of ‘enhanced reasoning capabilities’ provides us with another clue!

Enhanced Reasoning

When faced with a task, your brain is making decisions about how to approach that task. You do it so intuitively, that it may have never occurred to you that you are making decisions about how to make decisions.

Let’s say your task was to summarise a textbook.

One strategy would be to start at the beginning, taking notes as you read one paragraph at a time. Then you would use your notes to create a book summary.
Another strategy might be more outward-in. First, you might read the title. Then onto the contents page, to get an overall impression of the topics covered in the book. Then you might read introductions and conclusions of all chapters to build some mental scaffolding for the entire book. Then you would finish by working through and summarising each chapter, and the book overall.

For a computer, software, maths and algorithms take the stage in place of human intuition. Decisions on strategies for breaking a problem down involve a series of steps, so let’s break them down from a computer’s perspective:

First, a computer must explore possibilities. Think of each step you take coming with a range of possibilities as the next step in front of you, like branches on a tree, then branches on those branches, and so on.
Then you need to build this tree, or in a computer’s case it builds a series of ‘nodes’ that represents what’s called a ‘state’ – such as the state of a game, and the lines/connections that represent all possible decisions from each state.
The next step is evaluation. A computer needs to assess which strategies deliver outcomes that are ‘good’ or ‘bad’. A ‘good’ outcome is any step that takes the computer closer to achieving its goal, such as winning a game.

Image source: Monte Carlo tree search - Wikipedia

Deepmind have pioneered work on using Q-values with Monte-Carlo tree search, which is one of the available methods for making decisions. It involves exploring many possible moves before picking the best one. This was executed most famously in the creation of the super-human Go algorithm ‘AlphaGo’.

Combining Q-values with the Monte-Carlo tree search enabled AlphaGO to search through the results of different actions efficiently.

And, well, the rest is history.

Chain of Thought Prompting

Language Models aren’t great at performing a series of tasks, where it would break down problems into smaller chunks and tackle one chunk at a time. In computer science, this is referred to as a ‘chain of thought’.

Higher order problem solving require chain of thought reasoning. Computing arithmetic, exhibiting ‘common sense’ or performing symbolic reasoning tasks are examples of more advanced problem solving – essentially anything beyond an instant regurgitation of text.

Chain of thought prompting is a form of prompt engineering, which assists the LLM by leading it through a series of manageable chunks of a problem. This is currently possible but requires a human being capable of conceiving and mentally retaining a strategy, breaking down the problem into logical steps and ‘driving’ the language model through these steps!

Removing the Human Driver

It’s interesting to speculate that OpenAI could be leveraging this search method to enable the language model itself to efficiently perform ‘chain of thought’ reasoning strategies. This would arguably remove the need for a human ‘driver’ in many tasks.

Consider how useful basic regurgitative language models are already.
Now consider how much more useful they would be if they can reliably perform higher orders of thought.
How many human functions could it step in for?

Whilst these manual chains of prompts can be deployed for specific use cases, such as breaking maths problems into smaller steps, deploying this approach effectively for broader problem sets is much harder.

(During editing, one of our researchers shared this, a Twitter post from a former Deepmind researcher who moved to OpenAI in July and said that he wants to use the planning capabilities developed for AlphaGO).

A Step Towards AGI

It’s unknown what exactly would constitute AGI, but this would be a major step towards it.

Higher order planning capabilities would plug a major hole in current LLM capabilities. This form of meta-decision-making would enable an incredible number of extra functions. Simply the ability to break a task down by itself and tackle it one step at a time.

ChatGPT has surprised the world and even its own creators with how capable language prediction can be. It caught everyone off guard how much more than blog writing it could do. How well it sat law exams, created business strategies, and how much accurate information was embedded in the model. Naysayers have pointed to ‘hallucinations’ (when the models make things up) as a key debilitating weakness of generative models; but, in truth, it’s incredible that they’re as accurate as they are, considering they haven’t been designed to memorise but rather to predict.

Hints of OpenAI’s Progress

The mention of Q* has excited a good number of people. Perhaps, this is with good reason. If the rumours are to be believed, then the inclusion of successfully applied techniques such as the Monte-Carlo Tree Search could mean our collective excitement isn’t in vain.

It’s possible OpenAI are about to shock the world just as Deepmind shocked us all with their 4-1 victory over Lee Sedol. A moment that catches the world’s breath as we are forced to contemplate a world where computers can beat us at our own game.

The techniques pioneered by Deepmind and now purportedly adopted by OpenAI may lead to a ChatGPT-6.5 Turbo that can effectively breakdown a broad range of complex tasks into more manageable chunks. It’s possible that a large proportion of us, now tapping away at our keyboards and flourishing our ‘uniquely human’ capabilities, may be able to put our feet up a little sooner than anticipated.

Or perhaps it’s just a rumour and we should all get back to work.

Who are Advai?

Advai is a deep tech AI start-up based in the UK that has spent several years working with UK government and defence to understand and develop tooling for testing and validating AI in a manner that allows for KPIs to be derived throughout its lifecycle that allows data scientists, engineers, and decision makers to be able to quantify risks and deploy AI in a safe, responsible, and trustworthy manner.

If you would like to discuss this in more detail, please reach out to [email protected]

Useful Resources

Here are a few links that you might find useful if you are interested in learning more about Robust AI:

Learn Article

23 Jun 2025

When Computers Beat Us at Our Own Game

Words by

Categories

Tags

What’s a Q* Value?

Enhanced Reasoning

Chain of Thought Prompting

Removing the Human Driver

A Step Towards AGI

Hints of OpenAI’s Progress

Who are Advai?

Useful Resources

Welcome to the AI Standards Revolution

Crossing the Threshold: Why AI that Explores is More Than Just Chat

The AI Revolution: Turning Promise into Reality

AI Bias: The Hidden Flaws Shaping Our Future

Apple’s AI News Debacle: How Assurance-Driven Evaluation Could Have Prevented It

Aye Aye AI Podcast

A Look at Advai’s Assurance Techniques as Listed on CDEI

Authentic is Overrated: Why AI Benefits from Synthetic Data.

Ant Inspiration in AI Safety: Our Collaboration with the University of York

Advai’s Day Out Teaching the Military how to Exploit AI Vulnerabilities

When Computers Beat Us at Our Own Game

Words by

Categories

Tags

Share

What’s a Q* Value?

Enhanced Reasoning

Chain of Thought Prompting

Removing the Human Driver

A Step Towards AGI

Hints of OpenAI’s Progress

Who are Advai?

Useful Resources

Related Posts

Welcome to the AI Standards Revolution

Crossing the Threshold: Why AI that Explores is More Than Just Chat

The AI Revolution: Turning Promise into Reality

AI Bias: The Hidden Flaws Shaping Our Future

Apple’s AI News Debacle: How Assurance-Driven Evaluation Could Have Prevented It

Aye Aye AI Podcast

A Look at Advai’s Assurance Techniques as Listed on CDEI

Authentic is Overrated: Why AI Benefits from Synthetic Data.

Ant Inspiration in AI Safety: Our Collaboration with the University of York

Advai’s Day Out Teaching the Military how to Exploit AI Vulnerabilities