04 Oct 2023

In-between memory and thought: How to wield Large Language models. Part I.

With so much attention on Large Language Models (LLMs), many organisations are wondering how to take advantage of LLMs.

This is the first in a series of three articles geared towards non-technical business leaders.

We aim to shed light on some of the inner workings of LLMs and point out a few interesting quirks along the way.

Words by

Alex Carruthers

Introduction: Robustness in the Context of LLMs

Trust is paramount in the rapidly evolving world of artificial intelligence. The performance and reliability of Large Language Models (LLMs) such as GPT-4 play a pivotal role in fostering trust and subsequent use. As these complex AI systems gain prominence and their adoption begins to seem workable, the need for Robustness and Resilience becomes a pressing issue. This is especially true for organisations wanting to embed AI into their processes and customer interactions. The prospect of using LLMs might seem daunting to businesses who worry about control, auditability, and liability.

Our goal is to make only one point: given the inevitability of their widespread adoption, you need to understand that robustness is vital for the success of trustworthy LLM systems (not to mention, ‘AI’ in general). Whilst it’s still machine learning, the challenges of aligning LLM system behaviours with your intentions are new, nuanced and hard.

Anyway. Stir up a coffee and let's dive in.

Is it Thought? Is it Memory?

There’s an interesting comparison to make if we consider an LLM like GPT-4 against an advanced search engine – both applications are an empty text box.

There’s something fascinating to learn, both from a) a similar trait, and b) a dissimilarity.

a) The similarity.

There is a similarity to be drawn to a search engine: It’s logical to appreciate that if you need to find something, it’s quicker if you only need to search in the relevant areas. Instead of combing through a vast knowledge bank for the most relevant information, it’s helpful if you know where to look.

The fastest way to sort through, index, search and return information is a deeply mathematical field. In essence, your search can return relevant results faster if it identifies where the information is likely to be and prioritises these areas to search first.

(In contrast; consider the last time you used your computer’s file system search, where it alphabetically combs through all files and painstakingly returns irrelevant results in no logical order. Groan.)

Similarly, with an LLM like ChatGPT-4, an approach can be taken called the ‘Mixture of Experts’ (MoE). The MoE approach consists of multiple smaller slices of trained algorithms are only engaged when a prompt (or, in search terms, a ‘query’) is relevant to their expertise. It's akin to a group of specialists, each with a different field of knowledge, working together to solve a problem.

Each expert only contributes when their specific field of knowledge is needed, making the overall system more efficient. A leaked report by George Hotz (spoken about further here) suggests ChatGPT-4 has 8 such ‘Experts’ and perhaps two of these experts are engaged in each query. It’s intuitive to understand, if you’re going to have multiple experts, choosing which experts to engage is a vital step. This processing is performed separately, overlaying the 8 experts.

b) The dissimilarity.

But here's where it gets interesting – their main difference to search architecture. These models don't "think" (thought) or "remember" (knowledge) like we would expect them to. Consider the search paradigm and let’s highlight an example task – researching a topic and writing a summary paragraph.

A human being converts the goal into a search query (thought).
Machines execute the search through data (knowledge).
A human then converts these search results back into language that meets the summarisation goal (thought).

--> You can see there’s a very clear dividing line that breaks thought apart from knowledge.

An LLM blurs this line.

Here’s out it goes when ChatGPT-4 is given the goal of researching a topic and writing a summary paragraph.

ChatGPT-4 immediately writes a summary paragraph.

--> Large Language Models perform thought and knowledge tasks in one go.

This means that in a very real way their knowledge doesn't stem from thought or memory. It arguably doesn’t involve any thought or any memory! Clearly, it amounts to something new. For the sake of grappling with this bizarre and incredible phenomenon, let’s call it ‘thought and memory’.

A Continuum of Thought and Memory

LLMs are not identifying and regurgitating relevant pieces of information they've found on the internet; they’re producing coherent answers from the knowledge that was imprinted into, in ChatGPT-4’s case, 8 topically distinct patterns of language with embedded knowledge. It's worth quickly understanding the consequential trade-off that LLMs running a MoE approach need to make in combining these two tasks. There are architecture decisions exemplified by the 8 experts in Chatgpt-4, which lead to a trade-off between thought and memory.

More memory, less thought:

On the one hand if ChatGPT-4 was to use all ~1.8 trillion parameters in answering a question, then the chances of it stumbling across a pattern that very closely resembles your question (and its answer) are good.
Answers tend to exist in the surrounding context of a question’s semantic patterns; for example, easiest to grasp would be that of a school textbook – the questions and answers are side by side or indexed. However, in searching ~1.8 trillion parameters, the processing time would be painfully slow and the computation expensive. à Performing this exhaustive search would be something akin to ‘remembering’.

More thought, less memory:

On the other hand, if ChatGPT-4 identifies the field of your query first and then assigns two relevant networks of knowledge patterns (two ‘experts’) to the task,
Then the chance of a pattern exactly resembling your question being embedded within one of these two experts is smaller, but the computational demand would be close to, say, ten times lower.
Assuming the relevant ‘Experts’ have been selected; they are then capable of using the patterns of similar knowledge to produce a suitable answer à This selective activation of neural architectures would be something closer to ‘thought’.

Ryan Moulton, in his article The Many Ways that Digital Minds Can Know, eloquently describes this trade-off (and is the inspiration behind this analogy). He likens this process to "compression." What we’re calling ‘Thought’ he calls ‘Integration’. What we’re calling ‘Memory’, he’s calling ‘Coverage’. Put simply, models with a broader "coverage" of information pull from a more expansive range of facts and examples to generate responses. This requires greater resources.

If something is computationally easier - requiring fewer resources, then it’s also cheaper, too. ChatGPT-4 users were getting suspicious that answers were getting ‘dumber’. We’re not saying this is necessarily true, but you can understand how OpenAI might be naturally driven by profitability and engineering trade-offs, optimising towards the cheapest acceptable answer versus the best answer.

This is all to stress one major point: these models don't use every piece of information they've learned; they use patterns in the information they’ve learned, and knowledge of the world is embedded in these patterns.

Breaking the Anthropomorphic Spell

The MoEs approach selectively choose which patterns are best suited to answer a question or solve a problem. The fewer patterns needed, the less computation needed, the quicker, cheaper and more efficient the LLM can be.

Although they're designed to mimic human-like conversation and although they seem to be performing some kind of search-like knowledge function; in fact, they're fundamentally sophisticated word prediction engines, where knowledge has been embedded within their statistical manipulation of human language (this recent study showed an LLM encoded deep medical knowledge).

In psychology, the ‘Anthropomorphic bias’ describes the human tendency to ascribe human-like characteristics were in fact none exist. This is what’s happening when you see a face in the knot of a tree. It is this same bias that makes the experience of LLMs so uncanny, so real, makes it…feel… so believable.

This bizarre research into ‘analogous tokens’ is a really clear demonstration of an LLM vulnerability, such that it breaks the anthropomorphic spell. It demonstrates that all words are not words to an LLM, but that they are converted into tokens.

Patterns are found between tokens, not words. It’s not reading, but it’s searching a neural network of tokens. A hiccup of this approach (amid thousands of other examples in the research) is found when ChatGPT-4 simply doesn’t see the word ‘SmartyHeaderCode’. This shows that LLMs really don’t ‘think’ – let alone think like a human – at all.

Take the below image as a good visual example. It’s a slight tangent due to the multi-modal nature of this model (it interprets vision and text), but it helps underscore this inhuman aspect of models.

· A picture has been merged with a graphical axis.

· The LLM asked to interpret it.

· The idiom ‘crossing wires’ seems appropriate because the LLM begins to talk about a trend ‘over time’, which the picture obviously doesn’t show.

These examples of hyper intelligent machines making incredibly ‘dumb’ errors can happen in the wild, but they can also be intentionally provoked into making these kinds of errors.

Imagine a bad actor taking advantage of a discovered vulnerability by using an equivalent trick to make a sidewalk look like a road to an automated vehicle; or a red light like a green; or a stop sign like a speed limit sign.

Their knowledge doesn't stem from understanding or memory, but form something inhumanly new, something in-between. Understanding this difference when working with them will enable you to best understand their limitations and therefore how to install proper guardrails on their use.

Now that we've understood a bit about how these models work, in our next post we’ll explore how they can be controlled and guided.

Who are Advai?

Advai is a deep tech AI start-up based in the UK that has spent several years working with UK government and defence to understand and develop tooling for testing and validating AI in a manner that allows for KPIs to be derived throughout its lifecycle that allows data scientists, engineers, and decision makers to be able to quantify risks and deploy AI in a safe, responsible, and trustworthy manner.

If you would like to discuss this in more detail, please reach out to [email protected]