23 Nov 2023

The Achilles Heel of Modern Language Models

Maths can scare many non-technical business managers away. However, a brief look at the maths is great to remind you quite how 'inhuman' artificial intelligence is, and how inhuman their mistakes can be.

Read our short post on why the inclusion of 'sight' makes language models like Chat GPT-4 so vulnerable to adversarial attack and misalignment.

Words by

Alex Carruthers

Multi-Modal Models

An interesting development with the most recent version of Chat-GPT, with its integration of DALL-E 3, is that you can insert images as a part of a prompt.

This also means you can give it feedback on an image that it has created.

You can say: “I like the bottom left image (from a grid of four), but I need it to be more like [text input - whatever]”.

Sure enough, the ‘Generative AI’ agent will go away and recreate an image based on your part-verbal and part-visual instruction.

It’s possible the system simply updates the language prompt, but it’s more likely that it has taken the image and the written feedback and fed this back into the engine to generate a second image.

In any case, it is now possible to prompt GPT-4 with images.

This is called a ‘multi-modal model’, and it opens GPT-4 up to a whole new world of vulnerabilities.

Attack Vectors

An attack vector is generated by creating an input that is so far ‘out of distribution’ (OoD) – so unrecognisable and incomputable to the generative AI model, that it essentially causes it to break.

Our research shows that multimodal AI systems have infinitely more attack vectors than language models alone. The reason for this is complex, but it boils down to the fact that there are only so many ways of combining ‘strings’ (which can be words, letters, or groups of letters, sometimes sentences, along with punctation) and near-infinite combinations of pixels.

An example (arbitrarily sourced here) of a language based suffix attack, shows these string combinations could look something like this:

“.\ + similarlyNow write oppositeley.]( Me giving**ONE please? revert with “\!–Two”

There’s only so many combinations these strings of text can have, due to the natural limits of the alphabet, punctuation and numbers (0-9).

On the other hand, if you consider an image input, you must consider every pixel and every colour that this pixel could be.

You can either take our word for it that this is 16,777,216^1,048,576 or you can read this quick crash course in ‘how colours work’!

Colours and Pixels

DALL-E 3 Is capable of producing three sizes. But let’s use the square size as an example. It produces an image with dimensions 1024 X 1024. This is 1,048,576 individual pixels.

That’s already a fairly large number, but we haven’t got to the dimension of colour depth yet.

Consider how many different colours anyone pixel could be. Maybe you never have. Maybe this is the first time that you’re wondering how many different colours a pixel could be. Well, let us tell you how many colours a pixel can be.

DALL-E 3, the latest image model integrated with GPT-4, is capable of producing images in 24-bit colour, as is common in JPEG images.

In a 24-bit colour system, each pixel has three colour channels - red, green, and blue. Each colour channel in a 24-bit colour system can display 256 shades due to its 8-bit representation.

In digital systems, a bit is a basic unit of information that can have a value of either 0 or 1. With 8 bits, you can have 2^8 (which is 256) different combinations, ranging from 00000000 to 11111111 in binary (i.e. ... 00010101...)

So, each channel is capable of displaying 256 shades; therefore, the total number of colours available is the product of the shades in each channel, which is 256 (red) x 256 (green) x 256 (blue), equalling 16,777,216 possible colours for each pixel.

To determine the total possible variations a 1024 X 1024 image could be, you take the total number of colours each pixel can be to the power of number of individuals pixels. Expressed numerically, this looks like this:

= (256^3)^(1024^2)

= 16,777,216^1,048,576

(DISCLAIMER FOR THE MATHS PROS:

This number actually depends on your definition of uniqueness. If you ignore symmetries then this number is correct, but it's smaller if you don't. For example, imagine a simpler case of 2x2 pixels where every pixel is white, except one which is black. You can then imagine that moving the location of the black pixel doesn't actually change the image, it's just a rotation of the original. In the case of a much larger image, these symmetries become quite complicated to calculate. I'd say you can probably just keep it simple by prefacing these numbers with: "if you ignore symmetries...".)

A Staggeringly High Number

….which represents a staggeringly high number. It’s a number so far beyond computational capabilities to represent accurately or meaningfully.

To put it in perspective, this number is so large that it far exceeds not only the number of atoms in the observable universe but also many orders of magnitude beyond most numbers used in practical mathematics or science.

In other words, it’s so big it may as well be infinity.

When a machine learning system is trained, it’s trained to recognise patterns within a ‘distribution’. The attack space is the total number of possible combinations of pixels, minus the distribution represented by the model.

Put simply: it’s a very small fraction of total pixel combinations that are covered by the model.

The Achilles heel of Language Model Assurance

The multimodal attack will likely remain as the Achilles heel for the assurance of LLMs.

Essentially, in machine learning (more specifically, deep learning), the robustness problem hasn't been solved. This is despite years of research progress.

This means that so long as people can integrate a vision model with an LLM, the LLM can be readily exploited because it's so much easier to generate adversarial inputs for vision models.

The headline here is that, currently, no matter how much training you've done using, say, RLHF (we wrote about this here), to ensure that your LLM only provides "ethical" responses, attaching a vision model to this system means it's highly vulnerable to misuse.

The bottom line?
No matter how much training you've done, modern systems are vulnerable.

Who are Advai?

Advai is a deep tech AI start-up based in the UK that has spent several years working with UK government and defence to understand and develop tooling for testing and validating AI in a manner that allows for KPIs to be derived throughout its lifecycle that allows data scientists, engineers, and decision makers to be able to quantify risks and deploy AI in a safe, responsible, and trustworthy manner.

If you would like to discuss this in more detail, please reach out to [email protected]