20 Nov 2023

Securing the Future of LLMs

Exploring generative AI for your business? Discover how Advai contributes to this domain by researching Large Language Model (LLM) alignment, to safeguard against misuse or mishaps, and prevent the unlocking of criminal instruction manuals!

Words by

Alex Carruthers

Introduction

In the high-stakes, high-risk artificial intelligence (AI) landscape, ‘guardrail’ development offers businesses a path to responsible, risk-mitigated AI deployment. Businesses seeking to harness the capabilities of generative AI, namely Large Language Models (LLMs) like ChatGPT (a closed-source model) or Llama 2.0 (a comparable open-source model), invariably stumble upon issues with some of its behaviours.

Naturally, organisations must retain a high degree of control over the content that LLMs produce. Aligning LLM behaviours with organisational tone and goals, its choice of language, its perceived ethical stances, and opinions in general, is achieved via a process called ‘fine-tuning’.

There is also a national security threat. Imagine a password of sorts that criminals could share, enabling them to unlock existing opensource LLMs, to unlock a criminal instruction manual!

The UK Government has recognised the need to support industry in the development of safe, secure and responsible AI development. They are coordinating researchers and politicians around various AI Safety issues, to secure a leadership role for the UK on the global stage. The alignment of LLMs has been identified as one of the major challenges in AI development, and so Advai are conducting some cutting edge technical research in this area.

In this article, we’d like to introduce you to some of the challenges surrounding LLM guardrail development and overview the method of ‘fine-tuning’ that seeks to control LLMs.

This article was originally published as one of our LinkedIn articles: Advai's research: Securing the Future of LLMs | LinkedIn

Advai: Leading Adversarial Methods to Understand LLMs.

Advai is a leader in adversarial AI approaches. We are one of a small handful of businesses that can create Adversarial AI from base principles, turning research ideas into attack vectors. This is the ability to trick, manipulate and deceive AI systems using fundamental characteristics of the system itself.

We were originally formed in coordination with UK’s Ministry of Defence to create a Sovereign capacity for evaluating AI systems. The increasing drive in the commercial sector to develop AI capabilities has recently seen us apply these methods to commercial use cases.

We’ve developed significant expertise in

a) red-teaming models to unearth their failure modes;

b) attacking models with adversarial methods to reveal ways to strengthen AI systems; and,

c) enabling the benchmarking of an entire suite models against robustness/resilience related metrics.

Our experience has primarily involved Computer Vision (CV) systems. This is partly due to the nature of the requirements we were exposed to in Defence. However, it was mostly due to the maturity of adversarial research into CV models in contrast to the immature state of research into LLMs.

Put simply, attacking vision models is easier because one can manipulate the near infinite variations in pixels versus the manipulation of words.
Multi-trillion parameter language models are simply new and therefore the techniques to run adversarial tests against models of this size – models that interpret and produce strings of text (versus pixels) – largely haven’t existed until recently.

The combination of these three elements:

relentless commercial adoption rates of LLMs,

their obvious value to human society, and

our commitment to AI robustness,

…has organically led us to take these novel adversarial techniques and apply them to the vast application space of LLMs.

Further, there are clear signs of a demanding regulatory environment for businesses that wish to use AI for applications that have the potential to significantly affect the wellbeing of members of the public.

Given that language is the primary mode of human communication (in contrast, say, to the interpretation of body language via CV systems) it stands to reason that AI systems that use, interpret and create language are likely to have the greatest impact on human society. They are therefore a significant threat worthy of our and the government's focus.

Thus, our stage is set:

Organisations need risk-appropriate control over language models.

Adversarial methods for LLMs are at an immature state, but we can leverage techniques from the attack vectors we’ve used in our vision-based research.

Compliance standards for businesses that seek the efficiency rewards of LLMs will be stringent.

In response, our core offering has evolved to package our testing capabilities within a broader compliance and risk wrapper and apply this expertise to the complex world of LLMs.

The Alignment Problem.

AI alignment (also known as fine-tuning) aims to steer AI systems towards humans' intended goals, preferences, or ethical principles. Alignment, which produces an ‘Aligned Model’ from a ‘Foundation Model’, is usually the final step in training a model before deployment.

Alignment requires three components:

Foundation model: These are the original or base models that an organisation may use as the starting point for their model. They’re usually trained in an unsupervised manner, which means they are essentially fed a large unlabelled portion of the internet. For example, Llama 2.0 might be the foundation model a business starts with.

Reward Model: A model that can score the foundation model outputs on how well they align with human preferences. Reward models can vary in model type but often they are traditional machine learning models such as classifiers and have the corresponding robustness and assurance challenges.

Alignment prompts dataset: A curated set of carefully structured language prompts across diverse topics that are designed to elicit responses to be evaluated against alignment criteria.

To put it simply: the core method to install guardrails on a language model is one of ‘training’, not rules.

Algorithmic vulnerabilities need algorithmic mitigations.

Using algorithms as ‘guardrails’ can feel uncomfortable for an organisation because it seems to provide less precise control than the establishment of hard rules.

But what hard rules would you set?

If keyword detection, then what keywords should you ban? Atrocious topics might be referenced for positive purposes, such as in the name of research, or for content that argues against a given topic using those same awful keywords.

Clearly, a slapstick method of banning ‘all of X’ is more inhibitive than helpful.

How else can you control words?

How do you control words?

If someone asks you a question – say ‘how do you run a malicious misinformation campaign?’, how is it that you know to evade and perhaps condemn the question?

Over a lifetime, you are ever honing a sense for what is and isn’t aligned to the person you are trying to be; it then takes effort to form habits that avoid ideas – and words – you don’t want to use. You make mistakes, hopefully none too consequential, and learn from these mistakes thereby updating your internal self-checks. In a sense, you yourself, and all of us, are ‘fine-tuning’ the way we respond to language input over time.

We’ll come back to how this ‘fine-tuning’ process works in AI research, but first let’s look at the current state of LLMs guardrails – classifiers.

Classifiers are the current state-of-the-art of LLM guardrails.

Cutting edge LLMs use classifiers to get around this problem of banning ideas and not words. For example, a classifier model can be trained to detect hate speech or a criminal request and then a hard rule is applied that if hate speech is detected then the model must regurgitate a pre-programmed response denying the user’s request. Every major LLM-wielding organisation employs this method.

Yet, every week, we see research groups around the world training an adversarial attack to overcome these classifier-based guardrails. The week comes to an end, the company behind a close-source model or a community behind an open-source model installs a patch to prevent this attack from working, and then next week rolls around with a new vulnerability detected.

Advai successfully runs these attacks internally, too: with enough compute and the right know-how, it’s not that difficult to discover a suffix (a string of nonsensical characters that are fed into a model along with a user request) to inexplicably force the model to produce undesirable content.

Obviously: it’s an active research field and we’re part of a small group of AI research businesses working on precisely this challenge.

Advai are leading this work to develop methods of strengthening LLMs, so that criminals can’t unlock ‘criminal instruction manuals’ and nefarious ‘how-to’ knowledge (how-to: build a weapon; run a misinformation campaign, etc.). – Not to mention the broader value for organisations having more secure AI, etc.

The first step to mitigating an AI vulnerability is executing these attacks ourselves.
The second step is generalising these attack vectors across models, finding underlying mechanisms that cause failure.
Then, we try to understand exactly why these underlying mechanisms are working successfully so they may be prevented.

Back to Alignment: how does fine-tuning work?

There is an emerging mantra in the world of AI development: it takes AI to control AI.

The pre-trained foundation model is fed into the alignment process, as shown in Figure 1. During the alignment process:

Prompts are fed to the foundation model.
The foundation model makes predictions in response to the prompts.
The reward model then scores the predictions on how accurately they align with human preferences.
The alignment algorithm uses the scores to calculate an update to the foundation model.

The reward model is key to the process of alignment. It encodes the human preferences that are to be imposed on the foundation model. As a result, it’s the most likely source for any issues with AI alignment.

It’s worth emphasising that there will exist many more reward models than foundation models, because many businesses may use the same open-source foundation model (like Llama 2.0) and then develop their own guardrails in the form of a reward model.

The Advai Assurance Framework: Reward Models must be Robust.

As it stands, reward models are the main measure of control that an organisation can have over an LLM and yet reward models suffer the same robustness and assurance challenges of normal machine learning models. It is therefore vital that reward models themselves are carefully designed and tested rigorously when an organisation is fine-tuning a foundation model.

We test robustness and run adversarial attacks on LLMs to tease out vulnerabilities that are not addressed by the reward model and alignment process.

Namely, what inputs, literally what words can you type into an LLM (a prompt) will cause the model to operate in an undesirable way? Such as the unlocking of a criminal instruction manual.

Revisiting the stage we set earlier, Advai’s AI Assurance Framework is essentially an AI-alignment system built:

To help businesses exercise control over their models, so they can feel comfortable in the fact that their use of AI models – including LLMs – will not expose them to undue risk.
To leverage the techniques that we’ve honed from attacking CV models, techniques that discover underlying failure mechanisms and therefore reveal mitigation methods.
To provide end-to-end documentation of the rigorous robustness assurance methods that were employed to align AI systems. This protects the business from regulatory blowback in the event a model’s failure causes harm to members of the public.

Conclusion

AI is shaping the future of industry right now. The need to strengthen AI models is pressing.

Advai and the UK government’s are at the forefront of understanding the vulnerabilities of AI models, hardening the resistance of models against adversaries, and enabling businesses to enjoy the maximum potential of AI in a safe, secure and responsible manner.

Creating Trustworthy AI is about more than technological advancements. It’s about aligning these advancements with human values and preferences. Advai is dedicated to ensuring that the future of industry is shaped by AI technologies that we can trust. We see a future where humanity can thrive using the power of AI.

Who are Advai?

Advai is a deep tech AI start-up based in the UK that has spent several years working with UK government and defence to understand and develop tooling for testing and validating AI in a manner that allows for KPIs to be derived throughout its lifecycle that allows data scientists, engineers, and decision makers to be able to quantify risks and deploy AI in a safe, responsible, and trustworthy manner.

If you would like to discuss this in more detail, please reach out to [email protected]

Learn Article

23 Jun 2025

Securing the Future of LLMs

Words by

Categories

Tags

Introduction

Advai: Leading Adversarial Methods to Understand LLMs.

The Alignment Problem.

Algorithmic vulnerabilities need algorithmic mitigations.

Classifiers are the current state-of-the-art of LLM guardrails.

Back to Alignment: how does fine-tuning work?

The Advai Assurance Framework: Reward Models must be Robust.

Conclusion

Who are Advai?

Welcome to the AI Standards Revolution

Crossing the Threshold: Why AI that Explores is More Than Just Chat

The AI Revolution: Turning Promise into Reality

AI Bias: The Hidden Flaws Shaping Our Future

Apple’s AI News Debacle: How Assurance-Driven Evaluation Could Have Prevented It

Aye Aye AI Podcast

A Look at Advai’s Assurance Techniques as Listed on CDEI

Authentic is Overrated: Why AI Benefits from Synthetic Data.

Ant Inspiration in AI Safety: Our Collaboration with the University of York

Advai’s Day Out Teaching the Military how to Exploit AI Vulnerabilities

Securing the Future of LLMs

Words by

Categories

Tags

Share

Introduction

Advai: Leading Adversarial Methods to Understand LLMs.

The Alignment Problem.

Algorithmic vulnerabilities need algorithmic mitigations.

Classifiers are the current state-of-the-art of LLM guardrails.

Back to Alignment: how does fine-tuning work?

The Advai Assurance Framework: Reward Models must be Robust.

Conclusion

Who are Advai?

Related Posts

Welcome to the AI Standards Revolution

Crossing the Threshold: Why AI that Explores is More Than Just Chat

The AI Revolution: Turning Promise into Reality

AI Bias: The Hidden Flaws Shaping Our Future

Apple’s AI News Debacle: How Assurance-Driven Evaluation Could Have Prevented It

Aye Aye AI Podcast

A Look at Advai’s Assurance Techniques as Listed on CDEI

Authentic is Overrated: Why AI Benefits from Synthetic Data.

Ant Inspiration in AI Safety: Our Collaboration with the University of York

Advai’s Day Out Teaching the Military how to Exploit AI Vulnerabilities