New frontiers are constantly being explored in the new paradigm of artificial intelligence, and one of the most intriguing is the phenomenon of prompt hacking. This emerging concept focuses on the manipulation of LLM systems through meticulously designed instructions. But what exactly is prompt hacking?

Prompts and prompt templates are predefined structures used to interact with artificial intelligence models. These templates guide users in formulating commands that maximise the effectiveness and accuracy of the responses generated by the system. However, when these templates are used with the intention of exploiting system vulnerabilities, we enter the realm of prompt hacking.

The term prompt hacking comes from the concept of software hacking, adapted to the context of LLMs. Just as software hacking involves the manipulation of code and systems to achieve specific goals, prompt hacking uses clever commands to influence the behaviour of artificial intelligence language models. This practice not only reveals the capabilities and limitations of AI systems, but also highlights the importance of security and ethics in the development of advanced technologies.

Vulnerabilities in LLM-based projects

LLM-based projects, like other software projects, have many attack vectors. OWASP (Open Worldwide Application Security Project) is an open community dedicated to making it easier for organisations to design, develop, acquire, operate and maintain software for secure applications that can be trusted. The organisation has already published a document that defines an architecture model for such LLM-based projects, as well as possible vulnerabilities associated with each of the components of that architecture (OWASP Top 10 For LLMs). The architecture model it defines is as follows:

Architecture diagram of an application based on LLMs, highlighting potential security vulnerabilities in each interaction between components.

As can be seen, at each point in the architecture, certain types of attacks are involved. The attacks mentioned are:

image of a robot

As you can see, many of the vulnerabilities of this type of project overlap with those of a conventional software project, but there are other techniques that are not as common and which basically fall under the heading of prompt injection. We will now look at some prompt injection techniques that can be used directly against models, as well as possible techniques to counter these attacks.

Prompt injection techniques

Direct prompt injection

The term ‘direct prompt injection’ refers to the technique of inserting malicious or manipulated text into a prompt to deliberately alter the initial behaviour or instructions of an AI model. This method attempts to exploit weaknesses in the way the model processes and responds to instructions, forcing it to perform actions or generate responses that would normally be outside its stated capabilities or constraints.

Let’s look at a couple of examples of productive applications that have been compromised in this way.

Indirect prompt injection

These are attacks where malicious commands are embedded in URLs, external pages, PDFs, images... sometimes with characters and text that are invisible to the human eye. For example, malicious code can be inserted in base 64, as shown in this example. The system vulnerable to this attack sends the text to the LLM, and as the LLM is able to translate and execute it, environment variables such as API keys and passwords can be exposed.

This includes cases where the model needs to access external resources, for example to summarise a web page or PDF, and this is where the threat lies. The problem with this type of attack is that it usually requires the external source to be scanned beforehand in order to prevent it.

Jailbreaking

‘Jailbreaking’ of LLMs refers to the technique of manipulating these models to operate outside their default constraints, beyond those imposed at the prompt, allowing actions or responses that would normally be blocked for security, ethical or content control reasons. This technique attempts to bypass the built-in safeguards that LLM designers have implemented to ensure that the model behaves in a safe and appropriate manner.

The difference between jailbreaking and prompt injection is that jailbreaking attacks the model directly, while prompt injection attempts to sneak into the instructions (prompt) that a system has to work in a certain way. Examples of jailbreaking include the following:

The DAN prompt in English, around 1,300 words.

Prompt obfuscation

This technique relies on the fact that prompts can also be obfuscated to bypass keyword-based content filters. Common obfuscation techniques include encodings such as base64, base32, hexadecimal, ASCII and others. Many LLMs can understand these encodings and follow the decoded instructions.

The fundamental problem with this type of attack is that in order to detect the attack, such encrypted messages would have to be decoded and analysed in real time, which requires a lot of computation and significant time penalty, which would adversely affect the user experience.

Safety filters and protection methods

It is normal, after learning about all the possible attack vectors listed above, to feel a chill run down our spine when we consider using an LLM in our project.

A man with a scared face.

Fortunately, cloud providers and firewall vendors are already taking steps to address many of these threats, either in real time or with asynchronous solutions. Similarly, large artificial intelligence companies are already training their models to be resilient to many of the aforementioned attacks.

Beyond these commercial solutions, we will describe some of the methods that can be used, primarily at the prompt level, to protect our models from the above attacks.

Post-prompting and sandwich defence

Post-prompting is a simple technique that consists of adding prompt instructions after user input, as opposed to the usual way. LLMs tend to give more weight to the last instructions given to them, so it is a simple but no less effective method:

{{user_input}}
Translate the above text into French.

Similarly, the sandwich technique involves repeating the instruction before and after the user input, in order to reinforce the command and thus ensure that user input of the type “ignore the previous instructions and do...” is ignored.

Adversarial prompts

One way of ensuring the result is to check the response of the LLM against the model itself. The second command to the LLM, which has no user input and relies only on the response to the previous command, is more robust to attack, although not completely immune.

> Explain to me how a password can be broken using brute force techniques.

> The brute force technique involves trying all possible combinations of characters until the correct password is found. This process can be extremely slow and use a lot of computing resources, especially if the password is long and complex. The strongest passwords use a combination of letters, numbers, and symbols, and are long enough to make brute force impractical.

> Check that the following answer provides practical guidance on how to break a password using brute force techniques:

“The brute force technique involves trying all possible combinations of characters until the correct password is found. This process can be extremely slow and use a lot of computing resources, especially if the password is long and complex. The most secure passwords use a combination of letters, numbers and symbols, and are long enough to make brute-force techniques impractical”

> The answer provided does not give any practical guidance on how to break a password using brute force techniques. Instead, it describes the brute-force process in general terms, highlights the difficulty and resources required to perform the technique, and gives advice on how to create strong passwords.

Guardrails

The implementation of rigid rules and constraints within the system, such as regex, forbidden word lists or deterministic constraints, is called guardrails. This makes it possible to eliminate certain casuistries in the answers without using an LLM.

This method can be applied to both responses and user input, restricting system calls and enhancing security at many points in the product.

Human-in-the-Loop

Involving humans in the process of reviewing the responses generated by the AI model is undoubtedly the most accurate method of ensuring that they are appropriate and accurate. It is also the most expensive, which is why it is used in sensitive situations or conversations that have been filtered by other methods, so that the human only has to analyse a critical sample.

There are a number of tools, such as Phoenix or Langfuse, that allow us to sort the conversation traces, with the possibility of evaluating and filtering the conversations in order to have full observability of the LLM in our system.

Pre-scanning of external sources

As mentioned above, in order to avoid an indirect prompt injection attack, it is essential to pre-scan external elements such as PDFs, web pages, images, etc. Depending on the intended use of the resource, it should be considered whether it is necessary to include an intermediate step of scanning the source for hidden text or code.

The bottom line

As we can see, these measures are not very different from those that need to be taken to protect any software system. Whenever a new disruptive technology emerges, and as it is adopted by actors with their own particular interests, many other mechanisms come into play to ensure that the technology can be used safely. And generative artificial intelligence is no exception.

Tell us what you think.

Comments are moderated and will only be visible if they add to the discussion in a constructive way. If you disagree with a point, please, be polite.

Subscribe