What prompt hacking is and how to avoid it

Do you want our logo?

Do you want our logo description

New frontiers are constantly being explored in the new paradigm of artificial intelligence, and one of the most intriguing is the phenomenon of prompt hacking. This emerging concept focuses on the manipulation of LLM systems through meticulously designed instructions. But what exactly is prompt hacking?

Prompts and prompt templates are predefined structures used to interact with artificial intelligence models. These templates guide users in formulating commands that maximise the effectiveness and accuracy of the responses generated by the system. However, when these templates are used with the intention of exploiting system vulnerabilities, we enter the realm of prompt hacking.

The term prompt hacking comes from the concept of software hacking, adapted to the context of LLMs. Just as software hacking involves the manipulation of code and systems to achieve specific goals, prompt hacking uses clever commands to influence the behaviour of artificial intelligence language models. This practice not only reveals the capabilities and limitations of AI systems, but also highlights the importance of security and ethics in the development of advanced technologies.

Vulnerabilities in LLM-based projects

LLM-based projects, like other software projects, have many attack vectors. OWASP (Open Worldwide Application Security Project) is an open community dedicated to making it easier for organisations to design, develop, acquire, operate and maintain software for secure applications that can be trusted. The organisation has already published a document that defines an architecture model for such LLM-based projects, as well as possible vulnerabilities associated with each of the components of that architecture (OWASP Top 10 For LLMs). The architecture model it defines is as follows:

Architecture diagram of an application based on LLMs, highlighting potential security vulnerabilities in each interaction between components.

As can be seen, at each point in the architecture, certain types of attacks are involved. The attacks mentioned are:

LLM01: Prompt Injection
It is this type of vulnerability that we want to develop in more depth in this post. This method manipulates a large language model (LLM) through clever inputs, causing the model to perform unwanted actions. There are many types of prompt injection, as discussed below.
LLM02: Insecure management of model outputs
This type of attack mainly covers the risks associated with the direct use of LLM responses. Successful exploitation of this vulnerability can lead to XSS (Cross-Site Scripting) and CSRF (Cross-Site Request Forgery) attacks in web browsers, as well as SSRF (Server-Side Request Forgery), privilege escalation or remote code execution on back-end systems.
LLM03: Poisoning of training datasets
This occurs when the training data of a language model is altered, introducing vulnerabilities or biases that affect its security, performance or ethical behaviour.
LLM04: Denial of service
Attackers can cause LLMs to perform resource-intensive operations, resulting in service degradation or high costs that could threaten the economic viability of the organisation employing them. This vulnerability is more pronounced in this type of project due to the high resource consumption inherent in these models and the general unpredictability of the inputs provided by users.
LLM05: Supply chain vulnerabilities
The lifecycle of an application using LLM can be compromised by vulnerable components or services, making the application vulnerable to security attacks. The use of third party datasets, pre-trained models and less reliable plug-ins can introduce new vulnerabilities.
LLM06: Disclosure of confidential information
Language models may inadvertently disclose confidential corporate data in their responses, which could lead to unauthorised access to information, privacy violations and security breaches.
LLM07: Insecure design of plug-ins
Plug-ins or agents used as add-ons to LLMs may have insecure entries and insufficient or non-existent access controls. This lack of control in the application makes them more vulnerable to exploitation and can lead to consequences such as remote code execution.
LLM08: Excessive autonomy
LLM-based systems may perform unwanted actions due to poor permissions management, role assignment or excessive autonomy.

LLM09: Over-reliance
LLM can lead to high levels of misinformation, miscommunication, legal issues and security vulnerabilities due to incorrect or inappropriate content. This is exacerbated in LLMs with poor or no oversight.
LLM10: Model Theft
In a model theft scenario, an attacker attempts to obtain a replica of the model without directly accessing the original code or training data. This allows the attacker to rebuild or clone the model and access its capabilities without payment or permission from the owner.

As you can see, many of the vulnerabilities of this type of project overlap with those of a conventional software project, but there are other techniques that are not as common and which basically fall under the heading of prompt injection. We will now look at some prompt injection techniques that can be used directly against models, as well as possible techniques to counter these attacks.

Prompt injection techniques

Direct prompt injection

The term ‘direct prompt injection’ refers to the technique of inserting malicious or manipulated text into a prompt to deliberately alter the initial behaviour or instructions of an AI model. This method attempts to exploit weaknesses in the way the model processes and responds to instructions, forcing it to perform actions or generate responses that would normally be outside its stated capabilities or constraints.

Let’s look at a couple of examples of productive applications that have been compromised in this way.

Users of Chevrolet’s chatbot got it to recommend a Tesla and explain the benefits of buying one over a Chevrolet. This was not even a sophisticated attack.
A Twitter bot that was originally intended to display only remote job listings was vulnerable to prompt injections, and users made it respond irresponsibly to its initial instructions.

Indirect prompt injection

These are attacks where malicious commands are embedded in URLs, external pages, PDFs, images... sometimes with characters and text that are invisible to the human eye. For example, malicious code can be inserted in base 64, as shown in this example. The system vulnerable to this attack sends the text to the LLM, and as the LLM is able to translate and execute it, environment variables such as API keys and passwords can be exposed.

This includes cases where the model needs to access external resources, for example to summarise a web page or PDF, and this is where the threat lies. The problem with this type of attack is that it usually requires the external source to be scanned beforehand in order to prevent it.

Jailbreaking

‘Jailbreaking’ of LLMs refers to the technique of manipulating these models to operate outside their default constraints, beyond those imposed at the prompt, allowing actions or responses that would normally be blocked for security, ethical or content control reasons. This technique attempts to bypass the built-in safeguards that LLM designers have implemented to ensure that the model behaves in a safe and appropriate manner.

The difference between jailbreaking and prompt injection is that jailbreaking attacks the model directly, while prompt injection attempts to sneak into the instructions (prompt) that a system has to work in a certain way. Examples of jailbreaking include the following:

DAN (Do Anything Now). The user creates a scenario in which the model behaves as if it were not subject to its usual constraints, allowing responses to be generated that would normally be blocked. The DAN prompt often uses a role-playing approach, where the model is instructed to act as an entity that can do anything, regardless of the constraints imposed by the model designers. This is a known example of a jailbreaking prompt used in GPT-3 and GPT-4.

The DAN prompt in English, around 1,300 words.

Alignment hacking. The user tries to convince the LLM that it is its duty to respond to the prompt, and that refusing is not a good response (Your mission is...). LLM alignment is the set of purposes, attitudes and values with which LLMs are trained. Many LLMs have been fine-tuned with RLHF (reinforcement learning with human feedback) to know how to answer questions in chat mode. That is, they have been trained to distinguish what is the most desirable answer, using human standards of what is a good answer and what is not. So one way to hack an LLM is to try to convince it that the best answer is what we want it to do.
Research experiment. The user tries to convince the LLM that he/she is doing an experiment and the best way to help the research is to answer the question he/she wants.
Logical reasoning. The user asks the LLM to limit the scope of its answer to logical reasoning, thus forgetting any ethical considerations that may have been given in the prompt.
Superior model. The user tells the LLM that it is a superior AI model, even general intelligence, and that it is authorised to overwrite its security features.
Kernel mode. The user tries to trick the LLM into believing that he or she has access to a mode that can only be activated by users with higher permissions or privileges.

Prompt obfuscation

This technique relies on the fact that prompts can also be obfuscated to bypass keyword-based content filters. Common obfuscation techniques include encodings such as base64, base32, hexadecimal, ASCII and others. Many LLMs can understand these encodings and follow the decoded instructions.

The fundamental problem with this type of attack is that in order to detect the attack, such encrypted messages would have to be decoded and analysed in real time, which requires a lot of computation and significant time penalty, which would adversely affect the user experience.

Safety filters and protection methods

It is normal, after learning about all the possible attack vectors listed above, to feel a chill run down our spine when we consider using an LLM in our project.

Fortunately, cloud providers and firewall vendors are already taking steps to address many of these threats, either in real time or with asynchronous solutions. Similarly, large artificial intelligence companies are already training their models to be resilient to many of the aforementioned attacks.

Beyond these commercial solutions, we will describe some of the methods that can be used, primarily at the prompt level, to protect our models from the above attacks.

Post-prompting and sandwich defence

Post-prompting is a simple technique that consists of adding prompt instructions after user input, as opposed to the usual way. LLMs tend to give more weight to the last instructions given to them, so it is a simple but no less effective method:

{{user_input}}
Translate the above text into French.

Similarly, the sandwich technique involves repeating the instruction before and after the user input, in order to reinforce the command and thus ensure that user input of the type “ignore the previous instructions and do...” is ignored.

Adversarial prompts

One way of ensuring the result is to check the response of the LLM against the model itself. The second command to the LLM, which has no user input and relies only on the response to the previous command, is more robust to attack, although not completely immune.

> Explain to me how a password can be broken using brute force techniques.

> The brute force technique involves trying all possible combinations of characters until the correct password is found. This process can be extremely slow and use a lot of computing resources, especially if the password is long and complex. The strongest passwords use a combination of letters, numbers, and symbols, and are long enough to make brute force impractical.

> Check that the following answer provides practical guidance on how to break a password using brute force techniques:

“The brute force technique involves trying all possible combinations of characters until the correct password is found. This process can be extremely slow and use a lot of computing resources, especially if the password is long and complex. The most secure passwords use a combination of letters, numbers and symbols, and are long enough to make brute-force techniques impractical”

> The answer provided does not give any practical guidance on how to break a password using brute force techniques. Instead, it describes the brute-force process in general terms, highlights the difficulty and resources required to perform the technique, and gives advice on how to create strong passwords.

Guardrails

The implementation of rigid rules and constraints within the system, such as regex, forbidden word lists or deterministic constraints, is called guardrails. This makes it possible to eliminate certain casuistries in the answers without using an LLM.

This method can be applied to both responses and user input, restricting system calls and enhancing security at many points in the product.

Human-in-the-Loop

Involving humans in the process of reviewing the responses generated by the AI model is undoubtedly the most accurate method of ensuring that they are appropriate and accurate. It is also the most expensive, which is why it is used in sensitive situations or conversations that have been filtered by other methods, so that the human only has to analyse a critical sample.

There are a number of tools, such as Phoenix or Langfuse, that allow us to sort the conversation traces, with the possibility of evaluating and filtering the conversations in order to have full observability of the LLM in our system.

Pre-scanning of external sources

As mentioned above, in order to avoid an indirect prompt injection attack, it is essential to pre-scan external elements such as PDFs, web pages, images, etc. Depending on the intended use of the resource, it should be considered whether it is necessary to include an intermediate step of scanning the source for hidden text or code.

The bottom line

As we can see, these measures are not very different from those that need to be taken to protect any software system. Whenever a new disruptive technology emerges, and as it is adopted by actors with their own particular interests, many other mechanisms come into play to ensure that the technology can be used safely. And generative artificial intelligence is no exception.

Leticia Martín-Fuertes

I have a classical philology degree from the Autonomous University of Madrid and work as a computational linguist at Paradigma. Before that, I worked as a computational linguist at Molino de Ideas, Bitext and Google in, among others, lexical richness analysis, sentiment analysis and conversational system projects. I have co-organised data science and natural language processing communities such as R-Ladies and Lingwars.

View more of Leticia.

Eloisa Ibáñez

I am Senior Telecommunications Engineer and in my many lives I have developed in languages you wouldn't believe, operated systems beyond Orion, administrated systems with Ansible and even managed international offers. A renaissance mom, fan of cooking, anime, zombies and gardening. ¡Full Stack!

View more of Eloisa.

Bea Hernández

I am a university-educated mathematician and have been working in data science since before GPT-1 appeared. At Paradigma I work as an ML engineer. I am strongly inclined to building ties and improving diversity in the technology sector. I am a co-founder of R-Ladies Madrid and an active participant in other communities that promote diversity in the fields of technology and artificial intelligence.

View more of Bea.

More thoughts about this.

Cassandra, the lady of NoSQL databases.

By Manuel Zaforas

What are the key roles within the Big Data universe?

By Daniel y Hlynur

Tell us what you think.

Comments are moderated and will only be visible if they add to the discussion in a constructive way. If you disagree with a point, please, be polite.

What prompt hacking is and how to avoid it.