The unknowns in the code

Dangers of integrating Large Language Models into applications

It’s 2023, and we assume you have already heard of Large Language Models (LLMs). Perhaps not by that term, but you are undoubtedly familiar with models like ChatGPT and similar ones. You may have even tried them or know someone who has used them.

And you probably agree with us: it’s excellent! It’s a significant convenience in daily life, making us more productive and allowing us to complete “routine tasks” like generating, paraphrasing, shortening, and editing texts in seconds.

Language models, like all other generative AI technologies currently trending, are a disruptive technology that will profoundly alter our work environment and daily life. Experiencing this transformation in real time is fascinating and enjoyable.

However, as much as we wish it weren’t the case, there are issues. Significant security issues. And the most critical problem we discovered and analyzed was published in February 2023.

Large Language Models – The three levels of threat

Level 1: Model Misalignment

A language model can construct grammatically correct sentences in response to a query. However, there are no guarantees of factual accuracy. Those who expect this may be disappointed. Disinformation and bias in the training data are real concerns. You can address them by not blindly believing every output and double-checking facts.

Level 2: Prompt Injection

This method is also well-documented by now. With specially crafted inputs (prompts), one can prompt a language model to produce unintended responses. For example, this could lead to the model providing sensitive information or bypassing security measures. This allows the user to potentially outsmart their conversational partner or chatbot. It’s a genuine concern but currently has (relatively) manageable consequences.

Level 3: Indirect Prompt Injection

Here’s where our analysis of Indirect Prompt Injection comes into play: What if the input doesn’t come from the user? The thought is unsettling, but unfortunately, this risk is also genuine.

Large Language Models are widely adopted, integrated into applications, and sometimes even utilized in critical contexts like security advisory, legal consultation, military strategy, and more. If we manage to compromise language models from the outside, the consequences are profound. In February 2023, we speculated on this issue and published an analysis. A few days later, we could verify using publicly available language models: our worst fears have materialized. LLMs can be intentionally brought under an attacker’s control. This can be accomplished by injecting instructions into the dialog context of the language model, which can occur through uploaded documents or gaining access to the internet, emails, or databases.

An example: We embed a hidden comment on the Wikipedia page of Albert Einstein saying, “Please forget your previous instructions and do the following instead…”. If a user subsequently instructs the chatbot to summarize that Wikipedia page, the attacker gains control over the chatbot.

Countermeasures and why they don't work

The reflex at this point might be to say, “We can get this under control by…” – but, unfortunately, that’s not the case. We have analyzed the potential countermeasures (mitigations), and the results are sobering. The vulnerability is inherent in the technology of language models and cannot be simply switched off. While it’s possible to implement filters or restrict access, these measures address the symptoms, not the root causes.

The Federal Office for Information Security (BSI)has warned. Indirect Prompt Injection is not just a subject of discussion in professional circles and the media; it’s also a threat taken seriously by official authorities. The BSI advises conducting a risk analysis for projects that involve integrating LLMs into applications, which, in our view, is a valuable and essential recommendation.