How Microsoft discovers and mitigates evolving attacks against AI guardrails

Earlier this year Microsoft shared Microsoft’s policy and actions blocking the nation-state advanced persistent threats (APTs), advanced persistent manipulators (APMs), and cybercriminal syndicates we track from using our AI tools and APIs. It’s important to understand the potential harms that can arise from its use. Microsoft's ongoing commitment to advance safe, secure, and trustworthy AI includes transparency about the capabilities and limitations of large language models (LLMs). They prioritise research on societal risks and building secure, safe AI, and focus on developing and deploying AI systems for the public good. You can read more about Microsoft’s approach to securing generative AI with new tools we recently announced as available or coming soon to Microsoft Azure AI Studio for generative AI app developers.

The potential for malicious manipulation of LLMs

One of the main concerns with AI is its potential misuse for malicious purposes. To prevent this, Microsoft ensure that AI behaves according to human values and goals. However, there are attempts by malicious actors to bypass these safeguards, known as "jailbreaks." Jailbreaks can lead to unauthorized actions, ranging from harmless pranks to serious illegal activities. Efforts are made to strengthen these defenses and protect AI applications from such behavior.

AI-integrated apps face both traditional software attacks and unique AI-specific ones, like injecting malicious instructions via user prompts. These risks can be categorized into two groups of attack techniques.

Malicious prompts: When the user input attempts to circumvent safety systems in order to achieve a dangerous goal. Also referred to as user/direct prompt injection attack, or UPIA.
Poisoned content: When a well-intentioned user asks the AI system to process a seemingly harmless document (such as summarizing an email) that contains content created by a malicious third party with the purpose of exploiting a flaw in the AI system. Also known as cross/indirect prompt injection attack, or XPIA.

Malicious prompts

Microsoft's advances in this field include the discovery of a powerful technique to neutralise poisoned content, and the discovery of a novel family of malicious prompt attacks, and how to defend against them with multiple layers of mitigations.

Neutralising poisoned content (Spotlighting)

Prompt injection attacks pose a significant security threat as attackers can issue commands to the AI system as if they were the user. For instance, a malicious email could trigger the system to search the user's sensitive emails and send their contents to the attacker. Defending against such attacks is crucial for the safe operation of any AI service.

Spotlighting (also known as data marking) to make the external data clearly separable from instructions by the LLM, with different marking methods offering a range of quality and robustness tradeoffs that depend on the model in use.

Spotlighting

Mitigating the risk of multiturn threats (Crescendo)

Researchers identified a new type of jailbreak attack called Crescendo, which targets popular language models (LLMs). Crescendo can bypass content safety filters and achieve various malicious goals. We promptly shared our findings with other AI vendors to assess their systems' vulnerability and implement necessary defenses. These vendors are actively working to protect their platforms against Crescendo attacks.

At its core, Crescendo tricks LLMs into generating malicious content by exploiting their own responses. By asking carefully crafted questions or prompts that gradually lead the LLM to a desired outcome, rather than asking for the goal all at once, it is possible to bypass guardrails and filters—this can usually be achieved in fewer than 10 interaction turns. You can read about Crescendo’s results across a variety of LLMs and chat services, and more about how and why it works, in the research paper.

Crescendo attacks, while surprising, don't directly threaten user privacy or AI system security. Instead, they bypass content filters, potentially allowing AI interfaces to behave undesirably. We're dedicated to researching and addressing these attacks to ensure secure AI operation. For Crescendo, we've updated our AI technology, like Copilot AI assistants, to mitigate its impact. Microsoft remains proactive in updating product protections against emerging AI bypass techniques, contributing to AI security research and collaboration.

To understand how we addressed the issue, let us first review how we mitigate a standard malicious prompt attack (single step, also known as a one-shot jailbreak):

Standard prompt filtering: Detect and reject inputs that contain harmful or malicious intent, which might circumvent the guardrails (causing a jailbreak attack).
System metaprompt: Prompt engineering in the system to clearly explain to the LLM how to behave and provide additional guardrails.

System metaprompt

Detecting Crescendo presented initial challenges. Standard prompt filtering couldn't identify "jailbreak intent" because individual prompts weren't threats, and keywords alone weren't enough. The threat pattern only emerged when prompts were combined. Additionally, the LLM didn't perceive anything unusual, as each step built upon the previous one, making it harder to detect the attack.

To solve the unique problems of multiturn LLM jailbreaks, Microsoft created additional layers of mitigations to the previous ones mentioned above:

Multiturn prompt filter: Adapted input filters to look at the entire pattern of the prior conversation, not just the immediate interaction. Microsoft found that even passing this larger context window to existing malicious intent detectors, without improving the detectors at all, significantly reduced the efficacy of Crescendo.
AI Watchdog: Deploying an AI-driven detection system trained on adversarial examples, like a sniffer dog at the airport searching for contraband items in luggage. As a separate AI system, it avoids being influenced by malicious instructions. Microsoft Azure AI Content Safety is an example of this approach.
Advanced research: Microsoft invest in research for more complex mitigations, derived from better understanding of how LLM’s process requests and go astray. These have the potential to protect not only against Crescendo, but against the larger family of social engineering attacks against LLM’s.

Advanced research

How Microsoft helps protect AI systems

AI has the potential to bring many benefits to our lives. But it is important to be aware of new attack vectors and take steps to address them. By working together and sharing vulnerability discoveries, we can continue to improve the safety and security of AI systems. With the right product protections in place, we continue to be cautiously optimistic for the future of generative AI, and embrace the possibilities safely, with confidence. To learn more about developing responsible AI solutions with Azure AI, visit website.

Resources: https://www.microsoft.com/en-us/security/blog/2024/04/11/how-microsoft-discovers-and-mitigates-evolving-attacks-against-ai-guardrails/

Microsoft Resources Centre

Microsoft Blog

How Microsoft discovers and mitigates evolving attacks against AI guardrails

The potential for malicious manipulation of LLMs

Neutralising poisoned content (Spotlighting)

Mitigating the risk of multiturn threats (Crescendo)

How Microsoft helps protect AI systems