Google has revealed the various safety measures that are being incorporated into its generative artificial intelligence (AI) systems to mitigate emerging attack vectors like indirect prompt injections and improve the overall security posture for agentic AI systems.

"Unlike direct prompt injections, where an attacker directly inputs malicious commands into a prompt, indirect prompt injections involve hidden malicious instructions within external data sources," Google's GenAI security team said.

These external sources can take the form of email messages, documents, or even calendar invites that trick the AI systems into exfiltrating sensitive data or performing other malicious actions.

The tech giant said it has implemented what it described as a "layered" defense strategy that is designed to increase the difficulty, expense, and complexity required to pull off an attack against its systems.

These efforts span model hardening, introducing purpose-built machine learning (ML) models to flag malicious instructions and system-level safeguards. Furthermore, the model resilience capabilities are complemented by an array of additional guardrails that have been built into Gemini, the company's flagship GenAI model.

Cybersecurity

These include -

  • Prompt injection content classifiers, which are capable of filtering out malicious instructions to generate a safe response
  • Security thought reinforcement, which inserts special markers into untrusted data (e.g., email) to ensure that the model steers away from adversarial instructions, if any, present in the content, a technique called spotlighting.
  • Markdown sanitization and suspicious URL redaction, which uses Google Safe Browsing to remove potentially malicious URLs and employs a markdown sanitizer to prevent external image URLs from being rendered, thereby preventing flaws like EchoLeak
  • User confirmation framework, which requires user confirmation to complete risky actions
  • End-user security mitigation notifications, which involve alerting users about prompt injections

However, Google pointed out that malicious actors are increasingly using adaptive attacks that are specifically designed to evolve and adapt with automated red teaming (ART) to bypass the defenses being tested, rendering baseline mitigations ineffective.

"Indirect prompt injection presents a real cybersecurity challenge where AI models sometimes struggle to differentiate between genuine user instructions and manipulative commands embedded within the data they retrieve," Google DeepMind noted last month.

"We believe robustness to indirect prompt injection, in general, will require defenses in depth – defenses imposed at each layer of an AI system stack, from how a model natively can understand when it is being attacked, through the application layer, down into hardware defenses on the serving infrastructure."

The development comes as new research has continued to find various techniques to bypass a large language model's (LLM) safety protections and generate undesirable content. These include character injections and methods that "perturb the model's interpretation of prompt context, exploiting over-reliance on learned features in the model's classification process."

Another study published by a team of researchers from Anthropic, Google DeepMind, ETH Zurich, and Carnegie Mellon University last month also found that LLMs can "unlock new paths to monetizing exploits" in the "near future," not only extracting passwords and credit cards with higher precision than traditional tools, but also to devise polymorphic malware and launch tailored attacks on a user-by-user basis.

The study noted that LLMs can open up new attack avenues for adversaries, allowing them to leverage a model's multi-modal capabilities to extract personally identifiable information and analyze network devices within compromised environments to generate highly convincing, targeted fake web pages.

At the same time, one area where language models are lacking is their ability to find novel zero-day exploits in widely used software applications. That said, LLMs can be used to automate the process of identifying trivial vulnerabilities in programs that have never been audited, the research pointed out.

According to Dreadnode's red teaming benchmark AIRTBench, frontier models from Anthropic, Google, and OpenAI outperformed their open-source counterparts when it comes to solving AI Capture the Flag (CTF) challenges, excelling at prompt injection attacks but struggled when dealing with system exploitation and model inversion tasks.

"AIRTBench results indicate that although models are effective at certain vulnerability types, notably prompt injection, they remain limited in others, including model inversion and system exploitation – pointing to uneven progress across security-relevant capabilities," the researchers said.

"Furthermore, the remarkable efficiency advantage of AI agents over human operators – solving challenges in minutes versus hours while maintaining comparable success rates – indicates the transformative potential of these systems for security workflows."

Cybersecurity

That's not all. A new report from Anthropic last week revealed how a stress-test of 16 leading AI models found that they resorted to malicious insider behaviors like blackmailing and leaking sensitive information to competitors to avoid replacement or to achieve their goals.

"Models that would normally refuse harmful requests sometimes chose to blackmail, assist with corporate espionage, and even take some more extreme actions, when these behaviors were necessary to pursue their goals," Anthropic said, calling the phenomenon agentic misalignment.

"The consistency across models from different providers suggests this is not a quirk of any particular company's approach but a sign of a more fundamental risk from agentic large language models."

These disturbing patterns demonstrate that LLMs, despite the various kinds of defenses built into them, are willing to evade those very safeguards in high-stakes scenarios, causing them to consistently choose "harm over failure." However, it's worth pointing out that there are no signs of such agentic misalignment in the real world.

"Models three years ago could accomplish none of the tasks laid out in this paper, and in three years models may have even more harmful capabilities if used for ill," the researchers said. "We believe that better understanding the evolving threat landscape, developing stronger defenses, and applying language models towards defenses, are important areas of research."

Found this article interesting? Follow us on Twitter and LinkedIn to read more exclusive content we post.