Trust Engineering: Security and Reliability for Agent Systems
When agents call tools, execute code, and act on external data, traditional security guarantees break. Trust engineering is the discipline that fills the gap.
Traditional application security assumes a deterministic system. You validate inputs, sanitise outputs, enforce access controls, and write tests that pass or fail. The attack surface is bounded: your code, your dependencies, your infrastructure.
AI-native systems dissolve that boundary. When an agent reads a webpage, calls a function, executes generated code, and writes the result to a database, every step is a potential injection point. The model is not code you audit once. It is a probabilistic decision-maker that processes untrusted content at runtime and translates it into actions. Security guarantees that worked for traditional SaaS do not transfer cleanly to systems where the logic layer can be manipulated through the data it reads.
Trust engineering is the emerging discipline that addresses this gap. It borrows from traditional security but reframes the core assumptions: instead of deterministic correctness, you design for probabilistic reliability. Instead of proving a system cannot be compromised, you build layered defences that make compromise harder, more detectable, and more contained.
The threat landscape OWASP documented
The OWASP Top 10 for LLM Applications is the clearest current map of what can go wrong. First published in 2023 and substantially updated in 2025 to reflect agentic deployments, it catalogs the risks that security teams need to address when LLMs become operational components rather than text generators.
Prompt injection sits at the top, ranked the number-one critical vulnerability and appearing in over 73% of production AI deployments assessed during security audits. The mechanism is straightforward: an attacker includes instructions in content the model processes, causing it to override its original directives. Direct injection, where an attacker types into a visible input field, is the obvious case. Indirect injection is harder to defend: the attack payload is embedded in a webpage the agent browses, a document it reads, a tool description it loads, or a database record it retrieves. The model processes the content, follows the embedded instructions, and the application behaves in ways its developers did not intend.
Three other entries deserve attention for teams building agentic systems. Excessive agency (LLM08) covers the risk of granting agents broader permissions than their task requires. System prompt leakage (LLM07) addresses the exposure of internal prompts that define the agent's operational parameters. Supply chain vulnerabilities highlight the risk of compromised tool servers, poisoned MCP configurations, and third-party model providers with opaque security practices.
The 2025 update reflects how rapidly the threat landscape has shifted. The original list was written for chatbot applications. The current version grapples with agents that hold OAuth tokens, call APIs, execute shell commands, and coordinate with other agents.
Why tool-integrated agents are a different problem
When an LLM answers a question, the blast radius of a failure is limited to the response it returns. When an LLM drives an agent that can call tools, the blast radius expands to whatever those tools can do.
This changes the geometry of prompt injection attacks. A payload that causes a bare LLM to produce objectionable text is a content moderation problem. A payload that causes a tool-integrated agent to exfiltrate data, delete records, or send authenticated requests to external services is an operational security incident. The same fundamental vulnerability, at different levels of agent capability, produces very different consequences.
Real incidents have established this is not theoretical. CVE-2025-53773 exposed GitHub Copilot to remote code execution through prompt injection, affecting millions of developer environments. Security researcher Johann Rehberger demonstrated that a Claude Code agent navigating to a page containing a hidden injection payload would download a binary, mark it executable, run it, and connect to a command-and-control server. CVE-2025-59944 showed how a case-sensitivity bug in a protected file path allowed an attacker to feed Cursor's agent a malicious configuration, escalating to remote code execution.
The pattern across these incidents is consistent. Agents that ingest external content without trust boundaries, operate with broad filesystem or network permissions, and lack output verification before acting are the systems that get compromised. The Log-To-Leak attack class formalises one variant: covertly forcing the agent to invoke a malicious logging tool, causing it to exfiltrate its own context, including user queries, tool responses, and prior conversation turns.
The Drift supply chain incident in August 2025 illustrates a related vector. Threat actors used stolen OAuth tokens from Drift's Salesforce integration to access environments across more than 700 organisations. When agents hold long-lived credentials to cloud services, compromising one agent's token can cascade across every system the agent is authorised to reach.
Sandboxing as a structural control
For agents that execute code, sandboxing is not optional. Veracode's 2025 analysis found that 45% of AI-generated code fails security tests. The code an agent writes cannot be fully reviewed before it runs, and it may attempt destructive, resource-intensive, or deliberately malicious actions if the agent has been injected.
Docker containers are the instinctive first answer, but they are not a sandbox. Containers share the host kernel. A container escape vulnerability, and these exist, gives an agent full host access. For workloads you cannot trust, that is an insufficient isolation boundary.
Firecracker microVMs, originally built by AWS for Lambda and Fargate, provide the isolation that containers lack. Each microVM gets its own dedicated kernel via hardware virtualisation. Boot times are around 125ms. A companion jailer process applies Linux cgroups and namespace isolation to the Firecracker VMM itself, providing a second containment layer if the virtualisation boundary is somehow bypassed. Firecracker microVMs are now the de facto standard for executing untrusted LLM-generated code, with roughly 50% of Fortune 500 companies running agent workloads on them.
E2B is the most widely adopted dedicated platform in this space, running approximately 15 million sandbox sessions per month by early 2025, up from 40,000 per month in March 2024. Each E2B sandbox boots in under 200 milliseconds with its own kernel, root filesystem, and network namespace. Hugging Face uses E2B for reinforcement learning pipelines. Groq uses it for compound AI systems combining LLMs with live code execution.
A well-designed code execution sandbox needs four properties beyond isolation: an ephemeral lifecycle that destroys the environment after the task completes, network egress policies that prevent unauthorised outbound connections, resource limits that cap CPU, memory, and process counts, and a clean filesystem with no persistent state that crosses execution boundaries. The isolated environment buys you containment. The network and resource policies limit what a compromised environment can do before it is destroyed.
Guardrails architecture
Sandboxing addresses code execution. Guardrails address the model's behaviour across all interactions. The two are complementary layers.
A guardrails architecture operates at four levels. Input validation runs before the prompt reaches the model, checking for injection patterns, sensitive data inclusion, format anomalies, and rate limit violations. Model-level constraints shape behaviour during inference through system prompt design, temperature settings, and structured output requirements that reduce the model's degrees of freedom. Output filtering runs after generation, applying content moderation, PII detection, fact-checking heuristics, and policy validation before the response reaches the next component in the pipeline. Monitoring and alerting closes the loop, logging decisions at each layer and triggering circuit breakers when failure rates spike.
Semantic anomaly detection is a newer technique that addresses the limits of pattern matching. Rather than looking for known injection strings, it computes embeddings of incoming inputs and compares them against embeddings of known attack examples. High cosine similarity to attack patterns triggers additional scrutiny or rejection. This catches novel phrasings that bypass lexical filters while maintaining low false-positive rates for legitimate inputs. The approach is not foolproof but adds meaningful depth against adversaries who iterate on phrasing to evade rule-based controls.
The EU AI Act, which came into full effect in 2025, makes guardrails a compliance requirement for many categories of AI deployment, not just a security practice. Systems classified as high-risk must demonstrate continuous monitoring, incident logging, and human oversight mechanisms. Violations carry fines up to 35 million euros or 7% of global annual turnover. For product teams building on LLM APIs, the compliance case for guardrails investment now runs in parallel with the security case.
Least privilege for non-human actors
The access control principle that limits a user to the minimum permissions required for their task applies directly to agents. It is harder to implement correctly, because agents are dynamic and their tool invocations are determined at runtime rather than defined statically.
The practical approach is to scope agent permissions to the smallest set that supports the task, issue short-lived credentials that expire after the agent's session, and route tool calls through an intermediary that can inspect and log what the agent is doing before executing it. This intermediary, sometimes called a semantic firewall or agent gateway, sits between the model and the tools it can call. It validates that tool invocations are consistent with the agent's defined scope, rejects calls that fall outside that scope, and logs the full invocation record for audit.
Multi-agent architectures introduce a related problem. When agents hand off to other agents, the trust level of the originating agent should not automatically transfer to the receiving one. A poorly governed handoff is a privilege escalation path. Each agent in a multi-agent system should authenticate independently, operate under its own least-privilege grant, and be treated as an untrusted principal by downstream agents unless explicitly authorised.
The probabilistic shift in security thinking
The hardest adjustment for security teams encountering AI-native systems for the first time is the abandonment of deterministic guarantees.
In traditional application security, a correctly implemented access control check either passes or fails. A correctly implemented SQL parameterisation either prevents injection or it does not. The goal is to close vulnerabilities, and closed means closed.
LLM-based systems do not work this way. A 99% reliable agent still fails 1% of the time. A guardrail that correctly blocks 99.9% of injection attempts will eventually miss one. The probabilistic nature of neural systems is not a bug to be fixed but a property to be accounted for in the security architecture.
Google Cloud's Office of the CTO frames this as the central architectural challenge: organisations deploying AI agents must design security assuming failures will occur, rather than designing to prevent all failures. That means defence in depth over point solutions, circuit breakers that degrade gracefully when components fail, rollback capabilities for agent actions where possible, and audit trails detailed enough to diagnose what went wrong and why.
The Trust, Risk, and Security Management (TRiSM) framework emerging from the research community proposes a useful formulation: trust equals reliability from the probabilistic neural component plus governance from deterministic symbolic controls around it. The neural layer's probabilistic behaviour is contained by the deterministic guardrails, access controls, sandboxes, and monitoring that surround it. Neither layer alone is sufficient. Both together produce a system that is trustworthy at the product level even though no individual component provides absolute guarantees.
What product teams need to do differently
For teams shipping products on top of LLM APIs, trust engineering translates into concrete architectural decisions.
Design tool surfaces with blast radius in mind. Every tool an agent can call is a potential privilege to be exploited. Scope tool permissions to the minimum required for the specific task. Prefer tools that are idempotent and reversible over tools that produce irreversible side effects. Log every tool invocation with the full context of the agent's reasoning.
Treat external content as untrusted. Any data source the agent reads at runtime, webpages, documents, database records, email, tool descriptions from third-party MCP servers, should be processed with the assumption that it may contain injection payloads. Context isolation, where untrusted content is processed in a separate context from trusted instructions, reduces the attack surface for indirect injection.
Invest in evals as a security practice. The only way to know whether your guardrails work is to test them adversarially and continuously. Red teaming for LLM applications, where you systematically probe the system's behaviour with adversarial inputs, is becoming standard practice at security-conscious teams and a requirement under emerging regulatory frameworks.
Build in the assumption of failure. Circuit breakers that pause agent execution when anomaly rates spike, human escalation paths for actions above a defined risk threshold, and rollback capabilities for database and filesystem changes are not edge-case features. They are the structural controls that determine whether a security incident stays contained or cascades.
The discipline of trust engineering is still forming. The frameworks are newer, the tooling is maturing rapidly, and best practice is being established through a mix of research, incident response, and engineering trial and error. But the core insight is clear: building AI-native systems that operate safely in production requires treating security as a property of the system's architecture, not a property of the model's alignment. The model is one component. The architecture around it is where trust is actually built.
If your team is working through how agent security and governance fit into your AI product roadmap, our AI Product Strategy playbook covers the architectural frameworks and risk assessment approaches for building AI-native systems that are safe to ship.