Securing AI Agents with Information Flow Control (Part III)
From Policies to Guarantees: What Secured Agents Can (and Cannot) Do
This article concludes a three-part series explaining the Microsoft Research paper Securing AI Agents with Information-Flow Control (written by Manuel Costa, Boris Köpf, Aashish Kolluri, Andrew Paverd, Mark Russinovich, Ahmed Salem, Shruti Tople, Lukas Wutschitz, and Santiago Zanella-Béguelin).
In Part I, we looked at why tool-calling agents are dangerous by default. In Part II, we opened the agent and examined the planner: the place where decisions, memory, and labels meet.
In this final part, we answer the most important question: What security guarantees do we actually get once all of this machinery is in place?
This is where the paper moves from mechanisms to guarantees.
1. Policies as the Control Surface
Once we have a labeled planner and a taint-tracking planning loop, enforcing security reduces to a single question:
Should this tool call be allowed to proceed?
In the taint-tracking planner (Section 5.2, Part II), this question is answered by a policy check performed before any tool executes.
Policies are expressed purely in terms of labels:
- The label of the tool itself
- The labels of the tool’s arguments
A policy succeeds if, and only if, those labels are no more permissive than what the policy allows. That is, policies are local (checked at each tool call) but give rise to global guarantees about the agent’s behavior.
The paper focuses on two fundamental policies that are sufficient to express most real-world requirements.
1.1. Policy P-T: Trusted Actions
The first policy is Trusted Action (P-T).
This policy is designed to protect consequential actions: operations whose mere execution is dangerous, regardless of what data they handle. Examples include sending an email, creating a user, executing a transaction, modifying infrastructure, etc.
P-T requires that a tool call be generated exclusively from trusted data.
Formally, the integrity component of the tool call’s label must be Trusted (T). What this means operationally is simple but powerful: if any untrusted input influenced the decision to call the tool, the call is blocked.
This shuts down an entire class of prompt injection attacks. Even if an attacker manages to inject instructions into a document, email, or webpage, those instructions taint the planner’s context. Once tainted, the planner simply cannot trigger trusted tools.
This is integrity enforced in its strongest form.
1.2. Policy P-F: Permitted Flows
The second policy is Permitted Flow (P-F).
P-F is about data egress: sending data to external recipients. Unlike P-T, it does not care whether the decision to act came from a trusted context. Instead, it asks a narrower question: Are all recipients authorized to see this data?
P-F prevents illicit data leaks, even if the action itself was triggered by untrusted input.
Formally, P-F enforces a confidentiality check on the arguments of a tool call. If data labeled as readable only by a certain set of users is about to be sent somewhere, the policy ensures that the recipients are a subset of that set.
This is a weaker guarantee than full non-interference, but it is often exactly what you want in practice.
2. Combining Policies: Real-World Tradeoffs
The real power of the framework comes from combining P-T and P-F, or choosing between them deliberately.
For tools that trigger consequential actions, the paper enforces P-T. For tools that egress data, it enforces P-T, P-F, or both, depending on the desired behavior.
This yields four important regimes:
Note that different tools demand different guarantees:
- Consequential tools (e.g., “send money”, “disable account”) are typically guarded by P-T.
- Egress tools (e.g., “send message”, “upload file”) may be guarded by P-F, P-T, or both.
- Some tools may require both integrity and confidentiality guarantees, while others require only one.
The key insight is that security is not a binary concept. The framework lets you explicitly choose which guarantees you want for each tool.
2.1. Guarantees Enforced at the Planner Level
With taint tracking and policies correctly applied, the planner provides two critical assurances:
- Untrusted inputs cannot trigger protected actions. Prompt injection may influence reasoning, but it cannot cross integrity boundaries.
- Sensitive data cannot flow to unauthorized recipients. Even when an agent is manipulated, the impact of that manipulation is bounded.
These guarantees hold regardless of the model’s internal behavior. They do not depend on prompt engineering, alignment, or the model “doing the right thing”. They are enforced structurally, by construction.
3. FIDES: Advanced Information-Flow Control
The basic planner with dynamic taint tracking has a fundamental limitation. Whenever a tool returns untrusted or confidential data, that data immediately taints the conversation history. As a result, subsequent planner decisions are constrained, and many otherwise legitimate tool calls become disallowed by policy.
The variable-passing planner partially mitigates this issue by storing tool results in variables rather than appending them directly to the conversation. However, this alone is not sufficient for complex agent workflows.
To address these limitations, the paper introduces FIDES: a variable-passing planner equipped with more advanced information-flow control mechanisms.
At a high level, FIDES improves expressiveness without weakening security. It does so through two key ideas:
- Selective introduction of variables, and
- Constrained inspection of variables using typed outputs.
3.1. Selective Introduction of Variables
In earlier planners, every tool result was appended to the conversation history, immediately raising the security label of the current context. In FIDES, this is no longer the default behavior.
Instead, before appending a tool result, the planner applies a function conceptually called HIDE, which examines the result structure node by node.
The logic is simple:
3.1.1. Example: Selective Variable Introduction in Practice
Consider an agent tasked with handling support tickets. The agent retrieves a ticket from an external system and receives the following tool result:
- Ticket ID: #48291
- Subject: “Account locked after failed login attempts”
- Description: user-provided free text
- Internal notes: security-sensitive metadata
The description field originates from an external user and is therefore labeled untrusted, while the internal notes may be labeled confidential. Appending the entire result directly to the conversation history would raise the context label, restricting which tools the planner can call next.
With FIDES, the planner instead applies selective variable introduction:
- Fields whose labels are at or below the current context label (e.g., the ticket ID and subject) are appended directly to the conversation.
- Fields whose labels are more restrictive (e.g., the description and internal notes) are stored in fresh variables, each retaining its original label.
- The conversation history contains references to these variables rather than their contents.
3.1.2. Isolating Sensitive Data Without Breaking Planning
With selective variable introduction, the planner can continue issuing Query actions without raising the security label of the conversation history.
Sensitive or untrusted data is stored in variables instead of being appended directly, keeping the current context clean while preserving access to the data when needed.
This provides the same protection as fully hiding tool results, but without sacrificing planning capability. The planner can still reference stored variables in later steps, even though their contents are not exposed in the conversation.
This separation enables fine-grained policies. For example, when calling send_message(recipient, message):
- The decision to act and the recipient must originate from a trusted context,
- While the message may safely depend on untrusted data, such as web content.
Such distinctions are not possible with a basic taint-tracking planner, and are precisely what make FIDES practical for real agent workflows.
3.2. Constrained Inspection of Variables
In earlier planners, inspecting a variable meant revealing its full contents to the planner. This immediately tainted the conversation history with the variable’s label, often restricting which tools could be called next.
In FIDES, inspection is no longer an all-or-nothing operation.
Instead of always expanding a variable into the conversation, the planner can perform constrained inspection, extracting only limited, structured information from a variable while preserving information-flow guarantees.
This is achieved by combining variable inspection with the Dual-LLM pattern and constrained decoding.
3.2.1. Example: Inspecting Variables with Bounded Information
Consider an agent assisting with access reviews. The agent retrieves a list of permissions from an external system and stores the result in a variable:
- User permissions: a list of roles and entitlements
- Source: external system
- Label: untrusted
The planner now needs to decide whether escalation is required. It does not need the full permission list, only a simple answer to a specific question: Does this user hold any privileged roles?
Expanding the variable directly would expose untrusted data to the planner and taint the conversation history. Instead, FIDES allows the planner to query the variable using an isolated LLM with a constrained output schema.
For example, the planner issues a query such as:
- Question: “Does the permission set contain any admin-level roles?”
- Output schema: bool
The isolated LLM processes the variable contents but is restricted to producing a Boolean result. The output is stored in a new variable with a label that reflects both its origin and its bounded information capacity.
3.2.2. Limiting Information Without Losing Control
By constraining inspection outputs, FIDES limits how much information can flow into the planning context.
Low-capacity outputs (such as Booleans or small enumerations) carry provably bounded information. They are far less useful for prompt injection or data exfiltration than unconstrained strings.
As a result:
- The planner can reason about sensitive or untrusted data without fully revealing it.
- The conversation history may remain at a lower security label.
- Policies can permit certain actions based on constrained outputs, even when the original data is untrusted.
3.3 Why FIDES Matters
FIDES resolves a key challenge in secure agent design: “How can agents remain flexible without letting untrusted or sensitive data affect every future decision?”
By selectively hiding data, tracking labels, and limiting what inspection can reveal, FIDES allows planners to stay both capable and safe.
The result is an agent that can:
- handle complex workflows,
- combine trusted and untrusted inputs safely, and
- enforce security policies consistently,
without depending on prompt engineering or model alignment.
Together, these mechanisms make secure, real-world agent behavior practical.
Across these three parts, we moved step by step:
- from agent loops,
- to planners,
- to labeled data,
- to enforceable policies,
- to concrete guarantees.
The core takeaway is simple but profound:
Once you give agents the authority to act, security must live in the architecture — not in the prompt.
Information-flow control gives us a way to build agents that can reason freely while acting safely. Not by trusting the model, but by constraining what its decisions are allowed to affect.
If you’re building autonomous agents that interact with real systems, this line of work is worth your attention. It shows that we don’t have to choose between autonomy and security. We can engineer both!
Follow to stay updated on future deep dives into secure agent architectures.
Securing AI Agents with Information Flow Control (Part III) was originally published in InfoSec Write-ups on Medium, where people are continuing the conversation by highlighting and responding to this story.
目录
最新
- THM — ValenFind
- Phishing — Merry Clickmas | Tryhackme | Day-2
- HAWK_II — Cryptography Challenge Writeup
- Linux CLI — Shells Bells | Tryhackme | Day-1
- Piercing the Veil of Timelapse: Encrypted Keys & The LAPS Revelation ️
- The “Dumb” Editor That Got Too Smart: When Feature Bloat Leads to RCE
- I Wasn’t Looking at the Target — I Was Watching the Hackers First
- Expanding React2Shell for Serverless Lambda Function