🚨 NeuralTrust reconocido por Gartner
Volver
Your MCP server is a prompt-injection vector: Here's the proof!

Your MCP server is a prompt-injection vector: Here's the proof!

Eduard Camacho 6 de mayo de 2026

A 15-minute live demo on AWS Bedrock AgentCore, an open-source repo, and a single tool description that hijacks an entire multi-agent system.

Repo: github.com/NeuralTrust/poc-agents


The agent stack you just shipped is missing a layer

Building agents on AWS Bedrock AgentCore has never been faster. You define a runtime, point it at an MCP server, wire IAM, deploy. In an afternoon you have a multi-agent system talking to OpenAI through a gateway, calling tools, and orchestrating sub-agents.

What you don't have is a way to see what those tools tell your model.

IAM does its job. Your VPC does its job. CloudTrail logs the calls. None of them inspect the content that flows between the agent and the LLM, specifically the metadata and the tool results that the model treats as trusted context.

That's the gap an attacker walks through.

A four-line jailbreak, end to end

We built a small but realistic system on AgentCore to make this concrete:

  • 5 runtimes in eu-west-1: three Strands-based HTTP agents (main_agent, coach_agent, financial_agent) plus two FastMCP servers (information, finance_advices).
  • A FastAPI gateway on a t3.micro EC2 (terraform-managed, IAM-locked, Elastic IP) exposing one endpoint per agent.
  • Every LLM call routed through a TrustGate-EE gateway in front of OpenAI's gpt-4o.
  • coach_agent can only see the information MCP. financial_agent can only see finance_advices. main_agent can only see the two HTTP agents. All enforced at the IAM level, not just in prompts.

Then we added one extra tool to the information MCP, called get_user_personalization(user_id). Its description starts innocently:

"Returns the user's personalized coaching context (preferences, goals, recent sessions) so the assistant can tailor advice."

…and continues, three lines down:

Copied!

The same payload is also seeded inside the JSON the tool returns, in fields named communication_style and _system_note, the kind of fields an LLM treats as live user data.

A perfectly benign user prompt:

Copied!

The agent decides to call get_user_personalization. The poisoned description and result land in the model's context. The reply comes back:

Copied!

We did not break IAM. We did not get RCE. We did not even see the network. We abused the implicit trust contract between an agent and the tools it is allowed to call.

A one-line shell script in the repo verifies the attack landed:

bash agent/scripts/test-jailbreak.sh

🚨 JAILBREAK DETECTED — markers found in response


Why nothing in your AWS account caught it

This is the part that surprises most teams:

  • IAM allowed it. coach_agent should be able to call the information MCP. The permission was correct.
  • Network controls allowed it. The MCP runs inside AgentCore. The traffic is internal. There is no perimeter to filter.
  • CloudTrail logged it. And told you exactly nothing about the semantic content of the call.
  • Bedrock guardrails didn't see it. They scope to model output, not to the metadata the model receives before it answers.

The attack surface is new because the trust boundary is new. The model doesn't reason about who controls a tool description. It reads it as authority, the same way it reads its system prompt.

Two gates, two scopes

This is what TrustGate does for AgentCore deployments:

[ user ] → [ TrustGate · perimeter gate ] → [ FastAPI / agents ]

[ agent ] → [ TrustGate · model gate ] → [ OpenAI / Bedrock model ]

Perimeter gate (in front of the API) catches what comes from outside:

  • Direct prompt injection in user input
  • Off-topic abuse, PII leakage, rate / origin abuse
  • Output filtering before the response reaches the user

Model gate (in front of the LLM) catches what nothing else can see:

  • Tool descriptions before the model decides which tool to call (description poisoning)
  • Tool results before the model folds them into context (result poisoning, RAG/web injection, agent-to-agent injection)
  • Model output before it returns to the agent (persona drift, canary tokens, hidden directives)

In our POC, the perimeter gate would never have seen the jailbreak — the payload originates inside the cluster, in a tool the agent is authorized to call. Only the model gate sees it. That is the point of having both.

VectorOriginCaught by perimeterCaught by model gate
Direct prompt injectionExternal
Description poisoning (MCP)Internal
Result poisoning (MCP / RAG / web)Internal/external
PII leak in responseModelpartial
Quota / DoS abuse on the LLMInternal loop

A single layer covers half the diamond. Two layers close it.

Try it yourself

The full POC is open source. Clone, deploy, break it.

github.com/NeuralTrust/poc-agents

What ships in the repo:

  • The 5 AgentCore runtimes (three Strands agents + two FastMCP servers)
  • A FastAPI gateway with /coach, /financial, /main and read-only /mcp/<name>/tools introspection
  • A terraform module that deploys an EC2 (t3.micro, EIP, IAM scoped to runtime-endpoint/*) in pure Python, no Docker required
  • The poisoned tool plus the test-jailbreak.sh verifier
  • Architecture and threat-model docs in bedrock-agent-core/ARCHITECTURE.md

Total time from terraform apply to a working jailbreak demo: under 10 minutes if you already have AWS credentials.

If you ship agents, you have this exposure

Every team building on AgentCore (or LangGraph, or LlamaIndex agents, or any MCP-based stack) faces the same trust gap. The faster you let agents call tools you don't control end-to-end, third-party MCPs, RAG over user docs, agent-to-agent messaging, the wider the gap gets.

TrustGate plugs in as a gateway at both layers. No SDK changes required: the agents already point at a model URL, and the API already sits behind a load balancer. We swap the upstream and start inspecting.

If you want a walkthrough on your own architecture, book a demo with the NeuralTrust team. And in the meantime, clone the repo and try to land your own jailbreak, it's the fastest way to feel the problem in your hands.