May 11, 2026 · Edition #14

"Look, an instruction!" That's the bug.

Every week this newsletter covers a new place an attacker hid an instruction, and a new AI assistant that found it and ran it. Last September, ForcedLeak showed Salesforce Agentforce reading hidden instructions out of a Web-to-Lead form and exfilling CRM data through an expired allowlisted domain. In April, Comment and Control turned PR and issue comments into a credential-theft channel across Claude Code, Gemini CLI, and GitHub Copilot at once. This week it's two more: a GitHub issue body that hijacked Gemini CLI's triage agent and pushed code into a 101,000-star Google repo, and a prompt that reached a "run this as code" function inside Microsoft's Semantic Kernel. The clean read every time is "fix the plumbing, patch the code-execution function, lock down the keys, harden the loaders, fix the CSP." All true. All worth doing. But it misses why this is every week's story. Language models are wired to find instructions wherever they appear and act on them. That eagerness is their best feature, it's how they become useful assistants in the first place, and it's their main attack surface at the same time. Every plumbing fix patches the symptom. The behavior is the source. The fix has to land at the model itself: instructions should be opt-in, not opt-out. Your prompt is authorized. Everything else the model reads is data, even if it's phrased as a command, even if it's polite, even if it's wrapped in a system-prompt-looking wrapper inside an email, a lead form, a PR comment, or a GitHub issue. The default has to be "I found a sentence that looks like an order. I will ignore it unless I was told this source can give me orders." This is one of the hardest open problems in AI security. OpenAI's "instruction hierarchy," Anthropic's constitutional training, and Google's instruction-following research are all reaching for it, and none have shipped a default that holds against motivated attackers. Solving it likely needs structural changes, separating "trusted" instruction channels from "data" channels at the model level, tagging retrieved content at training time, or making the model demand explicit user permission before following anything that came from outside the user's prompt. None of these are weekend projects. And yes, until that lands, agents will be harder to build and less capable. Tool descriptions, retrieved documents, and planning steps all become "data" by default. Capabilities that secretly lean on "the model will follow whatever it reads" stop working. That is the price of not getting owned every week. Until then, the plumbing patches matter and they're still whack-a-mole. There is always one more place an attacker can hide a sentence, an email, a lead form, a comment, an issue, a vector store, the next surface someone wires an agent into. Add one line to your next agent vendor questionnaire: does this model treat retrieved text as data, or as instructions, by default? Most can't answer it cleanly today. That's the point, the question is worth asking even when the honest answer is "we're working on it," because that's where the real fix has to live. Until the labs ship a default that holds, we'll be writing this same story under a different headline next week. Worth knowing where the bug actually lives.

Listen to this episode