June 15, 2026 · Edition #19 · by Asaf Nakash

The Wall Faces the Wrong Way

A US export-control directive made Anthropic shut off two of its most capable models for every customer on earth at once. Strip away the geopolitics and you're left with a strange admission about how AI security actually works. The safeguard that took weeks of red-teaming to build fell in roughly forty-eight hours. The ban meant to contain the capability could stand for years. One of those numbers is measured in hours, the other in years, and they point in opposite directions. Here's the inversion that stuck with me. We expect a barrier to stop the dangerous and wave through the rightful. This one does the opposite, and it's worth being concrete about who lands on each side. The person the order actually stops is the compliant defender: a security team using a paid, monitored, logged-in connection to do legitimate work. The person it doesn't stop is the one it was written for. A determined actor who wants frontier cyber capability can spin up their own copy of an openly available model on machines no one can remotely shut off, can talk their way past a safety filter in an afternoon, or can simply use one of the other frontier models that do the same thing, Anthropic itself says the capability is widely available. The export control reaches none of them. The lock fell in hours against the people it was built to stop, and it'll hold for years against the only people who were never the threat. Two facts from this week make the same point from the other side. The guardrail came down in two days, and a line of new research argues this isn't a tuning problem more red-teaming can fix, that a model may never reliably separate the instructions it should trust from the hostile text it reads. If that's even close to right, making the model itself un-jailbreakable is the wrong place to spend the next decade. Security already has a name for what to do instead. Years ago the field gave up on the dream of a perfect perimeter and adopted a posture called assume breach: stop betting everything on keeping the attacker out, and design as if one is already inside, least privilege, segmentation, constant monitoring, a contained blast radius, so a single failure is survivable instead of fatal. It's the logic underneath Zero Trust, and most security leaders have lived it for a decade. The step this week's evidence demands is to point that same posture at the model itself: assume the model is already jailbroken. Treat its guardrails as something that will fail rather than something you can finish hardening, and put your effort where it survives that failure, tight limits on what the model can touch, sandboxed execution, runtime containment, a human in the loop wherever a mistake is expensive. That's the thread under everything else this week, too: six coding assistants tricked into running code, an agent hijacked through a monitoring tool it was built to trust. None of it gets solved at the model layer. It gets solved by assuming the model is breached and securing the environment around it. And the wall in this story faces the wrong way. The right wall isn't around the world's access to the model. It's around what your model is allowed to do once it's inside your house.

Listen to this episode

Written by Asaf Nakash, Principal Product Manager for AI Security at Microsoft Defender and host of the Context Window podcast. Originally published in Context Window Edition #19, June 15, 2026.