Sixty percent of the kernel exploits submitted to Google's kCTF reward program in a single year hit one feature. Not a sprawling subsystem with decades of cruft. One interface, barely a few years old: io_uring. Google paid out roughly $1 million in bounties for io_uring bugs alone, per the Google Online Security Blog in June 2023, then did the thing that should make you sit up. It turned the feature off. On ChromeOS, on Android via a seccomp filter, and on its own production servers.
That is the headline. The part that should actually change how you configure a cluster is quieter, and it has nothing to do with any single bug.
The feature works exactly as designed, and that is the problem
io_uring is a ring-buffer interface. A process drops read, write, network accept, and even process-spawn requests into a shared queue, and the kernel picks them up without a system call per operation. That is the whole point. Fewer syscalls means less context switching, which means more IOPS. The benchmark crowd has wanted this for a decade, and the throughput numbers are real.
Now look at the same fact from the other side of the fence. Almost every Linux runtime-security tool in production was built on one assumption: a process that touches a file or opens a socket has to issue a syscall to do it. Falco hooks syscalls. So do the kprobe and eBPF agents that watch the syscall boundary. Microsoft Defender for Endpoint on Linux leans on the same vantage point. io_uring quietly steps around that boundary, by design, for performance reasons that have nothing to do with hiding.
So you get two clocks running off one mechanism. The performance clock reads: fewer syscalls, more throughput, ship it. The observability clock reads: fewer syscalls means fewer events for your detection stack, and "fewer" slides toward "none" as more of the workload moves into the ring. Both readings are correct at the same time. Most adoption roadmaps only printed the first one.
The rootkit that makes no syscalls
This stopped being a thought experiment in April 2025. ARMO published a proof-of-concept rootkit it called Curing that performs command-and-control, file access, and process execution entirely through io_uring operations, issuing no traditional system calls at all. Per ARMO, io_uring exposes 61 operation types covering file reads and writes, network connect and accept, and process spawning. That is not a narrow primitive. That is a full toolkit for an implant.
The test results are the uncomfortable bit. In ARMO's testing, Falco was "completely blind" because it relies on syscall hooking. Defender for Endpoint missed the activity except where File Integrity Monitoring caught the file change after the fact, which is to say it noticed the burglary by spotting the missing TV. Tetragon could detect it, but only if the operator had already configured policies to hook the specific io_uring operations.
Read that last one twice. A tool that defends you only when you pre-arm it for an attack class you have never heard of is not defending you. It is waiting for you to do its job.
This is a Kubernetes problem before it is anything else
Here is where it gets operationally sharp, because your detection assumptions and your runtime defaults may quietly disagree, and the disagreement is decided by a single field.
The containerd project debated whether to strip io_uring syscalls out of its RuntimeDefault seccomp profile (issue #9048). GKE Autopilot applies the containerd default seccomp profile to every workload, so on Autopilot io_uring is blocked by default. Good. But a self-managed cluster with a permissive profile, or worse a pod running Unconfined, has no such guard. Same tooling, opposite exposure. The difference is one line in a security context that nobody reviewed.
I have seen this pattern bite teams in a way that has nothing to do with io_uring specifically: the "secure default" everyone cites lives in the managed platform, and the moment you hand-roll a node pool to save money or gain control, you inherit the permissive version without anyone deciding to. io_uring is just the latest place that gap shows up.
Why there is no patch coming
It is tempting to wait for a CVE and a kernel update to make this go away. There isn't one, and there won't be, because nothing is broken in the bug sense. io_uring is doing what the spec says. Per Google's 2023 assessment the component "provides strong exploitation primitives," and it remains actively developed, so the attack surface grows over time rather than shrinking.
The artifact you trusted is the one telling you everything is fine. Deep syscall visibility was Falco's whole pitch, and deep syscall visibility is precisely what io_uring routes around. That is the risk class worth naming: not a vulnerable component, but a trusted sensor pointed at the wrong boundary.
The fix the researchers point to is to move the sensor. KRSI, Kernel Runtime Security Instrumentation, attaches eBPF programs to Linux Security Module hooks. An LSM hook fires on the operation itself, at the point the kernel decides whether to allow it, regardless of whether the request arrived as a syscall or through a ring. Falco has since added io_uring visibility built on this approach. The catch: it is not the historical default, and you have to confirm it is actually switched on rather than assume the version you deployed two years ago grew the capability on its own.
The fair objection, and the honest answer
If io_uring is this dangerous, why is anyone turning it on? Because for trusted, first-party, high-throughput services the performance is genuinely worth it, and a workload that never executes untrusted code carries a far smaller threat model. That objection is correct, and it is exactly Google's own position: io_uring is safe for trusted components and a liability the moment it sits behind untrusted or internet-facing code paths.
The mistake is not enabling io_uring. The mistake is treating it as a neutral default instead of a scoped decision you made on purpose. Enable it where you own the entire stack. Block it where you run other people's code. The failure mode is leaving that choice to whatever the base image shipped.
What to check this week
Work this top to bottom. Every step ties to a signal you can query right now, not a vibe.
- Decide per workload, not once for the fleet. If a service handles untrusted input or runs multi-tenant, default it to no io_uring. If it is a first-party high-IOPS service you fully control, allowing it is defensible. Write the decision down so the next person does not silently flip it.
- Check the kernel knob. Run
sysctl kernel.io_uring_disabled(the control landed in Linux 6.6). Value0allows io_uring,1restricts it to processes with the right privilege,2disables it host-wide. If the host runs untrusted workloads and you do not actively need the feature, set it to2. - Confirm the seccomp profile is applied, not assumed. In Kubernetes set
securityContext.seccompProfile.type: RuntimeDefault, then verify thatio_uring_setup(425),io_uring_enter(426), andio_uring_register(427) are actually blocked for the pod. On GKE Autopilot this is on by default. On self-managed nodes, audit specifically for pods runningUnconfined, because that is where the hole lives. - Do not trust a syscall-only detector to see any of this. If you run Falco, confirm you are on a build with io_uring/KRSI support enabled rather than stock syscall hooking. If you run Tetragon, add an explicit TracingPolicy that hooks io_uring operations, because the default policies will not. If your only signal is File Integrity Monitoring catching the aftermath, you are detecting break-ins by inventory.
- Baseline what should never touch the ring. A standard web app or a logging sidecar issuing io_uring calls is itself the anomaly. Alert on io_uring usage from any workload that has no performance reason to want it.
The one-line version for the runbook: io_uring buys IOPS by skipping the boundary your security tools watch. Adopt the speed without moving detection down to the LSM layer and you have not made the system faster, you have made the attacker quieter.

Comments
Be the first to comment.