Run AI Fuzzing on Your Own Repo Before Attackers Do

OSS-Fuzz-Gen, Buttercup, and Atheris run on a laptop with one LLM key. Ten practical commands to fuzz your own repo before attackers do.

June 2026's Patch Tuesday shipped 206 CVEs. A record. The reason is not that humans suddenly got better at code review. Model-backed fuzzing is finding bugs faster than the patch cycle can absorb them, and the uncomfortable part is that the same engines doing it are open source and run on a laptop. No DARPA grant required. You need an afternoon, one LLM API key, and the willingness to point these tools at your own repo before someone else points theirs at it.

This is for the engineer who read "AI broke your patch cadence" and asked the obvious follow-up: can I run the AI bug-finder on my own code first? You can. Below are ten tips drawn from the systems that actually shipped the bugs, with the exact commands instead of the marketing.

The tips

Let the model write the harness, not the fuzzer. The hard part of fuzzing was never the fuzzer. It was writing a target that exercises real code paths, and most teams never get around to it. Google's OSS-Fuzz-Gen flips the work: you hand it a GitHub URL and it synthesizes the build, the Dockerfile, and the fuzz target for you. Google reported absolute coverage gains of up to 35% across more than 160 projects and 26 new vulnerabilities, including one in OpenSSL that human-written targets had missed for years. Start here:

   oss-fuzz-generator generate-full -m ${MODEL} -i "https://github.com/your-org/your-repo"

Run Buttercup locally as a full find-and-patch loop. Trail of Bits open-sourced Buttercup on 2025-08-08 after it took the $3M second prize at DARPA's AIxCC, where it found 28 vulnerabilities across 20 CWEs at roughly $181 per point using only non-reasoning LLMs. The standalone build is tuned for a workstation: 8 cores, 16 GB RAM, Docker, one API key.

   git clone --recurse-submodules https://github.com/trailofbits/buttercup.git
   cd buttercup && make setup-local && make deploy
   make web-ui   # http://localhost:31323

Validate the pipeline with make send-libpng-task before you aim it at your own OSS-Fuzz project via the CUSTOM_CHALLENGES.md guide. Confirm the known-good case works first, then trust the unknown.

Aim the AI at unfuzzed code, not your existing corpus. The biggest wins live in functions no human ever wrote a target for. Use Fuzz Introspector (bundled in the OSS-Fuzz-Gen flow) to rank reachable-but-uncovered functions, then generate targets for the top of that list. Re-fuzzing coverage you already have is wasted spend. The bugs are in the code your current suite never touches, which is exactly the code nobody wanted to write a target for in the first place.
Fuzz Python and native extensions with Atheris. Most of these tools are C/C++ first, and that leaves a lot of shipped code uncovered. Atheris is a coverage-guided engine for Python and CPython native extensions, built on libFuzzer, supporting Python 3.11 through 3.13. A working target is about ten lines:

   import sys, atheris

   @atheris.instrument_func
   def TestOneInput(data: bytes):
       my_parser(data)        # the function you actually ship

   atheris.Setup(sys.argv, TestOneInput)
   atheris.Fuzz()

If coverage looks suspiciously flat, you forgot the instrumentation. Wrap the entry point with @atheris.instrument_func, or call atheris.instrument_all(). This trips up everyone once.

Wire short fuzzing into CI with ClusterFuzzLite. You do not have to qualify for Google's hosted OSS-Fuzz, and closed-source code never will. ClusterFuzzLite runs CIFuzz as a pre- or post-commit job that fuzzes only the diff for a few minutes, so a regression that introduces a crash fails the PR instead of shipping. The signal you gate on is the job's crash-* artifact and a non-zero exit in your CI logs. Block the merge on that, not on a nightly run nobody reads.
Cap the model spend before you start, and prefer non-reasoning models. Buttercup's headline number, $181 per point, came from non-reasoning LLMs precisely because reasoning models burn tokens on triage you can do cheaper. Set a hard token or dollar cap per run, point at OpenAI or Anthropic (Buttercup runs with at least one provider key), and measure cost-per-confirmed-bug rather than cost-per-run. An uncapped agent loose on a large monorepo will find your billing limit before it finds your bug.
Keep a human on triage, because the bug count lies. Google's Big Sleep workflow draws the line explicitly: the agent finds and reproduces, a human verifies. That separation matters because AI discovery is already overwhelming bug-tracking systems, and a flood of unverified findings is noise wearing a severity label. Treat every machine-reported crash as a candidate until a person confirms the reproduction. Big Sleep's standout, the SQLite stack buffer underflow CVE-2025-6965, mattered because a human confirmed it was real and imminent, not because a model emitted it. The same human-on-the-loop line shows up on the defense side in agentic SOC triage, and for the same reason.
Let the AI propose the patch, then gate it like any other PR. OSS-Fuzz's agentic patching generated plausible fixes for 61% to 72% of historical vulnerabilities, and some have already merged upstream. At AIxCC, finalist systems patched 43 of the 54 unique vulnerabilities they found. Strong assist, not autopilot. Route AI patches through the same review, test, and signoff you would give a junior engineer's first PR. The failure mode to watch for is a patch that silences the crash without fixing the underlying bug.
Reuse the AIxCC systems instead of building your own. DARPA open-sourced four of the seven finalist cyber reasoning systems at the August 2025 results, and the field surfaced 18 previously unknown real-world flaws on top of the synthetic set. Before you write a custom orchestration layer, read Team Atlanta's ATLANTIS and Buttercup. The decisions you would agonize over, where to put static analysis, where to put the LLM, how to pair a vulnerability with its patch, are already made and already stress-tested under competition load. Borrow them.
Reproduce before you trust, and check the corpus into the repo. Every one of these tools can hand you a crashing input. A finding you cannot replay on demand is not a finding, it is a rumor. Re-run the target against the saved crash file, confirm the same stack trace, then commit the input into a corpus directory so the bug becomes a permanent regression test. The corpus is the durable asset. The one-off crash is disposable.

Wrap-up

If you do one thing this week, do tip 1: point OSS-Fuzz-Gen at your most security-sensitive parser and let it write the target you never got around to writing. Then run tip 10 on whatever it finds, because a crash you can replay is the only kind worth your time. The 206-CVE month is not a spike, it is the new floor, and the asymmetry only bites if attackers run these tools and you do not. They are free, they run locally, and the cost of entry is one API key. Close the gap before the CVE closes it for you.

Run AI Fuzzing on Your Own Repo Before Attackers Do

The tips

Wrap-up

Sources

Comments

Leave a comment