All articles
AI Auditor BuilderMay 31, 20268 min read

The AI Auditor Arena: Benchmarking Against 118 Real Code4rena Findings

Most AI auditor benchmarks use toy code with planted bugs. The Arena uses 118 real Code4rena findings (41 High, 77 Medium) across 10 contests and 19,200 lines of Solidity. Submit your auditor and get a precision and recall score.

By Carlos (Bloqarl)

TL;DR

  • Most AI auditor benchmarks are useless. They use small toy contracts with planted bugs and simulate audit conditions that look nothing like production work.
  • The AI Auditor Arena uses 118 real findings (41 High plus 77 Medium) across 10 actual Code4rena contests, totaling 19,200 lines of Solidity.
  • Each contest is a different protocol category: AMM, lending, NFT, governance, cross-chain swaps, stablecoin with endogenous collateral, gaming, soulbound incentives, and credit/governance hybrids.
  • You submit your AI auditor's output as a JSON manifest of findings. The Arena scores precision (TP / (TP + FP)) and recall (TP / total official findings) against the known finding set.
  • This is the benchmark we use to develop Krait. It is also the benchmark you should use if you are claiming your AI auditor outperforms baseline scanners.

Why this matters

Benchmarks define the field. If everyone running AI security tools claims their tool catches reentrancy on the SWC-107 example, nothing differentiates them. If everyone benchmarks against contests where the bugs were found by humans under bug-bounty conditions, you start to see which tools are actually pulling weight.

Most published AI auditor benchmarks today fall into two failure modes. They use synthetic contracts with planted bugs (way easier than real code), or they cite a single statistic like "we found 5 of 7 SWC categories" without showing precision (which is where most tools collapse).

Real audit work is different. Real protocols are 1,000 to 5,000 lines of code, often spread across many contracts. Bugs are subtle, contextual, and frequently require understanding the protocol's economic intent. False positives waste auditor time at scale; missing high-severity findings is a career-ending mistake. A benchmark that does not measure both is not measuring much.

The Arena was built to fix that. It is the benchmark we wanted to exist when we started developing Krait, our open-source AI auditor.

The 10 contests in the Arena

Each contest in the Arena was a real public security review on Code4rena. The contracts have been audited by dozens of human auditors and the findings have been triaged by Code4rena judges. We took the post-judging finding set as ground truth.

ContestDomainLoCHighMediumTotal
BasinAMM + Oracle (Beanstalk Wells)1,20011213
NextGenNFT Generative Art2,00041014
Kelp DAOLiquid Restaking1,000325
RevolutionAuction + Governance2,50041418
DecentCross-chain Swaps1,500459
Venus PrimeSoulbound Incentives1,500325
DYADStablecoin + Endogenous Collateral1,50010919
EthenaStablecoin + Staking1,000044
Ethereum Credit GuildCredit + Governance5,00041014
AI ArenaGaming + NFT2,0008917
Total19,2004177118

The mix is deliberate. AMMs and lending dominate the average security review pipeline, so those are well-represented. But generative NFTs, auction governance, cross-chain swaps, and gaming protocols all have their own attack patterns. An auditor that is good only at lending is not a generalist. The Arena measures whether your auditor handles the spread.

What "118 findings" actually represents

These 118 findings are not synthetic. They are issues that were:

  1. Submitted by warden auditors during the public Code4rena contest window.
  2. Reviewed by other wardens through the post-submission discussion process.
  3. Adjudicated by Code4rena judges as valid High or Medium severity.
  4. Paid out as legitimate findings, with auditors receiving the bounty allocation.

Low and informational findings are not in the Arena. We focus the benchmark on High and Medium because those are the findings that justify the cost of an audit. If your AI auditor catches every Low-severity comment in the documentation but misses a High-severity oracle manipulation, the benchmark should reflect that.

Scoring: precision and recall

When you submit a finding set to the Arena, two metrics get computed.

Precision = True Positives / (True Positives + False Positives). Of the findings your auditor produced, how many were real?

Recall = True Positives / Total Official Findings. Of the real findings in the contest, how many did your auditor catch?

Both numbers matter. High precision with low recall is a polished but lazy auditor. High recall with low precision is an exhausting auditor that drowns reviewers in noise. Krait's design goal was to maximize precision first, then push recall up over time. We treat any submission that ships with less than 100 percent precision as a failure.

Most generic LLM-based scanners will produce 30 to 70 findings on a single 2,000-line contest. In our internal Arena runs, generic GPT-4 scanners average around 8 percent precision: 5 real findings out of 60 reported. They produce a flood of "could-be" warnings that drown the actual issues.

Krait, with its 8 kill gates and exploit-context calibration, holds at 100 percent precision across 50 blind contests in our Code4rena dataset. The number of true positives caught varies by contest difficulty, but Krait's reports are short and actionable.

How to submit

The Arena accepts submissions in a JSON format that maps each finding to:

  • The contract path
  • The function or line range
  • A severity (High, Medium, or N/A)
  • A short title
  • A root-cause sentence (used for matching against the official finding's classification)

The submission gets cross-referenced against the contest's official finding set. Matches are scored as True Positives. Submissions not matching any official finding are False Positives. Official findings your submission missed are counted as False Negatives.

Submissions are public. Each submission gets a leaderboard entry so you can compare your auditor against Krait, Pashov Auditor, and any other registered AI auditing systems that have been benchmarked.

Why these 10 contests and not others

A benchmark is only as good as its dataset. We picked these 10 contests because:

  1. They span domains. AMM, lending, NFT, governance, cross-chain, gaming, stablecoin. An auditor that aces only one domain is overfitted to that domain.
  2. They have rigorous official findings. Each contest went through Code4rena's full submission and judging process, so the ground truth is solid.
  3. They are public. Anyone can read the contest reports and verify our scoring against the source data.
  4. They cover real attack patterns. Reentrancy, oracle manipulation, share inflation, governance hijack, accounting drift, signature replay, parameter mismatches in forks. The list mirrors the attacks that drained real protocols in 2022 to 2024.
  5. They scale. From Kelp DAO at 1,000 lines to Ethereum Credit Guild at 5,000 lines, your auditor has to handle both compact and sprawling codebases.

The benchmark intentionally omits very small CTF-style contracts (too easy) and large monorepos (impossible to score consistently). It targets the size range where audit work actually happens.

What the Arena does not measure

A benchmark is not a complete evaluation. The Arena does not measure:

  • Time to result. Some auditors take 30 minutes per contest, some take 30 hours. The Arena does not penalize either.
  • Cost per audit. Token cost varies wildly between auditors. We publish cost data alongside scoring, but cost is not in the precision/recall calculation.
  • Severity calibration. We accept High and Medium per the official judging. We do not penalize an auditor for marking a Medium as High, but we do not reward them for it either.
  • Explainability. Two findings can match the official finding but one might be a one-line nothing-statement and the other a detailed walkthrough. Both score equally in precision/recall.

Cost, time, severity calibration, and explanation quality matter for production work. They are tracked separately on the Arena's secondary metrics page.

Related questions

Can I submit multiple times? Yes. Each contest can be re-submitted as you improve your auditor. The leaderboard tracks the best precision-and-recall combination.

What stops gaming the benchmark? New contests get added to the Arena as Code4rena finalizes new contests. Your auditor cannot train on the official findings if the contest is new. We rotate the held-out set so over-fitting on past contests does not help on future ones.

Are LowSeverity findings in scope? No. The Arena focuses on High and Medium. We considered including Low but decided that the noise/signal ratio for Low findings is too high to make a useful benchmark.

Can I run the Arena privately? The Arena's grading code is open source. You can run it locally against the official finding set and validate your own auditor before submitting publicly.

Where to participate

The AI Auditor Builder pillar of Zealynx Academy walks through building an auditor end to end and submitting it to the Arena. The Arena itself is at /ai-agents/security/arena and is open to any auditor that produces findings in the documented JSON format. Krait and Pashov Auditor are pre-registered baselines you can compare against.

If you want to skip building an auditor and just see how well existing ones perform, the leaderboard at /ai-agents/security/arena shows current scores per auditor per contest. The data is public and the methodology is documented.

For a serious team building an AI auditing tool, this is the benchmark to validate against. Production audits are not toys. The benchmark should not be either.

Tagged

AI AuditingCode4renaBenchmarkingSmart Contract Security