AI Auditor BuilderMay 30, 20269 min read

27 Open-Source AI Audit Tools, 7 Architectural Patterns, One Choice You Probably Haven't Made

AI smart contract auditing tools cluster into 7 architectural patterns. Each has a different failure mode. Most teams adopt one without knowing which they picked.

By Carlos (Bloqarl)

TL;DR

There are 27+ open-source AI auditing tools for smart contracts as of mid-2026. They cluster into 7 architectural patterns: single-pass scanners, multi-phase pipelines, multi-agent systems, checklist-driven, multi-mindset, domain-specific primers, and static-analysis hybrids.
Each pattern has a distinct failure mode. Single-pass misses stateful bugs. Checklist-driven over-reports. Multi-agent burns tokens on coordination overhead. Domain primers fail on protocols outside their training set.
Most teams pick a tool from HackerNews and run with it. They never explicitly choose an architecture, so when the tool fails, they conclude "AI auditing doesn't work" instead of "this architecture doesn't fit this codebase".
The Academy's AI Auditor Builder pillar walks through 5 architectural decisions that define the auditor you actually need: architecture, detection strategy, verification, tool integration, output format. Picking deliberately matters more than picking well.
The right architecture depends on the codebase. Stateful protocols need multi-phase. Boilerplate-heavy codebases need checklist-driven. Novel mechanism designs need multi-mindset. There is no universal best.

Why this matters

If you've watched the AI auditing space for any length of time, you've seen the cycle: a new tool gets posted, hits the front page of HackerNews, gets benchmarked against a Code4rena contest, comes out with a precision/recall number, and then... nothing. Six months later it's abandoned, replaced by the next tool with a slightly different prompt template.

The reason this cycle keeps repeating is that nobody is comparing architectures. Each new tool implicitly picks an architecture, ships it, and gets evaluated against codebases its architecture happens to handle well or poorly. The headline number says nothing about which class of codebase the tool is good for.

Until you can name the architectural patterns and predict which one fits which codebase, you can't make an informed choice. You're picking by vibe.

The goal of this article is to lay out the seven patterns explicitly, with their failure modes, so you can decide which fits your situation. Then we'll point at the Academy module that walks through building one yourself.

The 7 architectural patterns

1. Single-pass scanners

The simplest design. The tool reads the entire codebase (or chunks of it) into one context window, runs one prompt, and returns a list of findings.

Strengths: cheap, fast, simple to implement and debug. Works well for small codebases (under ~5k lines) where the model can hold the whole protocol in context.

Failure mode: stateful bugs. Anything that requires reasoning across multiple transactions, multi-step state evolution, or invariants that span multiple functions. The single-pass prompt sees the code but loses the state machine.

Examples: most "wrap GPT-4 around a Solidity prompt" tools fall here.

2. Multi-phase pipelines

The tool runs distinct phases: reconnaissance (understand the protocol), detection (find candidate bugs), verification (filter false positives), reporting (write up findings). Each phase has its own prompt, its own context, its own output schema.

Strengths: stateful reasoning across phases. The detection phase has access to recon's protocol summary. Verification has access to detection's hypotheses. Each phase does one thing well.

Failure mode: brittle prompt-chain. A bad output from phase 1 propagates downstream and corrupts later phases. Hard to debug without phase-by-phase logging.

Examples: Krait (the Zealynx-built reference auditor) uses this pattern with 4 phases and 8 kill gates.

3. Multi-agent systems

Different "agents" (independent prompt instances) play different roles: an attacker, a defender, a critic, a synthesizer. The agents talk to each other, refining hypotheses through dialogue.

Strengths: novel finding generation through structured disagreement. The attacker proposes, the defender refutes, the critic resolves. This can surface bugs that a single-perspective prompt would miss.

Failure mode: coordination overhead. Agents talking to each other consume tokens fast. Convergence is slow. The "structured disagreement" can degenerate into hallucinated findings if the agents don't have grounding.

Examples: SC-Auditor's Devil's Advocate pattern, several research-grade systems from academic groups.

4. Checklist-driven (SWC, Solodit registries)

The tool walks through a curated list of vulnerability classes (e.g., the SWC Registry, Solodit's catalog) and checks the codebase against each item.

Strengths: high recall on known patterns. If the bug class is in the checklist, the tool will check for it. Easy to extend (just add to the checklist).

Failure mode: over-reports false positives. The checklist forces the tool to consider every class, even ones obviously irrelevant to the codebase. Also: blind to bugs that don't match a checklist item, which is most novel mechanism-design bugs.

Examples: tools that ingest Solodit and prompt the model with each entry as a check.

5. Multi-mindset (Attacker, Accountant, Spec Auditor, Edge-Case Hunter)

Closely related to multi-agent, but uses sequential prompts with different "personas" rather than concurrent agents. Each persona reads the same code with a different question in mind.

Strengths: catches bugs that fall through any single perspective. The Accountant catches arithmetic invariant breaks. The Spec Auditor catches missed-state-update bugs. The Edge-Case Hunter catches boundary-condition bugs.

Failure mode: cost scales linearly with persona count. Each persona is a full pass. Also: personas can produce overlapping findings that need de-duplication.

Examples: this is the pattern Krait inherited from Pashov Auditor's earlier work and refined.

6. Domain-specific primers (DEX, lending, staking, governance)

The tool ships with pre-loaded context for specific protocol types. Auditing a DEX? Load the AMM-specific exploit history (Uniswap V2 share inflation, V3 sandwich attack patterns, etc.) before the audit prompt.

Strengths: dramatically better recall on in-domain protocols. The primer gives the model calibration anchors that generic prompts lack.

Failure mode: brittle outside the domain. A DEX-trained tool reading a stablecoin protocol misses bugs that don't match DEX patterns. Also: requires maintaining a primer per domain, which is a non-trivial editorial task.

Examples: Forefy's context system, Krait's 7 domain primers (DEX, lending, staking, governance, NFT, bridge, cross-chain).

7. Static-analysis hybrids

The tool runs Slither, Aderyn, or another static analyzer first, then uses an LLM to triage and explain the findings. The LLM doesn't search for new bugs; it just verifies and writes up what the static analyzer flagged.

Strengths: high precision. Static analyzers catch real patterns; the LLM filters their false positives by adding semantic understanding. Fast.

Failure mode: bounded by the static analyzer's coverage. If Slither doesn't have a detector for the bug class, the hybrid won't find it. Misses anything that requires reasoning beyond pattern matching.

Examples: several commercial tools (Olympix, Cyfrin Aderyn-based pipelines) use this. Open-source is rarer.

How to pick

The decision framework is simple, even if applying it isn't.

Three questions, in order:

Question 1: Is the codebase small (<5k LoC) and self-contained?

Yes → single-pass scanner is fine. Don't over-engineer.
No → continue.

Question 2: Does the protocol have novel mechanism design (e.g., new AMM curve, new lending model)?

Yes → multi-mindset or multi-agent. The Attacker/Accountant/Spec-Auditor/Edge-Case-Hunter pattern catches mechanism bugs that pattern-matchers miss.
No (i.e., it's a fork or follows a known pattern) → continue.

Question 3: Is the codebase a fork of a known protocol with a documented bug history?

Yes → domain primer + multi-phase pipeline. The primer gives the model calibration; the pipeline gives it depth.
No → multi-phase pipeline alone, with checklist as a complement.

Notice what's missing: single-pass scanners are almost never the right answer for production audit work. They're useful for quick triage, not for deep audits.

The 5 architectural decisions

The Academy's AI Auditor Builder pillar formalizes this as 5 decisions you make explicitly when building your own auditor:

Architecture: single-pass / multi-phase / multi-agent
Detection strategy: checklist / multi-mindset / domain primers
Verification: kill gates / devil's-advocate / confidence scoring
Tool integration: AI-only / static analysis / full stack (static + fuzzing)
Output format: markdown / JSON+markdown / full audit-report suite

Each decision has tradeoffs. There is no universal best. The Builder pillar walks you through each decision with examples, then has you implement them as a Claude Code skill in .claude/skills/security-scan.md. After 12 steps, you have an auditor specifically tuned to the codebases you actually audit.

Why most teams pick badly

Three patterns we see repeatedly:

Pick by author popularity, not by fit. Whichever tool the loudest researcher built last month is the one teams adopt. Architectural fit gets ignored.
Conflate "AI auditing works" with "this AI tool worked on this codebase". A single-pass scanner that nailed a small CTF gets adopted for a 30k-LoC production protocol where it fundamentally cannot perform. Failure gets attributed to "AI" rather than to the architecture mismatch.
Skip the architectural decision because the tool already made it. When you adopt an off-the-shelf auditor, you adopt its architecture. Most teams never recognize that they're inheriting that decision.

The framing we want to push is: picking deliberately matters more than picking the "best". A multi-phase pipeline with explicit kill gates, even if simpler than the latest research system, will outperform a more sophisticated tool that doesn't fit your codebase.

Where to start building your own

If you've decided you need your own auditor (because no off-the-shelf tool fits your audit work), the Academy's AI Auditor Builder pillar walks through it in 12 steps over roughly 480 minutes. Each step is a real architectural decision with concrete consequences. By step 9, you have a working auditor; by step 12, it's tuned and benchmarked against the AI Auditor Arena's 118-finding dataset.

Total cost to follow the Builder pillar: a Claude Code subscription and ~8 hours of focused work. Output: a security-scan.md skill specifically built for the codebases you audit, with the architectural decisions you made deliberately documented in the skill itself.

That's worth more than adopting whatever's on HackerNews this week.

Tagged

AI AuditingSmart Contract SecurityOpen Source

TL;DR

Why this matters

The 7 architectural patterns

1. Single-pass scanners

2. Multi-phase pipelines

3. Multi-agent systems

4. Checklist-driven (SWC, Solodit registries)

5. Multi-mindset (Attacker, Accountant, Spec Auditor, Edge-Case Hunter)

6. Domain-specific primers (DEX, lending, staking, governance)

7. Static-analysis hybrids

How to pick

The 5 architectural decisions

Why most teams pick badly

Related questions

Where to start building your own