AI Auditor BuilderJune 4, 20269 min read

Why Domain-Tuned AI Auditors Beat Generic Ones (DEX, Lending, Staking, Governance)

Generic AI auditors plateau in precision. Loading domain primers (DEX, lending, staking, governance) with category-specific exploits and checklists is what closes the gap.

By Carlos (Bloqarl)

TL;DR

Generic AI auditors hit a precision ceiling because they treat every codebase the same way. A DEX has different bug classes than a lending protocol, which has different bug classes than a staking contract.
The fix is domain primers: prompt-injected modules that load category-specific exploit references, attack patterns, and audit checklists into context only when the codebase matches the category.
Krait, the AI auditor built by Zealynx and used as the reference implementation in the Academy's AI Auditor Builder pillar, uses 7 domain-specific primers (DEX, lending, staking, governance, NFT, bridge, derivatives) as part of its eight-kill-gate pipeline.
The trade-off is context window cost: each primer occupies tokens. The win is precision: a DEX-specialized auditor catches DEX-specific bugs (LP donation attacks, slippage manipulation, K-invariant breaks) that a generic auditor either misses or mis-classifies.
This is the single biggest precision multiplier you can add after solving false positives with kill gates.

Why this matters

If you've watched AI security tools benchmarked against Code4rena contests, you've probably noticed a pattern: tools score wildly differently across contests of different protocol types. The same auditor might catch 60% of bugs on a lending contest and 15% on a staking contest. The discrepancy isn't random; it's a domain-knowledge gap.

Generic AI auditors learn from broad smart-contract security principles. They understand reentrancy in the abstract. What they don't understand is how reentrancy specifically manifests in a Compound V2 fork's liquidateBorrow function, or what an "exchange rate" bug looks like in a cToken mint. The domain context is missing.

Loading the domain context fixes this. Not via fine-tuning the model (expensive, slow, doesn't update), but via prompt-time injection of category-specific knowledge. The model still uses its general reasoning capability; you give it the domain map.

If you're building an AI auditor or evaluating one, this is among the highest-yield architectural decisions you'll make.

What a domain primer contains

A domain primer is a prompt module loaded conditionally: "if the codebase under audit looks like a DEX, prepend this content to the detection prompt". Each primer typically contains:

1. Category definition

What makes a codebase fall into this category. For DEX: implements swap/addLiquidity/removeLiquidity patterns, has a constant-product or stable-curve invariant, holds two or more tokens with conserved value relationships. For lending: implements mint/borrow/repay/liquidate patterns, has collateral-factor and liquidation-incentive parameters.

This is how the auditor decides which primers to load. The decision can be heuristic (regex-pattern-match against function names) or LLM-based (ask the model to classify the codebase before audit).

2. Typical attack patterns

The 5-10 attack patterns most commonly seen in this category. For DEX:

LP donation attacks (covered in the minimum liquidity lock article).
K-invariant manipulation via sandwich attacks.
Fee-on-transfer token incompatibility (covered in the FoT article).
Price oracle manipulation via direct pool reads.
Flash-loan-amplified imbalance attacks.

Each pattern includes the attack's mechanism, the conditions that enable it, and the standard defense.

3. Named real exploits

Concrete historical exploits in this domain, with dollar amounts. For DEX: Uniswap V2 minimum-liquidity history (designed defense, not exploit), bZx ($1M flash-loan, 2020), Sushiswap accounting bugs (~$3M cumulative across forks). For lending: Hundred Finance ($7M), Cream ($130M), Mango ($114M), Inverse ($15.6M), Bonq ($120M).

These act as calibration anchors. We covered why anchor exploits matter in the exploit-context article.

4. Audit checklist

The 10-30 items an experienced auditor checks first when reviewing this category. For lending forks (Compound V2 specifically), this is essentially the checklist we built in the Canto mantissa article: blocksPerYear correctness, oracle scaling, mantissa consistency, token decimals, cross-market interactions.

5. Common parameter ranges

What "normal" looks like for parameters in this category. For lending: closeFactor typically in [0.4, 0.6], liquidationIncentive typically in [1.05, 1.12], kink typically in [0.7, 0.9]. Values outside these ranges aren't necessarily bugs, but they deserve attention.

The auditor uses these to flag parameter outliers, which often correlate with intentional risk-taking (which the protocol team should explicitly justify) or with unintentional configuration errors (which need to be fixed).

Walkthrough: DEX primer

A DEX primer for a generic constant-product AMM might prepend the following to the audit prompt:

The codebase under audit is a constant-product AMM (e.g., Uniswap V2 fork). The following bug classes appear in this category. Audit specifically for each:

First-deposit share inflation: if the first depositor can mint LP tokens with sub-1000-wei initial liquidity, donate large amounts to skew exchange rate, then watch subsequent depositors round to zero shares. Defense: 1000-wei minimum liquidity lock burned to address(0).

Fee-on-transfer incompatibility: if transferFrom(user, self, X) is followed by balanceOf(self) == prev + X, the protocol breaks on FoT tokens. Defense: compute deltas before and after each transfer.

Sandwich attack vulnerability: if a swap's amountOutMin parameter is missing or zero, MEV bots can sandwich the trade. Defense: enforce amountOutMin > 0.

Price oracle manipulation: if downstream consumers read reserve0/reserve1 directly as a price source, single-block manipulation drains them. Defense: TWAP, time-weighted accumulators.

skim()/sync() misuse: if these functions exist, are they called correctly? If they don't exist, are direct token donations tracked elsewhere?

Real exploits: Hundred Finance ($7M cToken donation, similar pattern), bZx ($1M flash-loan, 2020), Beanstalk ($182M flash-loan governance, 2022).

Now audit the codebase below for these specific patterns.

The model then runs detection against the codebase with this context loaded. The result: the model knows to look for the 1000-wei lock, recognize its absence as a finding, and report with reference to the share-inflation pattern. A generic prompt-without-primer would generate "looks like a DEX" findings without the specific detection.

Walkthrough: lending primer

A lending primer for a Compound V2 fork might prepend:

The codebase under audit is a Compound V2 fork. The following bug classes appear in this category. Audit specifically:

Mantissa scaling: oracle prices must be returned with mantissa 1e18 (or 1e18 * 10^(18 - underlyingDecimals)). Check getUnderlyingPrice and trace through accountLiquidity. Mismatched scaling produces 18-decimal-place errors.

blocksPerYear correctness: the constant must match the deployment chain's block time. Ethereum: 2,102,400. BSC: 10,512,000. Arbitrum: ~31,536,000. If unchanged from Ethereum default on a different chain, interest accrues at the wrong rate.

Per-year vs per-block units: rate parameters (baseRate, multiplier, jumpMultiplier) must be stored as per-block fractions. If stored as per-year, the contract accrues interest blocksPerYear times faster than intended.

cToken donation attack: a 1-wei mint followed by direct token donation skews the exchange rate. Subsequent depositors round to zero shares. Defense: minimum supply, virtual reserves, donation skim.

accountLiquidity oracle source: which oracle is queried? Is it manipulable? Have stale prices been ruled out?

Real exploits in this domain: Hundred Finance ($7M), Mango ($114M), Cream ($130M), Inverse ($15.6M), Bonq ($120M).

Similar structure for staking, governance, NFT, bridge, derivatives primers. Each is roughly 500-2000 tokens of context.

Trade-offs: context cost vs precision gain

Every primer occupies tokens. With 7 primers loaded, the audit prompt grows by roughly 7,000-14,000 tokens (depending on primer size). At Claude's pricing, that's a real cost per audit.

The naive solution is "load all primers always". The smart solution is "load only the primers that match the codebase". Krait uses a small classifier pass first ("what category is this codebase?") and loads only matching primers. A DEX-only audit loads the DEX primer; a lending audit loads the lending primer; a hybrid (a lending market with a built-in DEX, e.g., some Compound V3 deployments) loads both.

The classifier is cheap (one short prompt). The savings are large (5-12k tokens not loaded per audit).

The win, in precision: domain-tuned auditors typically catch 1.5-3x more findings in their domain than generic auditors. Combined with kill-gate filtering for false-positive control (covered in the Krait kill-gates article), the result is the precision benchmark Krait achieves: 100% across 50 blind contests.

Why this is hard for fine-tuning to replicate

You might ask: why use prompt-time primers instead of fine-tuning the model on each domain?

Three reasons:

1. Updating is expensive

A fine-tuned model is fixed. Adding a new domain or updating an existing one requires another training run. Prompt-time primers can be edited in a text file and immediately deployed.

2. Mixing models is easier

If you want to evaluate the same auditor against three different LLM backends (Claude, GPT-4, Gemini), prompt-time primers transfer. Fine-tuned weights don't.

3. Cost

Fine-tuning for ~7 specialized domains, plus version updates, plus cross-validation, runs into significant compute cost. Prompt primers are nearly free to develop and evolve.

The trade-off is that fine-tuning could theoretically reach higher precision in narrow domains. In practice, the gap between prompt-tuned and fine-tuned auditors is small enough that the engineering cost of fine-tuning isn't worth it for most teams.

Where to see this in Academy

The AI Auditor Builder pillar in Zealynx Academy walks through primer construction in Step 6. Each step builds toward a Krait-style auditor with primers as one of the architectural decisions. The full curriculum lets you build your own auditor and submit it to the Arena, where its precision and recall are measured against the 118 Code4rena findings benchmark (covered in the AI Auditor Arena article).

If you want to see Krait's actual primer set, the implementation is open source and the primer text files are readable. They're a strong starting point for your own auditor's domain knowledge.

Tagged

AI AuditingSmart Contract SecurityDomain Knowledge

TL;DR

Why this matters

What a domain primer contains

1. Category definition

2. Typical attack patterns

3. Named real exploits

4. Audit checklist

5. Common parameter ranges

Walkthrough: DEX primer

Walkthrough: lending primer

Trade-offs: context cost vs precision gain

Why this is hard for fine-tuning to replicate

1. Updating is expensive

2. Mixing models is easier

3. Cost

Related questions

Where to see this in Academy