Why Domain-Tuned AI Auditors Beat Generic Ones (DEX, Lending, Staking, Governance)
Generic AI auditors plateau in precision. Loading domain primers (DEX, lending, staking, governance) with category-specific exploits and checklists is what closes the gap.
TL;DR
- Generic AI auditors hit a precision ceiling because they treat every codebase the same way. A DEX has different bug classes than a lending protocol, which has different bug classes than a staking contract.
- The fix is domain primers: prompt-injected modules that load category-specific exploit references, attack patterns, and audit checklists into context only when the codebase matches the category.
- Krait, the AI auditor built by Zealynx and used as the reference implementation in the Academy's AI Auditor Builder pillar, uses 7 domain-specific primers (DEX, lending, staking, governance, NFT, bridge, derivatives) as part of its eight-kill-gate pipeline.
- The trade-off is context window cost: each primer occupies tokens. The win is precision: a DEX-specialized auditor catches DEX-specific bugs (LP donation attacks, slippage manipulation, K-invariant breaks) that a generic auditor either misses or mis-classifies.
- This is the single biggest precision multiplier you can add after solving false positives with kill gates.
Why this matters
If you've watched AI security tools benchmarked against Code4rena contests, you've probably noticed a pattern: tools score wildly differently across contests of different protocol types. The same auditor might catch 60% of bugs on a lending contest and 15% on a staking contest. The discrepancy isn't random; it's a domain-knowledge gap.
Generic AI auditors learn from broad smart-contract security principles. They understand reentrancy in the abstract. What they don't understand is how reentrancy specifically manifests in a Compound V2 fork's liquidateBorrow function, or what an "exchange rate" bug looks like in a cToken mint. The domain context is missing.
Loading the domain context fixes this. Not via fine-tuning the model (expensive, slow, doesn't update), but via prompt-time injection of category-specific knowledge. The model still uses its general reasoning capability; you give it the domain map.
If you're building an AI auditor or evaluating one, this is among the highest-yield architectural decisions you'll make.
What a domain primer contains
A domain primer is a prompt module loaded conditionally: "if the codebase under audit looks like a DEX, prepend this content to the detection prompt". Each primer typically contains:
1. Category definition
What makes a codebase fall into this category. For DEX: implements swap/addLiquidity/removeLiquidity patterns, has a constant-product or stable-curve invariant, holds two or more tokens with conserved value relationships. For lending: implements mint/borrow/repay/liquidate patterns, has collateral-factor and liquidation-incentive parameters.
This is how the auditor decides which primers to load. The decision can be heuristic (regex-pattern-match against function names) or LLM-based (ask the model to classify the codebase before audit).
2. Typical attack patterns
The 5-10 attack patterns most commonly seen in this category. For DEX:
- LP donation attacks (covered in the minimum liquidity lock article).
- K-invariant manipulation via sandwich attacks.
- Fee-on-transfer token incompatibility (covered in the FoT article).
- Price oracle manipulation via direct pool reads.
- Flash-loan-amplified imbalance attacks.
Each pattern includes the attack's mechanism, the conditions that enable it, and the standard defense.
3. Named real exploits
Concrete historical exploits in this domain, with dollar amounts. For DEX: Uniswap V2 minimum-liquidity history (designed defense, not exploit), bZx ($1M flash-loan, 2020), Sushiswap accounting bugs (~$3M cumulative across forks). For lending: Hundred Finance ($7M), Cream ($130M), Mango ($114M), Inverse ($15.6M), Bonq ($120M).
These act as calibration anchors. We covered why anchor exploits matter in the exploit-context article.
4. Audit checklist
The 10-30 items an experienced auditor checks first when reviewing this category. For lending forks (Compound V2 specifically), this is essentially the checklist we built in the Canto mantissa article: blocksPerYear correctness, oracle scaling, mantissa consistency, token decimals, cross-market interactions.
5. Common parameter ranges
What "normal" looks like for parameters in this category. For lending: closeFactor typically in [0.4, 0.6], liquidationIncentive typically in [1.05, 1.12], kink typically in [0.7, 0.9]. Values outside these ranges aren't necessarily bugs, but they deserve attention.
The auditor uses these to flag parameter outliers, which often correlate with intentional risk-taking (which the protocol team should explicitly justify) or with unintentional configuration errors (which need to be fixed).
Walkthrough: DEX primer
A DEX primer for a generic constant-product AMM might prepend the following to the audit prompt:
The codebase under audit is a constant-product AMM (e.g., Uniswap V2 fork). The following bug classes appear in this category. Audit specifically for each:
First-deposit share inflation: if the first depositor can mint LP tokens with sub-1000-wei initial liquidity, donate large amounts to skew exchange rate, then watch subsequent depositors round to zero shares. Defense: 1000-wei minimum liquidity lock burned to address(0).
Fee-on-transfer incompatibility: if
transferFrom(user, self, X)is followed bybalanceOf(self) == prev + X, the protocol breaks on FoT tokens. Defense: compute deltas before and after each transfer.Sandwich attack vulnerability: if a swap's
amountOutMinparameter is missing or zero, MEV bots can sandwich the trade. Defense: enforceamountOutMin > 0.Price oracle manipulation: if downstream consumers read
reserve0/reserve1directly as a price source, single-block manipulation drains them. Defense: TWAP, time-weighted accumulators.
skim()/sync()misuse: if these functions exist, are they called correctly? If they don't exist, are direct token donations tracked elsewhere?Real exploits: Hundred Finance ($7M cToken donation, similar pattern), bZx ($1M flash-loan, 2020), Beanstalk ($182M flash-loan governance, 2022).
Now audit the codebase below for these specific patterns.
The model then runs detection against the codebase with this context loaded. The result: the model knows to look for the 1000-wei lock, recognize its absence as a finding, and report with reference to the share-inflation pattern. A generic prompt-without-primer would generate "looks like a DEX" findings without the specific detection.
Walkthrough: lending primer
A lending primer for a Compound V2 fork might prepend:
The codebase under audit is a Compound V2 fork. The following bug classes appear in this category. Audit specifically:
Mantissa scaling: oracle prices must be returned with mantissa 1e18 (or 1e18 * 10^(18 - underlyingDecimals)). Check
getUnderlyingPriceand trace throughaccountLiquidity. Mismatched scaling produces 18-decimal-place errors.blocksPerYear correctness: the constant must match the deployment chain's block time. Ethereum: 2,102,400. BSC: 10,512,000. Arbitrum: ~31,536,000. If unchanged from Ethereum default on a different chain, interest accrues at the wrong rate.
Per-year vs per-block units: rate parameters (baseRate, multiplier, jumpMultiplier) must be stored as per-block fractions. If stored as per-year, the contract accrues interest blocksPerYear times faster than intended.
cToken donation attack: a 1-wei mint followed by direct token donation skews the exchange rate. Subsequent depositors round to zero shares. Defense: minimum supply, virtual reserves, donation skim.
accountLiquidity oracle source: which oracle is queried? Is it manipulable? Have stale prices been ruled out?
Real exploits in this domain: Hundred Finance ($7M), Mango ($114M), Cream ($130M), Inverse ($15.6M), Bonq ($120M).
Similar structure for staking, governance, NFT, bridge, derivatives primers. Each is roughly 500-2000 tokens of context.
Trade-offs: context cost vs precision gain
Every primer occupies tokens. With 7 primers loaded, the audit prompt grows by roughly 7,000-14,000 tokens (depending on primer size). At Claude's pricing, that's a real cost per audit.
The naive solution is "load all primers always". The smart solution is "load only the primers that match the codebase". Krait uses a small classifier pass first ("what category is this codebase?") and loads only matching primers. A DEX-only audit loads the DEX primer; a lending audit loads the lending primer; a hybrid (a lending market with a built-in DEX, e.g., some Compound V3 deployments) loads both.
The classifier is cheap (one short prompt). The savings are large (5-12k tokens not loaded per audit).
The win, in precision: domain-tuned auditors typically catch 1.5-3x more findings in their domain than generic auditors. Combined with kill-gate filtering for false-positive control (covered in the Krait kill-gates article), the result is the precision benchmark Krait achieves: 100% across 50 blind contests.
Why this is hard for fine-tuning to replicate
You might ask: why use prompt-time primers instead of fine-tuning the model on each domain?
Three reasons:
1. Updating is expensive
A fine-tuned model is fixed. Adding a new domain or updating an existing one requires another training run. Prompt-time primers can be edited in a text file and immediately deployed.
2. Mixing models is easier
If you want to evaluate the same auditor against three different LLM backends (Claude, GPT-4, Gemini), prompt-time primers transfer. Fine-tuned weights don't.
3. Cost
Fine-tuning for ~7 specialized domains, plus version updates, plus cross-validation, runs into significant compute cost. Prompt primers are nearly free to develop and evolve.
The trade-off is that fine-tuning could theoretically reach higher precision in narrow domains. In practice, the gap between prompt-tuned and fine-tuned auditors is small enough that the engineering cost of fine-tuning isn't worth it for most teams.
Related questions
How specific should a primer be? Too specific and it only covers one fork; too general and it's no better than a generic prompt. Krait's primers target protocol categories (DEX-with-constant-product, lending-with-LTV, staking-with-rewards) rather than specific protocols. This way, a Uniswap V2 fork and a SushiSwap fork both load the same DEX primer.
Can primers be community-contributed? Yes, and this is one of the most underused channels for AI security improvements. A primer is a markdown file with a known schema; security researchers can write one for their domain of expertise and submit it. The Krait codebase accepts PRs for primer additions.
Does the model still hallucinate even with a primer? Less, but not zero. Hallucinations cluster on edge cases the primer doesn't cover. Mitigation: continuously update primers as new bug classes emerge in the wild.
How do primers interact with kill gates? Primers shape DETECTION (what bugs to look for). Kill gates shape FILTERING (which detected candidates to keep). Both are part of the same pipeline; primers improve recall, kill gates improve precision.
Can a primer be wrong and cause false positives? Yes. A primer that lists "patterns to flag" without specifying when those patterns are NOT bugs creates false positives. Good primers describe both "what to look for" and "when this looks like a bug but isn't" (e.g., "the 1000-wei minimum liquidity lock is intentional; flag its ABSENCE, not its presence").
Where to see this in Academy
The AI Auditor Builder pillar in Zealynx Academy walks through primer construction in Step 6. Each step builds toward a Krait-style auditor with primers as one of the architectural decisions. The full curriculum lets you build your own auditor and submit it to the Arena, where its precision and recall are measured against the 118 Code4rena findings benchmark (covered in the AI Auditor Arena article).
If you want to see Krait's actual primer set, the implementation is open source and the primer text files are readable. They're a strong starting point for your own auditor's domain knowledge.
Tagged