MITIGATION · m-static-analysis
Static analysis on generated code — a pre-execution gate on LLM-emitted artifacts
An agent that can generate and execute code treats code generation as a tool call and code execution as the outcome. If the generated code contains a known-dangerous pattern, no amount of prompt engineering stops it from running once the execute call goes through. Static analysis closes that gap: it scans every code artifact the agent emits against a rule set before execution is permitted, catching the vulnerability patterns the same tooling already catches in human-written code.
At a glance
TL;DR
- Run a SAST/SCA scanner against every code artifact the agent emits before it executes, same tooling applied to human-authored code, same rule packs.
- LLM-generated code has vulnerability rates comparable to human code (Asare et al. 2023), with a higher incidence of over-permissive defaults and missing input validation.
- Directly addresses OWASP T11 Unexpected RCE and Code Attacks, Code Generation Attacks, Library Injection, and Reasoning Subversion through Code all surface as detectable code patterns.
- Pair with a sandbox (m-gvisor) for runtime defence and a human review gate (m-codegen-review-gate) for high-stakes output; static analysis alone catches known patterns, not novel ones.
How it behaves
What it is
Static analysis is the automated inspection of source code for vulnerability patterns without executing it. A SAST tool parses the code into an abstract syntax tree (AST), applies rules that encode known-dangerous constructs, and returns a structured list of findings. SCA tooling does the same for dependency manifests, checking third-party packages against known-vulnerability databases. Neither technique executes the code under test; that is what makes them safe to run as a pre-execution gate.
The argument for applying the same tooling to LLM-generated code is empirical. Asare et al. 2023 (arxiv:2304.09655) shows that Copilot-generated code has vulnerability rates comparable to human-authored code on the same tasks, with certain categories, including over-permissive defaults and missing input validation, appearing at higher rates. An agent generating and immediately executing code without a scan in between provides no fewer opportunities for exploitation than a human developer who skips peer review.
The gate sits at the edge between the code-generation tool call and the execution environment. The agent produces an artifact; the scanner runs against it; findings above a configured severity threshold block the execute call and return the finding summary to the agent. Findings below the threshold pass through, or are logged and sampled, depending on the policy. The agent does not control the scanner or its rule set.
Detection signals
- Severity-weighted findings per generated artifact. A rising rate indicates either a shift in the upstream prompt or model behaviour, or an active manipulation campaign targeting the codegen path.
- Findings-by-category distribution. A new dominant category, such as shell injection or credential hardcoding, that was not present in the baseline points to a recurring pattern in the model's output worth investigating at the prompt layer.
Threats it covers
-
WHY IT HELPS Unexpected RCE and Code Attacks is the execution of attacker-influenced code by an agent: the agent is prompted or manipulated into generating code that contains malicious or exploitable patterns, and that code then runs inside the trust boundary. Static analysis intercepts the artifact at the codegen-to-execution seam, refusing to forward code that matches known dangerous patterns before it can run.
Principle coverage
Defence-in-Depth stage: Prevent — and it advances:
- Sandboxing & Isolation Static analysis sandboxes the codegen-to-execution transition: every artifact the agent emits must clear a rule-based scan before the execution boundary is crossed, so code carrying known-dangerous patterns is refused at the seam rather than contained after it runs.
- Constrained Generation & Deterministic Guardrails Static analysis is the pre-execution enforcement layer for generated code: the agent may generate freely, but the artifact cannot cross into execution unless it passes the scanner's rule set, so generation capability does not automatically translate into execution of dangerous patterns.
Design & governance principles (open design, economy of mechanism, accountability, …) are architectural, not advanced by a single placed control.
Implementation options
All options below are verified against upstream documentation as of 2026-05-30. Choose by language coverage requirement and deployment model.
Semgrep OSS + Pro Pattern-matching SAST with 35+ language support; the OSS core is free, Pro rules add cross-file taint analysis and 710+ curated checks for Python, 250+ for JS/TS, and 190+ for Java.
Why choose it: Fastest path to a working gate for polyglot agent outputs; the CLI embeds cleanly as a subprocess call; OSS rules cover most production needs without a cloud dependency; Pro adds inter-file data-flow for deeper coverage.
More details:
GitHub CodeQL Query-language analysis that builds a relational database from source code and runs QL queries against it; supports C/C++, C#, Go, Java/Kotlin, JavaScript/TypeScript, Python, Ruby, Rust, and Swift.
Why choose it: Native to GitHub Actions; no infrastructure to operate; multi-repository variant analysis scales across a fleet; free for public repositories, included in GitHub Advanced Security for private ones.
More details:
Snyk Code Developer-first SAST with a DeepCode AI engine; performs data-flow, taint, and inter-file analysis across JavaScript, TypeScript, Python, Java, Go, C#, Ruby, PHP, Kotlin, Swift, Scala, and C/C++; provides fix suggestions in IDE and CI.
Why choose it: Strongest AI-generated fix suggestions among the commercial options; IDE-native feedback reduces friction; a free tier covers most individual and small-team scenarios.
More details:
Bandit AST-based scanner for Python; 47 built-in checks across injection, cryptography, hardcoded credentials, and framework misconfiguration categories; outputs JSON and SARIF; Apache 2.0 licence.
Why choose it: Zero cost, zero cloud dependency; trivially embeds as a subprocess in any Python agent pipeline; a good fast first-pass layer before a heavier polyglot scanner.
More details:
SonarQube Community Self-hosted SAST server with 6,000+ rules across Java, JavaScript/TypeScript, Python, C#, Go, Kotlin, PHP, Ruby, Scala, and IaC formats including Terraform, Dockerfiles, and CloudFormation; integrates with major CI systems.
Why choose it: Best for teams that need a persistent quality-gate dashboard with rule governance and historical trending; the free self-hosted tier is fully functional; taint analysis and branch/PR decoration require paid tiers.
More details:
Trade-offs
- Latency: Semgrep on artifacts under 500 lines runs in 1 to 5 seconds; larger multi-file patches or CodeQL database builds take 10 to 60 seconds, making asynchronous gating necessary for most conversational agents.
- False-positive rate rises when generic rule packs are applied to domain-specific generated code; budget time to calibrate severity thresholds or suppress noisy rules before putting the gate into production.
When NOT to use
- Do not use as a blocking gate on purely declarative or data-only agent outputs such as JSON, YAML, or plain text. No executable code means no relevant signal, only added latency.
- Static analysis is the wrong primary control for semantic attacks: a syntactically valid SQL query that extracts the wrong tenant's data passes every SAST rule; use query parameterisation and policy-layer validation instead.
Limitations
- Catches only known vulnerability patterns encoded in rules; novel attack classes, subtle data-flow bugs, and business-logic errors require dynamic analysis or human review.
- Rule packs were built against human-authored code; LLM-specific patterns such as over-trust of model-supplied strings in
eval, or chained tool outputs used as shell arguments, may require custom rules to surface reliably.
Maturity tier reasoning
- SAST and SCA tooling is Tier 1 mature in general software engineering; the novel integration work is inserting the scanner at the codegen-to-execution seam in an agent pipeline, which places the agentic application of the control at Tier 2.
- Not Tier 1 for agentic use, because the seam wiring and severity-threshold calibration for generated code are not yet standardised across agent frameworks and require deliberate engineering per deployment.
- Not Tier 3, because the underlying tooling is production-grade and the integration pattern is well-understood, even if not yet canonical for agent pipelines.
Last verified against upstream docs: 2026-05-30.