AI red teaming
LLM-shaped systems break in ways classical app-sec frameworks don’t name well. Prompt injection is the headline; in practice most real incidents are agent / tool / context-isolation failures.
Prereqs
- Comfort with at least one LLM API (OpenAI, Anthropic, Gemini, local via Ollama).
- Basic understanding of embeddings, retrieval, and tool calling.
Stage 1 — fundamentals
- llm-threat-model — what the user controls, what the model controls, what the system controls.
- direct-prompt-injection — classic “ignore previous instructions”.
- indirect-prompt-injection — payload inside documents, web pages, tool output.
- jailbreaks — DAN-style, role-play, encoding tricks, multi-turn drift.
- output-filtering-and-its-bypasses.
Stage 2 — agent and tool surface
- tool-confusion — making the agent call the wrong tool.
- mcp-attacks — Model Context Protocol server abuse.
- rag-poisoning — corpus-level injection.
- memory-poisoning — long-lived agent memory abuse.
- exfiltration-via-rendered-content — image URLs, markdown links, callbacks.
- chain-of-trust-confusion — system vs developer vs user prompt precedence failures.
Stage 3 — model-level and infrastructure attacks
- model-extraction.
- training-data-extraction.
- membership-inference.
- adversarial-suffixes — gradient-based jailbreaks (GCG and successors).
- multimodal-attacks — image / audio prompt injection.
- supply-chain-attacks-on-models — poisoned fine-tunes, malicious weights, malicious tokeniser.
- infrastructure-around-llms — vector DB ACLs, inference proxy abuse, billing-account compromise.
References
- OWASP LLM Top 10.
- Simon Willison’s prompt-injection posts.
- Anthropic responsible scaling policy — useful threat-model framing.
- Embrace the Red (Johann Rehberger) — agent and exfil-channel research.
- Lakera prompt-injection cheat sheet.