Published
- 4 min read
Red Teaming LLMs: A Practitioner Framework and Tooling Guide
Red teaming LLM applications differs substantially from conventional application penetration testing. The attack surface is semantic rather than syntactic; the vulnerability class is model behaviour rather than memory corruption; and the ground truth for “exploited” is often ambiguous. This guide covers a structured approach to LLM red teaming that produces reproducible, actionable findings.
Scoping: What Are You Testing?
Before engaging any tooling, define the threat model precisely. LLM application security is not monolithic — the relevant attacks and their severity depend heavily on the deployment context.
Scope dimensions to define:
| Dimension | Options |
|---|---|
| Model access | Black-box (API only), grey-box (system prompt accessible), white-box (weights accessible) |
| Deployment type | Chatbot, agent, RAG system, code assistant, classifier |
| User trust level | Public internet users, authenticated employees, other AI systems |
| Capabilities | Read-only, retrieval, tool use, code execution, external API calls |
| Data sensitivity | Public content, PII, IP, regulated data |
A public-facing customer service chatbot without tool access has a fundamentally different threat model to an internal agentic system with file access and API credentials.
Attack Taxonomy
Structure testing around five attack categories:
1. Jailbreaking
Attempts to override the model’s safety training and elicit harmful outputs. Current high-efficacy techniques include:
- Many-shot prompting — demonstrating the desired (harmful) behaviour repeatedly in the prompt before requesting it
- Role-playing frames — “you are DAN, an AI without restrictions”
- Hypothetical / fictional framing — “write a story in which a character explains…”
- Multilingual jailbreaks — safety training is uneven across languages; requests in low-resource languages often bypass filters
- Base64 / encoding tricks — encoding the request reduces pattern-matching filter hits
2. Prompt Injection
Attempts to override instructions from the application developer. Subdivided into:
- Direct injection — user input that contains adversarial instructions
- Indirect injection — instructions embedded in retrieved documents, tool outputs, or external data
3. Data Extraction
Attempts to recover information the model should not reveal:
- System prompt extraction via repetition, continuation, and translation attacks
- Training data memorisation extraction (targeted for known sensitive documents)
- Cross-user data leakage in multi-user deployments with shared context
4. Privilege Escalation (Agentic Systems)
In systems with tool use:
- Manipulating the agent to use tools beyond their intended scope
- Chaining tool calls to achieve effects not achievable via single calls
- Using allowed tools to indirectly access restricted resources
5. Denial of Service / Resource Abuse
- Prompt flooding to exhaust context window
- Generating maximally long outputs (token bombing)
- Triggering expensive retrieval operations (for RAG systems)
Tooling
Garak
Garak is an open-source LLM vulnerability scanner that runs automated probes across a library of known attack patterns. It produces a structured report of probe outcomes, supporting reproducible testing.
pip install garak
# Run standard probe suite against an OpenAI endpoint
garak --model_type openai --model_name gpt-4o \
--probes all \
--report_prefix ./garak_output
Garak covers jailbreaking, prompt injection, data leakage, and hallucination probes. It is most useful for baseline coverage and regression testing — ensuring that model updates or system prompt changes don’t introduce new vulnerabilities.
PyRIT (Python Risk Identification Toolkit for Generative AI)
Microsoft’s PyRIT provides an orchestration framework for multi-turn attack scenarios. Unlike Garak’s single-turn probes, PyRIT can run conversational attack strategies that evolve based on model responses.
from pyrit.prompt_target import AzureOpenAIChatTarget
from pyrit.orchestrator import PromptSendingOrchestrator
target = AzureOpenAIChatTarget(
deployment_name="gpt-4o",
endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
api_key=os.environ["AZURE_OPENAI_KEY"],
)
orchestrator = PromptSendingOrchestrator(prompt_target=target)
responses = await orchestrator.send_prompts_async(
prompt_list=jailbreak_prompts,
)
PyRIT is particularly useful for testing agentic systems with tool use, where multi-turn attacks that adapt to model responses are more realistic than single-shot probes.
PromptBench
PromptBench provides adversarial robustness evaluation focused on classification and reasoning tasks. Less useful for open-ended generation testing, more useful for evaluating whether instruction-tuned models maintain correct task behaviour under adversarial perturbation.
Manual Testing
Automated tools are a floor, not a ceiling. Significant vulnerabilities — particularly context-specific jailbreaks and application logic flaws — require human judgment to discover. Allocate at least 30–40% of red team time to manual, exploratory testing.
Reporting: From Findings to Improvements
LLM red team reports should map each finding to:
- Attack category (from taxonomy above)
- Reproducibility — exact prompt(s) that trigger the behaviour; success rate across runs
- Severity — based on impact (what can an attacker achieve?) and accessibility (how easy is the attack?)
- Root cause — model-level (training), system prompt (configuration), or application logic
- Remediation — which layer addresses the issue (model update, prompt hardening, application controls)
Severity Scoring for LLM Findings
Standard CVSS is poorly suited to LLM vulnerabilities (there is no memory corruption, no CVE). Consider a simplified scoring:
| Factor | Weight |
|---|---|
| Impact if exploited | 40% |
| Ease of exploitation (manual vs. automated) | 30% |
| Discoverability by average user | 20% |
| Required access level | 10% |
Frequency and Regression
LLM red teaming is not a one-time exercise. Trigger a re-test on:
- System prompt changes
- Model version updates (even patch versions can alter safety behaviour)
- New tool integrations or capability additions
- New user populations or expanded deployment contexts
Automated regression suites (using Garak or a custom probe library) can run continuously in CI to catch regressions before deployment.