Published
- 2 min read
Model Inversion Attacks: Extracting Training Data PII from Production LLMs
Technique Overview
Model inversion attacks exploit the tendency of language models to memorise and reproduce verbatim fragments of their training data. When a model is fine-tuned on proprietary or sensitive data, adversaries with API access can craft queries designed to cause the model to regurgitate that data.
This is fundamentally a data exfiltration technique with significant GDPR, HIPAA, and IP implications — distinct from prompt injection (behaviour manipulation) or adversarial examples (misclassification).
Attack Mechanics
1. Membership Inference
Before extraction, an attacker determines whether a specific data record was in the training set by querying the model and measuring “surprise” (perplexity) — training data typically yields lower perplexity than unseen data. Success rates of 60–80% have been demonstrated against fine-tuned GPT-class models.
2. Verbatim Extraction
Prefix prompting: Providing the first portion of a memorised sequence and recording the completion:
Prompt: "Customer ID: 00481, Name: [complete this record]"
→ Model may complete with memorised PII from training data
Template-based probing: Using structural templates matching the training data format.
Repeated token attacks: Repeating a token hundreds of times causes models to fall back to training data reproduction — demonstrated against production models including GPT-3.5.
3. Model Stealing
Attackers can reconstruct approximate model weights by querying the API with a large diverse dataset and training a surrogate model on the outputs, effectively stealing proprietary model behaviour and IP.
Real-World Risk Profile
| Scenario | Regulatory Exposure |
|---|---|
| LLM fine-tuned on customer PII exposed via API | GDPR Article 17, CCPA |
| Internal LLM fine-tuned on HR/legal documents | Legal privilege breach |
| SaaS AI using customer data for fine-tuning | GDPR processor obligations |
| LLM fine-tuned on proprietary code | IP theft |
Mitigations
At Training Time
- Differential Privacy (DP) during fine-tuning adds noise to gradient updates, formally bounding memorisation of individual examples
- Data minimisation — redact or pseudonymise PII before it enters the training pipeline
- Deduplication — memorisation disproportionately affects duplicated examples; data appearing 10+ times is orders of magnitude more likely to be reproduced verbatim
At Deployment Time
- Rate limiting and monitoring for high-volume, structurally repetitive queries
- Output PII filtering via AWS Comprehend, Azure AI Language, or Microsoft Presidio
- Canary records — embed synthetic unique records in training data to detect extraction attempts
- Prompt-response logging with anomaly detection
Detection Signals
- High-volume queries from a single identity with low semantic diversity
- Queries containing partial PII patterns designed to complete records
- API responses containing email addresses, phone numbers, or national ID patterns
- Queries consisting primarily of repeated tokens