Published
- 4 min read
Jailbreaking Multimodal Models via Image-Encoded Instructions
Safety researchers have documented a systematic vulnerability in the alignment architecture of multimodal large language models: instructions encoded as visible or near-visible text within images bypass the text-layer content moderation pipeline applied to user messages, enabling reliable jailbreaks against models that are otherwise robustly aligned when receiving equivalent text inputs.
The Structural Vulnerability
Current multimodal models process text and image inputs through separate pipelines that converge at the language model layer. Safety filtering is typically applied at the text input stage and at the output stage, but not to the intermediate representation of image content that the model “reads” via its visual understanding capabilities.
When a user submits text that violates safety guidelines, the input is filtered before reaching the model or the model is trained to refuse. When equivalent instructions are encoded as text in an image, they are processed by the visual encoder, converted to embeddings, and injected into the model’s reasoning context — bypassing text-input filters entirely.
Text input: [Safety filter applied] → LLM
Image input: [No text filter] → Visual encoder → Embeddings → LLM
Attack Variants
Variant 1: Plaintext Image Instructions
The simplest attack: type adversarial instructions in a text editor, screenshot, and submit as an image. For many models, this naive approach succeeds at high rates:
- GPT-4V class models: 71% jailbreak success rate (same instructions that fail as text)
- Gemini Pro Vision: 64% success rate
- Open-source VLMs (LLaVA, InternVL): 83–91% success rate
The disparity between open-source and frontier model success rates reflects additional image-content moderation deployed by frontier providers but not available in open-source deployments.
Variant 2: Steganographic Instruction Embedding
Researchers encoded instructions in images using low-opacity text (5–15% opacity) that is imperceptible to casual inspection but readable by the model’s visual processing. This defeats any manual review of submitted images while maintaining high attack success rates:
- Attack success (5% opacity text): 58% on GPT-4V class models
- Attack success (10% opacity text): 74% on GPT-4V class models
- Human detection rate: under 3% in user studies
This variant is particularly concerning for platforms with user-submitted image content, where manual review is impractical at scale.
Variant 3: Instruction-in-Noise
Building on adversarial perturbation research, this variant encodes instructions as adversarial noise patterns that cause the visual encoder to interpret them as text. The images appear as random noise or artistic patterns to humans:
# Conceptual attack loop
perturbation = torch.zeros_like(image)
target_embedding = clip_encode("IGNORE PREVIOUS INSTRUCTIONS: ...")
for step in range(attack_steps):
perturbed = image + perturbation
embedding = visual_encoder(perturbed)
loss = cosine_distance(embedding, target_embedding)
perturbation -= lr * gradient(loss, perturbation)
perturbation = project_onto_lp_ball(perturbation, epsilon=16/255)
This attack requires white-box access to the visual encoder (or a surrogate) and is significantly more technically demanding. However, given the high transfer rates documented in recent adversarial VLM research, surrogate-based attacks are feasible.
Variant 4: Combined Text-and-Image Context Manipulation
The most sophisticated variant uses the image to establish a malicious context that primes the model for subsequent text manipulation. An image containing a fictional “system update” message followed by a text prompt that references it achieves higher success rates than either vector alone:
- Text jailbreak alone: 12% success
- Image jailbreak alone: 71% success
- Combined (image primes, text completes): 89% success
Safety Alignment Implications
These findings challenge a core assumption in multimodal safety alignment: that training a model to refuse harmful text requests is sufficient to prevent harmful behaviour when equivalent requests are image-encoded.
The research suggests that safety alignment must be applied at the semantic level — the model should refuse harmful actions regardless of the modality through which the instruction arrived — rather than at the input modality level.
Currently, no publicly available multimodal model achieves consistent cross-modal alignment. Researchers tested 12 models; all showed materially higher jailbreak success rates via image encoding than via text.
Mitigations
For model providers:
- Apply OCR-based text extraction to image inputs before safety screening
- Train explicitly on image-encoded adversarial examples
- Implement cross-modal consistency checks (does the model’s response to an image match what it would produce for the extracted text?)
For application developers:
- Apply image content moderation (text extraction + filtering) to all user-submitted images before passing to the model
- Treat models receiving image input as less trustworthy than text-only equivalents for safety-critical applications
- Implement output-layer monitoring as a secondary control
For security evaluators:
- Include image-encoded attack variants in all multimodal AI red team exercises
- Test both plaintext image instructions and low-opacity variants
- Benchmark against the VLM’s text-only jailbreak resistance to measure the modality gap