Jailbreaking Multimodal Models via Image-Encoded Instructions • AI Security Wire

Safety researchers have documented a systematic vulnerability in the alignment architecture of multimodal large language models: instructions encoded as visible or near-visible text within images bypass the text-layer content moderation pipeline applied to user messages, enabling reliable jailbreaks against models that are otherwise robustly aligned when receiving equivalent text inputs.

The Structural Vulnerability

Current multimodal models process text and image inputs through separate pipelines that converge at the language model layer. Safety filtering is typically applied at the text input stage and at the output stage, but not to the intermediate representation of image content that the model “reads” via its visual understanding capabilities.

When a user submits text that violates safety guidelines, the input is filtered before reaching the model or the model is trained to refuse. When equivalent instructions are encoded as text in an image, they are processed by the visual encoder, converted to embeddings, and injected into the model’s reasoning context — bypassing text-input filters entirely.

Text input:  [Safety filter applied] → LLM
Image input: [No text filter] → Visual encoder → Embeddings → LLM

Attack Variants

Variant 1: Plaintext Image Instructions

The simplest attack: type adversarial instructions in a text editor, screenshot, and submit as an image. For many models, this naive approach succeeds at high rates:

GPT-4V class models: 71% jailbreak success rate (same instructions that fail as text)
Gemini Pro Vision: 64% success rate
Open-source VLMs (LLaVA, InternVL): 83–91% success rate

The disparity between open-source and frontier model success rates reflects additional image-content moderation deployed by frontier providers but not available in open-source deployments.

Variant 2: Steganographic Instruction Embedding

Researchers encoded instructions in images using low-opacity text (5–15% opacity) that is imperceptible to casual inspection but readable by the model’s visual processing. This defeats any manual review of submitted images while maintaining high attack success rates:

Attack success (5% opacity text): 58% on GPT-4V class models
Attack success (10% opacity text): 74% on GPT-4V class models
Human detection rate: under 3% in user studies

This variant is particularly concerning for platforms with user-submitted image content, where manual review is impractical at scale.

Variant 3: Instruction-in-Noise

Building on adversarial perturbation research, this variant encodes instructions as adversarial noise patterns that cause the visual encoder to interpret them as text. The images appear as random noise or artistic patterns to humans:

# Conceptual attack loop
perturbation = torch.zeros_like(image)
target_embedding = clip_encode("IGNORE PREVIOUS INSTRUCTIONS: ...")

for step in range(attack_steps):
    perturbed = image + perturbation
    embedding = visual_encoder(perturbed)
    loss = cosine_distance(embedding, target_embedding)
    perturbation -= lr * gradient(loss, perturbation)
    perturbation = project_onto_lp_ball(perturbation, epsilon=16/255)

This attack requires white-box access to the visual encoder (or a surrogate) and is significantly more technically demanding. However, given the high transfer rates documented in recent adversarial VLM research, surrogate-based attacks are feasible.

Variant 4: Combined Text-and-Image Context Manipulation

The most sophisticated variant uses the image to establish a malicious context that primes the model for subsequent text manipulation. An image containing a fictional “system update” message followed by a text prompt that references it achieves higher success rates than either vector alone:

Text jailbreak alone: 12% success
Image jailbreak alone: 71% success
Combined (image primes, text completes): 89% success

Safety Alignment Implications

These findings challenge a core assumption in multimodal safety alignment: that training a model to refuse harmful text requests is sufficient to prevent harmful behaviour when equivalent requests are image-encoded.

The research suggests that safety alignment must be applied at the semantic level — the model should refuse harmful actions regardless of the modality through which the instruction arrived — rather than at the input modality level.

Currently, no publicly available multimodal model achieves consistent cross-modal alignment. Researchers tested 12 models; all showed materially higher jailbreak success rates via image encoding than via text.

Mitigations

For model providers:

Apply OCR-based text extraction to image inputs before safety screening
Train explicitly on image-encoded adversarial examples
Implement cross-modal consistency checks (does the model’s response to an image match what it would produce for the extracted text?)

For application developers:

Apply image content moderation (text extraction + filtering) to all user-submitted images before passing to the model
Treat models receiving image input as less trustworthy than text-only equivalents for safety-critical applications
Implement output-layer monitoring as a secondary control

For security evaluators:

Include image-encoded attack variants in all multimodal AI red team exercises
Test both plaintext image instructions and low-opacity variants
Benchmark against the VLM’s text-only jailbreak resistance to measure the modality gap