Research demonstrates that LLMs with large context windows can be reliably jailbroken by embedding hundreds of fictitious dialogues before the target request — a technique that scales with context length and bypasses standard safety training.
Research demonstrates that LLMs with large context windows can be reliably jailbroken by embedding hundreds of fictitious dialogues before the target request — a technique that scales with context length and bypasses standard safety training.
New research demonstrates that backdoor behaviours introduced into LLMs during fine-tuning can persist through subsequent safety alignment procedures, including RLHF and adversarial training, posing significant supply chain risks.
A systematic study of membership inference attacks against foundation models finds that training data can be reconstructed from model weights with significantly higher accuracy than previously reported, with implications for GDPR compliance and PII handling in AI development.