Large language models (LLMs) pose significant security challenges due to their probabilistic mechanics and inherently indeterminate attack surface. Traditional software security measures are less effective for LLMs, as even subtle input variations can trigger drastically different behaviors. Adversaries can exploit this unpredictability using techniques like adversarial inputs, prompt injections, or emergent behaviors.
Security strategies such as input preprocessing/sanitization and output filtering can be bypassed through methods like text smuggling, encoding schemes, circumlocution, multi-step prompt crafting, and external reassembly. Dual LLM setups are also vulnerable to malicious content being passed from an untrusted model to a trusted one.
To mitigate these risks, organizations should limit the operational scope of LLMs using the principle of least privilege, inspect and sanitize all outputs before further action is taken, apply better instructions and system prompts, and use higher-quality training data. Adversarial training can enhance model robustness but may introduce trade-offs in performance and efficiency.