/plushcap/analysis/whylabs/whylabs-posts-navigating-threats-detecting-llm-prompt-injections-and-jailbreaks

Navigating Threats: Detecting LLM Prompt Injections and Jailbreaks

What's this blog post about?

This blog post discusses the issue of malicious attacks on language models (LLMs) such as jailbreak attacks and prompt injections. It presents two methods of detecting these attacks using LangKit, an open-source package for feature extraction for LLM and NLP applications. The first method involves comparing incoming user prompts to a set of known jailbreak/prompt injection attacks, while the second method is based on the assumption that under a prompt injection attack, the original prompt will not be followed by the model. Both methods have limitations, but they can help mitigate the issues associated with malicious LLM attacks.

Company
WhyLabs

Date published
Dec. 19, 2023

Author(s)
Felipe Adachi

Word count
1978

Hacker News points
None found.

Language
English


By Matt Makai. 2021-2024.