Multi Needle in a Haystack

Company

LangChain

Date Published

March 13, 2024

Author

Word count

1071

Language

English

Hacker News points

URL

blog.langchain.dev/multi-needle-in-a-haystack

Summary

The Multi-Needle + Reasoning benchmark explores the performance of long context Language Learning Models (LLMs) in retrieving multiple facts and reasoning over them. Results show that as the number of needles increases, retrieval decreases, and reasoning over those needles is worse than just retrieval. GPT-4 consistently retrieves needles towards the end while ignoring needles at the beginning. Performance degrades as more context is passed in, with clear observations: performance degrades as the number of needles increases from 1 to 10, and performance degrades as the context length increases from 1000 to 120,000 tokens. Retrieval and reasoning both degrade as the context length increases, with reasoning lagging retrieval. The benchmark highlights that multiple facts are not guaranteed to be retrieved, especially as the number of needles and context size increase, and that specific prompt formulations may be needed to improve recall with certain LLMs.