Multi Needle in a Haystack
The Multi-Needle + Reasoning benchmark explores the performance of long context Language Learning Models (LLMs) in retrieving multiple facts and reasoning over them. Results show that as the number of needles increases, retrieval decreases, and reasoning over those needles is worse than just retrieval. GPT-4 consistently retrieves needles towards the end while ignoring needles at the beginning. Performance degrades as more context is passed in, with clear observations: performance degrades as the number of needles increases from 1 to 10, and performance degrades as the context length increases from 1000 to 120,000 tokens. Retrieval and reasoning both degrade as the context length increases, with reasoning lagging retrieval. The benchmark highlights that multiple facts are not guaranteed to be retrieved, especially as the number of needles and context size increase, and that specific prompt formulations may be needed to improve recall with certain LLMs.
Company
LangChain
Date published
March 13, 2024
Author(s)
-
Word count
1071
Language
English
Hacker News points
1