/plushcap/analysis/langchain/langchain-multi-needle-in-a-haystack

Multi Needle in a Haystack

What's this blog post about?

The Multi-Needle + Reasoning benchmark explores the performance of long context Language Learning Models (LLMs) in retrieving multiple facts and reasoning over them. Results show that as the number of needles increases, retrieval decreases, and reasoning over those needles is worse than just retrieval. GPT-4 consistently retrieves needles towards the end while ignoring needles at the beginning. Performance degrades as more context is passed in, with clear observations: performance degrades as the number of needles increases from 1 to 10, and performance degrades as the context length increases from 1000 to 120,000 tokens. Retrieval and reasoning both degrade as the context length increases, with reasoning lagging retrieval. The benchmark highlights that multiple facts are not guaranteed to be retrieved, especially as the number of needles and context size increase, and that specific prompt formulations may be needed to improve recall with certain LLMs.

Company
LangChain

Date published
March 13, 2024

Author(s)
-

Word count
1071

Hacker News points
None found.

Language
English


By Matt Makai. 2021-2024.