Extraction Benchmarking

Company

LangChain

Date Published

Dec. 5, 2023

Author

Word count

2264

Language

English

Hacker News points

None

URL

blog.langchain.dev/extraction-benchmarking

Summary

The new extraction dataset measures LLMs' ability to infer structured information from chat logs. It offers a practical environment to test common challenges in LLM application development like classifying unstructured text, generating machine-readable information, and reasoning over multiple tasks with distracting information. The Chat Extraction dataset is designed around testing how well today's crop of LLMs are able to extract and categorize relevant information from this type of data. Experiments were conducted on various LLMs including GPT-4, Claude-2, Llama-v2-34b-code-instruct, Llama-v2-chat-70b, and yi-34b-200k-capybara to compare their performance in generating structured information from chat logs. The results showed that GPT-4 performed better across almost all metrics compared to other LLMs. However, none of the prompting strategies offered a significant boost to the structure of the model output. Grammar-based decoding was tested as another way to reliably generate schema-compliant JSON, which resulted in 100% validity for json_schema correctness but did not guarantee the quality of the values themselves.