API-Bank: Benchmarking Language Models’ Tool Use
Researchers have developed a new benchmark called API-Bank for testing how well large language models (LLMs) use external tools such as APIs to accomplish tasks. The benchmark evaluates LLMs' abilities in three main areas: deciding when to call an API, finding the right tool for the job, and employing multiple APIs to complete a task. GPT-4 outperforms GPT-3.5 Turbo on most of the tests, but both models struggle with tasks requiring multiple rounds of interdependent API calls. The results highlight the potential for LLMs to become more efficient and useful by incorporating external tools, as well as areas where further improvements are needed.
Company
Deepgram
Date published
Aug. 28, 2023
Author(s)
Brad Nikkel
Word count
2334
Hacker News points
None found.
Language
English