Calling All Functions: Benchmarking OpenAI Function Calling and Explanations

Company

Arize

Date Published

Dec. 7, 2023

Author

Amber Roberts

Word count

1995

Language

English

Hacker News points

None

URL

arize.com/blog/calling-all-functions-benchmarking-openai-function-calling-and-explanations

Summary

This blog post benchmarks OpenAI's GPT models with function calling and explanations against various performance metrics, focusing on correctly classifying hallucinated and relevant responses. The results show trade-offs between speed and performance for different LLM application systems. GPT models with function calling tend to have a slightly higher latency than LLMs without function calling but perform on par with them. For model predictive ability on relevance, GPT-4 performs the best overall, while for hallucinations, GPT-4 correctly identifies more often across precision, accuracy, recall and F1 than GPT-4-turbo. The use of explanations does not always improve performance. When deciding which LLM to use for an application, benchmarking and experimentation are required, considering the latency of the system in addition to the performance of relevant prediction metrics.