How to Compare Large Language Models: GPT-4 & 3.5 vs Anthropic Claude vs Cohere

Company

Activeloop

Date Published

June 8, 2023

Author

Akash Sharma

Word count

4856

Language

English

Hacker News points

URL

www.activeloop.ai/resources/how-to-compare-large-language-models-gpt-4-3-5-vs-anthropic-claude-vs-cohere

Summary

The blog post by Akash Sharma and Sinan Ozdemir explores Vellum's Playground, a solution for finding the right prompt/model mix for one's use case. They compare four leading LLMs from three top AI companies - OpenAI’s GPT-3.5 and GPT-4, Anthropic’s Claude, and Cohere’s Command series of models. The authors walk through four examples: Text Classification (detecting offensive language), Creative Content Generation with rules/personas, Question Answering and Logical Reasoning, and Code Generation. They consider three main metrics for performance/quality - Accuracy, Semantic Text Similarity, and Robustness. The goal is not to declare any of these models a “winner” but rather to help users think about judging the quality and performance of models in a more structured way using Vellum – a developer platform for building production LLM apps.