/plushcap/analysis/hume/new-evaluation-creative-ability-in-large-language-models

Introducing a new evaluation for creative ability in Large Language Models

What's this blog post about?

HumE-1 (Human Evaluation 1) is a new evaluation method for large language models (LLMs) that focuses on human ratings to assess their ability to perform creative tasks in ways that matter to us, evoking the desired feelings. LLMs are already being used in various fields such as writing books and articles, assisting legal professionals and healthcare practitioners, and providing mental health support. However, existing benchmarks fail to capture how these models affect our satisfaction and well-being. HumE-1 evaluates LLMs on tasks like writing motivational quotes, interesting facts, funny jokes, beautiful haikus, charming limericks, scary horror stories, appetizing descriptions of food, and persuasive arguments for charity donations. The evaluation uses honest and naturalistic prompts to reflect real-life scenarios better. In the first round of results, Gemini Ultra performed best, followed by GPT-4 Turbo, with both models having significant room for improvement.

Company
Hume

Date published
Feb. 9, 2024

Author(s)
Jeffrey Brooks, PhD

Word count
1062

Language
English

Hacker News points
None found.