Every week, it seems like another AI provider releases a state-of-the-art model. These announcements come with impressive benchmarks, but those benchmarks rarely reflect real-world use cases. So, how do you know if the new model is worth deploying in your app? To gauge if a particular model will improve your application, it’s first worth understanding how well your app is currently performing by setting up a baseline using evaluations that consider the accuracy or quality of the LLM outputs. The best way to evaluate a new AI model is by testing it against the actual data your app handles in production, pulling real logs from your app and organizing them into a dataset. If the results show that the new model outperforms your current one, update it in production with just a one-line code change. After shipping the new model in production, you can keep tabs on its performance on the Monitor page, selecting Group by model to focus on your model change and tightening the timeline to when you made the changes. By testing a new AI model with your actual data and swapping models easily, you’ll know for sure if it’s better for your app.