Anyscale Endpoints has made experimentation with LLMs more accessible, allowing researchers to compare the factual accuracy of different models, including open-source LLMs like Llama 2. The comparison showed that Llama-2-70b is almost as strong as gpt-4 in terms of factuality and considerably better than gpt-3.5-turbo. However, Llama 2-7b and Llama 2-13b had severe ordering bias issues, while gpt-3.5-turbo showed a significant ordering bias. The cost comparison revealed that Llama 2 is 30 times cheaper for summarization than gpt-4, despite having similar performance levels. This experiment highlights the importance of considering the ordering bias when using LLMs for summaries and the potential benefits of using open-source LLMs like Llama 2.