This article discusses how to evaluate large language model (LLM) chatbots for their performance in a conversation. It highlights that LLM chatbot evaluation is different from regular LLM evaluation as it involves evaluating LLM input-output interactions using prior conversation history as additional context. The article explains two types of LLM conversation evaluation: entire conversation evaluation and last best response evaluation. It also introduces four conversational metrics for evaluating entire conversations, namely role adherence, conversation relevancy, knowledge retention, and conversation completeness. DeepEval, an open-source LLM evaluation framework, is used to implement these metrics in a few lines of code. The article concludes by emphasizing the importance of LLM chatbot evaluation for identifying areas of improvement and ensuring effective conversational agents.