BERTScore is a semantic evaluation metric for natural language processing (NLP) that leverages contextual embeddings generated by the Bidirectional Encoder Representations from Transformers (BERT) model and its variants, such as RoBERTa and XLNet. It calculates similarity scores between the contextual embeddings of words in candidate and reference texts using cosine similarity, ensuring a precise reflection of semantic equivalence. BERTScore's token matching component aligns each token in the candidate sentence with the most semantically similar token in the reference sentence, calculating precision and recall by incorporating IDF weighting to enhance sensitivity to critical terms. The metric bypasses traditional limitations of exact token matching by considering both context and meaning. Its practical successes include assessing machine-generated content, particularly for RAG and GenAI applications, where it provides significant advantages over conventional metrics like BLEU and ROUGE. To maximize BERTScore, organizations should follow best practices such as setting up evaluation pipelines, configuring pipelines for batch processing, integrating with existing frameworks, and monitoring data quality. Galileo's Evaluate module and Observe enhance the process by providing Experimentation Frameworks and traceability, respectively, while its GenAI Firewall prevents harmful content and hallucinations in summarization models.