Vision-Language Models (VLMs) have gained significant attention since their introduction, leveraging transformer architectures and large amounts of text data. Unlike Large Language Models (LLMs), VLMs can work with both images and textual data, enabling tasks such as image captioning, instance detection, and visual question answering. The field has seen rapid progress, with models like CLIP and its variants becoming state-of-the-art performers in various benchmarks. However, training high-quality VLMs remains a complex task, requiring careful consideration of objectives, datasets, architectures, and fine-tuning strategies. To effectively use or develop a VLM, one must understand the importance of dataset curation, loss function design, benchmark selection, and business metrics evaluation. By following best practices and leveraging existing SOTA models, researchers and practitioners can unlock the full potential of VLMs for various applications, including document extraction and understanding.