We have built a robust monitoring system using Thanos and Prometheus that provides fast and reliable monitoring capabilities, with high availability and scalability. We have learned the importance of caching to improve query performance, downsampling metrics to reduce storage requirements, keeping metrics in good shape by only storing important ones, sharding long-term storage to serve large amounts of data efficiently, and scaling and high availability through manual scaling of Prometheus shards. By implementing these strategies, we have improved the performance and reliability of our monitoring system, which currently stores 130TB of metrics and serves a high volume of queries within a few seconds.