/plushcap/analysis/datadog/postgresql-monitoring

Key metrics for PostgreSQL monitoring

What's this blog post about?

PostgreSQL is a powerful open-source relational database management system (RDBMS) that enables you to store and manage large volumes of data in tables with rows and columns. In order to ensure that your applications can consistently access the data they need, it’s crucial to monitor key performance metrics for your PostgreSQL databases. In this series, we will cover an overview of PostgreSQL monitoring and its key performance metrics. We will also explain how you can collect these metrics from your PostgreSQL instances using a combination of built-in statistics collector functions and external tools like Prometheus and Grafana. This first part of the series focuses on explaining what each metric means, why it’s important to monitor them, and which specific queries or configuration settings you should use to collect these metrics from your PostgreSQL databases. We will also provide some examples of how you can interpret these metrics in order to detect potential issues with your database performance. In the next part of this series, we will explain how to configure Prometheus and Grafana to collect and visualize these key performance metrics for your PostgreSQL instances. In Part 3, we will show you how to use distributed tracing tools like Jaeger or Zipkin in conjunction with your PostgreSQL monitoring data to help pinpoint the root cause of any potential issues that arise. By the end of this series, you should have a solid understanding of what each key performance metric means for PostgreSQL, and how you can use them together to effectively monitor and troubleshoot your databases in real time. Let’s get started! ## Part 1: Key Performance Metrics for PostgreSQL Monitoring In this section, we will cover an overview of the key performance metrics that are most important to monitor when it comes to ensuring optimal performance for your PostgreSQL databases. We will explain what each metric means, why it’s important to monitor them, and which specific queries or configuration settings you should use to collect these metrics from your PostgreSQL databases. We have organized this section into three main categories: query throughput & performance metrics, concurrent operations performance metrics, and replication & reliability metrics. We will also provide some examples of how you can interpret these metrics in order to detect potential issues with your database performance. ### Query Throughput & Performance Metrics One of the most important aspects of PostgreSQL monitoring is keeping an eye on query throughput and performance metrics, which help give you an idea of what types of queries your database is serving. By tracking these metrics over time, you can identify trends or patterns that may indicate potential issues with your database performance. Here are some key performance metrics related to query throughput & performance: - Rows inserted, updated, deleted by queries (per database) - Tuples updated vs. heap-only tuples (HOT) updated - Total number of transactions executed (commits + rollbacks) We will explain each one in more detail below. #### Rows Inserted, Updated, Deleted by Queries (Per Database) PostgreSQL tracks several key statistics related to the number of rows that are inserted, updated, and deleted by queries on a per-database basis. These statistics can be accessed using the following SQL query: ```sql SELECT * FROM pg_stat_database; ``` This will return a table with one row for each database in your PostgreSQL instance, showing you the total number of rows that have been inserted (tup_inserted), updated (tup_updated), and deleted (tup_deleted) by queries on that database. Monitoring the number of rows inserted, updated, and deleted can help give you an idea of what types of write queries your database is serving. If you see a high rate of updated and deleted rows, you should also keep a close eye on the number of dead rows, since an increase in dead rows indicates a problem with VACUUM processes, which can slow down your queries. A sudden drop in throughput is concerning and could be due to issues like locks on tables and/or rows that need to be accessed in order to make updates. Monitoring write activity along with other database metrics like locks can help you pinpoint the potential source of the throughput issue. #### Tuples Updated vs. Heap-Only Tuples (HOT) Updated PostgreSQL will try to optimize updates when it is feasible to do so, through what’s known as a Heap-Only Tuple (HOT) update. A HOT update is possible when the transaction does not change any columns that are currently indexed (for example, if you created an index on the column age, but the update only affects the name column, which is not indexed). In comparison with normal updates, a HOT update introduces less I/O load on the database, since it can update the row without having to update its associated index. In general, you want to see more HOT updates over regular updates because they produce less load on the database. If you see a significantly higher number of updates than HOT updates, it may be due to frequent data updates in indexed columns. This issue will only continue to increase as your indexes grow in size and become more difficult to maintain. #### Total Number of Transactions Executed (Commits + Rollbacks) PostgreSQL tracks the total number of transactions executed on a per-database basis, which includes both commits and rollbacks. These statistics can be accessed using the following SQL query: ```sql SELECT * FROM pg_stat_database; ``` This will return a table with one row for each database in your PostgreSQL instance, showing you the total number of transactions executed (xact_commit + xact_rollback). Monitoring the total number of transactions executed can help give you an idea of how much write activity is occurring on your databases. A high rate of commits and/or rollbacks could indicate that there are issues with data consistency or integrity, which may require further investigation. ### Concurrent Operations Performance Metrics Another important aspect of PostgreSQL monitoring is keeping an eye on concurrent operations performance metrics, which help ensure that the database can scale sufficiently to be able to fulfill a high rate of queries. The VACUUM process is one of the most important maintenance tasks related to ensuring successful concurrent operations. Here are some key performance metrics related to concurrent operations: - Locks - Deadlocks (v. 9.2+) - Dead rows We will explain each one in more detail below. #### Locks PostgreSQL grants locks to certain transactions in order to ensure that data remains consistent across concurrent queries. You can also query the pg_locks view to see the active locks on the database, which objects have locks, and which processes are waiting to place locks on objects. Viewing the number of locks per table, categorized by lock mode, can help ensure that you are able to access data consistently. Some types of lock modes, such as ACCESS SHARE, are less restrictive than others, like ACCESS EXCLUSIVE (which conflicts with every other type of lock), so it can be helpful to focus on monitoring the more restrictive lock modes. A high rate of locks in your database indicates that active connections could be building up from long-running queries, which will result in queries timing out. #### Deadlocks A deadlock occurs when one or more transactions holds exclusive lock(s) on the same rows/tables that other transactions need in order to proceed. Let’s say that transaction A has a row-level lock on row 1, and transaction B has a row-level lock on row 2. Transaction A then tries to update row 2, while transaction B requests a lock on row 1 to update a column value. Each transaction is forced to wait for the other transaction to release its lock before it can proceed. In order for either transaction to complete, one of the transactions must be rolled back in order to release a lock on an object that the other transaction needs. PostgreSQL uses a deadlock_timeout setting to determine how long it should wait for a lock before checking if there is a deadlock. The default is one second, but it’s generally not advised to lower this, because checking for deadlocks uses up resources. The documentation advises that you should aim to avoid deadlocks by ensuring that your applications acquire locks in the same order all the time, to avoid conflicts. #### Dead Rows If you have a vacuuming schedule in place (either through autovacuum or some other means), the number of dead rows should not be steadily increasing over time—this indicates that something is interfering with your VACUUM process. VACUUM processes can get blocked if there is a lock on the table/row that needs to be vacuumed. If you suspect that a VACUUM is stuck, you will need to investigate to see what is causing this slowdown, as it can lead to slower queries and increase the amount of disk space that PostgreSQL uses. Therefore, it’s crucial to monitor the number of dead rows to ensure that your tables are being maintained with regular, periodic VACUUM processes. ### Replication & Reliability Metrics Many users set up PostgreSQL to replicate WAL changes from each primary server to one or more standby servers, in order to improve performance by directing queries to specific pools of read-only standbys. Replication also makes the database highly available—if the primary server experiences a failure, the database will always be prepared to failover to a standby. Here are some key performance metrics related to replication & reliability: - Number of checkpoints requested & scheduled - Buffers written by checkpoints as percentage of total buffers written - Replication delay (seconds)

Company
Datadog

Date published
Dec. 15, 2017

Author(s)
Emily Chang

Word count
6535

Language
English

Hacker News points
None found.


By Matt Makai. 2021-2024.