Key metrics for Amazon RDS PostgreSQL monitoring

Company

Datadog

Date Published

April 12, 2018

Author

Emily Chang

Word count

8276

Language

English

Hacker News points

None

URL

www.datadoghq.com/blog/aws-rds-postgresql-monitoring

Summary

Amazon RDS provides a managed service that enables you to set up, operate, and scale PostgreSQL deployments in the cloud. Monitoring your Amazon RDS PostgreSQL instances is crucial for ensuring optimal performance and availability of your databases. In this series, we’ll cover an overview of key metrics to monitor when running PostgreSQL on AWS RDS, including replication delay, resource utilization (CPU, memory, disk I/O, network), connections, and more. In the first part of this series, we discussed how to monitor replication delay in Amazon RDS PostgreSQL instances. In this second part, we’ll cover how to track key system-level metrics like CPU, disk, memory, and network usage, as well as other important metrics related to storage, connections, and more. Resource utilization Monitoring key system-level metrics like CPU, disk, memory, and network can help you investigate PostgreSQL performance issues and ensure that the database has enough resources to complete its work. RDS also provides the option to enable enhanced monitoring on your instances (excluding the db.m1.small instance class), which can help you gain more visibility into resource usage. Although CloudWatch also provides basic system-level metrics for RDS instances, enhanced metrics are collected by an agent that runs directly on the instance, rather than via the hypervisor, which results in higher-granularity data (collected as frequently as once a second). Disk I/O metrics |Metric description |CloudWatch name |Enhanced monitoring name |Metric type |Read I/O operations per second |ReadIOPS |diskIO.readIOsPS |Resource: Utilization |Write I/O operations per second |WriteIOPS |diskIO.writeIOsPS |Resource: Utilization |I/O operations waiting for disk access |DiskQueueDepth |diskIO.avgQueueLen |Resource: Saturation |Average amount of time per read I/O operation (ms) |ReadLatency |- |Resource: Other |Average amount of time per write I/O operation (ms) |WriteLatency |- |Resource: Other RDS PostgreSQL users can select from three types of storage (provided through Amazon Elastic Block Store) that cater to varying levels of desired performance: SSD, provisioned IOPS, and magnetic. Provisioned IOPS is the highest-performance option, and delivers speeds of up to 40,000 I/O operations per second. RDS provides CloudWatch metrics, ReadIOPS and WriteIOPS, that correspond to the average number of read and write I/O operations completed per second over each 1-minute interval. However, DiskQueueDepth, which tracks the number of I/O requests waiting in the queue, can often be more informative. If you see a consistently high value for this metric, you may need to convert to another storage type, or scale your storage to help ease the workload. Adding read replica instances may also help reduce I/O pressure on read-heavy database workloads. Memory metrics |Metric description |CloudWatch name |Enhanced monitoring name |Metric type |Available RAM (bytes) |FreeableMemory |memory.free |Resource: Utilization |Swap usage (bytes) |SwapUsage |swap.cached |Resource: Saturation When PostgreSQL reads or updates data, it checks for the block in the shared buffer cache first, and also in the OS cache, to see if it can serve the request without having to read from disk. If the block is not cached, it will need to access the data from disk. However, it will also cache it in memory so that the next time that data is queried, it won’t need to access the disk. PostgreSQL query performance relies heavily on caching data in the in-memory shared buffer cache, so AWS recommends providing your database instances with enough memory to store all of your most commonly accessed data. Monitoring FreeableMemory and SwapUsage can help ensure that your database has enough RAM to serve queries from memory rather than disk. AWS recommends tracking ReadIOPS to determine if your data is stored mostly in memory—ideally, it should be a low, steady value. If you suspect that your instance needs more memory, you can try scaling up its RAM and observing the ensuing effect on ReadIOPS. If it drops drastically, this indicates that your data was previously being accessed mostly from disk rather than memory, and that you may need to continue allocating more RAM to your instance to optimize query performance. The shared_buffers parameter determines how much memory the database can use for the shared buffer cache. In PostgreSQL, this value is usually about 128 MB, but in RDS, the default value of shared_buffers is calculated as a proportion of your database instance’s available memory using the following formula: DBInstanceClassMemory / 32768. Note that RDS requires you to specify shared_buffers in terms of 8-KB units, while the DBInstanceClassMemory parameter variable is provided in bytes, not kilobytes. Storage metrics |Metric description |Name |Metric type |Availability |Available storage space (bytes) |FreeStorageSpace |Resource: Utilization |CloudWatch |Disk space used by each table (excluding indexes) |pg_table_size |Resource: Utilization |PostgreSQL: Database object management functions |Disk space used by indexes in the table |pg_indexes_size |Resource: Utilization |PostgreSQL: Database object management functions |Number of index scans initiated on this table or index |idx_scan |Resource: Utilization |PostgreSQL: pg_stat_user_tables or pg_stat_user_indexes When you create an RDS PostgreSQL database instance, you must provide it with a certain amount of storage (usually in the form of EBS volumes). Monitoring FreeStorageSpace can help you determine if you’re running out of space, meaning that you either need to scale up your storage or delete unused or outdated data/logs. Note that you can increase, but not decrease, the amount of storage allocated to an RDS instance. In addition to tracking available storage space on your database instances, you can also track how much storage space is being utilized by various tables and indexes in your database. PostgreSQL collects statistics internally to help you track the size of tables and indexes over time, which is helpful for gauging future changes in query performance. As your tables and indexes grow in size, queries will take longer, and indexes will require more disk space—so eventually, you will either need to scale up the instance’s storage, partition your data, or rethink your indexing strategy. If you see any unexpected growth in table or index size, it may also point to problems with VACUUMs not running properly, so you should also inspect VACUUM-related metrics to see if they provide other insights. In the next part of this series, we’ll show you how to query pg_stat_user_indexes to see if there are any underutilized indexes that you could remove in order to free up storage space and decrease unnecessary load on the database. Indexes can be increasingly difficult to maintain as they grow in size, so it may not be worth applying resources to data that isn’t queried very often. Network and CPU metrics |Metric description |CloudWatch name |Enhanced monitoring name |Metric type |Network traffic to RDS PostgreSQL instance |NetworkReceive Throughput (bytes/sec) |network.rx (packets) |Resource: Utilization |Network traffic from RDS PostgreSQL instance |NetworkTransmit Throughput (bytes/sec) |network.tx (packets) |Resource: Utilization |CPU utilization (percent) |CPUUtilization |cpuUtilization.total |Resource: Utilization Keeping an eye on the CloudWatch metrics NetworkReceiveThroughput and NetworkTransmitThroughput will help you determine if your instances have enough network bandwidth to serve queries and replicate updates to standby and/or replica instances. This is particularly important if your replica instances are located in a different region. High CPU utilization is usually not a cause for concern. However, If you notice that CPU has increased significantly without any obvious reason, you can try querying pg_stat_activity to see if long-running queries may be the source of the issue. Resource utilization metrics to alert on FreeStorageSpace: This is a critical metric to monitor on your database instances. If you run out of storage, you will not be able to connect to the database instance. As such, AWS recommends setting up an alert to get notified when this metric reaches 85 percent or higher. This will give you enough time to take action by deleting outdated data/logs, removing unused indexes or tables, or adding more storage to the instance. DiskQueueDepth: High-traffic databases can expect to see queued I/O operations. However, if you see this metric increasing along with any noticeable spikes in read or write latency, you may need to upgrade your storage type to keep up with demand. ReadLatencyand WriteLatency: These two metrics help you track the latency of I/O read and write operations, and can help you determine if your allocated storage is able to handle the database workload. If latency continues to degrade, you can consult the RDS documentation to see you could improve performance by upgrading to a higher-performance storage option like provisioned IOPS, which enables RDS instances to process more I/O requests concurrently. Connections |Metric description |Name |Metric type |Availability |Number of open connections |DatabaseConnections |Resource: Utilization |CloudWatch |Number of open connections |numbackends |Resource: Utilization |PostgreSQL (pg_stat_database) |Percentage of max connections in use |numbackends as percentage of max_connections in pg_settings |Resource: Utilization |PostgreSQL (pg_stat_database, pg_settings) |Client connections waiting on a server connection (PgBouncer) |cl_waiting |Resource: Saturation |PgBouncer |Max time a client connection has been waiting to be served (PgBouncer) |maxwait |Resource: Saturation |PgBouncer The PostgreSQL primary server process forks a new process every time a client requests a connection. PostgreSQL sets a max_connections limit, which determines the maximum number of connections that can be opened to the server at any one time. By default, RDS will set this parameter in proportion to your database instance class’s available memory. The formula varies according to the version of PostgreSQL you’re running. RDS also reserves up to three of these connections for system maintenance. If you see the number of open connections consistently approaching the number of maximum connections, this can indicate that applications are issuing long-running queries, and constantly creating new connections to send other requests, instead of reusing existing connections. Using a connection pool can help ensure that connections are consistently reused whenever they go idle, instead of requiring the primary/source instance to frequently open and close connections. In high-concurrency environments, using a connection pool like PgBouncer can help distribute requests made to your primary instance. The pool serves as a proxy between your applications and RDS PostgreSQL instances. In versions 9.6+, you can also set an idle_in_transaction_session_timeout, which instructs PostgreSQL to close any connections that remain idle for longer than this period of time. By default, this value is 0, which means that it is disabled. Next steps in Amazon RDS PostgreSQL monitoring In this post, we’ve covered an overview of the key metrics to monitor when running PostgreSQL on RDS. As you scale your Amazon RDS PostgreSQL deployment over time, keeping an eye on these metrics will help you detect and troubleshoot potential issues and keep database operations running smoothly and efficiently. Read the next part of this series to learn how to collect all of these metrics from AWS CloudWatch and from PostgreSQL itself. Source Markdown for this post is available on GitHub. Questions, corrections, additions, etc.? Please let us know.``` SUMMARY: