Links

Database Health

OtterTune's Database Health automatically detects the possible issues in your database and provides recommendations to improve your database's performance.

Overview

OtterTune's Database Health contains a collection of specific checks to help diagnose and resolve common database performance problems. It provides real-time analysis of your database's health conditions and suggests actions to fix the issues it identifies.
After selecting a database entry from the dashboard page, you can see the Database Health panel on the Database Overview page. The panel displays up to three checks. To access all the Database Health check results, you can select See all in the panel or the Database Health tab at the top of the Database Overview page.

Database Health Descriptions

The following table describes each Database Health check.
MySQL
Aurora MySQL
PostgreSQL
Aurora PostgreSQL
Name
Description
Buffer Cache Hit Rate
The buffer cache hit rate is the percentage of time that data blocks requested by queries are in memory. A value lower than 95% can indicate inefficient cache management. The calculation is based on data collected from the past 24 hours.
CPU Utilization
The average and the maximum CPU utilization from the past 24 hours. We consider your database over-utilized if the average utilization is above 80%.
Memory Utilization
The average and the maximum memory utilization from the past 24 hours. We consider your database over-utilized if the average utilization is above 80%.
Database Connection Utilization
Database connection utilization is the ratio of active connections to the maximum connections available. If the connection utilization is 0, it can indicate your database is idle. If the value is larger than 80%, it can indicate inefficient connection management. The calculation is based on data collected from the past 24 hours.
Disk IOPS Utilization
Disk IOPS utilization is the ratio of the current number of IOPS that the disk is performing to the total IOPS capacity. For General Purpose SSD storage (gp2), it has the baseline IOPS capacity and burst I/O credits. I/O credits represent the available bandwidth that your storage can use to burst large amounts of I/O when more than the baseline IOPS is needed. If the average IOPS utilization is larger than 90% and the percentage of burst credits available is lower than 10% for gp2 storage, it can indicate the resource is overutilized. If the maximum IOPS utilization is lower than 60% and the storage type is Provisioned IOPS SSD (io1), users can reduce the provisioned IOPS to save cost. The calculation is based on data collected from the past 7 days.
Disk Space Usage
The estimated number of days remaining until your database runs out of disk space is based on the average consumption rate from the past 7 days and the amount of free storage space available. We consider your database unhealthy if we predict your disk space will be depleted within two weeks.
Join Without Indexes Rate
Join without indexes rate represents the average number of times per minute that the database performs a join without using indexes. Performing a join without indexes results in a full table scan, which is expensive for large tables. If there are more than 20 joins without indexes per minute, this may indicate missing table indexes. The calculation is based on data collected from the past 24 hours.
Opened Table Rate
Opened table rate represents the average number of tables per minute that the database opens. When the cache of open tables is not big enough, it might frequently re-open some tables and this will reduce the database's performance. If there are more than 10 opened tables per minute, this may indicate inefficient table cache management. The calculation is based on data collected from the past 24 hours.
Sort Merge Passes Ratio
Sort merge passes ratio suggests how many sorts are external sorts. If the sort buffer size is not enough to hold all the sorted data, the database will write the data to disk files, sort those temporary files and merge them to sort the whole result set. If this happens many times, it can be slow. If the ratio of sort merge passes is larger than 10%, it can indicate inefficient sort buffer management. The calculation is based on data collected from the past 24 hours.
Temporary Disk Data
The server creates internal temporary tables while processing statements in some cases. If the table becomes too large, it will be stored on disk. A percentage of on-disk temporary tables higher than 25% can indicate inefficient memory management for temporary tables. The calculation is based on data collected from the past 24 hours.
Thread Cache Hit Rate
When a client disconnects, the client's threads are put in the thread cache, and new connections will reuse the thread in cache. If the cache is empty, a new thread will be created. The thread cache hit rate lower than 50% can indicate inefficient thread cache management. The calculation is based on data collected from the past 24 hours.
Resource Under-utilization
Your database is considered under-utilized if your average CPU, memory, and IOPS utilizations are all below 10%. The calculation is based on data collected from the past 7 days.
Name
Description
Buffer Cache Hit Rate
The buffer cache hit rate is the percentage of time that data blocks requested by queries are in memory. A value lower than 95% can indicate inefficient cache management. The calculation is based on data collected from the past 24 hours
CPU Utilization
The average and the maximum CPU utilization from the past 24 hours. We consider your database overutilized if the average utilization is above 80%.
Memory Utilization
The average and the maximum memory utilization from the past 24 hours. We consider your database over-utilized if the average utilization is above 80%.
Database Connection Utilization
Database connection utilization is the ratio of active connections to the maximum connections available. If the connection utilization is 0, it can indicate your database is idle. If the value is larger than 80%, it can indicate inefficient connection management. The calculation is based on data collected from the past 24 hours.
Join Without Indexes Rate
Join without indexes rate represents the average number of times per minute that the database performs a join without using indexes. Performing a join without indexes results in a full table scan, which is expensive for large tables. If there are more than 20 joins without indexes per minute, this may indicate missing table indexes. The calculation is based on data collected from the past 24 hours.
Opened Table Rate
(Disabled for Aurora Serverless MySQL)
Opened table rate represents the average number of tables per minute that the database opens. When the cache of open tables is not big enough, it might frequently re-open some tables and this will reduce the database's performance. If there are more than 10 opened tables per minute, this may indicate inefficient table cache management. The calculation is based on data collected from the past 24 hours.
Sort Merge Passes Ratio
Sort merge passes ratio suggests how many sorts are external sorts. If the sort buffer size is not enough to hold all the sorted data, the database will write the data to disk files, sort those temporary files and merge them to sort the whole result set. If this happens many times, it can be slow. If the ratio of sort merge passes is larger than 10%, it can indicate inefficient sort buffer management. The calculation is based on data collected from the past 24 hours.
Temporary Disk Data
The server creates internal temporary tables while processing statements in some cases. If the table becomes too large, it will be stored on disk. A percentage of on-disk temporary tables higher than 25% can indicate inefficient memory management for temporary tables. The calculation is based on data collected from the past 24 hours.
Thread Cache Hit Rate
When a client disconnects, the client's threads are put in the thread cache, and new connections will reuse the thread in cache. If the cache is empty, a new thread will be created. The thread cache hit rate lower than 50% can indicate inefficient thread cache management. The calculation is based on data collected from the past 24 hours.
Local Storage Usage
The estimated number of days remaining until your database runs out of local storage is based on the average consumption rate from the past 7 days and the amount of free local storage available. We consider your database unhealthy if we predict your local storage will be depleted within two weeks. Aurora MySQL uses local storage for error logs, general logs, slow query logs, audit logs, and non-InnoDB temporary tables.",
Resource Under-utilization
Your database is considered under-utilized if your average CPU and memory utilizations are all below 10%. The calculation is based on data collected from the past 7 days.
Name
Description
Buffer Cache Hit Rate
The buffer cache hit rate is the percentage of time that data blocks requested by queries are in memory. A value lower than 95% can indicate inefficient cache management. The calculation is based on data collected from the past 24 hours.
CPU Utilization
The average and the maximum CPU utilization from the past 24 hours. We consider your database overutilized if the average utilization is above 80%.
Memory Utilization
The average and the maximum memory utilization from the past 24 hours. We consider your database over-utilized if the average utilization is above 80%.
Database Connection Utilization
Database connection utilization is the ratio of active connections to the maximum connections available. If the connection utilization is 0, it can indicate your database is idle. If the value is larger than 80%, it can indicate inefficient connection management. The calculation is based on data collected from the past 24 hours.
Dead Tuple Ratio
When a row is deleted from a table, PostgreSQL marks it as dead but does not physically remove it. These dead tuples consume unnecessary storage and can degrade database performance. VACUUM reclaims the storage occupied by dead tuples. If the percentage of dead tuples in your database is higher than 30%, this can indicate inefficient autovacuum management. The calculation is based on data collected from the past 24 hours.
Disk IOPS Utilization
Disk IOPS utilization is the ratio of the current number of IOPS that the disk is performing to the total IOPS capacity. For General Purpose SSD storage (gp2), it has the baseline IOPS capacity and burst I/O credits. I/O credits represent the available bandwidth that your storage can use to burst large amounts of I/O when more than the baseline IOPS is needed. If the average IOPS utilization is larger than 90% and the percentage of burst credits available is lower than 10% for gp2 storage, it can indicate the resource is overutilized. If the maximum IOPS utilization is lower than 60% and the storage type is Provisioned IOPS SSD (io1), users can reduce the provisioned IOPS to save cost. The calculation is based on data collected from the past 7 days.
Disk Space Usage
The estimated number of days remaining until your database runs out of disk space is based on the average consumption rate from the past 7 days and the amount of free storage space available. We consider your database unhealthy if we predict your disk space will be depleted within two weeks.
Index Cache Hit Rate
Ratio of index lookups served from the shared buffer cache. A value lower than 95% can indicate inefficient cache management. The calculation is based on data collected from the past 24 hours.
Index Scan Ratio
An index scan reads through an index to quickly look up the rows that match your filter. A sequential scan reads ALL rows from the table sequentially, which typically takes longer than an index scan. If the percentage of index scans is lower than 90%, this can indicate inefficient index management. The calculation is based on data collected from the past 24 hours.
Requested Checkpoint Ratio
A checkpoint is triggered by either reaching a certain amount of time since the last checkpoint (time-based checkpoint), or exceeding a maximum size threshold for the Write-Ahead Log since the previously requested checkpoint. If the percentage of requested checkpoints is higher than 20%, it can indicate inefficient checkpoint management. The calculation is based on data collected from the past 24 hours.
Temporary Disk Data
Complex queries that include joins, sorting, DISTINCT, etc. can require more memory than the amount available. When this happens, the data is "spilled" to disk and stored in temporary files. If the amount of data written to temporary files exceeds 50MB per minute, this can indicate insufficient memory for complex queries. The calculation is based on data collected from the past 24 hours.
Resource Under-utilization
Your database is considered under-utilized if your average CPU, memory, and IOPS utilizations are all below 10%. The calculation is based on data collected from the past 7 days.
Name
Description
Buffer Cache Hit Rate
The buffer cache hit rate is the percentage of time that data blocks requested by queries are in memory. A value lower than 95% can indicate inefficient cache management. The calculation is based on data collected from the past 24 hours.
CPU Utilization
The average and the maximum CPU utilization from the past 24 hours. We consider your database overutilized if the average utilization is above 80%.
Memory Utilization
The average and the maximum memory utilization from the past 24 hours. We consider your database over-utilized if the average utilization is above 80%.
Database Connection Utilization
Database connection utilization is the ratio of active connections to the maximum connections available. If the connection utilization is 0, it can indicate your database is idle. If the value is larger than 80%, it can indicate inefficient connection management. The calculation is based on data collected from the past 24 hours.
Dead Tuple Ratio
When a row is deleted from a table, PostgreSQL marks it as dead but does not physically remove it. These dead tuples consume unnecessary storage and can degrade database performance. VACUUM reclaims the storage occupied by dead tuples. If the percentage of dead tuples in your database is higher than 30%, this can indicate inefficient autovacuum management. The calculation is based on data collected from the past 24 hours.
Index Cache Hit Rate
Ratio of index lookups served from the shared buffer cache. A value lower than 95% can indicate inefficient cache management. The calculation is based on data collected from the past 24 hours.
Index Scan Ratio
An index scan reads through an index to quickly look up the rows that match your filter. A sequential scan reads ALL rows from the table sequentially, which typically takes longer than an index scan. If the percentage of index scans is lower than 90%, this can indicate inefficient index management. The calculation is based on data collected from the past 24 hours.
Requested Checkpoint Ratio
A checkpoint is triggered by either reaching a certain amount of time since the last checkpoint (time-based checkpoint), or exceeding a maximum size threshold for the Write-Ahead Log since the previously requested checkpoint. If the percentage of requested checkpoints is higher than 20%, it can indicate inefficient checkpoint management. The calculation is based on data collected from the past 24 hours.
Temporary Disk Data
Complex queries that include joins, sorting, DISTINCT, etc. can require more memory than the amount available. When this happens, the data is "spilled" to disk and stored in temporary files. If the amount of data written to temporary files exceeds 50MB per minute, this can indicate insufficient memory for complex queries. The calculation is based on data collected from the past 24 hours.
Local Storage Usage
The estimated number of days remaining until your database runs out of local storage is based on the average consumption rate from the past 7 days and the amount of free local storage available. We consider your database unhealthy if we predict your local storage will be depleted within two weeks. Aurora Postgres uses local storage for error logs and temporary files.
Resource Under-utilization
Your database is considered under-utilized if your average CPU and memory utilizations are all below 10%. The calculation is based on data collected from the past 7 days.

Database Health Load Time

It usually takes 20 to 30 seconds to load database health checks for the first time. Then the cache will accelerate the loading process.

Other Details About Database Health

Some of OtterTune's Database Health checks use data from Amazon CloudWatch and RDS Performance Insights. Other checks require data from the OtterTune Agent (see connection instructions). You can find more details about which checks need OtterTune Agent connectivity in the table below. For Postgres, some checks need table and index data from a database. Users need to specify a database whose tables or indexes are not empty to see the check results.
OtterTune's Database Health checks use fixed ranges of time series metric data. You can find the default time range for each check in the following table. Please note that the checks will use all existing data for newly created databases if the time range exceeds the lifetime of the database.
The following table describes the agent requirements and the time range for each specific Database Health check.
MySQL
Aurora MySQL
PostgreSQL
Aurora PostgreSQL
Name
Agent Required
Time Range
Buffer Cache Hit Rate
No
1 Day
CPU Utilization
No
1 Day
Memory Utilization
No
1 Day
Database Connection Utilization
No
1 Day
Disk IOPS Utilization
No
7 Days
Disk Space Usage
No
7 Days
Join Without Indexes Rate
No
1 Day
Opened Table Rate
No
1 Day
Sort Merge Passes Ratio
No
1 Day
Temporary Disk Data
No
1 Day
Thread Cache Hit Rate
No
1 Day
Resource Under-utilization
No
7 Days
Name
Agent Required
Time Range
Buffer Cache Hit Rate
No
1 Day
CPU Utilization
No
1 Day
Memory Utilization
No
1 Day
Database Connection Utilization
No
1 Day
Join Without Indexes Rate
No
1 Day
Opened Table Rate
No
1 Day
Sort Merge Passes Ratio
No
1 Day
Temporary Disk Data
No
1 Day
Thread Cache Hit Rate
No
1 Day
Local Storage Usage
No
7 Days
Resource Under-utilization
No
7 Days
Name
Agent Required
Time Range
Buffer Cache Hit Rate
No
1 Day
CPU Utilization
No
1 Day
Memory Utilization
No
1 Day
Database Connection Utilization
No
7 Days
Dead Tuple Ratio
Yes
1 Day
Disk IOPS Utilization
No
7 Days
Disk Space Usage
No
1 Day
Index Cache Hit Rate
Yes
1 Day
Index Scan Ratio
Yes
1 Day
Requested Checkpoint Ratio
No
1 Day
Temporary Disk Data
No
1 Day
Resource Under-utilization
No
7 Days
Name
Agent Required
Time Range
Buffer Cache Hit Rate
No
1 Day
CPU Utilization
No
1 Day
Memory Utilization
No
1 Day
Database Connection Utilization
No
1 Day
Dead Tuple Ratio
Yes
1 Day
Index Cache Hit Rate
Yes
1 Day
Index Scan Ratio
Yes
1 Day
Requested Checkpoint Ratio
No
1 Day
Temporary Disk Data
No
1 Day
Local Storage Usage
No
7 Days
Resource Under-utilization
No
7 Days