Database Maintenance: Tracing Production Incidents to Their Root Cause

This article explains that database maintenance should be triggered by workload signals rather than fixed schedules, as scheduled maintenance often misses tables that need it most and leads to production incidents. It provides a diagnostic framework for SQL Server, PostgreSQL, and MySQL that uses wait states (such as PAGEIOLATCH_SH, LCK_M_S, and CXPACKET) as the universal entry point for triaging slow queries and tracing them to root causes like index fragmentation, stale statistics, or lock contention. The article also covers cross-engine buffer cache monitoring and offers a self-assessment scorecard, with a note that silent corruption requires detection-first treatment since it produces no precursor signal.

Database maintenance fails when it runs on a calendar instead of on signal. Fragmentation, stale statistics, log growth, and lock contention are functions of write workload https://learn.microsoft.com/en-us/sql/relational-databases/indexes/reorganize-and-rebuild-indexes?view=sql-server-ver17 , not weekly schedules. Scheduled maintenance skips the tables that need it most, and the resulting incident fires before anyone notices the gap. This article replaces the cron job with a response system. Four observable symptoms I/O degradation, query plan regression, storage pressure, and lock contention each trace back to a specific maintenance root cause, with fixes for SQL Server, PostgreSQL, and MySQL. Silent corruption, the one failure mode that produces no precursor signal, gets its own detection-first treatment. A closing scorecard lets you self-assess. First Response: Wait State Triage Across Engines When a slow query alert fires, the first diagnostic step is the same regardless of engine: check what the query is waiting on. Wait states https://learn.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-os-wait-stats-transact-sql are the universal entry point for database incident triage. They tell you whether the problem is I/O bound, lock bound, or CPU bound, and that classification determines which section of this article contains your fix. SQL Server wait types PAGEIOLATCH SH https://www.sqlskills.com/help/waits/pageiolatch sh/ means the query is waiting for data pages to be read from disk into the buffer pool. This points to index fragmentation, buffer cache pressure https://learn.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-os-wait-stats-transact-sql , or storage subsystem saturation. and https://www.sqlskills.com/help/waits/lck m s/ LCK M S indicate row or table-level lock contention from a concurrent transaction or a maintenance operation holding locks. https://www.sqlskills.com/help/waits/lck m x/ LCK M X visible in https://sqlperformance.com/2015/08/sql-performance/more-on-cxpacket-waits-skewed-parallelism CXPACKET sys.dm exec requests signals parallelism skew, which typically traces to stale statistics or a missing index causing the optimizer to choose an expensive parallel plan. PostgreSQL and MySQL equivalents PostgreSQL exposes wait diagnostics through pg stat activity https://www.postgresql.org/docs/current/monitoring-stats.html . The query below is your triage entry point: -- PostgreSQL: active session wait events SELECT pid, wait event type, wait event, state, query FROM pg stat activity WHERE wait event IS NOT NULL AND state = 'idle' AND backend type = 'client backend'; The diagram above maps each value to its target section. One non-obvious case is worth calling out: a NULL wait event while state = 'active' indicates the query is compute-bound the PostgreSQL equivalent of CPU pressure , which can point toward stale statistics or a plan regression rather than I/O. For MySQL, performance schema.events waits current https://dev.mysql.com/doc/refman/8.4/en/performance-schema-wait-tables.html is the source for the values shown in the diagram. Verify performance schema = ON in my.cnf first, as it is disabled by default in some MySQL 5.x builds and carries non-zero overhead; on MySQL 8.0+ it is enabled by default. gives a quicker but less granular view. https://dev.mysql.com/doc/refman/8.4/en/show-processlist.html SHOW PROCESSLIST Once you have identified the wait type, the sections below trace each category to its maintenance root cause and prescribe the fix. For hybrid topologies that span on-prem and cloud-managed instances, ManageEngine OpManager Nexus https://www.manageengine.com/it-operations-management/database-monitoring.html surfaces wait-state and slow-query data across both in a single triage view through its SaaS delivery for managed databases https://www.site24x7.com/help/database-monitoring/ . Symptom: I/O Degradation and Read Amplification A buffer cache hit ratio drifting below the 95-99% range that healthy OLTP workloads maintain is the cross-engine signal that the engine is reading more pages from disk than memory can satisfy. SQL Server practitioners typically treat 90% as a warning and 85% as an action threshold; PostgreSQL and MySQL expose equivalents in pg statio user tables and information schema.INNODB BUFFER POOL STATS or SHOW ENGINE INNODB STATUS . The most common cause is index fragmentation: pages split, B-tree leaves scatter across non-contiguous extents, and one logical read becomes several physical I/Os. Read amplification surfaces as PAGEIOLATCH waits on SQL Server, DataFileRead on PostgreSQL, and elevated innodb data file waits on MySQL. On cloud-managed instances where DMV access is restricted RDS, Azure SQL Managed Instance , OpManager Nexus's SaaS delivery surfaces the same buffer-pool visibility through its agent. Diagnosing index bloat SQL Server: sys.dm db index physical stats https://learn.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-db-index-physical-stats-transact-sql is the authoritative source for fragmentation data. The query below returns indexes above 5% fragmentation with more than 1,000 pages the page count filter matters https://www.brentozar.com/archive/2009/02/index-fragmentation-findings-part-2-size-matters/ because rebuilding very small indexes produces negligible performance improvement : SELECT OBJECT NAME ips.object id AS tbl name, i.name AS idx name, ips.index type desc, ips.avg fragmentation in percent, ips.page count FROM sys.dm db index physical stats DB ID , NULL, NULL, NULL, 'LIMITED' AS ips JOIN sys.indexes i ON ips.object id = i.object id AND ips.index id = i.index id WHERE ips.avg fragmentation in percent 5 AND ips.page count 1000 ORDER BY ips.avg fragmentation in percent DESC; The 'LIMITED' scan mode traverses only the index allocation structure, making it safe and fast on production systems. 'SAMPLED' reads a statistical sample of data pages for more accurate numbers at moderate I/O cost on very large tables or partitioned indexes. 'DETAILED' performs a full scan; reserve it for offline assessment. PostgreSQL: The pg stat user tables view provides the first signal. A dead pct above 10-20% on a high-write table is a common trigger for manual VACUUM this range aligns with practitioner guidance, with the autovacuum default kicking in at 20% : SELECT schemaname, relname, n dead tup, n live tup, round n dead tup::numeric / NULLIF n live tup + n dead tup, 0 100, 2 AS dead pct, last vacuum, last autovacuum FROM pg stat user tables WHERE n live tup 10000 ORDER BY n dead tup DESC LIMIT 20; For index-level bloat physical B-tree bloat that VACUUM does not reclaim , the pgstattuple extension exposes two functions. pgstattuple returns free percent , the wasted-space ratio that is the PostgreSQL equivalent of avg fragmentation in percent : CREATE EXTENSION IF NOT EXISTS pgstattuple; SELECT FROM pgstattuple 'orders created at idx' ; pgstatindex returns the B-tree-specific metrics: leaf fragmentation percentage of leaf pages not in logical order, indicating physical scatter and avg leaf density below 50% suggests the index has many near-empty pages : SELECT FROM pgstatindex 'orders created at idx' ; Both functions perform a full scan of the target relation, so on a multi-hundred-GB index expect runtime and I/O comparable to a sequential read of the entire object — schedule them like any other heavy diagnostic, not in a hot loop. High free percent with low leaf fragmentation may indicate space reclaimable by VACUUM rather than a full rebuild. Values of free percent in the 20-30% range are a widely used trigger for REINDEX https://aws.amazon.com/blogs/database/improve-postgresql-performance-using-the-pgstattuple-extension/ ; consult your workload and current community guidance to calibrate the threshold. MySQL: Query information schema.TABLES for InnoDB tablespace fragmentation: SELECT table schema, table name, round data length / 1024 / 1024, 2 AS data mb, round data free / 1024 / 1024, 2 AS free mb, round data free / data length + index length + data free 100, 2 AS frag pct FROM information schema.TABLES WHERE engine = 'InnoDB' AND data free 0 ORDER BY data free DESC LIMIT 10; This metric is meaningful only with per-table tablespaces innodb file per table = ON , the default since MySQL 5.6 ; on shared-tablespace deployments, data free reflects unused space in the global ibdata file and is repeated identically across every InnoDB row. Tables with frag pct above 20% are commonly treated as candidates for OPTIMIZE TABLE https://dev.mysql.com/doc/refman/8.0/en/optimize-table.html or pt-online-schema-change this threshold is a practitioner guideline rather than a MySQL-documented limit . Remediation by engine and downtime tolerance Microsoft's documentation on index reorganization and rebuild https://learn.microsoft.com/en-us/sql/relational-databases/indexes/reorganize-and-rebuild-indexes maps fragmentation levels to two SQL Server operations: - 5-30% fragmentation: compacts leaf-level pages incrementally as an online operation. It can be interrupted mid-run without corrupting the index. ALTER INDEX idx name ON tbl name REORGANIZE - Above 30%: recreates the index. Offline by default acquires a schema modification lock that ALTER INDEX idx name ON tbl name REBUILD blocks concurrent access https://learn.microsoft.com/en-us/sql/relational-databases/indexes/guidelines-for-online-index-operations . Add WITH ONLINE = ON on Enterprise edition to keep the index available during the rebuild. Note that even online rebuilds acquire a brief Schema Modification Sch-M lock at the beginning and end of the operation, typically milliseconds, but long enough to cause noticeable waits on extremely high-concurrency workloads. On SQL Server 2017+, combine ONLINE = ON with RESUMABLE = ON and a configurable MAX DURATION to pause and resume long rebuilds: ALTER INDEX idx name ON tbl name REBUILD WITH ONLINE = ON, RESUMABLE = ON, MAX DURATION = 60 . Resume with ALTER INDEX idx name ON tbl name REBUILD WITH RESUME . RESUMABLE = ON requires ONLINE = ON and is Enterprise-edition-only on SQL Server 2017; SQL Server 2019+ also enables it on Standard and Web editions, so verify your edition before scripting against this syntax. The 5% floor matters equally https://www.sqlskills.com/blogs/paul/where-do-the-books-online-index-fragmentation-thresholds-come-from/ . Running REORGANIZE on a 3% fragmented index generates log activity, consumes I/O, and produces no measurable query improvement. For PostgreSQL, VACUUM https://www.postgresql.org/docs/current/sql-vacuum.html reclaims dead tuple storage and updates the visibility map. updates planner statistics. https://www.postgresql.org/docs/current/sql-analyze.html ANALYZE REINDEX rebuilds the B-tree structure https://docs.aws.amazon.com/prescriptive-guidance/latest/postgresql-maintenance-rds-aurora/reindex.html when physical index bloat is confirmed: VACUUM VERBOSE ANALYZE transactions; -- Blocking rebuild requires maintenance window : REINDEX INDEX transactions created at idx; -- Non-blocking rebuild PostgreSQL 12+ : REINDEX INDEX CONCURRENTLY transactions created at idx; REINDEX CONCURRENTLY cannot run inside a transaction block and takes longer than the standard form, but it allows writes to continue during the rebuild. Beyond immediate remediation, VACUUM VERBOSE output is worth reviewing regularly on your heaviest-write tables. It provides dead tuple counts, page recycling data, and cleanup statistics that give indirect signals of table health. PostgreSQL's autovacuum handles routine dead tuple cleanup https://www.postgresql.org/docs/current/routine-vacuuming.html automatically, but under high-velocity delete workloads it can fall behind. The official PostgreSQL documentation on routine vacuuming covers tuning autovacuum vacuum scale factor and autovacuum vacuum threshold for tables where the defaults prove too conservative. For MySQL, OPTIMIZE TABLE defragments the tablespace and rebuilds statistics in a single operation. In MySQL 8.0+, this runs online for regular InnoDB tables with only brief metadata locks at prepare and commit phases, but the full copy can take significant time on large tables: OPTIMIZE TABLE events; ANALYZE TABLE events; Internally, InnoDB maps OPTIMIZE TABLE to ALTER TABLE ... FORCE , rebuilding the clustered index and all secondary indexes. For zero-downtime execution on large tables, pt-online-schema-change https://docs.percona.com/percona-toolkit/pt-online-schema-change.html from Percona Toolkit performs the same rebuild while keeping the original table live: pt-online-schema-change \ --alter "ENGINE=InnoDB" \ --execute \ D=app prod,t=events,h=127.0.0.1,F=$HOME/.my.cnf This maintains a shadow copy and replays writes via triggers throughout the rebuild. The --execute flag is required; without it the tool runs in dry-run mode only. Remediation lookup by symptom severity: | Symptom Severity | Engine | Downtime Tolerance | Recommended Action | |---|---|---|---| | Mild frag < 5% / dead pct < 10% | All | N/A | None | | Moderate 5-30% | SQL Server | Any | ALTER INDEX ... REORGANIZE | | Severe 30% | SQL Server | Required | ALTER INDEX ... REBUILD WITH ONLINE=ON Enterprise | | Severe 30% | SQL Server | Available | ALTER INDEX ... REBUILD | | Elevated dead pct 10% | PostgreSQL | Any | VACUUM ANALYZE | | High bloat free percent 30% | PostgreSQL | Required | REINDEX CONCURRENTLY | | Elevated frag pct 20% | MySQL | Available | OPTIMIZE TABLE | | Elevated frag pct 20% | MySQL | Required | pt-online-schema-change | With fragmentation addressed, the next failure category that produces slow queries is stale statistics, which causes the optimizer to choose a scan where an index seek would be orders of magnitude faster. Symptom: Query Plan Regression The execution plan shows a table scan where an index seek ran yesterday. The optimizer has not changed; the data it relies on has. This is a statistics problem. Diagnosing stale statistics The SQL Server optimizer uses row count estimates and data distribution histograms https://learn.microsoft.com/en-us/sql/relational-databases/statistics/statistics?view=sql-server-ver17 to choose between index seeks and table scans. When those statistics are weeks out of date on a fast-growing table, the optimizer picks a scan where a seek would be dramatically faster. Run UPDATE STATISTICS table name WITH FULLSCAN on any table that receives large batch loads. The WITH SAMPLE https://learn.microsoft.com/en-us/sql/t-sql/statements/update-statistics-transact-sql?view=sql-server-ver17 variant uses a row sampling percentage that can miss skewed distributions on large tables, producing statistics that look current but reflect an unrepresentative subset. To detect indexes suffering from stale statistics or poor plan choices, query sys.dm db index usage stats : SELECT OBJECT NAME object id AS tbl name, index id, user seeks, user scans, user lookups FROM sys.dm db index usage stats WHERE database id = DB ID ORDER BY user scans DESC; Indexes with zero seeks but high scans are candidates for statistics updates or missing index evaluation. PostgreSQL's ANALYZE command https://www.postgresql.org/docs/17/planner-stats.html and MySQL's https://dev.mysql.com/doc/refman/8.4/en/analyze-table.html update planner statistics independently from ANALYZE TABLE VACUUM and OPTIMIZE TABLE respectively. On PostgreSQL, autovacuum runs ANALYZE automatically after a configurable percentage of rows change https://www.postgresql.org/docs/17/runtime-config-autovacuum.html controlled by autovacuum analyze scale factor , default 0.1 or 10% , but that default is too high for large tables https://aws.amazon.com/blogs/database/understanding-autovacuum-in-amazon-rds-for-postgresql-environments/ . A 200-million-row table would need 20 million row changes to trigger autovacuum's ANALYZE pass, by which point the query plan may have been wrong for hours. Lowering autovacuum analyze scale factor to 0.01 or using autovacuum analyze threshold with per-table overrides addresses this. Updating statistics without disruption On SQL Server, UPDATE STATISTICS generally does not block queries it runs with NOLOCK semantics on data reads , though asynchronous statistics updates can cause brief schema lock contention during query compilation in high-workload scenarios. It does invalidate cached execution plans for the affected table: immediately after, SQL Server will recompile plans on next execution, which can briefly spike CPU on systems with many concurrent queries against the updated table. Run during low-traffic windows on heavily queried tables. The choice between FULLSCAN and SAMPLE depends on table size and distribution skew. For tables in the small-to-medium range, FULLSCAN typically completes quickly enough to run during off-peak hours the practical upper bound depends on hardware, but many teams use roughly 100M rows as a rule-of-thumb cutoff . For larger tables, a higher sample percentage such as SAMPLE 20 PERCENT or SAMPLE 30 PERCENT typically provides a better tradeoff between accuracy and duration than the default sample, though the optimal percentage varies by workload. On PostgreSQL, ANALYZE https://www.crunchydata.com/blog/indexes-selectivity-and-statistics reads a configurable sample default default statistics target = 100 , meaning 30,000 rows per column and does not lock the table. Run it manually after any bulk load or partition swap.On MySQL, ANALYZE TABLE is a lightweight operation on InnoDB that reads the index tree's random dive samples. It is a fast operation: in MySQL 8.0+, ANALYZE TABLE uses online DDL semantics, avoiding the full read lock that earlier versions required. Capture EXPLAIN for representative queries before and after to confirm the planner picked up the new statistics. OpManager Nexus automates detection of query plan regression on-prem through historical baseline comparison and anomaly flagging. The same capability extends to cloud-managed databases through its SaaS delivery, where slow query log analysis https://www.site24x7.com/database-monitoring.html drills into queries exceeding a configurable execution-time threshold. The Automated Remediation section below covers how to wire that detection into corrective workflows. Statistics failures are invisible until the query plan degrades. Storage failures are equally silent, until a disk fills and takes the database offline. Symptom: Storage Pressure and Runaway Growth A disk usage alert fires at 85% capacity. The database server has been running for months without anyone checking how fast the log files or tablespaces are growing. The root cause splits into two categories: unmanaged transaction log growth and missing archiving strategy. Both are maintenance failures that monitoring should have caught weeks earlier. Transaction log and WAL management SQL Server: A full recovery model database without regular transaction log backups https://learn.microsoft.com/en-us/sql/relational-databases/logs/troubleshoot-a-full-transaction-log-sql-server-error-9002?view=sql-server-ver17 will grow its log file until the disk fills, and a full data volume is an immediate production outage. To check current log space usage across all databases, run DBCC SQLPERF LOGSPACE ; , which returns log size, space used percentage, and status for every database. For a single database, query sys.databases for the log reuse wait desc column, which tells you exactly why the log cannot be truncated e.g., LOG BACKUP , ACTIVE TRANSACTION . Schedule log backups at an interval matching your Recovery Point Objective RPO : for most OLTP workloads, intervals in the range of 5-30 minutes are commonly used, with tighter intervals for high-transaction systems, though the right frequency is workload-specific. DBCC SHRINKFILE https://learn.microsoft.com/en-us/sql/t-sql/database-console-commands/dbcc-shrinkfile-transact-sql?view=sql-server-ver17 on the log file is a last resort for reclaiming space after an unexpected log growth event. The reason it is a last resort, rather than a routine cleanup tool, is the side effect on Virtual Log Files https://learn.microsoft.com/en-us/sql/relational-databases/logs/manage-the-size-of-the-transaction-log-file?view=sql-server-ver17 VLFs , the internal segments SQL Server divides the transaction log into. Each shrink-then-regrow cycle adds a new VLF, so a log that has been shrunk repeatedly ends up fragmented into many small VLFs instead of a few large ones. That fragmentation degrades sequential log write throughput and increases recovery time https://www.sqlskills.com/blogs/paul/why-you-should-not-shrink-your-data-files/ . The fix is to address the root cause missing log backups, long-running transactions rather than shrinking on a schedule. PostgreSQL: WAL Write-Ahead Log https://www.postgresql.org/docs/current/continuous-archiving.html management serves the same function as SQL Server's transaction log. The archive mode and archive command settings control whether completed WAL segments are shipped to archive storage. Without archiving enabled, WAL segments accumulate https://www.percona.com/blog/five-reasons-why-wal-segments-accumulate-in-the-pg wal-directory-in-postgresql/ in pg wal/ until disk fills. The wal keep size https://pgpedia.info/w/wal keep size.html parameter PostgreSQL 13+, replacing wal keep segments sets a floor for retained WAL data, but does not cap growth. For production systems, configure continuous archiving with archive mode = on and point archive command to your backup infrastructure pgBackRest, Barman, or cloud-native equivalents .To verify archiving is active and current: SELECT FROM pg stat archiver; Check last archived wal timestamp and failed count . A non-zero failed count or a stale last archived time means WAL segments are accumulating. Also: SELECT count , pg size pretty sum size FROM pg ls waldir ; PostgreSQL 10+ shows total WAL directory size. MySQL: Binary logs https://dev.mysql.com/doc/refman/8.0/en/replication-options-binary-log.html binlogs serve replication and point-in-time recovery. Without rotation, they grow indefinitely. expire logs days https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-3.html deprecated in MySQL 8.0.3 or binlog expire logs seconds MySQL 8.0+ controls automatic purge. Setting binlog expire logs seconds = 604800 retains seven days of binary logs, which is sufficient for most replication topologies. Run PURGE BINARY LOGS BEFORE NOW - INTERVAL 7 DAY for one-time cleanup. Capacity forecasting with OpManager Nexus Reacting to a disk alert at 85% leaves little room for planned action. OpManager Nexus's AI/ML-based storage forecasting https://www.manageengine.com/network-monitoring/help/forecast-reports.html uses up to 14 days of history to predict when storage will hit 80%, 90%, and 100% https://www.manageengine.com/network-monitoring/storage-capacity-forecasting-planning.html , giving your team a "disk full in N days" signal once it has at least 3 days of data. Its adaptive thresholds https://www.manageengine.com/network-monitoring/help/adaptive-thresholds.html learn baseline behavior so alerts fire on genuine anomalies rather than every batch job, and the Database Tab surfaces individual database size, data and log file utilization, and growth trends. Note:OpManager Nexus's own monitoring data retention configured under Settings General Settings Database Maintenance is independent of your production database storage. Defaults are 7, 30, and 365 days for detailed, hourly, and daily statistics. Use OpManager Nexus's forecast reports to verify your archiving cadence keeps pace with growth: if the forecast shows 80% capacity in 30 days but your archive job runs monthly, increase frequency or provision more storage. Storage pressure is a passive failure that accumulates over time. Lock contention is an active failure: the maintenance operation meant to fix the database becomes the source of the incident. Symptom: Lock Contention from Maintenance Operations A spike in blocked sessions immediately after a scheduled maintenance run is direct evidence that the REBUILD or REORGANIZE collided with production traffic and created lock contention https://www.mssqltips.com/sqlservertip/5880/why-is-index-reorganize-and-update-statistics-causing-sql-server-blocking/ . The maintenance job is supposed to fix performance, but index REBUILDs running without ONLINE = ON during peak traffic or without a maintenance window hold locks that block concurrent queries, turning the fix into the incident. Identifying maintenance-induced blocking Correlating maintenance timing with OpManager Nexus's Sessions Tab is how you distinguish maintenance-induced blocking from application-level contention. If blocked session counts spike within minutes of a maintenance window opening, the maintenance job is the cause. On SQL Server, check sys.dm exec requests https://learn.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-exec-requests-transact-sql for sessions with wait type values starting with LCK M , then look up the head-of-chain blocker and inspect its command column for ALTER INDEX or DBCC operations.On PostgreSQL, pg stat activity shows Lock wait events with wait event values like relation or transactionid . If the blocking PID is running REINDEX or VACUUM FULL , that is maintenance-induced contention. For cloud-managed instances where Sessions Tab access is unavailable, OpManager Nexus's SaaS delivery https://www.site24x7.com/database-monitoring.html surfaces lock contention and blocking session counts on its database performance dashboard for the same triage signal. Online and resumable operations The fix is operational: use online operations and schedule them outside peak traffic windows. SQL Server: Use ALTER INDEX ... REBUILD WITH ONLINE = ON, RESUMABLE = ON, MAX DURATION = 60 as described in the I/O Degradation section. The duration is any positive integer in minutes; set it based on your maintenance window. REORGANIZE is always online and interruptible. PostgreSQL: REINDEX INDEX CONCURRENTLY introduced in the I/O Degradation section avoids exclusive locks. VACUUM without FULL does not block reads or writes. MySQL: Standard OPTIMIZE TABLE already runs as online DDL on MySQL 8.0+ introduced in the I/O Degradation section . Reach for pt-online-schema-change when you need finer control over lock duration on very large tables, or when you want triggered shadow-copy semantics that OPTIMIZE TABLE does not offer. The four symptom categories above all produce observable performance signals before they become outages. Corruption is different: it produces no signal until it surfaces as query failures or data loss. Symptom: Silent Corruption and Integrity Failures Because corruption produces no precursor wait events or latency drift, detection is a deliberate scheduled act, not an alert response. Regular integrity checks are the primary detection mechanism, supplemented by storage-level checksums, page verification, and reliable backups. SQL Server: DBCC CHECKDB https://learn.microsoft.com/en-us/sql/t-sql/database-console-commands/dbcc-checkdb-transact-sql?view=sql-server-ver17 catches page corruption, allocation errors, and consistency violations https://techcommunity.microsoft.com/blog/sqlserversupport/sql-server-database-corruption-causes-detection-and-some-details-behind-dbcc-che/4460631 . -- Recommended production form: suppresses informational messages, shows only errors DBCC CHECKDB 'ProductionDB' WITH NO INFOMSGS, ALL ERRORMSGS; For large databases where a full DBCC CHECKDB is too slow for a maintenance window, DBCC CHECKDB ... WITH PHYSICAL ONLY checks page and record header integrity without logical consistency checks and completes significantly faster. Corruption surfaces in the SQL Server error log as messages Msg 823, 824, or 825 https://support.microsoft.com/en-us/help/2015755/how-to-troubleshoot-a-msg-823-error-in-sql-server . To proactively check for known corruption events, query the suspect pages table: SELECT db id, file id, page id, event type, error count, last update date FROM msdb.dbo.suspect pages WHERE event type IN 1, 2, 3 ; Event type 1 = 823/824 errors, 2 = bad checksum, 3 = torn page. A non-empty result requires immediate DBCC CHECKDB and restore planning. Running DBCC CHECKDB as frequently as your maintenance windows allow is the safe path. Many experts recommend daily on all databases; if that is impractical, prioritize critical databases and shorten the interval on large ones using WITH PHYSICAL ONLY . PostgreSQL: The pg amcheck https://www.postgresql.org/docs/14/app-pgamcheck.html utility PostgreSQL 14+ verifies B-tree index integrity by checking that every heap tuple referenced by an index entry actually exists and that index entries are in the correct sort order. The default invocation is fast enough for routine scheduled checks and catches most corruption: pg amcheck mydb After an unexpected crash, storage event, or replication failure, run the thorough variant on critical tables: pg amcheck --heapallindexed --parent-check mydb --heapallindexed performs a deeper check that every heap tuple has a corresponding index entry; --parent-check verifies cross-level B-tree invariants. Both flags increase runtime substantially, so reserve them for incident response or post-event verification rather than the routine schedule. MySQL: mysqlcheck provides table-level integrity verification: mysqlcheck --check --all-databases -u root -p For individual tables, CHECK TABLE table name within the MySQL client performs the same operation. InnoDB tables benefit from CHECK TABLE ... FOR UPGRADE https://dev.mysql.com/doc/refman/9.7/en/check-table.html after major version upgrades to verify storage format compatibility. Running these checks manually is the safety net. The next section shows how to automate the response so the platform acts before the on-call engineer logs in. From Alert to Fix: Automated Remediation Across Engines When the alert fires at 3 AM, having the platform execute the remediation automatically matters far more than knowing the fix. OpManager Nexus's IT Workflow Automation triggers a custom monitoring script when an alert threshold is breached: the script queries the symptom's diagnostic surface fragmentation, dead tuples, log space , evaluates severity, and runs the remediation. SQL Server: wiring remediation into OpManager Nexus OpManager Nexus accepts PowerShell or shell scripts as custom monitors https://www.manageengine.com/network-monitoring/script-monitoring.html Custom Script Monitors require build 12.7 or later . The integration pattern matches the PostgreSQL and MySQL examples below: query sys.dm db index physical stats for fragmentation, branch on the threshold, issue ALTER INDEX REORGANIZE or REBUILD WITH ONLINE = ON accordingly, and emit one log line per action so the run shows up in the monitor's history. Run the script under a service account with at least db ddladmin on the target database; for SQL authentication or cross-domain setups, pull credentials from a secrets store rather than embedding them. PostgreSQL and MySQL shell automation For PostgreSQL, a cron-driven shell script can query pg stat user tables for bloated tables and trigger remediation: bash /usr/bin/env bash PostgreSQL automated vacuum/reindex for tables exceeding dead tuple threshold. Credentials sourced from ~/.pgpass chmod 600 ; export PGPASSFILE if non-default. PGHOST="localhost" PGPORT="5432" PGDATABASE="app prod" PGUSER="maintenance user" export PGPASSFILE="${PGPASSFILE:-$HOME/.pgpass}" DEAD THRESHOLD=15 BLOAT THRESHOLD=30 VACUUM tables with high dead tuple ratio psql -h "$PGHOST" -p "$PGPORT" -U "$PGUSER" -d "$PGDATABASE" -t -A -F'|' -c " SELECT schemaname, relname, round n dead tup::numeric / NULLIF n live tup + n dead tup, 0 100, 2 FROM pg stat user tables WHERE n live tup 10000 AND round n dead tup::numeric / NULLIF n live tup + n dead tup, 0 100, 2 $DEAD THRESHOLD " | while IFS='|' read -r schema table dead pct; do echo "$ date '+%Y-%m-%d %H:%M:%S' | VACUUM ANALYZE ${schema}.${table} | dead pct=${dead pct}%" psql -h "$PGHOST" -p "$PGPORT" -U "$PGUSER" -d "$PGDATABASE" -c "VACUUM ANALYZE ${schema}.${table};" done For MySQL, a similar approach queries information schema.TABLES and triggers OPTIMIZE TABLE . Use a MySQL option file instead of embedding credentials in the script create ~/.my.cnf with client credentials and restrict permissions to 600 : bash /usr/bin/env bash MySQL automated optimize for InnoDB tables exceeding fragmentation threshold MYSQL HOST="localhost" MYSQL DB="app prod" FRAG THRESHOLD=20 mysql --defaults-extra-file="$HOME/.my.cnf" -h "$MYSQL HOST" -N -B -e " SELECT table name, round data free / data length + index length + data free 100, 2 AS frag pct FROM information schema.TABLES WHERE table schema = '${MYSQL DB}' AND engine = 'InnoDB' AND data free 0 AND round data free / data length + index length + data free 100, 2 ${FRAG THRESHOLD} " | while read -r table frag pct; do echo "$ date '+%Y-%m-%d %H:%M:%S' | OPTIMIZE TABLE ${table} | frag pct=${frag pct}%" mysql --defaults-extra-file="$HOME/.my.cnf" -h "$MYSQL HOST" "$MYSQL DB" -e "OPTIMIZE TABLE ${table};" done Schedule either script via cron e.g., 0 3 /opt/scripts/pg maintenance.sh /var/log/db maintenance.log 2 &1 and monitor the log output through OpManager Nexus's custom monitor integration. Cloud-managed database automation For databases running on Amazon RDS, Aurora, or Azure SQL, OpManager Nexus's SaaS delivery provides the cloud-side counterpart of the PowerShell and shell automation patterns above. Its IT Automation module https://www.site24x7.com/help/admin/configuration-profiles/actions.html triggers corrective actions from threshold breaches and anomaly detections, and AI-powered baselines https://www.site24x7.com/anomaly-detection.html replace the manual threshold tuning that self-managed instances require. For RDS specifically, service actions https://www.site24x7.com/help/it-automation/rds-actions.html like start, stop, and reboot with failover are surfaced directly. Engine-specific monitor setup https://www.site24x7.com/help/database-monitoring/ for SQL Server, PostgreSQL, and MySQL is documented separately. Threshold profiles let you apply equivalent alert configurations across dev, staging, and production monitors, so a query that fragments an index under realistic staging load surfaces in slow query detection before it reaches production scale. Maintenance Health Scorecard: Assessing Your Current Posture Instead of running through the diagnostic queries from scratch, use this scorecard to assess your maintenance posture. Each item references the diagnostic approach covered in its corresponding section above. I/O health see: I/O Degradation section - SQL Server: Run the sys.dm db index physical stats query filter the results at 30% fragmentation . Count of indexes returned: - PostgreSQL: Run the pg stat user tables dead tuple query. Tables with dead pct above 10-20% are candidates for immediate attention: - MySQL: Run the information schema.TABLES fragmentation query. Tables with frag pct above 20%: Statistics freshness see: Query Plan Regression section - SQL Server: Check sys.dm db index usage stats for indexes with zero seeks but high scans plan regression or poorly matched index - PostgreSQL: Verify autovacuum analyze scale factor is set below 0.1 for tables above 100 million rows - MySQL: Run ANALYZE TABLE on your top 10 tables by write volume; capture EXPLAIN output for representative queries before and after to confirm planner statistics changed as expected Storage trajectory see: Storage Pressure section - OpManager Nexus forecast report confirms sufficient capacity runway before any threshold crossing: Yes / No - Transaction log backup job SQL Server or WAL archiving PostgreSQL is confirmed running and last backup verified: Yes / No - Binary log rotation MySQL is configured with binlog expire logs seconds set to an explicit value: Yes / No Integrity baseline see: Silent Corruption section - SQL Server: DBCC CHECKDB last run date on critical databases: - PostgreSQL: pg amcheck last run date or equivalent manual check : - MySQL: mysqlcheck --check last run date: Automation coverage see: Automated Remediation section - At least one automated remediation script is deployed, scheduled, and confirmed to be producing output logs: Yes / No - OpManager Nexus alert thresholds are configured and tested for key database health metrics BCHR, disk utilization, blocked sessions : Yes / No - Maintenance windows are scheduled based on monitoring signals, not calendar dates: Yes / No Cross-reference those results against OpManager Nexus's slow query and session data on-prem Performance Tab or SaaS Database Metrics dashboard . If a table in the top results by size also appears as a source of slow query detections, that is your highest-priority maintenance target.