{"slug": "database-maintenance-tracing-production-incidents-to-their-root-cause", "title": "Database Maintenance: Tracing Production Incidents to Their Root Cause", "summary": "This article explains that database maintenance should be triggered by workload signals rather than fixed schedules, as scheduled maintenance often misses tables that need it most and leads to production incidents. It provides a diagnostic framework for SQL Server, PostgreSQL, and MySQL that uses wait states (such as PAGEIOLATCH_SH, LCK_M_S, and CXPACKET) as the universal entry point for triaging slow queries and tracing them to root causes like index fragmentation, stale statistics, or lock contention. The article also covers cross-engine buffer cache monitoring and offers a self-assessment scorecard, with a note that silent corruption requires detection-first treatment since it produces no precursor signal.", "body_md": "Database maintenance fails when it runs on a calendar instead of on signal. [Fragmentation, stale statistics, log growth, and lock contention are functions of write workload](https://learn.microsoft.com/en-us/sql/relational-databases/indexes/reorganize-and-rebuild-indexes?view=sql-server-ver17), not weekly schedules. Scheduled maintenance skips the tables that need it most, and the resulting incident fires before anyone notices the gap.\n\nThis article replaces the cron job with a response system. Four observable symptoms (I/O degradation, query plan regression, storage pressure, and lock contention) each trace back to a specific maintenance root cause, with fixes for SQL Server, PostgreSQL, and MySQL. Silent corruption, the one failure mode that produces no precursor signal, gets its own detection-first treatment. A closing scorecard lets you self-assess.\n\n## First Response: Wait State Triage Across Engines\n\nWhen a slow query alert fires, the first diagnostic step is the same regardless of engine: check what the query is waiting on. [Wait states](https://learn.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-os-wait-stats-transact-sql) are the universal entry point for database incident triage. They tell you whether the problem is I/O bound, lock bound, or CPU bound, and that classification determines which section of this article contains your fix.\n\n### SQL Server wait types\n\n[ PAGEIOLATCH_SH](https://www.sqlskills.com/help/waits/pageiolatch_sh/) means the query is waiting for data pages to be read from disk into the buffer pool. This points to index fragmentation,\n\n[buffer cache pressure](https://learn.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-os-wait-stats-transact-sql), or storage subsystem saturation.\n\n[and](https://www.sqlskills.com/help/waits/lck_m_s/)\n\n`LCK_M_S`\n\n[indicate row or table-level lock contention from a concurrent transaction or a maintenance operation holding locks.](https://www.sqlskills.com/help/waits/lck_m_x/)\n\n`LCK_M_X`\n\n[(visible in](https://sqlperformance.com/2015/08/sql-performance/more-on-cxpacket-waits-skewed-parallelism)\n\n`CXPACKET`\n\n`sys.dm_exec_requests`\n\n) signals parallelism skew, which typically traces to stale statistics or a missing index causing the optimizer to choose an expensive parallel plan.###\n\nPostgreSQL and MySQL equivalents\n\nPostgreSQL exposes wait diagnostics through [ pg_stat_activity](https://www.postgresql.org/docs/current/monitoring-stats.html). The query below is your triage entry point:\n\n```\n-- PostgreSQL: active session wait events\nSELECT pid, wait_event_type, wait_event, state, query\nFROM pg_stat_activity\nWHERE wait_event IS NOT NULL\n  AND state != 'idle'\n  AND backend_type = 'client backend';\n```\n\nThe diagram above maps each value to its target section. One non-obvious case is worth calling out: a `NULL`\n\n`wait_event`\n\nwhile `state = 'active'`\n\nindicates the query is compute-bound (the PostgreSQL equivalent of CPU pressure), which can point toward stale statistics or a plan regression rather than I/O.\n\nFor MySQL, [ performance_schema.events_waits_current](https://dev.mysql.com/doc/refman/8.4/en/performance-schema-wait-tables.html) is the source for the values shown in the diagram. Verify\n\n`performance_schema = ON`\n\nin `my.cnf`\n\nfirst, as it is disabled by default in some MySQL 5.x builds and carries non-zero overhead; on MySQL 8.0+ it is enabled by default. [gives a quicker but less granular view.](https://dev.mysql.com/doc/refman/8.4/en/show-processlist.html)\n\n`SHOW PROCESSLIST`\n\nOnce you have identified the wait type, the sections below trace each category to its maintenance root cause and prescribe the fix. For hybrid topologies that span on-prem and cloud-managed instances, [ManageEngine OpManager Nexus](https://www.manageengine.com/it-operations-management/database-monitoring.html) surfaces wait-state and slow-query data across both in a single triage view through its [SaaS delivery for managed databases](https://www.site24x7.com/help/database-monitoring/).\n\n## Symptom: I/O Degradation and Read Amplification\n\nA buffer cache hit ratio drifting below the 95-99% range that healthy OLTP workloads maintain is the cross-engine signal that the engine is reading more pages from disk than memory can satisfy.\n\nSQL Server practitioners typically treat 90% as a warning and 85% as an action threshold; PostgreSQL and MySQL expose equivalents in `pg_statio_user_tables`\n\nand `information_schema.INNODB_BUFFER_POOL_STATS`\n\n(or `SHOW ENGINE INNODB STATUS`\n\n). The most common cause is index fragmentation: pages split, B-tree leaves scatter across non-contiguous extents, and one logical read becomes several physical I/Os. Read amplification surfaces as `PAGEIOLATCH`\n\nwaits on SQL Server, `DataFileRead`\n\non PostgreSQL, and elevated `innodb_data_file`\n\nwaits on MySQL.\n\nOn cloud-managed instances where DMV access is restricted (RDS, Azure SQL Managed Instance), OpManager Nexus's SaaS delivery surfaces the same buffer-pool visibility through its agent.\n\n### Diagnosing index bloat\n\n**SQL Server:** [ sys.dm_db_index_physical_stats](https://learn.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-db-index-physical-stats-transact-sql) is the authoritative source for fragmentation data. The query below returns indexes above 5% fragmentation with more than 1,000 pages (the\n\n[page count filter matters](https://www.brentozar.com/archive/2009/02/index-fragmentation-findings-part-2-size-matters/)because rebuilding very small indexes produces negligible performance improvement):\n\n```\nSELECT\n    OBJECT_NAME(ips.object_id) AS tbl_name,\n    i.name AS idx_name,\n    ips.index_type_desc,\n    ips.avg_fragmentation_in_percent,\n    ips.page_count\nFROM sys.dm_db_index_physical_stats(\n    DB_ID(), NULL, NULL, NULL, 'LIMITED') AS ips\nJOIN sys.indexes i\n    ON ips.object_id = i.object_id\n    AND ips.index_id = i.index_id\nWHERE ips.avg_fragmentation_in_percent > 5\n    AND ips.page_count > 1000\nORDER BY ips.avg_fragmentation_in_percent DESC;\n```\n\nThe `'LIMITED'`\n\nscan mode traverses only the index allocation structure, making it safe and fast on production systems. `'SAMPLED'`\n\nreads a statistical sample of data pages for more accurate numbers at moderate I/O cost on very large tables or partitioned indexes. `'DETAILED'`\n\nperforms a full scan; reserve it for offline assessment.\n\n**PostgreSQL:** The `pg_stat_user_tables`\n\nview provides the first signal. A `dead_pct`\n\nabove 10-20% on a high-write table is a common trigger for manual VACUUM (this range aligns with practitioner guidance, with the autovacuum default kicking in at 20%):\n\n```\nSELECT schemaname, relname,\n       n_dead_tup,\n       n_live_tup,\n       round(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 2) AS dead_pct,\n       last_vacuum,\n       last_autovacuum\nFROM pg_stat_user_tables\nWHERE n_live_tup > 10000\nORDER BY n_dead_tup DESC\nLIMIT 20;\n```\n\nFor index-level bloat (physical B-tree bloat that VACUUM does not reclaim), the `pgstattuple`\n\nextension exposes two functions. `pgstattuple()`\n\nreturns `free_percent`\n\n, the wasted-space ratio that is the PostgreSQL equivalent of `avg_fragmentation_in_percent`\n\n:\n\n```\nCREATE EXTENSION IF NOT EXISTS pgstattuple;\nSELECT * FROM pgstattuple('orders_created_at_idx');\n```\n\n`pgstatindex()`\n\nreturns the B-tree-specific metrics: `leaf_fragmentation`\n\n(percentage of leaf pages not in logical order, indicating physical scatter) and `avg_leaf_density`\n\n(below 50% suggests the index has many near-empty pages):\n\n```\nSELECT * FROM pgstatindex('orders_created_at_idx');\n```\n\nBoth functions perform a full scan of the target relation, so on a multi-hundred-GB index expect runtime and I/O comparable to a sequential read of the entire object — schedule them like any other heavy diagnostic, not in a hot loop.\n\nHigh `free_percent`\n\nwith low `leaf_fragmentation`\n\nmay indicate space reclaimable by VACUUM rather than a full rebuild. Values of `free_percent`\n\nin the 20-30% range are a [widely used trigger for REINDEX](https://aws.amazon.com/blogs/database/improve-postgresql-performance-using-the-pgstattuple-extension/); consult your workload and current community guidance to calibrate the threshold.\n\n**MySQL:** Query `information_schema.TABLES`\n\nfor InnoDB tablespace fragmentation:\n\n```\nSELECT table_schema, table_name,\n       round(data_length / 1024 / 1024, 2) AS data_mb,\n       round(data_free / 1024 / 1024, 2) AS free_mb,\n       round(data_free / (data_length + index_length + data_free) * 100, 2) AS frag_pct\nFROM information_schema.TABLES\nWHERE engine = 'InnoDB'\n  AND data_free > 0\nORDER BY data_free DESC\nLIMIT 10;\n```\n\nThis metric is meaningful only with per-table tablespaces (`innodb_file_per_table = ON`\n\n, the default since MySQL 5.6); on shared-tablespace deployments, `data_free`\n\nreflects unused space in the global `ibdata`\n\nfile and is repeated identically across every InnoDB row.\n\nTables with `frag_pct`\n\nabove 20% are commonly treated as candidates for [ OPTIMIZE TABLE](https://dev.mysql.com/doc/refman/8.0/en/optimize-table.html) or\n\n`pt-online-schema-change`\n\n(this threshold is a practitioner guideline rather than a MySQL-documented limit).###\n\nRemediation by engine and downtime tolerance\n\n[Microsoft's documentation on index reorganization and rebuild](https://learn.microsoft.com/en-us/sql/relational-databases/indexes/reorganize-and-rebuild-indexes) maps fragmentation levels to two SQL Server operations:\n\n-\n**5-30% fragmentation:** compacts leaf-level pages incrementally as an online operation. It can be interrupted mid-run without corrupting the index.`ALTER INDEX idx_name ON tbl_name REORGANIZE`\n\n-\n**Above 30%:** recreates the index. Offline by default (acquires a schema modification lock that`ALTER INDEX idx_name ON tbl_name REBUILD`\n\n[blocks concurrent access](https://learn.microsoft.com/en-us/sql/relational-databases/indexes/guidelines-for-online-index-operations)). Add`WITH (ONLINE = ON)`\n\non Enterprise edition to keep the index available during the rebuild. Note that even online rebuilds acquire a brief Schema Modification (Sch-M) lock at the beginning and end of the operation, typically milliseconds, but long enough to cause noticeable waits on extremely high-concurrency workloads.\n\nOn SQL Server 2017+, combine `ONLINE = ON`\n\nwith `RESUMABLE = ON`\n\nand a configurable `MAX_DURATION`\n\nto pause and resume long rebuilds: `ALTER INDEX idx_name ON tbl_name REBUILD WITH (ONLINE = ON, RESUMABLE = ON, MAX_DURATION = 60)`\n\n. Resume with `ALTER INDEX idx_name ON tbl_name REBUILD WITH (RESUME)`\n\n. `RESUMABLE = ON`\n\nrequires `ONLINE = ON`\n\nand is Enterprise-edition-only on SQL Server 2017; SQL Server 2019+ also enables it on Standard and Web editions, so verify your edition before scripting against this syntax.\n\nThe [5% floor matters equally](https://www.sqlskills.com/blogs/paul/where-do-the-books-online-index-fragmentation-thresholds-come-from/). Running REORGANIZE on a 3% fragmented index generates log activity, consumes I/O, and produces no measurable query improvement.\n\nFor PostgreSQL, [ VACUUM](https://www.postgresql.org/docs/current/sql-vacuum.html) reclaims dead tuple storage and updates the visibility map.\n\n[updates planner statistics.](https://www.postgresql.org/docs/current/sql-analyze.html)\n\n`ANALYZE`\n\n`REINDEX`\n\n[rebuilds the B-tree structure](https://docs.aws.amazon.com/prescriptive-guidance/latest/postgresql-maintenance-rds-aurora/reindex.html)when physical index bloat is confirmed:\n\n```\nVACUUM VERBOSE ANALYZE transactions;\n\n-- Blocking rebuild (requires maintenance window):\nREINDEX INDEX transactions_created_at_idx;\n\n-- Non-blocking rebuild (PostgreSQL 12+):\nREINDEX INDEX CONCURRENTLY transactions_created_at_idx;\n```\n\n`REINDEX CONCURRENTLY`\n\ncannot run inside a transaction block and takes longer than the standard form, but it allows writes to continue during the rebuild. Beyond immediate remediation, `VACUUM VERBOSE`\n\noutput is worth reviewing regularly on your heaviest-write tables. It provides dead tuple counts, page recycling data, and cleanup statistics that give indirect signals of table health. PostgreSQL's [autovacuum handles routine dead tuple cleanup](https://www.postgresql.org/docs/current/routine-vacuuming.html) automatically, but under high-velocity delete workloads it can fall behind. The official PostgreSQL documentation on routine vacuuming covers tuning `autovacuum_vacuum_scale_factor`\n\nand `autovacuum_vacuum_threshold`\n\nfor tables where the defaults prove too conservative.\n\nFor MySQL, `OPTIMIZE TABLE`\n\ndefragments the tablespace and rebuilds statistics in a single operation. In MySQL 8.0+, this runs online for regular InnoDB tables with only brief metadata locks at prepare and commit phases, but the full copy can take significant time on large tables:\n\n```\nOPTIMIZE TABLE events;\nANALYZE TABLE events;\n```\n\nInternally, InnoDB maps `OPTIMIZE TABLE`\n\nto `ALTER TABLE ... FORCE`\n\n, rebuilding the clustered index and all secondary indexes. For zero-downtime execution on large tables, [ pt-online-schema-change](https://docs.percona.com/percona-toolkit/pt-online-schema-change.html) from Percona Toolkit performs the same rebuild while keeping the original table live:\n\n```\npt-online-schema-change \\\n  --alter \"ENGINE=InnoDB\" \\\n  --execute \\\n  D=app_prod,t=events,h=127.0.0.1,F=$HOME/.my.cnf\n```\n\nThis maintains a shadow copy and replays writes via triggers throughout the rebuild. The `--execute`\n\nflag is required; without it the tool runs in dry-run mode only.\n\n**Remediation lookup by symptom severity:**\n\n| Symptom Severity | Engine | Downtime Tolerance | Recommended Action |\n|---|---|---|---|\n| Mild (frag < 5% / dead_pct < 10%) | All | N/A | None |\n| Moderate (5-30%) | SQL Server | Any | ALTER INDEX ... REORGANIZE |\n| Severe (> 30%) | SQL Server | Required | ALTER INDEX ... REBUILD WITH (ONLINE=ON) [Enterprise] |\n| Severe (> 30%) | SQL Server | Available | ALTER INDEX ... REBUILD |\n| Elevated (dead_pct > 10%) | PostgreSQL | Any | VACUUM ANALYZE |\n| High bloat (free_percent > 30%) | PostgreSQL | Required | REINDEX CONCURRENTLY |\n| Elevated (frag_pct > 20%) | MySQL | Available | OPTIMIZE TABLE |\n| Elevated (frag_pct > 20%) | MySQL | Required | pt-online-schema-change |\n\nWith fragmentation addressed, the next failure category that produces slow queries is stale statistics, which causes the optimizer to choose a scan where an index seek would be orders of magnitude faster.\n\n## Symptom: Query Plan Regression\n\nThe execution plan shows a table scan where an index seek ran yesterday. The optimizer has not changed; the data it relies on has. This is a statistics problem.\n\n### Diagnosing stale statistics\n\nThe SQL Server [optimizer uses row count estimates and data distribution histograms](https://learn.microsoft.com/en-us/sql/relational-databases/statistics/statistics?view=sql-server-ver17) to choose between index seeks and table scans. When those statistics are weeks out of date on a fast-growing table, the optimizer picks a scan where a seek would be dramatically faster. Run `UPDATE STATISTICS table_name WITH FULLSCAN`\n\non any table that receives large batch loads. The [ WITH SAMPLE](https://learn.microsoft.com/en-us/sql/t-sql/statements/update-statistics-transact-sql?view=sql-server-ver17) variant uses a row sampling percentage that can miss skewed distributions on large tables, producing statistics that look current but reflect an unrepresentative subset.\n\nTo detect indexes suffering from stale statistics or poor plan choices, query `sys.dm_db_index_usage_stats`\n\n:\n\n```\nSELECT OBJECT_NAME(object_id) AS tbl_name,\n       index_id,\n       user_seeks,\n       user_scans,\n       user_lookups\nFROM sys.dm_db_index_usage_stats\nWHERE database_id = DB_ID()\nORDER BY user_scans DESC;\n```\n\nIndexes with zero seeks but high scans are candidates for statistics updates or missing index evaluation.\n\n[PostgreSQL's ANALYZE command](https://www.postgresql.org/docs/17/planner-stats.html) and\n\n[MySQL's](https://dev.mysql.com/doc/refman/8.4/en/analyze-table.html)update planner statistics independently from\n\n`ANALYZE TABLE`\n\n`VACUUM`\n\nand `OPTIMIZE TABLE`\n\nrespectively. On PostgreSQL, autovacuum runs `ANALYZE`\n\nautomatically after a [configurable percentage of rows change](https://www.postgresql.org/docs/17/runtime-config-autovacuum.html)(controlled by\n\n`autovacuum_analyze_scale_factor`\n\n, default 0.1 or 10%), but that default is [too high for large tables](https://aws.amazon.com/blogs/database/understanding-autovacuum-in-amazon-rds-for-postgresql-environments/). A 200-million-row table would need 20 million row changes to trigger autovacuum's ANALYZE pass, by which point the query plan may have been wrong for hours. Lowering\n\n`autovacuum_analyze_scale_factor`\n\nto 0.01 or using `autovacuum_analyze_threshold`\n\nwith per-table overrides addresses this.###\n\nUpdating statistics without disruption\n\nOn SQL Server, `UPDATE STATISTICS`\n\ngenerally does not block queries (it runs with NOLOCK semantics on data reads), though asynchronous statistics updates can cause brief schema lock contention during query compilation in high-workload scenarios. It does invalidate cached execution plans for the affected table: immediately after, SQL Server will recompile plans on next execution, which can briefly spike CPU on systems with many concurrent queries against the updated table. Run during low-traffic windows on heavily queried tables. The choice between `FULLSCAN`\n\nand `SAMPLE`\n\ndepends on table size and distribution skew.\n\nFor tables in the small-to-medium range, `FULLSCAN`\n\ntypically completes quickly enough to run during off-peak hours (the practical upper bound depends on hardware, but many teams use roughly 100M rows as a rule-of-thumb cutoff). For larger tables, a higher sample percentage (such as `SAMPLE 20 PERCENT`\n\nor `SAMPLE 30 PERCENT`\n\n) typically provides a better tradeoff between accuracy and duration than the default sample, though the optimal percentage varies by workload.\n\nOn PostgreSQL, [ ANALYZE](https://www.crunchydata.com/blog/indexes-selectivity-and-statistics) reads a configurable sample (default\n\n`default_statistics_target = 100`\n\n, meaning 30,000 rows per column) and does not lock the table. Run it manually after any bulk load or partition swap.On MySQL, `ANALYZE TABLE`\n\nis a lightweight operation on InnoDB that reads the index tree's random dive samples. It is a fast operation: in MySQL 8.0+, `ANALYZE TABLE`\n\nuses online DDL semantics, avoiding the full read lock that earlier versions required. Capture `EXPLAIN`\n\nfor representative queries before and after to confirm the planner picked up the new statistics.\n\nOpManager Nexus automates detection of query plan regression on-prem through historical baseline comparison and anomaly flagging. The same capability extends to cloud-managed databases through its SaaS delivery, where [slow query log analysis](https://www.site24x7.com/database-monitoring.html) drills into queries exceeding a configurable execution-time threshold. The Automated Remediation section below covers how to wire that detection into corrective workflows.\n\nStatistics failures are invisible until the query plan degrades. Storage failures are equally silent, until a disk fills and takes the database offline.\n\n## Symptom: Storage Pressure and Runaway Growth\n\nA disk usage alert fires at 85% capacity. The database server has been running for months without anyone checking how fast the log files or tablespaces are growing. The root cause splits into two categories: unmanaged transaction log growth and missing archiving strategy. Both are maintenance failures that monitoring should have caught weeks earlier.\n\n### Transaction log and WAL management\n\n**SQL Server:** A [full recovery model database without regular transaction log backups](https://learn.microsoft.com/en-us/sql/relational-databases/logs/troubleshoot-a-full-transaction-log-sql-server-error-9002?view=sql-server-ver17) will grow its log file until the disk fills, and a full data volume is an immediate production outage. To check current log space usage across all databases, run `DBCC SQLPERF(LOGSPACE);`\n\n, which returns log size, space used percentage, and status for every database. For a single database, query `sys.databases`\n\nfor the `log_reuse_wait_desc`\n\ncolumn, which tells you exactly why the log cannot be truncated (e.g., `LOG_BACKUP`\n\n, `ACTIVE_TRANSACTION`\n\n). Schedule log backups at an interval matching your Recovery Point Objective (RPO): for most OLTP workloads, intervals in the range of 5-30 minutes are commonly used, with tighter intervals for high-transaction systems, though the right frequency is workload-specific.\n\n[ DBCC SHRINKFILE](https://learn.microsoft.com/en-us/sql/t-sql/database-console-commands/dbcc-shrinkfile-transact-sql?view=sql-server-ver17) on the log file is a last resort for reclaiming space after an unexpected log growth event. The reason it is a last resort, rather than a routine cleanup tool, is the side effect on\n\n[Virtual Log Files](https://learn.microsoft.com/en-us/sql/relational-databases/logs/manage-the-size-of-the-transaction-log-file?view=sql-server-ver17)(VLFs), the internal segments SQL Server divides the transaction log into. Each shrink-then-regrow cycle adds a new VLF, so a log that has been shrunk repeatedly ends up fragmented into many small VLFs instead of a few large ones. That fragmentation degrades sequential log write throughput and\n\n[increases recovery time](https://www.sqlskills.com/blogs/paul/why-you-should-not-shrink-your-data-files/). The fix is to address the root cause (missing log backups, long-running transactions) rather than shrinking on a schedule.\n\n**PostgreSQL:** [WAL (Write-Ahead Log)](https://www.postgresql.org/docs/current/continuous-archiving.html) management serves the same function as SQL Server's transaction log. The `archive_mode`\n\nand `archive_command`\n\nsettings control whether completed WAL segments are shipped to archive storage. [Without archiving enabled, WAL segments accumulate](https://www.percona.com/blog/five-reasons-why-wal-segments-accumulate-in-the-pg_wal-directory-in-postgresql/) in `pg_wal/`\n\nuntil disk fills. The [ wal_keep_size](https://pgpedia.info/w/wal_keep_size.html) parameter (PostgreSQL 13+, replacing\n\n`wal_keep_segments`\n\n) sets a floor for retained WAL data, but does not cap growth. For production systems, configure continuous archiving with `archive_mode = on`\n\nand point `archive_command`\n\nto your backup infrastructure (pgBackRest, Barman, or cloud-native equivalents).To verify archiving is active and current: `SELECT * FROM pg_stat_archiver;`\n\nCheck `last_archived_wal`\n\ntimestamp and `failed_count`\n\n. A non-zero `failed_count`\n\nor a stale `last_archived_time`\n\nmeans WAL segments are accumulating. Also: `SELECT count(*), pg_size_pretty(sum(size)) FROM pg_ls_waldir();`\n\n(PostgreSQL 10+) shows total WAL directory size.\n\n**MySQL:** [Binary logs](https://dev.mysql.com/doc/refman/8.0/en/replication-options-binary-log.html) (binlogs) serve replication and point-in-time recovery. Without rotation, they grow indefinitely. [ expire_logs_days](https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-3.html) (deprecated in MySQL 8.0.3) or\n\n`binlog_expire_logs_seconds`\n\n(MySQL 8.0+) controls automatic purge. Setting `binlog_expire_logs_seconds = 604800`\n\nretains seven days of binary logs, which is sufficient for most replication topologies. Run `PURGE BINARY LOGS BEFORE NOW() - INTERVAL 7 DAY`\n\nfor one-time cleanup.###\n\nCapacity forecasting with OpManager Nexus\n\nReacting to a disk alert at 85% leaves little room for planned action. OpManager Nexus's [AI/ML-based storage forecasting](https://www.manageengine.com/network-monitoring/help/forecast-reports.html) uses up to 14 days of history to predict when [storage will hit 80%, 90%, and 100%](https://www.manageengine.com/network-monitoring/storage-capacity-forecasting-planning.html), giving your team a \"disk full in N days\" signal once it has at least 3 days of data. Its [adaptive thresholds](https://www.manageengine.com/network-monitoring/help/adaptive-thresholds.html) learn baseline behavior so alerts fire on genuine anomalies rather than every batch job, and the Database Tab surfaces individual database size, data and log file utilization, and growth trends.\n\nNote:OpManager Nexus's own monitoring data retention (configured under Settings > General Settings >[Database Maintenance]) is independent of your production database storage. Defaults are 7, 30, and 365 days for detailed, hourly, and daily statistics.\n\nUse OpManager Nexus's forecast reports to verify your archiving cadence keeps pace with growth: if the forecast shows 80% capacity in 30 days but your archive job runs monthly, increase frequency or provision more storage.\n\nStorage pressure is a passive failure that accumulates over time. Lock contention is an active failure: the maintenance operation meant to fix the database becomes the source of the incident.\n\n## Symptom: Lock Contention from Maintenance Operations\n\nA spike in blocked sessions immediately after a scheduled maintenance run is direct evidence that the REBUILD or REORGANIZE collided with production traffic and [created lock contention](https://www.mssqltips.com/sqlservertip/5880/why-is-index-reorganize-and-update-statistics-causing-sql-server-blocking/). The maintenance job is supposed to fix performance, but index REBUILDs running without `ONLINE = ON`\n\nduring peak traffic or without a maintenance window hold locks that block concurrent queries, turning the fix into the incident.\n\n### Identifying maintenance-induced blocking\n\nCorrelating maintenance timing with OpManager Nexus's Sessions Tab is how you distinguish maintenance-induced blocking from application-level contention. If blocked session counts spike within minutes of a maintenance window opening, the maintenance job is the cause. On SQL Server, check [ sys.dm_exec_requests](https://learn.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-exec-requests-transact-sql) for sessions with\n\n`wait_type`\n\nvalues starting with `LCK_M_*`\n\n, then look up the head-of-chain blocker and inspect its `command`\n\ncolumn for `ALTER INDEX`\n\nor `DBCC`\n\noperations.On PostgreSQL, `pg_stat_activity`\n\nshows `Lock`\n\nwait events with `wait_event`\n\nvalues like `relation`\n\nor `transactionid`\n\n. If the blocking PID is running `REINDEX`\n\nor `VACUUM FULL`\n\n, that is maintenance-induced contention. For cloud-managed instances where Sessions Tab access is unavailable, OpManager Nexus's [SaaS delivery](https://www.site24x7.com/database-monitoring.html) surfaces lock contention and blocking session counts on its database performance dashboard for the same triage signal.\n\n### Online and resumable operations\n\nThe fix is operational: use online operations and schedule them outside peak traffic windows.\n\n**SQL Server:** Use `ALTER INDEX ... REBUILD WITH (ONLINE = ON, RESUMABLE = ON, MAX_DURATION = 60)`\n\nas described in the I/O Degradation section. The duration is any positive integer in minutes; set it based on your maintenance window. `REORGANIZE`\n\nis always online and interruptible.\n\n**PostgreSQL:** `REINDEX INDEX CONCURRENTLY`\n\n(introduced in the I/O Degradation section) avoids exclusive locks. `VACUUM`\n\nwithout `FULL`\n\ndoes not block reads or writes.\n\n**MySQL:** Standard `OPTIMIZE TABLE`\n\nalready runs as online DDL on MySQL 8.0+ (introduced in the I/O Degradation section). Reach for `pt-online-schema-change`\n\nwhen you need finer control over lock duration on very large tables, or when you want triggered shadow-copy semantics that `OPTIMIZE TABLE`\n\ndoes not offer.\n\nThe four symptom categories above all produce observable performance signals before they become outages. Corruption is different: it produces no signal until it surfaces as query failures or data loss.\n\n## Symptom: Silent Corruption and Integrity Failures\n\nBecause corruption produces no precursor wait events or latency drift, detection is a deliberate scheduled act, not an alert response. Regular integrity checks are the primary detection mechanism, supplemented by storage-level checksums, page verification, and reliable backups.\n\n**SQL Server:** [ DBCC CHECKDB](https://learn.microsoft.com/en-us/sql/t-sql/database-console-commands/dbcc-checkdb-transact-sql?view=sql-server-ver17) catches\n\n[page corruption, allocation errors, and consistency violations](https://techcommunity.microsoft.com/blog/sqlserversupport/sql-server-database-corruption-causes-detection-and-some-details-behind-dbcc-che/4460631).\n\n```\n-- Recommended production form: suppresses informational messages, shows only errors\nDBCC CHECKDB('ProductionDB') WITH NO_INFOMSGS, ALL_ERRORMSGS;\n```\n\nFor large databases where a full DBCC CHECKDB is too slow for a maintenance window, `DBCC CHECKDB ... WITH PHYSICAL_ONLY`\n\nchecks page and record header integrity without logical consistency checks and completes significantly faster. Corruption surfaces in the SQL Server error log as [messages Msg 823, 824, or 825](https://support.microsoft.com/en-us/help/2015755/how-to-troubleshoot-a-msg-823-error-in-sql-server). To proactively check for known corruption events, query the suspect pages table:\n\n```\nSELECT db_id, file_id, page_id, event_type, error_count, last_update_date\nFROM msdb.dbo.suspect_pages\nWHERE event_type IN (1, 2, 3);\n```\n\nEvent_type 1 = 823/824 errors, 2 = bad checksum, 3 = torn page. A non-empty result requires immediate DBCC CHECKDB and restore planning.\n\nRunning DBCC CHECKDB as frequently as your maintenance windows allow is the safe path. Many experts recommend daily on all databases; if that is impractical, prioritize critical databases and shorten the interval on large ones using `WITH PHYSICAL_ONLY`\n\n.\n\n**PostgreSQL:** The [ pg_amcheck](https://www.postgresql.org/docs/14/app-pgamcheck.html) utility (PostgreSQL 14+) verifies B-tree index integrity by checking that every heap tuple referenced by an index entry actually exists and that index entries are in the correct sort order. The default invocation is fast enough for routine scheduled checks and catches most corruption:\n\n```\npg_amcheck mydb\n```\n\nAfter an unexpected crash, storage event, or replication failure, run the thorough variant on critical tables:\n\n```\npg_amcheck --heapallindexed --parent-check mydb\n```\n\n`--heapallindexed`\n\nperforms a deeper check that every heap tuple has a corresponding index entry; `--parent-check`\n\nverifies cross-level B-tree invariants. Both flags increase runtime substantially, so reserve them for incident response or post-event verification rather than the routine schedule.\n\n**MySQL:** `mysqlcheck`\n\nprovides table-level integrity verification:\n\n```\nmysqlcheck --check --all-databases -u root -p\n```\n\nFor individual tables, `CHECK TABLE table_name`\n\nwithin the MySQL client performs the same operation. InnoDB tables benefit from [ CHECK TABLE ... FOR UPGRADE](https://dev.mysql.com/doc/refman/9.7/en/check-table.html) after major version upgrades to verify storage format compatibility.\n\nRunning these checks manually is the safety net. The next section shows how to automate the response so the platform acts before the on-call engineer logs in.\n\n## From Alert to Fix: Automated Remediation Across Engines\n\nWhen the alert fires at 3 AM, having the platform execute the remediation automatically matters far more than knowing the fix. OpManager Nexus's IT Workflow Automation triggers a custom monitoring script when an alert threshold is breached: the script queries the symptom's diagnostic surface (fragmentation, dead tuples, log space), evaluates severity, and runs the remediation.\n\n### SQL Server: wiring remediation into OpManager Nexus\n\nOpManager Nexus accepts PowerShell or shell scripts as [custom monitors](https://www.manageengine.com/network-monitoring/script-monitoring.html) (Custom Script Monitors require build 12.7 or later). The integration pattern matches the PostgreSQL and MySQL examples below: query `sys.dm_db_index_physical_stats`\n\nfor fragmentation, branch on the threshold, issue `ALTER INDEX REORGANIZE`\n\nor `REBUILD WITH (ONLINE = ON)`\n\naccordingly, and emit one log line per action so the run shows up in the monitor's history. Run the script under a service account with at least `db_ddladmin`\n\non the target database; for SQL authentication or cross-domain setups, pull credentials from a secrets store rather than embedding them.\n\n### PostgreSQL and MySQL shell automation\n\nFor PostgreSQL, a cron-driven shell script can query `pg_stat_user_tables`\n\nfor bloated tables and trigger remediation:\n\n``` bash\n#!/usr/bin/env bash\n# PostgreSQL automated vacuum/reindex for tables exceeding dead tuple threshold.\n# Credentials sourced from ~/.pgpass (chmod 600); export PGPASSFILE if non-default.\nPGHOST=\"localhost\"\nPGPORT=\"5432\"\nPGDATABASE=\"app_prod\"\nPGUSER=\"maintenance_user\"\nexport PGPASSFILE=\"${PGPASSFILE:-$HOME/.pgpass}\"\n\nDEAD_THRESHOLD=15\nBLOAT_THRESHOLD=30\n\n# VACUUM tables with high dead tuple ratio\npsql -h \"$PGHOST\" -p \"$PGPORT\" -U \"$PGUSER\" -d \"$PGDATABASE\" -t -A -F'|' -c \"\n  SELECT schemaname, relname, round(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 2)\n  FROM pg_stat_user_tables\n  WHERE n_live_tup > 10000\n    AND round(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 2) > $DEAD_THRESHOLD\n\" | while IFS='|' read -r schema table dead_pct; do\n  echo \"$(date '+%Y-%m-%d %H:%M:%S') | VACUUM ANALYZE ${schema}.${table} | dead_pct=${dead_pct}%\"\n  psql -h \"$PGHOST\" -p \"$PGPORT\" -U \"$PGUSER\" -d \"$PGDATABASE\" -c \"VACUUM ANALYZE ${schema}.${table};\"\ndone\n```\n\nFor MySQL, a similar approach queries `information_schema.TABLES`\n\nand triggers `OPTIMIZE TABLE`\n\n. Use a MySQL option file instead of embedding credentials in the script (create `~/.my.cnf`\n\nwith `[client]`\n\ncredentials and restrict permissions to 600):\n\n``` bash\n#!/usr/bin/env bash\n# MySQL automated optimize for InnoDB tables exceeding fragmentation threshold\nMYSQL_HOST=\"localhost\"\nMYSQL_DB=\"app_prod\"\n\nFRAG_THRESHOLD=20\n\nmysql --defaults-extra-file=\"$HOME/.my.cnf\" -h \"$MYSQL_HOST\" -N -B -e \"\n  SELECT table_name, round(data_free / (data_length + index_length + data_free) * 100, 2) AS frag_pct\n  FROM information_schema.TABLES\n  WHERE table_schema = '${MYSQL_DB}'\n    AND engine = 'InnoDB'\n    AND data_free > 0\n    AND round(data_free / (data_length + index_length + data_free) * 100, 2) > ${FRAG_THRESHOLD}\n\" | while read -r table frag_pct; do\n  echo \"$(date '+%Y-%m-%d %H:%M:%S') | OPTIMIZE TABLE ${table} | frag_pct=${frag_pct}%\"\n  mysql --defaults-extra-file=\"$HOME/.my.cnf\" -h \"$MYSQL_HOST\" \"$MYSQL_DB\" -e \"OPTIMIZE TABLE ${table};\"\ndone\n```\n\nSchedule either script via cron (e.g., `0 3 * * * /opt/scripts/pg_maintenance.sh >> /var/log/db_maintenance.log 2>&1`\n\n) and monitor the log output through OpManager Nexus's custom monitor integration.\n\n### Cloud-managed database automation\n\nFor databases running on Amazon RDS, Aurora, or Azure SQL, OpManager Nexus's SaaS delivery provides the cloud-side counterpart of the PowerShell and shell automation patterns above. Its [IT Automation module](https://www.site24x7.com/help/admin/configuration-profiles/actions.html) triggers corrective actions from threshold breaches and anomaly detections, and [AI-powered baselines](https://www.site24x7.com/anomaly-detection.html) replace the manual threshold tuning that self-managed instances require. For RDS specifically, [service actions](https://www.site24x7.com/help/it-automation/rds-actions.html) like start, stop, and reboot with failover are surfaced directly. [Engine-specific monitor setup](https://www.site24x7.com/help/database-monitoring/) for SQL Server, PostgreSQL, and MySQL is documented separately. Threshold profiles let you apply equivalent alert configurations across dev, staging, and production monitors, so a query that fragments an index under realistic staging load surfaces in slow query detection before it reaches production scale.\n\n## Maintenance Health Scorecard: Assessing Your Current Posture\n\nInstead of running through the diagnostic queries from scratch, use this scorecard to assess your maintenance posture. Each item references the diagnostic approach covered in its corresponding section above.\n\n**I/O health (see: I/O Degradation section)**\n\n- [ ] SQL Server: Run the\n`sys.dm_db_index_physical_stats`\n\nquery (filter the results at 30% fragmentation). Count of indexes returned: ___ - [ ] PostgreSQL: Run the\n`pg_stat_user_tables`\n\ndead tuple query. Tables with dead_pct above 10-20% are candidates for immediate attention: ___ - [ ] MySQL: Run the\n`information_schema.TABLES`\n\nfragmentation query. Tables with frag_pct above 20%: ___\n\n**Statistics freshness (see: Query Plan Regression section)**\n\n- [ ] SQL Server: Check\n`sys.dm_db_index_usage_stats`\n\nfor indexes with zero seeks but high scans (plan regression or poorly matched index) - [ ] PostgreSQL: Verify\n`autovacuum_analyze_scale_factor`\n\nis set below 0.1 for tables above 100 million rows - [ ] MySQL: Run\n`ANALYZE TABLE`\n\non your top 10 tables by write volume; capture`EXPLAIN`\n\noutput for representative queries before and after to confirm planner statistics changed as expected\n\n**Storage trajectory (see: Storage Pressure section)**\n\n- [ ] OpManager Nexus forecast report confirms sufficient capacity runway before any threshold crossing: Yes / No\n- [ ] Transaction log backup job (SQL Server) or WAL archiving (PostgreSQL) is confirmed running and last backup verified: Yes / No\n- [ ] Binary log rotation (MySQL) is configured with\n`binlog_expire_logs_seconds`\n\nset to an explicit value: Yes / No\n\n**Integrity baseline (see: Silent Corruption section)**\n\n- [ ] SQL Server:\n`DBCC CHECKDB`\n\nlast run date on critical databases: ___ - [ ] PostgreSQL:\n`pg_amcheck`\n\nlast run date (or equivalent manual check): ___ - [ ] MySQL:\n`mysqlcheck --check`\n\nlast run date: ___\n\n**Automation coverage (see: Automated Remediation section)**\n\n- [ ] At least one automated remediation script is deployed, scheduled, and confirmed to be producing output logs: Yes / No\n- [ ] OpManager Nexus alert thresholds are configured and tested for key database health metrics (BCHR, disk utilization, blocked sessions): Yes / No\n- [ ] Maintenance windows are scheduled based on monitoring signals, not calendar dates: Yes / No\n\nCross-reference those results against OpManager Nexus's slow query and session data (on-prem Performance Tab or SaaS Database Metrics dashboard). If a table in the top results by size also appears as a source of slow query detections, that is your highest-priority maintenance target.", "url": "https://wpnews.pro/news/database-maintenance-tracing-production-incidents-to-their-root-cause", "canonical_source": "https://dev.to/damasosanoja/database-maintenance-tracing-production-incidents-to-their-root-cause-327e", "published_at": "2026-05-22 12:00:38+00:00", "updated_at": "2026-05-22 12:08:58.345596+00:00", "lang": "en", "topics": ["data", "enterprise-software", "developer-tools", "cloud-computing"], "entities": ["SQL Server", "PostgreSQL", "MySQL"], "alternates": {"html": "https://wpnews.pro/news/database-maintenance-tracing-production-incidents-to-their-root-cause", "markdown": "https://wpnews.pro/news/database-maintenance-tracing-production-incidents-to-their-root-cause.md", "text": "https://wpnews.pro/news/database-maintenance-tracing-production-incidents-to-their-root-cause.txt", "jsonld": "https://wpnews.pro/news/database-maintenance-tracing-production-incidents-to-their-root-cause.jsonld"}}