PostgreSQL's Dead Tuple and Vacuum

9/21/2024 2 min read

I detected autovacuum operations that were excessively using CPU in PostgreSQL. Since I had never seen this in databases like Oracle, MySQL, and MSSQL that I've used in the field, I looked into it and found it interesting, so I organized it.

Before learning about Dead Tuple and Vacuum/autovacuum, let me first cover MVCC for easier delivery.

MVCC (Multi-Version Concurrency Control)

MVCC translates to Multi-Version Concurrency Control.

For example, think of Transactions. Transactions can group work into units and Undo/Redo changes before Commit. They implement MVCC to ensure integrity for read/write operations.

An interesting point is that each database implements features slightly differently.

Summary of Main Differences

DBMS	MVCC Basic Operation	Garbage Collection	Performance
PostgreSQL	Creates new versions leaving Dead Tuples	Autovacuum	Excellent
MySQL (InnoDB)	Manages previous versions in Undo Log	Purge Thread	Excellent
Oracle	Manages versions in Undo Segment	SMON	Very Excellent
SQL Server	Provides similar method in Snapshot Isolation	Automatic version management	Flexible

As such, MVCC is implemented similarly across different DBs using different methods and terminology.

Dead Tuples

Dead tuples image PostgreSQL's Dead Tuples are data that accumulates during updates/deletions.

Unlike databases like MySQL that use MVCC + Lock functionality, it reduces conflicts and improves performance when transactions occur simultaneously. Since modification work is possible even while reading data, it's a structure optimized for read performance.

So when is this accumulated data removed?

Autovacuum

Autovacuum image Autovacuum exists to clean up this accumulated data. When Dead Tuples accumulate to a threshold, it executes Garbage Collection for optimization, and this is called Autovacuum.

The Autovacuum daemon runs and automatically executes when thresholds are reached. There's also manual vacuum execution available.

Conclusion

While learning about autovacuum, I found the MVCC implementation and differences across different DBs interesting, so I briefly organized it.