<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <!-- Source: https://kernelmaker.github.io/feed.xml -->
  <generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator>
  <link href="https://siftrss.com/f/kpPjBJjzqr5" rel="self" type="application/atom+xml"/>
  <link href="https://kernelmaker.github.io/" rel="alternate" type="text/html"/>
  <updated>2026-04-16T20:02:11+00:00</updated>
  <id>https://siftrss.com/f/kpPjBJjzqr5</id>
  <title type="html">Zhao Song’s Blog</title>
  <subtitle>Database internals</subtitle>
  <author>
    <name>Zhao Song</name>
  </author>
  <entry>
    <title type="html">Dissecting the MySQL 8.0 Performance Regression on oltp_update_non_index</title>
    <link href="https://kernelmaker.github.io/MySQL-regression-1" rel="alternate" type="text/html" title="Dissecting the MySQL 8.0 Performance Regression on oltp_update_non_index"/>
    <published>2026-04-16T00:00:00+00:00</published>
    <updated>2026-04-16T00:00:00+00:00</updated>
    <id>https://kernelmaker.github.io/MySQL-regression-1</id>
    <content type="html" xml:base="https://kernelmaker.github.io/MySQL-regression-1"><![CDATA[<p>The performance regression in MySQL 8.0 is well known, but it is still not fully understood. That is because it is not a regression caused by one obvious bottleneck. MySQL 8.0 introduced many new designs and refactored major subsystems, so the gap comes from a combination of configuration defaults, architectural trade-offs, and many small overheads spread across different layers.</p>

<p>So I picked one workload, profiled it carefully, and tried to answer a more practical question: <strong>where exactly does the regression come from, and how much does each part contribute?</strong></p>

<p>In this post, I use <code class="language-plaintext highlighter-rouge">oltp_update_non_index</code> , one of the worst regression cases between MySQL 5.7.44 and 8.0.45 as a starting point. Beginning from a <strong>-32.0% throughput gap</strong>, I narrow it to <strong>-0.2%</strong> by systematically isolating one factor at a time. As expected, the regression is not dominated by a single bottleneck. It is the combined effect of default settings, architectural changes, and dozens of small code-level costs.</p>

<h2 id="1-setup">1. Setup</h2>

<p>Both MySQL versions are compiled with GCC 8.5.0 at <code class="language-plaintext highlighter-rouge">-O3</code>. <code class="language-plaintext highlighter-rouge">mysqld</code> and <code class="language-plaintext highlighter-rouge">sysbench</code> are each pinned to 8 dedicated physical cores with no hyperthreading overlap. The buffer pool is 23 GB. The dataset is a single table with 50 million rows. <code class="language-plaintext highlighter-rouge">innodb_flush_log_at_trx_commit=2</code>, <code class="language-plaintext highlighter-rouge">sync_binlog=0</code>, and the adaptive hash index is OFF. Both versions use identical InnoDB settings wherever the options are comparable. Full configuration details are listed in the appendix.</p>

<p>The benchmark is sysbench <code class="language-plaintext highlighter-rouge">oltp_update_non_index</code>, with 8 threads and 90 seconds per run. This is one of the worst regression cases..</p>

<h2 id="2-narrowing-the-gap-step-by-step">2. Narrowing the gap step by step</h2>

<p>The method is straightforward. I start with MySQL 8.0.45 and 5.7.44 under the same default-like configuration, then use <code class="language-plaintext highlighter-rouge">perf record</code> and <code class="language-plaintext highlighter-rouge">perf report</code> to see where CPU time goes. Once a subsystem becomes the dominant bottleneck, I either adjust the configuration to remove that cost or write a small targeted patch. Then I profile again and repeat.</p>

<p>Each round removes one visible layer of overhead and exposes the next one underneath.</p>

<p>Starting from the baseline and applying changes one by one, I was able to close almost the entire gap:</p>

<table>
  <thead>
    <tr>
      <th>Step</th>
      <th>Description</th>
      <th>5.7 TPS</th>
      <th>8.0 TPS</th>
      <th>Gap</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td> </td>
      <td>Baseline (PFS=1, bin=ON, writer=ON)</td>
      <td>69,818</td>
      <td>47,509</td>
      <td>-32.0%</td>
    </tr>
    <tr>
      <td>1</td>
      <td>innodb_log_writer_threads=OFF</td>
      <td>(69,818)</td>
      <td>52,539</td>
      <td>-24.7%</td>
    </tr>
    <tr>
      <td>2</td>
      <td>performance_schema=0</td>
      <td>72,316</td>
      <td>54,919</td>
      <td>-24.1%</td>
    </tr>
    <tr>
      <td>3</td>
      <td>skip-log-bin</td>
      <td>128,140</td>
      <td>109,044</td>
      <td>-14.9%</td>
    </tr>
    <tr>
      <td>4</td>
      <td>–db-ps-mode=auto</td>
      <td>135,664</td>
      <td>125,119</td>
      <td>-7.8%</td>
    </tr>
    <tr>
      <td>5</td>
      <td>innodb_flush_log_at_trx_commit=0</td>
      <td>149,577</td>
      <td>144,941</td>
      <td>-3.1%</td>
    </tr>
    <tr>
      <td>6</td>
      <td>5 code patches</td>
      <td>(149,577)</td>
      <td><strong>149,273</strong></td>
      <td><strong>-0.2%</strong></td>
    </tr>
  </tbody>
</table>

<p>From <strong>-32.0% to -0.2%</strong>. Below is the breakdown.</p>

<h3 id="step-1-innodb_log_writer_threadsoff">Step 1: <code class="language-plaintext highlighter-rouge">innodb_log_writer_threads=OFF</code></h3>

<p>This is an 8.0-only setting; it does not exist in 5.7.</p>

<p>With <code class="language-plaintext highlighter-rouge">innodb_log_writer_threads=ON</code>, a dedicated log writer thread is responsible for writing the log buffer to disk. The problem appears when <code class="language-plaintext highlighter-rouge">mysqld</code> is pinned to 8 cores and all 8 are already saturated by client threads. In that case, the log writer thread cannot get scheduled quickly enough. Client threads call <code class="language-plaintext highlighter-rouge">log_write_up_to()</code> and spin in <code class="language-plaintext highlighter-rouge">ut_delay()</code> while waiting for the writer to advance the write position, but the writer itself is CPU-starved. That creates a feedback loop: client threads spin longer, occupy more CPU, and make it even harder for the writer to run.</p>

<p>With innodb_log_writer_threads=OFF, no separate writer thread is needed, the calling thread takes over the writer role inside log_write_up_to, eliminating the scheduling dependency.</p>

<h3 id="step-2-performance_schema0">Step 2: <code class="language-plaintext highlighter-rouge">performance_schema=0</code></h3>

<p>Disabling Performance Schema reduces the gap further. MySQL 8.0 changed PFS significantly compared with 5.7, including v2 metadata lock instrumentation, new memory statistics layers, and allocator hook changes. However, I did not isolate the PFS internals deeply enough in this workload to say which part is the main contributor.</p>

<p>So for this step, I can say that PFS matters, but I cannot yet attribute the overhead to a specific internal component.</p>

<h3 id="step-3-skip-log-bin">Step 3: <code class="language-plaintext highlighter-rouge">skip-log-bin</code></h3>

<p>This is the single largest step in absolute TPS gain. Both versions improve dramatically when binary logging is disabled (5.7: +67%, 8.0: +86%). The workload is commit-bound, and every transaction generates one binlog event. Even with <code class="language-plaintext highlighter-rouge">sync_binlog=0</code>, the binlog still adds per-commit CPU cost for event formatting, <code class="language-plaintext highlighter-rouge">Table_map</code> construction, and memory allocation.</p>

<p>MySQL 8.0 benefits more from disabling binlog than 5.7 does (+86% vs +67%), which suggests that the 8.0 binlog path adds extra per-event overhead. Looking at the code, I found several 8.0-only additions that are plausible contributors:</p>

<ul>
  <li><strong><code class="language-plaintext highlighter-rouge">ColumnFilterOutboundFunctionalIndexes</code></strong>: <code class="language-plaintext highlighter-rouge">is_filter_needed()</code> returns <code class="language-plaintext highlighter-rouge">true</code> unconditionally, so the column filter is installed even for tables without functional indexes.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">ReplicatedColumnsView</code></strong>: allocates a <code class="language-plaintext highlighter-rouge">std::vector&lt;std::unique_ptr&lt;ColumnFilter&gt;&gt;</code> for each <code class="language-plaintext highlighter-rouge">Table_map</code> event.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">init_metadata_fields()</code></strong>: new in 8.0, adding metadata serialization work to every <code class="language-plaintext highlighter-rouge">Table_map</code> event.</li>
</ul>

<p>These are reasonable suspects, but I have not profiled the binlog path in isolation, so I am not claiming that they are the confirmed dominant causes.</p>

<h3 id="step-4---db-ps-modeauto-prepared-statements">Step 4: <code class="language-plaintext highlighter-rouge">--db-ps-mode=auto</code> (prepared statements)</h3>

<p>Switching from <code class="language-plaintext highlighter-rouge">--db-ps-mode=disable</code> to <code class="language-plaintext highlighter-rouge">--db-ps-mode=auto</code> helps by avoiding full parsing on every execution. In this mode, the statement is parsed once, then re-optimized on each execution. The net effect is that prepared statements help 8.0 more than 5.7.</p>

<p>I have not fully decomposed why the text-protocol path in 8.0 is heavier for a simple <code class="language-plaintext highlighter-rouge">UPDATE</code>. It is unlikely to be the grammar itself, because a simple <code class="language-plaintext highlighter-rouge">UPDATE</code> does not exercise features such as CTEs or window functions. More likely, the extra cost comes from surrounding setup work in the lexer, resolver, or optimizer path. That still needs confirmation.</p>

<h3 id="step-5-innodb_flush_log_at_trx_commit0">Step 5: <code class="language-plaintext highlighter-rouge">innodb_flush_log_at_trx_commit=0</code></h3>

<p>Setting <code class="language-plaintext highlighter-rouge">innodb_flush_log_at_trx_commit=0</code> decouples transaction commit from the redo log write. This makes it easier to isolate the cost of 8.0’s lock-free redo log design.</p>

<p>Profiling shows that with <code class="language-plaintext highlighter-rouge">flush=2</code>, the function <code class="language-plaintext highlighter-rouge">ut_delay()</code> , the busy-wait loop inside <code class="language-plaintext highlighter-rouge">log_write_up_to()</code> , consumes <strong>9.59% of total CPU</strong> in 8.0. With <code class="language-plaintext highlighter-rouge">writer_threads=OFF</code>, each committing thread writes to the log buffer through <code class="language-plaintext highlighter-rouge">log_buffer_reserve()</code>, <code class="language-plaintext highlighter-rouge">log_buffer_write()</code>, and <code class="language-plaintext highlighter-rouge">log_buffer_write_completed()</code> and then spin-waits in <code class="language-plaintext highlighter-rouge">log_write_up_to()</code> for the write to reach disk. This lock-free coordination machinery (<code class="language-plaintext highlighter-rouge">log_buffer_reserve</code> 0.57% + <code class="language-plaintext highlighter-rouge">log_buffer_write_completed</code> 0.33% + <code class="language-plaintext highlighter-rouge">log_wait_for_space_in_log_recent_closed</code> 0.31%) has no equivalent in 5.7, which uses a simpler <code class="language-plaintext highlighter-rouge">log_sys</code> mutex-based design.</p>

<p>One interesting result is that 5.7 also spends a lot of CPU in <code class="language-plaintext highlighter-rouge">ut_delay()</code>, in fact even more than 8.0 (13.64% vs 9.59%). But its redo path per transaction is shorter: take <code class="language-plaintext highlighter-rouge">log_sys</code>, write, release. So even though it spins more, it still completes more useful work per transaction.</p>

<p>This is a real architectural trade-off. The 8.0 redo redesign favors scalability at higher concurrency and under stricter durability requirements. But for this workload, at 8 threads and <code class="language-plaintext highlighter-rouge">flush=2</code>, that trade-off is unfavorable.</p>

<h3 id="step-6-five-code-patches">Step 6: Five code patches</h3>

<p>After aligning the configuration and benchmark settings and decoupling redo with <code class="language-plaintext highlighter-rouge">flush=0</code>, the remaining gap is -3.1%. At this point, profiling (924K perf samples) shows a very flat CPU profile: no single function is above 1.8%. The remaining gap is spread across many small 8.0-specific overheads.</p>

<table>
  <thead>
    <tr>
      <th>Function</th>
      <th>8.0 CPU%</th>
      <th>5.7 CPU%</th>
      <th>Category</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">cmp_dtuple_rec_with_match_low</code></td>
      <td>1.18%</td>
      <td>0.44%</td>
      <td>Inline Regression</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">buf_flush_note_modification</code></td>
      <td>0.67%</td>
      <td>0% (inlined)</td>
      <td>Inline Regression</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">THD::store_cached_properties</code></td>
      <td>0.53%</td>
      <td>0% (not present)</td>
      <td>New Overhead</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">fold_condition</code></td>
      <td>0.43%</td>
      <td>0% (not present)</td>
      <td>New Overhead</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">ha_innobase::info_low</code></td>
      <td>0.36%</td>
      <td>0% (lighter)</td>
      <td>Missing Fast-Path</td>
    </tr>
  </tbody>
</table>

<p>I group them into three categories:</p>

<ul>
  <li><strong>[Inline Regression]</strong>: the 8.0 version of a function grew enough that GCC no longer auto-inlines it.</li>
  <li><strong>[New Overhead]</strong>: entirely new code in 8.0 that runs unconditionally, even when the feature behind it is not needed.</li>
  <li><strong>[Missing Fast-Path]</strong>: 8.0 added support for more cases but did not keep a cheap fast path for the common case.</li>
</ul>

<p>Five small targeted patches reduce the remaining gap from -3.1% to <strong>-0.2%</strong> at <code class="language-plaintext highlighter-rouge">flush=0</code>. The patch details are listed in the appendix.</p>

<p><img src="/public/images/2026-04-16/1.png" alt="image-1" /></p>

<p>One important note: at <code class="language-plaintext highlighter-rouge">flush=2</code>, these 5 patches show <strong>no measurable TPS improvement</strong> (124,572 vs 125,119 TPS). The reason is that <code class="language-plaintext highlighter-rouge">ut_delay()</code> in the redo commit path already consumes 9.59% of CPU and acts as a throughput ceiling. The cycles freed by the patches are mostly absorbed by extra spin iterations instead of being converted into more completed transactions.</p>

<p>The patches only become visible at <code class="language-plaintext highlighter-rouge">flush=0</code>, which confirms that they are fixing real overhead that is otherwise masked by the redo bottleneck.</p>

<h2 id="4-going-deeper-how-much-does-each-factor-really-contribute">4. Going deeper: how much does each factor really contribute?</h2>

<p>After completing the step-by-step analysis, I wanted to understand the true contribution of each factor. The cumulative table above is intuitive, but it has a methodological limitation: <strong>the attribution depends on the order in which changes are applied</strong>.</p>

<p>For example, disabling PFS in step 2, after <code class="language-plaintext highlighter-rouge">writer_threads=OFF</code> but before <code class="language-plaintext highlighter-rouge">binlog=OFF</code>, appears to recover only 0.7 percentage points. But when I measure PFS independently, by changing only PFS from the original common baseline, the contribution is actually 2.3 percentage points. In the cumulative sequence, other costs were masking it.</p>

<p>So I repeated the measurements using <strong>independent ablation</strong>: each factor is changed individually from the same common baseline. That prevents one factor from hiding or amplifying another.</p>

<p><strong>Common baseline:</strong> <code class="language-plaintext highlighter-rouge">performance_schema=1</code>, <code class="language-plaintext highlighter-rouge">log-bin=mysql-bin</code>, <code class="language-plaintext highlighter-rouge">innodb_log_writer_threads=ON</code> (8.0), <code class="language-plaintext highlighter-rouge">--db-ps-mode=disable</code>, <code class="language-plaintext highlighter-rouge">innodb_flush_log_at_trx_commit=2</code>.</p>

<table>
  <thead>
    <tr>
      <th>Factor changed</th>
      <th>5.7 TPS</th>
      <th>8.0 TPS</th>
      <th>Gap</th>
      <th>Attribution</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Baseline (nothing)</td>
      <td>69,818</td>
      <td>47,509</td>
      <td>-32.0%</td>
      <td>—</td>
    </tr>
    <tr>
      <td>A. <code class="language-plaintext highlighter-rouge">writer_threads=OFF</code></td>
      <td>69,818</td>
      <td>52,539</td>
      <td>-24.7%</td>
      <td>7.2 pp</td>
    </tr>
    <tr>
      <td>B. <code class="language-plaintext highlighter-rouge">performance_schema=0</code></td>
      <td>72,316</td>
      <td>50,798</td>
      <td>-29.7%</td>
      <td>2.3 pp</td>
    </tr>
    <tr>
      <td>C. <code class="language-plaintext highlighter-rouge">skip-log-bin</code></td>
      <td>116,588</td>
      <td>88,379</td>
      <td>-24.2%</td>
      <td>7.8 pp</td>
    </tr>
    <tr>
      <td>D. <code class="language-plaintext highlighter-rouge">db-ps-mode=auto</code></td>
      <td>74,406</td>
      <td>53,838</td>
      <td>-27.7%</td>
      <td>4.3 pp</td>
    </tr>
    <tr>
      <td>E. <code class="language-plaintext highlighter-rouge">flush_log_at_trx_commit=0</code></td>
      <td>81,952</td>
      <td>60,906</td>
      <td>-25.7%</td>
      <td>6.3 pp</td>
    </tr>
  </tbody>
</table>

<p>The independent factors sum to <strong>27.9 percentage points</strong>. Adding the code-level overhead (2.9 pp, measured at <code class="language-plaintext highlighter-rouge">flush=0</code>) gives <strong>30.8 pp</strong>. The remaining <strong>1.2 pp</strong> appears to come from interaction effects between factors. For example, binlog overhead amplifies the redo commit path, so removing both together saves slightly more than the sum of removing each in isolation.</p>

<p>Comparing the cumulative and independent views gives a more accurate picture:</p>

<table>
  <thead>
    <tr>
      <th>Factor</th>
      <th>Cumulative</th>
      <th>Independent</th>
      <th>Observation</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">writer_threads=OFF</code></td>
      <td>7.2 pp</td>
      <td>7.2 pp</td>
      <td>Same, measured first, so no masking</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">performance_schema=0</code></td>
      <td>0.7 pp</td>
      <td><strong>2.3 pp</strong></td>
      <td>PFS is undercounted 3x in the cumulative sequence</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">skip-log-bin</code></td>
      <td>9.2 pp</td>
      <td><strong>7.8 pp</strong></td>
      <td>Binlog is overcounted in the cumulative sequence</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">db-ps-mode=auto</code></td>
      <td>7.1 pp</td>
      <td><strong>4.3 pp</strong></td>
      <td>SQL layer overhead is overcounted in the cumulative sequence</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">flush_log_at_trx_commit=0</code></td>
      <td>4.7 pp</td>
      <td><strong>6.3 pp</strong></td>
      <td>Redo overhead is undercounted in the cumulative sequence</td>
    </tr>
  </tbody>
</table>

<p>The main lesson is simple: if you want correct attribution, you need to control for interaction between variables. The same benchmarking principle applies to regression analysis itself.</p>

<h2 id="5-summary">5. Summary</h2>

<p>The <strong>-32.0%</strong> regression on <code class="language-plaintext highlighter-rouge">oltp_update_non_index</code> is not caused by one dominant bottleneck. It is the result of several layers of overhead:</p>

<table>
  <thead>
    <tr>
      <th>Factor</th>
      <th>Independent attribution</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Binary log overhead</td>
      <td>7.8 pp</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">innodb_log_writer_threads</code> CPU starvation</td>
      <td>7.2 pp</td>
    </tr>
    <tr>
      <td>Lock-free redo log architecture</td>
      <td>6.3 pp</td>
    </tr>
    <tr>
      <td>SQL text protocol / parser path</td>
      <td>4.3 pp</td>
    </tr>
    <tr>
      <td>Code-level overhead (5 patches)</td>
      <td>2.9 pp</td>
    </tr>
    <tr>
      <td>Performance Schema instrumentation</td>
      <td>2.3 pp</td>
    </tr>
    <tr>
      <td>Interaction effects</td>
      <td>1.2 pp</td>
    </tr>
    <tr>
      <td><strong>Total</strong></td>
      <td><strong>32.0 pp</strong></td>
    </tr>
  </tbody>
</table>

<p>The largest single factor is the binary log (7.8 pp), followed closely by log writer thread CPU starvation (7.2 pp) and redo log architecture (6.3 pp). Together, these three redo/binlog-related factors account for <strong>21.3 percentage points</strong>, roughly two-thirds of the total regression.</p>

<p>The code-level overhead (2.9 pp) comes from many small additions that are easy to ignore in isolation, an extra function call here, an extra pass there, but they add up. Database development is always about trade-offs. MySQL 8.0 added many useful capabilities, but on a workload running at 125,000+ queries per second, even small per-query costs accumulate quickly.</p>

<h2 id="appendix-a-patch-details">Appendix A: Patch details</h2>

<h3 id="patch-1--cmp_data-always_inline--loop-invariant-hoist">Patch 1:  <code class="language-plaintext highlighter-rouge">cmp_data</code> <code class="language-plaintext highlighter-rouge">ALWAYS_INLINE</code> + loop-invariant hoist</h3>

<ul>
  <li><strong>File:</strong> <code class="language-plaintext highlighter-rouge">storage/innobase/rem/rem0cmp.cc</code></li>
  <li><strong>Problem:</strong> In 8.0, <code class="language-plaintext highlighter-rouge">cmp_data()</code> grew past GCC’s auto-inline threshold because of added multi-value index support (<code class="language-plaintext highlighter-rouge">is_asc</code>, <code class="language-plaintext highlighter-rouge">DATA_MULTI_VALUE</code> assertions, <code class="language-plaintext highlighter-rouge">dfield_is_multi_value()</code> checks). The per-field comparison loop in <code class="language-plaintext highlighter-rouge">cmp_dtuple_rec_with_match_low()</code> also redundantly calls <code class="language-plaintext highlighter-rouge">dict_index_is_ibuf()</code> and checks <code class="language-plaintext highlighter-rouge">dfield_is_multi_value()</code> on every iteration.</li>
  <li><strong>Fix:</strong> Mark <code class="language-plaintext highlighter-rouge">cmp_data()</code> as <code class="language-plaintext highlighter-rouge">ALWAYS_INLINE</code>. Hoist <code class="language-plaintext highlighter-rouge">is_ibuf</code> and <code class="language-plaintext highlighter-rouge">is_mv_index</code> out of the loop.</li>
</ul>

<h3 id="patch-2--buf_flush_note_modification-always_inline">Patch 2:  <code class="language-plaintext highlighter-rouge">buf_flush_note_modification</code> <code class="language-plaintext highlighter-rouge">ALWAYS_INLINE</code></h3>

<ul>
  <li><strong>Files:</strong> <code class="language-plaintext highlighter-rouge">storage/innobase/include/buf0flu.h</code>, <code class="language-plaintext highlighter-rouge">buf0flu.ic</code></li>
  <li><strong>Problem:</strong> Both 5.7 and 8.0 support flush observers in <code class="language-plaintext highlighter-rouge">buf_flush_note_modification()</code>, but the 8.0 version grew enough that GCC no longer auto-inlines it. As a result, 8.0 emits a standalone function call on every dirty-page modification where 5.7 keeps it inlined.</li>
  <li><strong>Fix:</strong> Change <code class="language-plaintext highlighter-rouge">static inline</code> to <code class="language-plaintext highlighter-rouge">static ALWAYS_INLINE</code> in both the declaration and definition.</li>
</ul>

<h3 id="patch-3--server_store_cached_values-no-op">Patch 3:  <code class="language-plaintext highlighter-rouge">server_store_cached_values</code> no-op</h3>

<ul>
  <li><strong>File:</strong> <code class="language-plaintext highlighter-rouge">sql-common/net_serv.cc</code></li>
  <li><strong>Problem:</strong> 8.0 added <code class="language-plaintext highlighter-rouge">server_store_cached_values()</code>, which calls <code class="language-plaintext highlighter-rouge">THD::store_cached_properties(RW_STATUS)</code> on every network I/O path (packet reads, writes, async operations; 10 call sites in <code class="language-plaintext highlighter-rouge">net_serv.cc</code>). This refreshes cached THD properties that are rarely consumed. The mechanism does not exist in 5.7.</li>
  <li><strong>Fix:</strong> Replace the function body with an empty no-op.</li>
</ul>

<h3 id="patch-4--fold_condition-fast-path">Patch 4:  <code class="language-plaintext highlighter-rouge">fold_condition</code> fast path</h3>

<ul>
  <li><strong>File:</strong> <code class="language-plaintext highlighter-rouge">sql/sql_const_folding.cc</code></li>
  <li><strong>Problem:</strong> 8.0 introduced constant folding (<code class="language-plaintext highlighter-rouge">fold_condition</code>) during <code class="language-plaintext highlighter-rouge">JOIN::optimize()</code> → <code class="language-plaintext highlighter-rouge">optimize_cond()</code> → <code class="language-plaintext highlighter-rouge">remove_eq_conds()</code> on every execution, including every prepared statement re-execution. For the common shape <code class="language-plaintext highlighter-rouge">field OP literal_constant</code> , which matches every sysbench query here, the function does no useful work, but still walks the full folding logic. 5.7 has no such pass.</li>
  <li><strong>Fix:</strong> Add an early fast path in <code class="language-plaintext highlighter-rouge">fold_condition()</code> that detects <code class="language-plaintext highlighter-rouge">field OP basic_const</code> or <code class="language-plaintext highlighter-rouge">field OP param</code> and returns <code class="language-plaintext highlighter-rouge">false</code> immediately.</li>
</ul>

<h3 id="patch-5--info_low-fast-path">Patch 5:  <code class="language-plaintext highlighter-rouge">info_low</code> fast path</h3>

<ul>
  <li><strong>File:</strong> <code class="language-plaintext highlighter-rouge">storage/innobase/handler/ha_innodb.cc</code></li>
  <li><strong>Problem:</strong> <code class="language-plaintext highlighter-rouge">ha_innobase::info_low()</code> is called by the optimizer for cost estimation on every <code class="language-plaintext highlighter-rouge">UPDATE</code> and <code class="language-plaintext highlighter-rouge">DELETE</code>. In 8.0, the function became heavier because of additional statistics-related logic. The common case (<code class="language-plaintext highlighter-rouge">HA_STATUS_VARIABLE | HA_STATUS_NO_LOCK</code>) only needs a small amount of information, but still goes through <code class="language-plaintext highlighter-rouge">update_thd()</code>, <code class="language-plaintext highlighter-rouge">op_info</code> writes, and extra helper calls.</li>
  <li><strong>Fix:</strong> Add a fast path at the top that handles <code class="language-plaintext highlighter-rouge">HA_STATUS_VARIABLE | HA_STATUS_NO_LOCK</code> directly and returns immediately.</li>
</ul>

<h2 id="appendix-b-environment-details">Appendix B: Environment details</h2>

<ul>
  <li><strong>CPU:</strong> AMD Ryzen Threadripper PRO 3975WX, 32 cores / 64 threads, single socket</li>
  <li><strong>OS:</strong> RHEL 8.10, kernel 4.18.0-553</li>
  <li><strong>Compiler:</strong> GCC 8.5.0, <code class="language-plaintext highlighter-rouge">-O3 -g -DNDEBUG</code> (RelWithDebInfo)</li>
  <li><strong>MySQL:</strong> 5.7.44 vs 8.0.45 (KernelMaker fork)</li>
  <li><strong>CPU pinning:</strong> <code class="language-plaintext highlighter-rouge">mysqld</code> on cores 16–23, <code class="language-plaintext highlighter-rouge">sysbench</code> on cores 24–31 (physical cores, no hyperthreading overlap)</li>
  <li><strong>Data:</strong> 1 table, 50M rows (<code class="language-plaintext highlighter-rouge">sbtest1</code>), about 12 GB InnoDB tablespace per version</li>
  <li><strong>Benchmark:</strong> sysbench <code class="language-plaintext highlighter-rouge">oltp_update_non_index</code>, 8 threads, 90 seconds, <code class="language-plaintext highlighter-rouge">--report-interval=5</code></li>
</ul>

<h3 id="shared-innodb-configuration">Shared InnoDB configuration</h3>

<div class="language-ini highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="py">innodb_buffer_pool_size</span>      <span class="p">=</span> <span class="s">23G</span>
<span class="py">innodb_buffer_pool_instances</span> <span class="p">=</span> <span class="s">4</span>
<span class="py">innodb_flush_log_at_trx_commit</span> <span class="p">=</span> <span class="s">2</span>
<span class="py">innodb_flush_method</span>          <span class="p">=</span> <span class="s">O_DIRECT_NO_FSYNC</span>
<span class="py">innodb_adaptive_hash_index</span>   <span class="p">=</span> <span class="s">OFF</span>
<span class="py">innodb_io_capacity</span>           <span class="p">=</span> <span class="s">10000</span>
<span class="py">innodb_io_capacity_max</span>       <span class="p">=</span> <span class="s">20000</span>
<span class="py">innodb_page_cleaners</span>         <span class="p">=</span> <span class="s">4</span>
<span class="py">innodb_purge_threads</span>         <span class="p">=</span> <span class="s">4</span>
<span class="py">innodb_log_file_size</span>         <span class="p">=</span> <span class="s">2G</span>
<span class="py">innodb_log_files_in_group</span>    <span class="p">=</span> <span class="s">15</span>
<span class="py">innodb_log_buffer_size</span>       <span class="p">=</span> <span class="s">64M</span>
<span class="py">innodb_max_dirty_pages_pct</span>   <span class="p">=</span> <span class="s">90</span>
<span class="py">innodb_max_dirty_pages_pct_lwm</span> <span class="p">=</span> <span class="s">80</span>
<span class="py">sync_binlog</span>                  <span class="p">=</span> <span class="s">0</span>
</code></pre></div></div>

<p>8.0-only settings: <code class="language-plaintext highlighter-rouge">innodb_dedicated_server=OFF</code>, <code class="language-plaintext highlighter-rouge">innodb_idle_flush_pct=1</code>, <code class="language-plaintext highlighter-rouge">innodb_doublewrite_pages=128</code>, <code class="language-plaintext highlighter-rouge">innodb_use_fdatasync=ON</code>, <code class="language-plaintext highlighter-rouge">default_authentication_plugin=mysql_native_password</code>.</p>

<h3 id="non-patchable-80-overhead-architectural">Non-patchable 8.0 overhead (architectural)</h3>

<p>The following 8.0-specific CPU costs were visible in profiling, but they are not realistically patchable in the same way because they are either correctness-critical or fundamental to the current 8.0 architecture:</p>

<table>
  <thead>
    <tr>
      <th>Function</th>
      <th>CPU%</th>
      <th>Reason</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">locksys::Global_shared_latch_guard</code></td>
      <td>0.84%</td>
      <td>Lock sharding; correctness-critical for concurrent lock operations</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">log_buffer_reserve</code></td>
      <td>0.57%</td>
      <td>Lock-free redo design; replaces 5.7’s <code class="language-plaintext highlighter-rouge">log_sys</code> mutex</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">log_buffer_write</code> + <code class="language-plaintext highlighter-rouge">log_buffer_write_completed</code></td>
      <td>0.67%</td>
      <td>Lock-free redo design</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">log_wait_for_space_in_log_recent_closed</code></td>
      <td>0.31%</td>
      <td>Lock-free redo design</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">PolicyMutex::enter</code> (InnoDB trx fabric)</td>
      <td>1.79%</td>
      <td>Transaction infrastructure mutexes spread across multiple internal subsystems</td>
    </tr>
    <tr>
      <td> </td>
      <td> </td>
      <td> </td>
    </tr>
  </tbody>
</table>]]></content>
    <author>
      <name>Zhao Song</name>
    </author>
    <summary type="html"><![CDATA[The performance regression in MySQL 8.0 is well known, but it is still not fully understood. That is because it is not a regression caused by one obvious bottleneck. MySQL 8.0 introduced many new designs and refactored major subsystems, so the gap comes from a combination of configuration defaults, architectural trade-offs, and many small overheads spread across different layers.]]></summary>
  </entry>
  <entry>
    <title type="html">MySQL vs PostgreSQL Internals (Part 2) — MVCC (Multi-version Concurrency Control)</title>
    <link href="https://kernelmaker.github.io/mysql-vs-pg-mvcc" rel="alternate" type="text/html" title="MySQL vs PostgreSQL Internals (Part 2) — MVCC (Multi-version Concurrency Control)"/>
    <published>2026-03-10T00:00:00+00:00</published>
    <updated>2026-03-10T00:00:00+00:00</updated>
    <id>https://kernelmaker.github.io/mysql-vs-pg-mvcc</id>
    <content type="html" xml:base="https://kernelmaker.github.io/mysql-vs-pg-mvcc"><![CDATA[<p>In the previous <a href="https://kernelmaker.github.io/mysql-vs-pg-bufferpool">post</a>, I took a detailed look at how MySQL and PostgreSQL differ in their buffer pool design and implementation. In this post, I will continue with a detailed comparison of their MVCC implementations.</p>

<h2 id="the-role-of-mvcc">The Role of MVCC</h2>

<p>MVCC (Multi-version Concurrency Control) is a common mechanism used in transactional databases to resolve read–write conflicts. The core idea is that when a transaction modifies data, it does not overwrite the original data directly. Instead, it preserves the previous version while creating a new version of the record. As a result, all historical versions of a record are retained in the database.</p>

<p>The key benefit is that read and write transactions on the same record no longer need to block each other. Even if a write transaction has modified a record but not yet committed, a read transaction can still directly read the version that is visible to it from the historical versions.</p>

<p>The simplified principle is illustrated below:</p>

<p><img src="/public/images/2026-03-10/1.png" alt="image-1" /></p>

<p>For three modifications to the record with PK, each modification creates a new version, so all historical versions of the record are retained in the database.</p>

<p>So what is the benefit of retaining all historical versions? Consider the following example:</p>

<p><img src="/public/images/2026-03-10/2.png" alt="image-2" /></p>

<p>Three write transactions A, B, and C modify the same PK sequentially on the timeline, while three read transactions X, Y, and Z interleave with them and read the PK. Without MVCC, read transactions must block until write transactions commit and release locks. With MVCC, when read transaction X attempts to read the PK, the state of the PK is:</p>

<ol>
  <li>Write transaction A inserted PK: ‘aaa’ and has committed</li>
  <li>Write transaction B modified it to PK: ‘bbb’ but has not yet committed</li>
</ol>

<p>Under RC (Read Committed) and RR (Repeatable Read) isolation levels, PK: ‘bbb’ is not visible to transaction X. Because MVCC preserves the old version, transaction X can directly read the visible version PK: ‘aaa’ and ignore the currently running write transaction B. This greatly improves read–write concurrency.</p>

<p>As an essential capability of databases, MVCC is supported by both MySQL and PostgreSQL. Fundamentally, they both achieve the behavior described above, but their designs and implementations make different trade-offs.</p>

<p>In the following sections, I will compare their implementations in detail across three aspects:</p>

<ol>
  <li>Organization of multiple versions</li>
  <li>Visibility checks for multiple versions</li>
  <li>Garbage collection of old versions</li>
</ol>

<h1 id="1-organization-of-multiple-versions">1. Organization of Multiple Versions</h1>

<h2 id="postgresql">PostgreSQL</h2>

<h2 id="in-postgresql-a-tuple-and-all-its-historical-versions-reside-in-the-heap-as-shown-below">In PostgreSQL, a tuple and all its historical versions reside in the heap, as shown below:</h2>

<p><img src="/public/images/2026-03-10/3.png" alt="image-3" /></p>

<p>The leaf nodes of the nbtree index store the PK fields of the tuple and point to the actual location of the tuple in the heap (TID). Following the Heap TID leads to the tuple, which contains the full data including PK fields and value fields.</p>

<p>Each index tuple points to the <strong>oldest version</strong> in its corresponding HOT chain. When the tuple is modified, the old version is not changed. Instead, a new version is created with the same PK fields but different value fields. The <code class="language-plaintext highlighter-rouge">ctid</code> field of the old tuple points to the location of the next version. The latest version’s <code class="language-plaintext highlighter-rouge">ctid</code> points to itself.</p>

<p>In short, every version of a PostgreSQL tuple is a complete tuple containing all fields. Starting from the index tuple, the version chain is linked from old to new through the <code class="language-plaintext highlighter-rouge">ctid</code> field.</p>

<p>It is important to note that the version chain may become ‘<strong>broken</strong>’, as shown below:</p>

<p><img src="/public/images/2026-03-10/4.png" alt="image-4" /></p>

<p>The PK fields remain unchanged, but the value fields are modified multiple times. Each modification generates a new full version stored in the same heap page as the old version, as shown for versions 1, 2, and 3. Because they reside in the same heap page, advancing through the chain via <code class="language-plaintext highlighter-rouge">ctid</code> is very cheap (no additional heap page lock is required). This is the type of chain that can be efficiently traversed for version lookups, known as a <strong>HOT chain</strong> in PostgreSQL.</p>

<p>A HOT chain requires two conditions:</p>

<ol>
  <li>The new version can fit into the same heap page</li>
  <li>The modified columns do not include any indexed columns</li>
</ol>

<p>If either condition is not met, the HOT chain breaks.</p>

<p>As modifications continue, starting from version 4, the original heap page (heap page 1) can no longer accommodate the new tuple. The new tuple is therefore stored in another page (heap page 2). Although version 3’s <code class="language-plaintext highlighter-rouge">ctid</code> still points to version 4, the HOT chain effectively ends at version 3.</p>

<p>When traversing a HOT chain, the reader will <strong>not follow <code class="language-plaintext highlighter-rouge">ctid</code> across heap pages</strong>. Instead, it stops at the end of the chain. The reason is that both the latest and historical versions of tuples reside in heap pages. If the reader followed <code class="language-plaintext highlighter-rouge">ctid</code> from page 1 to page 2, it would hold a read lock on heap page 1 and attempt to acquire a read lock on heap page 2. Because both pages are heap pages and there is no defined lock ordering between them, another backend might hold the lock on page 2 and attempt to acquire the lock on page 1, leading to a deadlock.</p>

<p>At this point, the HOT chain is considered broken.</p>

<p>How does PostgreSQL transition from version 3 to version 4 then?</p>

<p>The implementation is to insert a new index tuple into the nbtree index with the same PK fields pointing to version 4, starting a new HOT chain. If a read operation traverses the version chain and finds versions 1, 2, and 3 all invisible, it stops the current HOT chain traversal and returns to the index layer. It then proceeds to the next index tuple (the one pointing to version 4).</p>

<p>This design leads to an interesting and somewhat counterintuitive behavior: <strong>multiple index tuples with identical PK fields may coexist in the nbtree index.</strong></p>

<h2 id="mysql">MySQL</h2>

<p>In MySQL, data is stored directly in the <strong>clustered index (B+Tree)</strong> leaf nodes. This is an important difference from PostgreSQL: MySQL does not have a heap.</p>

<p>The second difference is that the record stored in the clustered index and its historical versions reside in different places. Historical versions are not stored in the clustered index. Instead, old values are stored in <strong>undo records</strong> in the undo space. When needed, historical versions are reconstructed by applying the undo records to the current record.</p>

<p>The third difference is that an undo record does not store a full copy of the record. It only stores the <strong>old values of the columns modified in the operation</strong>.</p>

<p>The fourth difference is the direction of the version chain. In MySQL, the clustered index always stores the <strong>latest version</strong>. Before a record is modified, the old values of the columns being changed (together with the PK fields) are copied to the undo space, and the record is updated <strong>in place</strong>.</p>

<p>As shown below:</p>

<p><img src="/public/images/2026-03-10/5.png" alt="image-5" /></p>

<p>The clustered index record contains two system fields: <code class="language-plaintext highlighter-rouge">TRX_ID</code> and <code class="language-plaintext highlighter-rouge">ROLL_PTR</code>.</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">TRX_ID</code> records the transaction ID that last modified the record and is used for MVCC visibility checks.</li>
  <li><code class="language-plaintext highlighter-rouge">ROLL_PTR</code> links the version chain.</li>
</ul>

<p>Similar to PostgreSQL’s <code class="language-plaintext highlighter-rouge">ctid</code>, <code class="language-plaintext highlighter-rouge">ROLL_PTR</code> links versions together, but the direction is opposite:
 <code class="language-plaintext highlighter-rouge">ctid</code> points <strong>from old to new</strong>, while <code class="language-plaintext highlighter-rouge">ROLL_PTR</code> points <strong>from new to old</strong>.</p>

<p>In the figure, the record was modified three times:</p>

<ol>
  <li>Field 2 was modified</li>
  <li>Field 2 was modified again</li>
  <li>Field 3 was modified</li>
</ol>

<p>Therefore, the clustered index record stores the latest version after the three modifications. Through <code class="language-plaintext highlighter-rouge">ROLL_PTR</code>, it points to the previous version stored in the undo space (the version before Field 3 was modified), and so on.</p>

<h2 id="summary">Summary</h2>

<p>The differences in version organization between PostgreSQL and MySQL can be summarized in three contrasts:</p>

<ol>
  <li>Versions mixed together vs latest version and historical versions stored in different spaces</li>
  <li>Old versions contain full tuples vs old versions mainly store the primary key and old values of modified columns</li>
  <li>Version chain ordered from old to new vs from new to old</li>
</ol>

<h1 id="2-visibility-checks-for-multiple-versions">2. Visibility Checks for Multiple Versions</h1>

<p>Once multiple versions exist, the next question is: <strong>how does a read transaction determine which version it should see?</strong></p>

<p>This is the core of MVCC: <strong>visibility checks</strong>.</p>

<p>To determine visibility, the database must establish an order among transactions. Taking the RR isolation level as an example, when a transaction begins, it must know which write transactions are currently active in the system. All modifications produced by those active transactions are invisible to the read transaction. Only modifications from transactions that were already committed at that moment are visible.</p>

<p>Therefore, if the database can define an order among write transactions, it becomes straightforward to perform visibility checks.</p>

<p>Most databases achieve this by using a <strong>globally increasing transaction ID</strong>. When a write transaction is created, it obtains the current maximum transaction ID plus one. This naturally orders write transactions.</p>

<p>Once transaction IDs exist, each data modification can be tagged with the transaction ID that produced it. A read transaction, when created, obtains the list of currently active write transactions. Later, when reading data, it simply compares the transaction ID recorded on the data with this list and applies the visibility rules to determine whether the data is visible.</p>

<p>Both PostgreSQL and MySQL follow this approach.</p>

<h2 id="postgresql-1">PostgreSQL</h2>

<p>As mentioned earlier, transaction IDs are critical. In PostgreSQL, the globally increasing transaction ID is called <strong><code class="language-plaintext highlighter-rouge">nextXid</code></strong>.</p>

<p>Each write transaction obtains the latest value when it starts.</p>

<p><img src="/public/images/2026-03-10/6.png" alt="image-6" /></p>

<p>Transaction A is created first and obtains xid 7. It inserts PK: ‘aaa’. The tuple records this through the <code class="language-plaintext highlighter-rouge">xmin</code> field, which stores the inserter’s transaction ID (7).</p>

<p>After transaction A commits, transaction B is created and obtains xid 11. It updates the record to ‘bbb’. Following the multi-version rule, transaction B does not overwrite the tuple inserted by A. Instead, it creates a new version. The old tuple’s <code class="language-plaintext highlighter-rouge">xmax</code> is set to 11, indicating that transaction 11 has “deleted” this tuple version. The new tuple records <code class="language-plaintext highlighter-rouge">xmin = 11</code>. The old tuple’s <code class="language-plaintext highlighter-rouge">ctid</code> points to the new tuple.</p>

<p>Transaction C proceeds similarly.</p>

<p><img src="/public/images/2026-03-10/7.png" alt="image-7" /></p>

<p>Thus, each PostgreSQL tuple contains two fields recording the related transactions:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">xmin</code> : the inserter</li>
  <li><code class="language-plaintext highlighter-rouge">xmax</code> : the deleter</li>
</ul>

<p>With the global transaction ID (<code class="language-plaintext highlighter-rouge">nextXid</code>) and the transaction tags (<code class="language-plaintext highlighter-rouge">xmin</code>, <code class="language-plaintext highlighter-rouge">xmax</code>) on each tuple, the next requirement is the <strong>snapshot</strong> used by read transactions for visibility checks.</p>

<p><img src="/public/images/2026-03-10/8.png" alt="image-8" /></p>

<p>At the top of the figure are the globally increasing transaction IDs and the currently active write transactions. The next ID to allocate is 16. Among all assigned IDs, transactions ≤7 have already committed. Between 8 and 15, some have committed, and the currently active write transactions are 8, 11, 12, and 14.</p>

<p>If a read transaction starts now, it obtains a snapshot:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">xmin</code> : the smallest active transaction ID (8)</li>
  <li><code class="language-plaintext highlighter-rouge">xmax</code> : the next transaction ID to allocate (16)</li>
  <li><code class="language-plaintext highlighter-rouge">xids[]</code> : the list of active transactions</li>
</ul>

<p>With this snapshot, it can determine whether a tuple’s transaction tag is visible. For example:</p>

<ul>
  <li>If a tuple has <code class="language-plaintext highlighter-rouge">xmin = 7</code>, it is visible to the snapshot.</li>
  <li>If a tuple has <code class="language-plaintext highlighter-rouge">xmin = 14</code>, it is not visible.</li>
</ul>

<p>Now that we know how to determine the visibility of transaction tags, the final question is how to determine whether a tuple itself is visible, given that it has both <code class="language-plaintext highlighter-rouge">xmin</code> and <code class="language-plaintext highlighter-rouge">xmax</code>.</p>

<p>The core principle is:</p>

<blockquote>
  <p>A tuple is visible to a snapshot if its inserter (<code class="language-plaintext highlighter-rouge">xmin</code>) is visible and its deleter (<code class="language-plaintext highlighter-rouge">xmax</code>) is not visible.</p>
</blockquote>

<p><img src="/public/images/2026-03-10/9.png" alt="image-9" /></p>

<p>The process is:</p>

<ol>
  <li>Check <code class="language-plaintext highlighter-rouge">xmin</code>. If <code class="language-plaintext highlighter-rouge">xmin</code> is not visible, the tuple is invisible.</li>
  <li>If <code class="language-plaintext highlighter-rouge">xmin</code> is visible, check <code class="language-plaintext highlighter-rouge">xmax</code>.</li>
  <li>If <code class="language-plaintext highlighter-rouge">xmax</code> is also visible, the tuple has been deleted in the snapshot and is therefore invisible.</li>
  <li>If <code class="language-plaintext highlighter-rouge">xmax</code> is not visible, the tuple is visible.</li>
</ol>

<p>Finally, the following figure shows the process of locating a tuple visible to a snapshot starting from the nbtree index:</p>

<p><img src="/public/images/2026-03-10/10.png" alt="image-10" /></p>

<h2 id="mysql-1">MySQL</h2>

<p>MySQL also has a globally increasing transaction ID called <strong><code class="language-plaintext highlighter-rouge">next_trx_id_or_no</code></strong>.</p>

<p><img src="/public/images/2026-03-10/11.png" alt="image-11" /></p>

<p>In the example, three transactions modify the same record three times. Both the clustered index record and the undo records contain a <code class="language-plaintext highlighter-rouge">TRX_ID</code> field. This field is the transaction tag used by MySQL. The <code class="language-plaintext highlighter-rouge">TRX_ID</code> records which transaction created that version of the record.</p>

<p>Unlike PostgreSQL, a record in MySQL has <strong>only one transaction tag</strong>, <code class="language-plaintext highlighter-rouge">TRX_ID</code>, rather than two. The reason will be explained later.</p>

<p>Next, consider the visibility check.</p>

<p><img src="/public/images/2026-03-10/12.png" alt="image-12" /></p>

<p>MySQL’s <strong>ReadView</strong> is extremely similar to PostgreSQL’s snapshot and serves the same purpose. The only difference is that MySQL’s ReadView contains an additional field: <code class="language-plaintext highlighter-rouge">m_creator_trx_id</code>.</p>

<p>This field is necessary because the transaction that creates the ReadView is itself included in <code class="language-plaintext highlighter-rouge">m_ids[]</code> (since it is an active transaction). Without <code class="language-plaintext highlighter-rouge">m_creator_trx_id</code>, the transaction would not be able to see its own modifications. It also handles cases where a read transaction is promoted to a write transaction.</p>

<p>Aside from this, the visibility rules are almost identical.</p>

<p>Given the visibility rules, determining whether a record is visible to a ReadView becomes straightforward:</p>

<p><img src="/public/images/2026-03-10/13.png" alt="image-13" /></p>

<p>The process is simple: check whether the record’s <code class="language-plaintext highlighter-rouge">TRX_ID</code> is visible to the ReadView. Unlike PostgreSQL, MySQL does not need two separate checks for <code class="language-plaintext highlighter-rouge">xmin</code> and <code class="language-plaintext highlighter-rouge">xmax</code>.</p>

<p>Finally, the process of finding a version visible to a ReadView starting from the B+Tree is shown below:</p>

<p><img src="/public/images/2026-03-10/14.png" alt="image-14" /></p>

<h2 id="summary-1">Summary</h2>

<p>PostgreSQL and MySQL use highly similar visibility mechanisms. The primary difference is that PostgreSQL stores two transaction tags (<code class="language-plaintext highlighter-rouge">xmin</code> and <code class="language-plaintext highlighter-rouge">xmax</code>) on each tuple, requiring two checks. MySQL stores only one (<code class="language-plaintext highlighter-rouge">TRX_ID</code>), requiring only one check.</p>

<p>Why is this the case? The fundamental reason is the <strong>direction of the version chain</strong>:</p>

<ol>
  <li>PostgreSQL’s version chain goes from old to new. In theory, <code class="language-plaintext highlighter-rouge">xmin</code> alone would be sufficient because it records the inserter. However, when traversing the chain, the reader cannot stop immediately after finding a visible insert because the next version might also be visible. The reader must continue until it finds the first version whose insert is invisible. The previous version is then the visible version. This means at least one extra step is required. PostgreSQL therefore stores the insert transaction of the next version as the deleter (<code class="language-plaintext highlighter-rouge">xmax</code>) of the current version, avoiding that extra traversal. Additionally, <code class="language-plaintext highlighter-rouge">xmax</code> is required for DELETE operations where no next version exists.</li>
  <li>MySQL’s version chain goes from new to old. Once the latest version is found to be invisible, the reader simply moves to the previous version until it finds the first visible one. No additional step is required.</li>
</ol>

<h1 id="3-garbage-collection-of-multiple-versions">3. Garbage Collection of Multiple Versions</h1>

<p>The next core problem in MVCC is <strong>garbage collection of historical versions</strong>. Historical versions are not always needed.</p>

<p>Because global transaction IDs advance linearly, the snapshots (or ReadViews) of read transactions also move forward. A historical version can be safely purged when:</p>

<blockquote>
  <p>No active snapshot or ReadView in the system still needs that version (i.e., all snapshots can already see the newer version that replaced it).</p>
</blockquote>

<p>This is where PostgreSQL and MySQL differ most significantly.</p>

<hr />

<h2 id="postgresql-2">PostgreSQL</h2>

<p>PostgreSQL reclaims historical versions through the <strong>Vacuum backend</strong>.</p>

<p><img src="/public/images/2026-03-10/15.png" alt="image-15" /></p>

<p>PostgreSQL uses <code class="language-plaintext highlighter-rouge">GlobalVisState</code> to track purge boundaries. It contains two variables:</p>

<p><strong>maybe_needed</strong></p>

<p>This is the minimum value among all backend transaction IDs and the <code class="language-plaintext highlighter-rouge">xmin</code> values of their snapshots. Backend transaction IDs must be considered because a backend may have started a write transaction and obtained an xid but not yet created a snapshot. That xid still forms a lower bound that cannot be crossed.</p>

<p>All tuple <code class="language-plaintext highlighter-rouge">xmax</code> values (deleters) are compared against <code class="language-plaintext highlighter-rouge">maybe_needed</code>. If <code class="language-plaintext highlighter-rouge">xmax</code> is smaller than <code class="language-plaintext highlighter-rouge">maybe_needed</code>, the deleter is visible to all backends and snapshots, meaning the tuple is globally deleted and can be safely purged.</p>

<p><strong>definitely_needed</strong></p>

<p>This is the <code class="language-plaintext highlighter-rouge">xmin</code> of the latest snapshot taken by the Vacuum backend. Any tuple whose <code class="language-plaintext highlighter-rouge">xmax</code> is greater than or equal to <code class="language-plaintext highlighter-rouge">definitely_needed</code> is invisible to the Vacuum backend and cannot be purged.</p>

<p>These two values define the continuous upper bound that can be purged and the lower bound that cannot. For tuples whose <code class="language-plaintext highlighter-rouge">xmax</code> falls between these bounds, Vacuum may need to refresh <code class="language-plaintext highlighter-rouge">maybe_needed</code> and re-evaluate, since the snapshot used by Vacuum might be outdated. Because refreshing is expensive, PostgreSQL optimizes this by checking whether <code class="language-plaintext highlighter-rouge">RecentXmin</code> has advanced. If it has not changed, refreshing is skipped.</p>

<p>With these rules, the workflow of the Vacuum backend is:</p>

<p><img src="/public/images/2026-03-10/16.png" alt="image-16" /></p>

<ol>
  <li>Scan all heap tuples and determine whether they can be purged using <code class="language-plaintext highlighter-rouge">GlobalVisState</code>. Collect purgable tuples into a set.</li>
  <li>Scan all index tuples and check whether they reference heap tuples in the purge set. If so, delete those index tuples.</li>
  <li>Scan again all pages containing dead tuples collected in step 1 and reclaim them (setting their line pointers to <code class="language-plaintext highlighter-rouge">LP_UNUSED</code>).</li>
</ol>

<p>This process involves extensive scanning of both heap and index structures, which can be expensive. PostgreSQL mitigates this cost with several optimizations:</p>

<ol>
  <li><strong>Visibility map</strong> – allows the first scan to skip pages where all tuples are visible.</li>
  <li><strong>HOT pruning</strong> – during normal reads of heap pages, PostgreSQL opportunistically removes dead tuples through <code class="language-plaintext highlighter-rouge">heap_page_prune()</code>, reducing the workload of Vacuum.</li>
  <li><strong>LP_REDIRECT</strong> – when intermediate versions in a HOT chain are removed, the head line pointer is redirected to the surviving tuple instead of being marked unused, so existing index tuples can still locate the correct tuple without index updates.</li>
</ol>

<h2 id="mysql-2">MySQL</h2>

<p>MySQL takes a different approach.</p>

<p>All undo records (historical versions) are grouped by the transactions that produced them. These transactions are then organized according to their <strong>global commit order</strong> (forming a min-heap).</p>

<p>With this ordering, MySQL can quickly identify the undo records belonging to the <strong>earliest committed transaction</strong>, which are typically the closest candidates for purging.</p>

<p>The purge thread compares the transaction number (<code class="language-plaintext highlighter-rouge">trx_no</code>) of the earliest transaction in the history list with <code class="language-plaintext highlighter-rouge">m_low_limit_no</code> from the purge view.</p>

<ul>
  <li>If <code class="language-plaintext highlighter-rouge">trx_no &lt; m_low_limit_no</code>, all active ReadViews can see this transaction’s commit, so its undo records are no longer needed and can be safely purged.</li>
  <li>Otherwise, it cannot be purged. Since it is the earliest transaction, later ones cannot be purged either, so the purge process stops and waits.</li>
</ul>

<p>An important optimization is that transactions are ordered by <strong>commit order rather than creation order</strong>.</p>

<p>Sorting by creation order would be safe because the earliest transaction must be purged first. However, it has a drawback: if the earliest transaction does not commit for a long time, later transactions that have already committed cannot be purged even if they are no longer needed.</p>

<p>For example:</p>

<ol>
  <li>Trx A is created and modifies record R1 from ‘111’ to ‘222’</li>
  <li>Trx B is created and modifies record R2 from ‘aaa’ to ‘bbb’</li>
  <li>Read-only Trx X starts. Since Trx B has not committed, X sees R2 as ‘aaa’</li>
  <li>Trx B commits</li>
  <li>Trx X commits</li>
</ol>

<p>If transactions were ordered by creation time, Trx A would come before Trx B. Because Trx A has not committed, purge would be blocked and Trx B’s undo records could not be purged, even though no ReadView needs them anymore.</p>

<p>By ordering transactions by commit time instead, MySQL can purge Trx B’s undo records immediately after its commit.</p>

<p>This is an important optimization. Notably, <code class="language-plaintext highlighter-rouge">trx_id</code> and <code class="language-plaintext highlighter-rouge">trx_no</code> both come from the same global variable: <code class="language-plaintext highlighter-rouge">next_trx_id_or_no</code>.</p>

<p>The workflow is shown below:</p>

<p><img src="/public/images/2026-03-10/17.png" alt="image-17" /></p>

<p>The purge thread first clones the oldest active ReadView in the system. The <code class="language-plaintext highlighter-rouge">m_low_limit_no</code> in this ReadView represents the smallest <code class="language-plaintext highlighter-rouge">trx_no</code> that was still committing when the view was created. All transactions with smaller <code class="language-plaintext highlighter-rouge">trx_no</code> values have already committed.</p>

<p>In the undo space, committed transactions’ undo records are linked together in the history list in commit order (ascending <code class="language-plaintext highlighter-rouge">trx_no</code>). The purge thread simply compares <code class="language-plaintext highlighter-rouge">m_low_limit_no</code> with the smallest <code class="language-plaintext highlighter-rouge">trx_no</code> in the history list to determine whether purging is possible.</p>

<h1 id="summary-2">Summary</h1>

<p>Garbage collection of historical versions is a major implementation difference between PostgreSQL and MySQL.</p>

<p>In fact, it reflects their different design philosophies. This difference was already visible in the previous post discussing buffer pools.</p>

<p>MySQL tends to favor <strong>precise control and ordered structures</strong>, such as the LRU list and flush list, which allow it to quickly identify the oldest pages that can be evicted or flushed. Similarly, undo purge maintains ordered historical versions so that the oldest purgeable undo records can be quickly located.</p>

<p>PostgreSQL, on the other hand, tends to rely more on <strong>global scanning</strong> mechanisms, both in shared buffers and in Vacuum. In the buffer pool case, the cost of global scanning is relatively low because it scans descriptor arrays in memory. However, Vacuum must scan heap and index disk pages (although visibility maps can skip many all-visible pages). For frequently updated tables, the amount of scanning can still be substantial.</p>]]></content>
    <author>
      <name>Zhao Song</name>
    </author>
    <summary type="html"><![CDATA[In the previous post, I took a detailed look at how MySQL and PostgreSQL differ in their buffer pool design and implementation. In this post, I will continue with a detailed comparison of their MVCC implementations.]]></summary>
  </entry>
  <entry>
    <title type="html">MySQL vs PostgreSQL Internals (Part 1) – Buffer Pool</title>
    <link href="https://kernelmaker.github.io/mysql-vs-pg-bufferpool" rel="alternate" type="text/html" title="MySQL vs PostgreSQL Internals (Part 1) – Buffer Pool"/>
    <published>2026-02-16T00:00:00+00:00</published>
    <updated>2026-02-16T00:00:00+00:00</updated>
    <id>https://kernelmaker.github.io/mysql-vs-pg-bufferpool</id>
    <content type="html" xml:base="https://kernelmaker.github.io/mysql-vs-pg-bufferpool"><![CDATA[<p>The debate over “MySQL vs PostgreSQL, which one is better?” has been around for a long time. As two outstanding representatives of open-source OLTP databases, I personally don’t think one overwhelmingly dominates the other. Transactional database theory has been stable for decades; both systems are practical implementations built under the same theoretical framework.</p>

<p>The differences mainly come from <strong>different trade-offs made during engineering practice</strong>. I’ve always believed that database development is the art of trade-offs. So I’m planning a series that compares MySQL and PostgreSQL from the perspective of kernel design and implementation, focusing on the different trade-offs they make when pursuing similar goals.</p>

<p>As the first article in this series, I’ll start with the design and implementation differences of the <strong>Buffer Pool</strong>.</p>

<h2 id="comparison-dimensions">Comparison Dimensions</h2>

<p>The Buffer Pool in MySQL and the corresponding module in PostgreSQL (commonly referred to as <strong>Shared Buffers</strong>) are critical subsystems. Their primary job is to cache on-disk data pages in memory to minimize disk I/O as much as possible, and they are therefore a major factor in relational database performance.</p>

<p>In essence, it is a huge hash table:</p>

<ul>
  <li>The key corresponds to a specific on-disk data page.</li>
  <li>The value is a pointer (or index) to the in-memory representation of that page.</li>
</ul>

<p>In the following sections, I compare MySQL and PostgreSQL buffer pool designs from these aspects:</p>

<ol>
  <li><strong>Hash table structure and implementation</strong></li>
  <li><strong>Eviction policy for old pages and its implementation</strong></li>
  <li><strong>Dirty page flushing strategy and its implementation</strong></li>
</ol>

<h2 id="1-hash-table">1. Hash Table</h2>

<h3 id="mysql">MySQL</h3>

<p><img src="/public/images/2026-02-16/1.png" alt="image-1" /></p>

<p>MySQL’s buffer pool is not backed by a single hash table, it uses <strong>multiple</strong> hash tables. As illustrated conceptually:</p>

<ol>
  <li>
    <p>Multiple <code class="language-plaintext highlighter-rouge">buf_pool_t</code> instances shard one large buffer pool. Each <code class="language-plaintext highlighter-rouge">buf_pool_t</code> maintains its own hash table.</p>
  </li>
  <li>
    <p>The hash key is <code class="language-plaintext highlighter-rouge">(space_id, page_no)</code>, identifying a specific page within a data file (tablespace). During lookup:</p>

    <ul>
      <li>First, it computes a hash using <code class="language-plaintext highlighter-rouge">(space_id, page_no &gt;&gt; 6)</code> to locate the corresponding <code class="language-plaintext highlighter-rouge">buf_pool_t</code> instance.</li>
      <li>Why shift <code class="language-plaintext highlighter-rouge">page_no &gt;&gt; 6</code>? Because MySQL tries to place <strong>64 consecutive pages</strong> under the same <code class="language-plaintext highlighter-rouge">space_id</code> into the same <code class="language-plaintext highlighter-rouge">buf_pool_t</code>. This helps in two ways:
        <ul>
          <li>During reads, it enables read-ahead (prefetching contiguous pages).</li>
          <li>During flushing, it increases the chance to flush contiguous dirty pages together, improving I/O utilization.</li>
        </ul>
      </li>
      <li>After locating the <code class="language-plaintext highlighter-rouge">buf_pool_t</code>, it computes a hash over the full key <code class="language-plaintext highlighter-rouge">(space_id, page_no)</code> to find the target cell in that instance’s hash table.
        <ul>
          <li>Pages with the same hash value are chained in that cell.</li>
          <li>The lookup then traverses the chain and compares keys to find the target page.</li>
        </ul>
      </li>
    </ul>
  </li>
  <li>
    <p>The hash table stores only pointers to the corresponding page objects (<code class="language-plaintext highlighter-rouge">buf_page_t</code>). The actual <code class="language-plaintext highlighter-rouge">buf_block_t</code> objects and page frames live in a large memory region.</p>

    <p><img src="/public/images/2026-02-16/2.png" alt="image-1" /></p>

    <ul>
      <li>MySQL splits the page memory into multiple <strong>chunks</strong> (<code class="language-plaintext highlighter-rouge">buf_chunk_t</code>).</li>
      <li>Each chunk is a contiguous block of memory.</li>
      <li>The first part stores per-page metadata (<code class="language-plaintext highlighter-rouge">buf_block_t</code>) for the pages in that chunk.</li>
      <li>The second part stores the actual 16KB page frames.</li>
      <li>The mapping between <code class="language-plaintext highlighter-rouge">buf_block_t</code> and the actual page frame is done via the <code class="language-plaintext highlighter-rouge">frame</code> pointer in <code class="language-plaintext highlighter-rouge">buf_block_t</code>.</li>
    </ul>
  </li>
</ol>

<h3 id="postgresql">PostgreSQL</h3>

<p><img src="/public/images/2026-02-16/3.png" alt="image-1" /></p>

<p>Conceptually (as illustrated):</p>

<ol>
  <li>
    <p>PostgreSQL also shards the shared buffer mapping, with a similar idea.</p>
  </li>
  <li>
    <p>It first hashes the key <code class="language-plaintext highlighter-rouge">(tablespaceOid, dbOid, relNumber, forkNum, blockNum)</code> to obtain a <strong>bucket number</strong>.</p>
  </li>
  <li>
    <p>Then it uses <code class="language-plaintext highlighter-rouge">bucket_number &gt;&gt; 8</code> to locate the directory entry in the first-level mapping, i.e., the <strong>segment</strong> (<code class="language-plaintext highlighter-rouge">dir</code>).</p>
  </li>
  <li>
    <p>Each segment contains 256 buckets, so after finding the segment, it uses <code class="language-plaintext highlighter-rouge">bucket_number % 256</code> to locate the bucket within the segment.</p>
  </li>
  <li>
    <p>It then traverses the bucket chain, comparing keys one by one to find the page.</p>
  </li>
  <li>
    <p>All page frames are stored in one contiguous memory region, as an array: <code class="language-plaintext highlighter-rouge">BufferBlocks[]</code>.</p>

    <p><img src="/public/images/2026-02-16/4.png" alt="image-1" /></p>

    <ul>
      <li>Each page is 8KB.</li>
      <li>PostgreSQL does <strong>not</strong> split this region into chunks like MySQL does, all pages are stored together.</li>
      <li>Metadata for pages is stored separately in another array: <code class="language-plaintext highlighter-rouge">BufferDescriptors[]</code>.</li>
      <li>Both arrays have the same number of elements, equal to the total number of buffers/pages.</li>
      <li>The indices align one-to-one: it is straightforward to locate the actual page frame from the metadata by index.</li>
      <li>The hash table stores <code class="language-plaintext highlighter-rouge">buf_id</code>, which is the index into both <code class="language-plaintext highlighter-rouge">BufferDescriptors[]</code> and <code class="language-plaintext highlighter-rouge">BufferBlocks[]</code>.</li>
    </ul>
  </li>
</ol>

<p><strong>Summary:</strong> Both MySQL and PostgreSQL implement fairly standard hash-table-based page lookup; there isn’t a fundamental difference there. The biggest difference is that MySQL splits pages into chunks, which makes it easier to dynamically resize the buffer pool by adding/removing chunks.</p>

<h2 id="2-eviction-policy-for-old-pages-aging-and-implementation">2. Eviction Policy for Old Pages (Aging) and Implementation</h2>

<h3 id="mysql-1">MySQL</h3>

<p><img src="/public/images/2026-02-16/5.png" alt="image-1" /></p>

<p>MySQL maintains page aging information in a direct way: pages in the hash table are also linked into an LRU doubly-linked list. Each page’s <code class="language-plaintext highlighter-rouge">buf_page_t::LRU</code> is the list node that links the page into the LRU list.</p>

<ul>
  <li>The LRU head points to the most recently accessed page.</li>
  <li>The LRU tail points to the least recently accessed page.</li>
</ul>

<p>Each time a page is found via hash lookup, MySQL moves the page to the head of the LRU list via <code class="language-plaintext highlighter-rouge">buf_page_t::LRU</code>. Over time, pages that are not accessed drift toward the tail. When memory is insufficient and an old page must be evicted, the tail provides a fast candidate.</p>

<p>Of course, that is the conceptual LRU behavior. MySQL adds an important optimization, because the above design has a major problem: if requests perform table scans, a large number of pages enter the LRU and can overwrite/destroy the existing hot/cold information. To avoid scan workloads disrupting the LRU, MySQL splits the LRU list.</p>

<p>Roughly ~37.5% from the tail, it maintains a <strong>midpoint</strong>:</p>

<ul>
  <li>To the left is the <strong>young</strong> area: the true hot region.</li>
  <li>To the right is the <strong>old</strong> area: a screening region for newly loaded pages.</li>
</ul>

<p>All new pages loaded from disk are initially inserted at the midpoint, i.e., the head of the <strong>old</strong> list. Since it is close to the tail, such pages are more likely to be evicted quickly. If a page is accessed again before it is evicted, MySQL does <strong>not</strong> immediately promote it to the young region. Instead, it records the first access time, and the page’s position stays unchanged. Only when it is accessed again, and the elapsed time since the first access exceeds <code class="language-plaintext highlighter-rouge">innodb_old_blocks_time</code> (default 1 second), will it be promoted to the LRU head (young region). As a result, pages introduced by full table scans typically stay in the old area for less than 1 second and are evicted quickly, without polluting the hot working set in the young region.</p>

<p>When a user thread needs to read a disk page but the buffer pool is full, it evicts an old page from the LRU tail and uses that frame to load the needed page. But eviction is not that trivial. Below is the concrete eviction procedure when a user thread needs a new page:</p>

<h4 id="first-attempt-n_iterations--0">First attempt (n_iterations == 0)</h4>

<ol>
  <li>First, try the free list. If a free page is found, return it. Otherwise:</li>
  <li>If <code class="language-plaintext highlighter-rouge">try_LRU_scan == true</code>, it indicates a partial LRU scan is allowed. Scan from the tail forward, at most 100 pages.
    <ul>
      <li>If an evictable page is found, reset it and move it to the free list, then <strong>return to step 1 and retry</strong>.</li>
      <li>If no evictable page is found, set <code class="language-plaintext highlighter-rouge">try_LRU_scan = false</code> to tell other user threads that partial LRU scanning is ineffective, so they should skip partial scans and go directly to the single-page flush path.</li>
    </ul>
  </li>
  <li>Notify the page cleaner thread that free pages are insufficient and it should accelerate cleaning.</li>
  <li>Scan forward from the tail.
    <ul>
      <li>If a clean evictable page is found, evict it directly.</li>
      <li>Otherwise, locate the first dirty page that can be flushed; perform a synchronous flush of that single page; then add it to the free list and <strong>proceed to the next attempt</strong>.</li>
    </ul>
  </li>
</ol>

<h4 id="second-attempt-n_iterations--1">Second attempt (n_iterations == 1)</h4>

<ol>
  <li>Same as first attempt step 1.</li>
  <li>Perform a full LRU list scan starting from the tail, searching for an evictable page; if found, move it to the free list and <strong>return to step 1 to retry</strong>. If that fails:</li>
  <li>Same as first attempt step 3.</li>
  <li>Same as first attempt step 4.</li>
</ol>

<h4 id="third-and-subsequent-attempts-n_iterations--1">Third and subsequent attempts (n_iterations &gt; 1)</h4>

<ol>
  <li>Same as first attempt step 1.</li>
  <li>Same as second attempt step 2.</li>
  <li>Same as first attempt step 3.</li>
  <li>Sleep for 10ms.</li>
  <li>Same as first attempt step 4.</li>
</ol>

<p>One more detail worth mentioning: the LRU scan does not always start from the tail for every thread. Each <code class="language-plaintext highlighter-rouge">buf_pool_t</code> maintains a global scan cursor <code class="language-plaintext highlighter-rouge">lru_scan_itr</code> (type <code class="language-plaintext highlighter-rouge">LRUItr</code>). After a thread finishes scanning, it leaves the cursor at its current position, and the next thread continues scanning from there, avoiding multiple threads repeatedly scanning the same region. Only when the cursor is empty/invalid, or still within the old region (meaning the previous scan did not progress far enough), will it be reset back to the tail. In addition, single-page flushing (step 4) uses another independent cursor <code class="language-plaintext highlighter-rouge">single_scan_itr</code>; these two cursors do not interfere with each other.</p>

<h3 id="postgresql-1">PostgreSQL</h3>

<p><img src="/public/images/2026-02-16/6.png" alt="image-1" /></p>

<p>PostgreSQL does not maintain a global LRU list like MySQL does, but that doesn’t mean it does not perform LRU-style eviction. It simply takes another path.</p>

<p>All page metadata lives in the <code class="language-plaintext highlighter-rouge">BufferDescriptors[]</code> array. Each <code class="language-plaintext highlighter-rouge">BufferDescriptor</code> has two fields representing the current usage state of its corresponding page:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">refcount</code>: how many backends are currently using (pinning) the page</li>
  <li><code class="language-plaintext highlighter-rouge">usage_count</code>: the accumulated number of accesses to the page (capped at 5, When accessed via a ring buffer strategy, it is only incremented if it is currently 0, limiting it to 1)</li>
</ul>

<p>Whenever a backend accesses a page via the hash table, it increments both <code class="language-plaintext highlighter-rouge">refcount</code> and <code class="language-plaintext highlighter-rouge">usage_count</code>. When the backend is done with the page, it only decrements <code class="language-plaintext highlighter-rouge">refcount</code>. Therefore, <code class="language-plaintext highlighter-rouge">usage_count</code> serves as an approximate LRU weight (but not unbounded, it stops increasing once it reaches 5).</p>

<p>When a backend tries to load a page from disk but finds no free page, it starts a <strong>clock sweep</strong>: it traverses <code class="language-plaintext highlighter-rouge">BufferDescriptors</code> circularly. If a buffer is not currently used by any backend (<code class="language-plaintext highlighter-rouge">refcount == 0</code>), it decrements <code class="language-plaintext highlighter-rouge">usage_count</code> (cooling down the LRU weight) and continues sweeping. Eventually it finds a buffer where both <code class="language-plaintext highlighter-rouge">refcount == 0</code> and <code class="language-plaintext highlighter-rouge">usage_count == 0</code>, and that buffer becomes the victim for eviction.</p>

<p>Of course, this alone is still insufficient to prevent LRU pollution from one-time full scans. PostgreSQL has its own optimization: introducing a <strong>local ring buffer</strong>.</p>

<p>Each backend has its own local ring buffer: essentially a fixed-length array of buffer IDs. A buffer ID points to a page slot in the global <code class="language-plaintext highlighter-rouge">BufferDescriptors</code>. The ring buffer limits how many global buffers the backend consumes at once, so eviction is more likely to happen within the ring buffer itself, reducing pollution of the global shared buffers.</p>

<p>More concretely, suppose a backend is performing a sequential scan and the upper layer marks the operation to use the ring buffer. When reading pages via the hash table:</p>

<ul>
  <li>If the backend’s local ring buffer is not full, it stores the buffer ID into the ring buffer.</li>
  <li>As reading continues, the ring buffer becomes full.</li>
  <li>After it is full, when it needs to read the next page:
    <ul>
      <li>It checks the page at the ring buffer’s current cursor position.</li>
      <li>If that buffer is not used by other backends (<code class="language-plaintext highlighter-rouge">refcount == 0</code> and <code class="language-plaintext highlighter-rouge">usage_count &lt;= 1</code>), it reuses it directly: evict and load the next page into it.</li>
      <li>If that buffer is currently used by other backends, it falls back to searching in <code class="language-plaintext highlighter-rouge">BufferDescriptors</code> for another available buffer to load the next page, and then replaces the current ring entry with the new buffer ID.</li>
    </ul>
  </li>
</ul>

<p>Here you can see the different approaches MySQL and PostgreSQL take for the same scenario. MySQL introduces an “old/young” split in the global LRU list as a general strategy to prevent pollution. PostgreSQL’s ring buffer is essentially also an “old area”, but it relies on higher-level operation tagging: only scan-heavy operations such as VACUUM, sequential scan, bulk insert, etc., will use the ring buffer.</p>

<p>Below is the complete procedure PostgreSQL uses to find a free buffer when a backend needs one:</p>

<ol>
  <li>
    <p>Determine whether to use the ring buffer. If yes, inspect the buffer at the ring’s current cursor position:</p>

    <p>a. If it has not been used before, the ring is not full yet, go to step 2.</p>

    <p>b. Otherwise the ring is full. If the buffer is not used by any backend (<code class="language-plaintext highlighter-rouge">refcount == 0</code> and <code class="language-plaintext highlighter-rouge">usage_count &lt;= 1</code>), it can be reused immediately, return this buffer.</p>

    <p>c. If the buffer is used by other backends, fall back to step 2 to find a buffer from the global pool; after success, replace the current ring entry with the newly found buffer ID.</p>
  </li>
  <li>
    <p>Check the free list. If a buffer is available, return it.</p>
  </li>
  <li>
    <p>Start clock sweep: traverse from <code class="language-plaintext highlighter-rouge">nextVictimBuffer</code> (the current sweep cursor in <code class="language-plaintext highlighter-rouge">BufferDescriptors</code>):</p>

    <ul>
      <li>If <code class="language-plaintext highlighter-rouge">refcount != 0</code>, skip.</li>
      <li>Otherwise, if <code class="language-plaintext highlighter-rouge">usage_count != 0</code>, decrement it (cooling down) and continue.</li>
      <li>Otherwise, the buffer is evictable. If it is not dirty, return it immediately. If it is dirty, flush it and then return it.</li>
      <li>Advance <code class="language-plaintext highlighter-rouge">nextVictimBuffer</code> accordingly.</li>
    </ul>
  </li>
</ol>

<p><strong>Summary:</strong> MySQL and PostgreSQL are similar in essence: both are LRU-like. MySQL chooses to implement an explicit LRU list for more precise eviction, at the cost of additional overhead to maintain the list. PostgreSQL uses reference counting plus <code class="language-plaintext highlighter-rouge">usage_count</code> as an approximate LRU, avoiding the locking overhead of maintaining a true LRU list but losing precision. This is the result of different trade-offs. Another notable difference: when a MySQL foreground thread tries to find a free page, it tends to prefer evicting old pages that are not dirty first; PostgreSQL’s sweep does not have an explicit priority between dirty and clean pages in the same sense.</p>

<h2 id="3-dirty-page-flushing-strategy-and-implementation">3. Dirty Page Flushing Strategy and Implementation</h2>

<p>Earlier we mentioned that MySQL user threads and PostgreSQL backends may flush a single dirty page when searching for a free page (single-page flush). However, such foreground single-page flushing is only an emergency measure when no free page is available.</p>

<p>For normal bulk flushing, both MySQL and PostgreSQL have dedicated background threads/processes. The goal is to flush dirty pages in advance and evict old pages so that foreground threads can quickly find free pages.</p>

<p>Background flushing has two goals:</p>

<ol>
  <li><strong>LRU flush</strong>: flush old pages in advance based on foreground free-page pressure, reducing foreground wait time for free pages.</li>
  <li><strong>Checkpoint flush</strong>: flush dirty pages associated with the oldest WAL LSN to advance the checkpoint, purge old WAL, and reduce crash recovery time.</li>
</ol>

<h3 id="mysql-2">MySQL</h3>

<p><img src="/public/images/2026-02-16/7.png" alt="image-1" /></p>

<p><img src="/public/images/2026-02-16/8.png" alt="image-1" /></p>

<p>In MySQL (InnoDB), background flushing is performed by <strong>page cleaner</strong> threads, consisting of one coordinator and N workers.</p>

<h4 id="coordinator">Coordinator</h4>

<ol>
  <li>Sleep for ~1 second, or be woken by a foreground thread.</li>
  <li>Check whether work is needed (sync flush / adaptive / idle). If yes:</li>
  <li>Dynamically calculate the number of dirty pages to flush in the next batch: <code class="language-plaintext highlighter-rouge">n_pages</code>.</li>
  <li>Pass <code class="language-plaintext highlighter-rouge">n_pages</code> to all workers and wake them up. Each worker is responsible for one <code class="language-plaintext highlighter-rouge">buf_pool_t</code> slot. The coordinator itself also works as worker 0.</li>
  <li>Wait for all workers to finish.</li>
</ol>

<h4 id="worker">Worker</h4>

<ol>
  <li>Wait to be woken by the coordinator.</li>
  <li>Locate the assigned <code class="language-plaintext highlighter-rouge">buf_pool_t</code> slot.</li>
  <li><strong>LRU flush</strong>: scan from the LRU tail forward, scanning at most <code class="language-plaintext highlighter-rouge">srv_LRU_scan_depth</code> pages.
    <ul>
      <li>If a page is clean and not being used, move it directly from the LRU list to the free list.</li>
      <li>If a page can be flushed, initiate asynchronous I/O; after I/O completes, move it into the free list.</li>
      <li>Stop early if the free list length reaches <code class="language-plaintext highlighter-rouge">srv_LRU_scan_depth</code>.</li>
    </ul>
  </li>
  <li><strong>Checkpoint flush</strong>: scan from the flush list tail forward and flush continuously until:
    <ul>
      <li>the number of flushed pages satisfies the quota assigned by the coordinator, or</li>
      <li>the WAL LSN advances to the target LSN assigned by the coordinator.</li>
    </ul>
  </li>
  <li>Finish and report to the coordinator.</li>
</ol>

<p>Now, step 3 in the coordinator is adaptive: it calculates the flush workload and the target LSN advancement. The logic is as follows:</p>

<h4 id="a-based-on-dirty-page-percentage-get_pct_for_dirty">a. Based on dirty page percentage (<code class="language-plaintext highlighter-rouge">get_pct_for_dirty()</code>)</h4>

<p>Compute <code class="language-plaintext highlighter-rouge">dirty_pct</code>, the percentage of dirty pages in the buffer pool:</p>

<ul>
  <li>If <code class="language-plaintext highlighter-rouge">innodb_max_dirty_pages_pct_lwm</code> (low watermark) is set and <code class="language-plaintext highlighter-rouge">dirty_pct &gt;= lwm</code>, start progressive flushing and return the percentage of <code class="language-plaintext highlighter-rouge">io_capacity</code> as:
<code class="language-plaintext highlighter-rouge">dirty_pct * 100 / (max_dirty_pages_pct + 1)</code></li>
  <li>If no low watermark is set, but <code class="language-plaintext highlighter-rouge">dirty_pct &gt;= innodb_max_dirty_pages_pct</code> (high watermark), flush at 100% <code class="language-plaintext highlighter-rouge">io_capacity</code>.</li>
  <li>Otherwise, do not flush based on dirty ratio (return 0).</li>
</ul>

<h4 id="b-based-on-redo-log-age-get_pct_for_lsnage">b. Based on redo log age (<code class="language-plaintext highlighter-rouge">get_pct_for_lsn(age)</code>)</h4>

<p>Compute checkpoint age:
 <code class="language-plaintext highlighter-rouge">age = current_lsn - oldest_lsn</code></p>

<ul>
  <li>If <code class="language-plaintext highlighter-rouge">age &lt; innodb_adaptive_flushing_lwm</code> (default 10% of redo log capacity), no adaptive flushing needed (return 0).</li>
  <li>If <code class="language-plaintext highlighter-rouge">age</code> exceeds the low watermark:
<code class="language-plaintext highlighter-rouge">age_factor = age * 100 / limit_for_dirty_page_age</code>
Return the percentage of <code class="language-plaintext highlighter-rouge">io_capacity</code> as:
<code class="language-plaintext highlighter-rouge">(max_io_capacity / io_capacity) * age_factor * sqrt(age_factor) / 7.5</code></li>
</ul>

<p>This is a super-linear growth curve: as redo space approaches exhaustion, flushing ramps up aggressively.</p>

<h4 id="combined-calculation-set_flush_target_by_lsn">Combined calculation (<code class="language-plaintext highlighter-rouge">set_flush_target_by_lsn()</code>)</h4>

<p>Take:
 <code class="language-plaintext highlighter-rouge">pct_total = max(pct_for_dirty, pct_for_lsn)</code></p>

<p>Then compute the target LSN:
 <code class="language-plaintext highlighter-rouge">target_lsn = oldest_lsn + lsn_avg_rate * 3</code>
 (i.e., advance by 3× the recent average redo generation rate; <code class="language-plaintext highlighter-rouge">buf_flush_lsn_scan_factor = 3</code>)</p>

<p>Then traverse each buffer pool instance’s flush list and count the number of pages whose <code class="language-plaintext highlighter-rouge">oldest_modification &lt;= target_lsn</code>. Call this number <code class="language-plaintext highlighter-rouge">pages_for_lsn</code> (pages that must be flushed to advance checkpoint to <code class="language-plaintext highlighter-rouge">target_lsn</code>).</p>

<p>Finally, take the average of three estimates:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>n_pages = (PCT_IO(pct_total) + page_avg_rate + pages_for_lsn) / 3
</code></pre></div></div>

<p>Where:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">PCT_IO(pct_total)</code> is the I/O demand estimated from dirty ratio / redo age.</li>
  <li><code class="language-plaintext highlighter-rouge">page_avg_rate</code> is the recent actual average flushing rate (moving average across multiple iterations).</li>
  <li><code class="language-plaintext highlighter-rouge">pages_for_lsn</code> is the precise demand obtained from scanning the flush list.</li>
</ul>

<p>Averaging these three makes the flushing rate smoother and avoids abrupt oscillation. <code class="language-plaintext highlighter-rouge">n_pages</code> is capped by <code class="language-plaintext highlighter-rouge">srv_max_io_capacity</code>.</p>

<p>If redo pressure is high (<code class="language-plaintext highlighter-rouge">pct_for_lsn &gt; 30</code>), the per-instance flush quota is weighted by how many pages in each instance’s flush list need flushing; otherwise, it is evenly distributed across instances.</p>

<h4 id="sync-flush-mode">Sync Flush mode</h4>

<p>When redo log space is extremely tight (checkpoint cannot keep up with redo generation), <code class="language-plaintext highlighter-rouge">log_sync_flush_lsn()</code> returns non-zero and the coordinator enters sync flush mode:</p>

<ul>
  <li>It no longer sleeps for 1 second; it starts the next iteration immediately.</li>
  <li><code class="language-plaintext highlighter-rouge">n_pages</code> is set directly to <code class="language-plaintext highlighter-rouge">pages_for_lsn</code> (no averaging), with a lower bound of <code class="language-plaintext highlighter-rouge">srv_io_capacity</code>.</li>
  <li>It loops until redo pressure is relieved.</li>
</ul>

<h4 id="idle-flushing">Idle flushing</h4>

<p>When the server is idle (no user activity) and the 1-second sleep times out, the coordinator does not run the adaptive algorithm. Instead, it flushes in the background using <code class="language-plaintext highlighter-rouge">innodb_idle_flush_pct</code> percent of <code class="language-plaintext highlighter-rouge">innodb_io_capacity</code> (default 100%), keeping the buffer pool clean.</p>

<h3 id="postgresql-2">PostgreSQL</h3>

<p>PostgreSQL also has both LRU flush and checkpoint flush, but unlike MySQL’s unified page cleaner, PostgreSQL separates responsibilities:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">bgwriter</code> handles <strong>LRU flush</strong></li>
  <li><code class="language-plaintext highlighter-rouge">checkpointer</code> handles <strong>checkpoint flush</strong></li>
</ul>

<h4 id="1-bgwriter">1. bgwriter</h4>

<p><img src="/public/images/2026-02-16/9.png" alt="image-1" /></p>

<p>The goal of <code class="language-plaintext highlighter-rouge">bgwriter</code> is to predict the upcoming demand for free buffers based on historical and current pressure, and try to free enough buffers <strong>before</strong> backends are forced into heavy clock sweep work (i.e., flush dirty pages that are otherwise reusable victims).</p>

<p>The overall flow:</p>

<ol>
  <li>
    <p>Collect historical info from clock sweep, including:</p>

    <ul>
      <li><code class="language-plaintext highlighter-rouge">strategy_buf_id</code>: the current backend clock sweep position</li>
      <li><code class="language-plaintext highlighter-rouge">strategy_passes</code>: how many full sweeps have been completed</li>
      <li><code class="language-plaintext highlighter-rouge">recent_alloc</code>: how many buffers have been allocated by backends since the last bgwriter recycle</li>
    </ul>
  </li>
  <li>
    <p>Compare <code class="language-plaintext highlighter-rouge">bgwriter</code>’s current position <code class="language-plaintext highlighter-rouge">next_to_clean</code> with clock sweep’s <code class="language-plaintext highlighter-rouge">strategy_buf_id</code>, and determine how far ahead it is:</p>

    <ul>
      <li><code class="language-plaintext highlighter-rouge">bufs_to_lap</code>: number of buffers bgwriter must scan for <code class="language-plaintext highlighter-rouge">next_to_clean</code> to “lap” (catch up to) <code class="language-plaintext highlighter-rouge">strategy_buf_id</code>.
        <ul>
          <li>Case 1: same pass, bgwriter ahead → <code class="language-plaintext highlighter-rouge">bufs_to_lap</code> is the remaining distance to lap.</li>
          <li>Case 2: same pass, bgwriter behind → set <code class="language-plaintext highlighter-rouge">next_to_clean</code> to <code class="language-plaintext highlighter-rouge">strategy_buf_id</code>, set <code class="language-plaintext highlighter-rouge">bufs_to_lap = NBuffers</code>, effectively reset bgwriter.</li>
          <li>Case 3: bgwriter already one full pass ahead → <code class="language-plaintext highlighter-rouge">bufs_to_lap</code> may be negative, meaning bgwriter has scanned everything it can scan; no need to scan in this round.</li>
        </ul>
      </li>
      <li><code class="language-plaintext highlighter-rouge">bufs_ahead = NBuffers - bufs_to_lap</code> (how many buffers bgwriter is ahead of sweep)</li>
    </ul>
  </li>
  <li>
    <p>Based on the history above, compute how many buffers clock sweep needs to scan to find one free buffer, i.e. <code class="language-plaintext highlighter-rouge">scans_per_alloc</code>. Maintain an exponential moving average:
<code class="language-plaintext highlighter-rouge">smoothed_density += (scans_per_alloc - smoothed_density) / 16;</code></p>
  </li>
  <li>
    <p>Maintain <code class="language-plaintext highlighter-rouge">smoothed_alloc</code> similarly:</p>

    <ul>
      <li>If <code class="language-plaintext highlighter-rouge">smoothed_alloc &lt; recent_alloc</code>, set <code class="language-plaintext highlighter-rouge">smoothed_alloc = recent_alloc</code> (fast attack).</li>
      <li>Otherwise decay slowly using EMA:
<code class="language-plaintext highlighter-rouge">smoothed_alloc += (recent_alloc - smoothed_alloc) / 16;</code> (slow decay)</li>
    </ul>
  </li>
  <li>
    <p>Compute the prediction for the next round:</p>

    <ul>
      <li><code class="language-plaintext highlighter-rouge">upcoming_alloc_est = smoothed_alloc * bgwriter_lru_multiplier</code> (predict upcoming allocations)</li>
      <li>Estimate how many reusable buffers exist in the region bgwriter is ahead:
<code class="language-plaintext highlighter-rouge">reusable_buffers_est = bufs_ahead / smoothed_density</code></li>
      <li>Ensure minimum progress:
<code class="language-plaintext highlighter-rouge">min_scan_buffers = NBuffers / (120s / 200ms)</code>
Then:
<code class="language-plaintext highlighter-rouge">upcoming_alloc_est = max(upcoming_alloc_est, min_scan_buffers + reusable_buffers_est)</code></li>
    </ul>

    <p>This “minimum progress” ensures that even if the system is idle, bgwriter will scan the entire buffer pool in about 120 seconds, continuously cleaning dirty pages.</p>
  </li>
  <li>
    <p>Scan from <code class="language-plaintext highlighter-rouge">next_to_clean</code>. For each buffer, bgwriter only considers buffers with <code class="language-plaintext highlighter-rouge">refcount == 0</code> and <code class="language-plaintext highlighter-rouge">usage_count == 0</code> (truly reusable candidates). It skips buffers in use or recently used. If a candidate is dirty, it flushes it synchronously. Stop scanning when any of these is met:</p>

    <ul>
      <li><code class="language-plaintext highlighter-rouge">bufs_to_lap</code> reaches 0 (caught up to clock sweep)</li>
      <li><code class="language-plaintext highlighter-rouge">reusable_buffers</code> reaches <code class="language-plaintext highlighter-rouge">upcoming_alloc_est</code> (freed enough reusable buffers)</li>
      <li><code class="language-plaintext highlighter-rouge">num_written</code> reaches <code class="language-plaintext highlighter-rouge">bgwriter_lru_maxpages</code> (default 100) to avoid excessive I/O in one round</li>
    </ul>
  </li>
</ol>

<p>After one scan round, bgwriter sleeps for <code class="language-plaintext highlighter-rouge">bgwriter_delay</code> (default 200ms) before next iteration. If <code class="language-plaintext highlighter-rouge">bufs_to_lap == 0</code> and <code class="language-plaintext highlighter-rouge">recent_alloc == 0</code> (no allocation activity), bgwriter enters hibernation and sleeps longer, until a backend needing buffers wakes it via latch.</p>

<h4 id="2-checkpointer">2. checkpointer</h4>

<p><img src="/public/images/2026-02-16/10.png" alt="image-1" /></p>

<p>The goal of <code class="language-plaintext highlighter-rouge">checkpointer</code> is to flush all dirty pages up to a consistency point, forming a checkpoint. This advances WAL recycling and reduces how much WAL must be replayed during crash recovery. Unlike bgwriter, checkpointer does not care whether a page was recently used, it must flush all pages that were dirty at checkpoint start.</p>

<p><strong>Trigger conditions:</strong> in the main loop, checkpointer triggers a checkpoint when any of the following occurs:</p>

<ul>
  <li>Time since last checkpoint exceeds <code class="language-plaintext highlighter-rouge">checkpoint_timeout</code> (default 5 minutes)</li>
  <li>WAL volume exceeds <code class="language-plaintext highlighter-rouge">max_wal_size</code> and backends notify checkpointer</li>
  <li>User manually runs <code class="language-plaintext highlighter-rouge">CHECKPOINT</code></li>
  <li>Shutdown checkpoint during server shutdown</li>
</ul>

<p><strong>Detailed procedure:</strong></p>

<ol>
  <li>
    <p><strong>Scan and collect dirty buffers:</strong> traverse all <code class="language-plaintext highlighter-rouge">NBuffers</code> <code class="language-plaintext highlighter-rouge">BufferDescriptors</code>. For each dirty page, set the <code class="language-plaintext highlighter-rouge">BM_CHECKPOINT_NEEDED</code> flag, and collect its identity info into <code class="language-plaintext highlighter-rouge">CkptBufferIds[]</code> (tablespace OID, relation number, fork number, block number, etc.).
Note: only pages that are already dirty at checkpoint start are included. Pages that become dirty during the checkpoint are not included and will be handled in the next checkpoint.</p>
  </li>
  <li>
    <p><strong>Sort:</strong> sort <code class="language-plaintext highlighter-rouge">CkptBufferIds[]</code> by <code class="language-plaintext highlighter-rouge">(tablespace, relation, fork, block)</code>. This clusters pages from the same file and orders them by increasing block number, converting random I/O into more sequential patterns as much as possible.</p>
  </li>
  <li>
    <p><strong>Build tablespace-level progress tracking:</strong> traverse the sorted array and group by tablespace. For each tablespace, build a <code class="language-plaintext highlighter-rouge">CkptTsStatus</code> structure tracking total pages to flush and current progress. Put all tablespaces into a binary heap (min-heap), ordered by flush progress.</p>
  </li>
  <li>
    <p><strong>Balanced flushing across tablespaces:</strong> repeatedly pop the tablespace with the lowest progress from the heap, flush its next dirty page (via <code class="language-plaintext highlighter-rouge">SyncOneBuffer</code>), update its progress, then re-heapify.
The purpose is to spread writes evenly across tablespaces (possibly on different disks), instead of flushing one tablespace completely before another.
Unlike bgwriter, checkpointer calls <code class="language-plaintext highlighter-rouge">SyncOneBuffer</code> with <code class="language-plaintext highlighter-rouge">skip_recently_used = false</code>, meaning it will flush buffers with <code class="language-plaintext highlighter-rouge">BM_CHECKPOINT_NEEDED</code> regardless of recent usage.</p>
  </li>
  <li>
    <p><strong>Write throttling:</strong> after flushing each page, call <code class="language-plaintext highlighter-rouge">CheckpointWriteDelay()</code> to throttle. The goal is to finish flushing within:
<code class="language-plaintext highlighter-rouge">checkpoint_completion_target</code> (default 0.9) × <code class="language-plaintext highlighter-rouge">checkpoint_timeout</code>.
The logic compares:</p>

    <ul>
      <li>flush progress (flushed pages / total),</li>
      <li>elapsed time progress,</li>
      <li>WAL progress.</li>
      <li>If flush progress is ahead of both time progress and WAL progress (<code class="language-plaintext highlighter-rouge">IsCheckpointOnSchedule == true</code>), sleep 100ms.</li>
      <li>If lagging behind, do not sleep and flush at full speed.</li>
      <li>In IMMEDIATE mode (e.g., shutdown checkpoint) or under urgent checkpoint requests, do not throttle.</li>
    </ul>

    <p>This spreads checkpoint I/O across the entire checkpoint window and avoids I/O spikes.</p>
  </li>
  <li>
    <p><strong>Writeback coalescing:</strong> if not using <code class="language-plaintext highlighter-rouge">O_DIRECT</code>, similar to bgwriter, use <code class="language-plaintext highlighter-rouge">WritebackContext</code> to collect tags for flushed pages. After accumulating enough, batch-call <code class="language-plaintext highlighter-rouge">IssuePendingWritebacks()</code>, sort and coalesce adjacent blocks, and use <code class="language-plaintext highlighter-rouge">posix_fadvise</code> to hint the kernel to write back OS cache pages to disk. After checkpoint completion, force one more <code class="language-plaintext highlighter-rouge">IssuePendingWritebacks()</code> to ensure all pending writebacks are issued.</p>
  </li>
</ol>

<p><strong>Summary:</strong> Although the implementations differ significantly, both MySQL and PostgreSQL aim to pre-clean pages in the background so that foreground threads can quickly find free pages. PostgreSQL’s bgwriter predicts upcoming buffer allocation demand from foreground activity; MySQL’s page cleaner reacts to dirty page pressure and redo log age.</p>

<p>From an engineering perspective, their differences largely come down to the trade-off between linked lists and arrays:</p>

<ul>
  <li>With linked lists, MySQL can precisely obtain LRU ordering and dirty-page ordering from old to new. This greatly improves precision in eviction and flushing decisions. In particular, for checkpoint flushing, it can directly take the oldest dirty pages from the flush list tail to advance checkpoint quickly. The trade-off is the cost of maintaining those lists.</li>
  <li>PostgreSQL sacrifices some precision and scans arrays instead, avoiding the additional overhead of maintaining linked lists. It is also worth noting that PostgreSQL’s checkpoint flushing emphasizes balanced progress across tablespaces rather than globally prioritizing the oldest dirty pages to advance checkpoint in small steps.</li>
</ul>]]></content>
    <author>
      <name>Zhao Song</name>
    </author>
    <summary type="html"><![CDATA[The debate over “MySQL vs PostgreSQL, which one is better?” has been around for a long time. As two outstanding representatives of open-source OLTP databases, I personally don’t think one overwhelmingly dominates the other. Transactional database theory has been stable for decades; both systems are practical implementations built under the same theoretical framework.]]></summary>
  </entry>
  <entry>
    <title type="html">Visualizing MySQL BLOB Internals Directly from MySQL Data Files (.ibd)</title>
    <link href="https://kernelmaker.github.io/blob-ibdninja" rel="alternate" type="text/html" title="Visualizing MySQL BLOB Internals Directly from MySQL Data Files (.ibd)"/>
    <published>2026-02-08T00:00:00+00:00</published>
    <updated>2026-02-08T00:00:00+00:00</updated>
    <id>https://kernelmaker.github.io/blob-ibdninja</id>
    <content type="html" xml:base="https://kernelmaker.github.io/blob-ibdninja"><![CDATA[<p>In a previous <a href="https://kernelmaker.github.io/mysql-blob">post</a>, I explored how MySQL implements partial updates and multi-versioning for BLOB columns internally.</p>

<p>To better see what actually happens inside the data files, I’ve added a new feature to <a href="https://github.com/KernelMaker/ibdNinja">ibdNinja</a>, an interactive BLOB inspection mode:</p>

<p><strong>--inspect-blob</strong></p>

<p>This feature is designed as a extension of ibdNinja’s existing inspection workflow, allowing you to drill down from high-level structures to the actual BLOB data stored on disk.</p>

<h3 id="how-it-works">How it works:</h3>

<h4 id="step-1">Step 1</h4>
<p>Use ibdNinja’s existing features to parse, extract, and print information from a MySQL <code class="language-plaintext highlighter-rouge">.ibd</code> file at the table, index, page, and record levels.
Once you’ve located a record you want to dive deeper into, note its page number and record number.</p>

<h4 id="step-2">Step 2</h4>
<p>Pass those identifiers to <code class="language-plaintext highlighter-rouge">--inspect-blob</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ibdNinja -f &lt;table.ibd&gt; --inspect-blob &lt;page_no&gt;,&lt;record_no&gt;
</code></pre></div></div>

<p>to start an interactive inspection of the BLOB field in that record.</p>

<p><img src="/public/images/2026-02-08/1.png" alt="image-1" /></p>

<p>As shown above, ibdNinja will:</p>

<ol>
  <li>Traverse the external BLOB page chain</li>
  <li>Reconstruct the version chain introduced by partial updates</li>
  <li>Visualize the complete on-disk layout of the BLOB across all versions</li>
</ol>

<p>From there, you can choose <strong>any version</strong> and:</p>

<ol>
  <li>Hex-print or dump the full value for binary BLOBs (images, raw binary data, etc.)</li>
  <li>Decode JSON BLOBs (MySQL JSON is still a BLOB internally) into readable text, or inspect the raw MySQL-encoded JSON in hex</li>
</ol>

<p>If some historical versions have already been purged, ibdNinja will detect that and clearly report it.</p>

<p>If you’re into MySQL data file internals, or knee-deep in development, debugging, or production issues, give ibdNinja a try, dig under the hood — and consider bug reports part of the feature set.</p>]]></content>
    <author>
      <name>Zhao Song</name>
    </author>
    <summary type="html"><![CDATA[In a previous post, I explored how MySQL implements partial updates and multi-versioning for BLOB columns internally.]]></summary>
  </entry>
  <entry>
    <title type="html">A POC on optimizing MySQL’s unique index insertion path</title>
    <link href="https://kernelmaker.github.io/unique-index-poc" rel="alternate" type="text/html" title="A POC on optimizing MySQL’s unique index insertion path"/>
    <published>2026-01-25T00:00:00+00:00</published>
    <updated>2026-01-25T00:00:00+00:00</updated>
    <id>https://kernelmaker.github.io/unique-index-poc</id>
    <content type="html" xml:base="https://kernelmaker.github.io/unique-index-poc"><![CDATA[<p>A few months ago, I wrote a post about a possible optimization in MySQL’s unique index insertion path. As illustrated there, the idea is to reduce the current 3 B+Tree searches into 1 B+Tree search plus a scan on the leaf page (or leaf level), in order to avoid the overhead of repeatedly traversing the tree.
This weekend, I implemented a quick proof-of-concept on MySQL 8.0.45 and measure the effect.</p>

<p><img src="/public/images/2026-01-25/1.png" alt="image-1" /></p>

<h3 id="1-setup">1. Setup:</h3>

<p>Table with 200K rows, a VARCHAR(700) unique key (latin1), creating a tall B-tree:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CREATE TABLE t1 (
 id INT PRIMARY KEY AUTO_INCREMENT,
 uk_col VARCHAR(700) NOT NULL,
 UNIQUE KEY uk_idx (uk_col)
) ENGINE=InnoDB CHARACTER SET latin1;
</code></pre></div></div>

<h3 id="2-test-procedure">2. Test procedure:</h3>

<ul>
  <li>Insert 100 TARGET rows with prefix “TARGET_ROW_”</li>
  <li>Start a blocker transaction (START TRANSACTION WITH CONSISTENT SNAPSHOT) to prevent purge</li>
  <li>Delete the 100 TARGET rows (creates delete-marked records)</li>
  <li>Re-insert the same 100 TARGET rows, this triggers the duplicate-check path, since delete-marked records with the same unique key exist</li>
  <li>Instrument row_ins_sec_index_entry_low() with timing around each B-tree search.</li>
  <li>Run the benchmark twice: once with the original path, reset metrics, then with the optimized path</li>
</ul>

<h3 id="3-results">3. Results:</h3>
<h4 id="original-path-3-b-tree-searches">Original path (3 B-tree searches):</h4>
<ul>
  <li>Search1: ~6,508 ns</li>
  <li>Search2: ~5,649 ns</li>
  <li>Search3: ~2,498 ns</li>
  <li>Total: ~14,656 ns
    <h4 id="optimized-path-1-b-tree-search--inline-scan">Optimized path (1 B-tree search + inline scan):</h4>
  </li>
  <li>
    <p>Search1: ~7,272 ns</p>
  </li>
  <li>
    <p>Inline: ~3,118 ns</p>
  </li>
  <li>Total: ~10,390 ns</li>
</ul>

<p><strong>Improvement: ~29.1% reduction in search-path time</strong></p>

<p>This test focuses specifically on the unique index insertion path (row_ins_sec_index_entry_low()), comparing the cost of the original three searches with the optimized “one search + inline scan” approach. In this local scope, the saving is close to 30%, which matches the intuition of collapsing three tree traversals into one.</p>

<h3 id="4-however-when-evaluating-the-overall-benefit-there-are-a-few-important-considerations">4. However, when evaluating the overall benefit, there are a few important considerations:</h3>
<ul>
  <li>
    <p>In a single-row insert, how large is this part relative to the whole insert path? If its share is small, the end-to-end gain will be diluted. In my tests, when measuring the full insert path, the improvement drops to single-digit percentages.</p>
  </li>
  <li>
    <p>Under concurrent workloads, each of the three B-tree searches holds page latches. This is one of the key factors affecting scalability. Reducing this section by ~30% also shortens latch holding time, so the benefit may be more visible in parallel scenarios.</p>
  </li>
  <li>
    <p>While implementing the POC, I also realized that this optimization is not a silver bullet. There are cases that still need to fall back to the original path, although there are ways to minimize how often that happens.</p>
  </li>
</ul>

<p>These are just the numbers from a quick POC. If this direction turns out to be meaningful, it would still require much more careful design, implementation, and testing.</p>

<p><a href="https://bugs.mysql.com/bug.php?id=118363">Bug #118363</a></p>]]></content>
    <author>
      <name>Zhao Song</name>
    </author>
    <summary type="html"><![CDATA[A few months ago, I wrote a post about a possible optimization in MySQL’s unique index insertion path. As illustrated there, the idea is to reduce the current 3 B+Tree searches into 1 B+Tree search plus a scan on the leaf page (or leaf level), in order to avoid the overhead of repeatedly traversing the tree. This weekend, I implemented a quick proof-of-concept on MySQL 8.0.45 and measure the effect.]]></summary>
  </entry>
  <entry>
    <title type="html">MySQL BLOB Internals - Partial Update Implementation and Multi-Versioning</title>
    <link href="https://kernelmaker.github.io/mysql-blob" rel="alternate" type="text/html" title="MySQL BLOB Internals - Partial Update Implementation and Multi-Versioning"/>
    <published>2025-12-01T00:00:00+00:00</published>
    <updated>2025-12-01T00:00:00+00:00</updated>
    <id>https://kernelmaker.github.io/mysql-blob</id>
    <content type="html" xml:base="https://kernelmaker.github.io/mysql-blob"><![CDATA[<p>In this blog, I would like to introduce the implementation of BLOB and BLOB partial update in MySQL, and explain how the current design works together with the MVCC module to support multi-version control for BLOB columns.</p>

<h1 id="1-background">1. Background</h1>

<p>Before going into the details, I would like to briefly introduce two important concepts that are closely related to this topic.</p>

<h2 id="1-basic-principles-of-mysql-mvcc-multi-version-concurrency-control">1. Basic Principles of MySQL MVCC (Multi-Version Concurrency Control)</h2>

<p>MySQL supports snapshot reads. Each read transaction reads data based on a certain snapshot, so even if other write transactions modify the data during the execution of a read transaction, the read transaction will always see the version it is supposed to see.</p>

<p>The underlying mechanism is that a write transaction directly updates the data in place on the primary key record. However, before the update happens, the old value of the field to be modified is copied into the undo space. At the same time, there is a ROLL_PTR field in the row that points to the exact location in the undo space where the old value (the undo log record) is stored.</p>

<p><img src="/public/images/2025-12-01/1.png" alt="image-1" /></p>

<p>As shown in the figure above, there is a row in the primary key index that contains three fields. Suppose a write transaction is modifying Field 2. It will first copy the original value of Field 2 into the undo space, and then overwrite Field 2 directly in the row. After that, two important system fields of the row are updated:</p>

<ul>
  <li>TRX_ID is set to the ID of the current write transaction and is used later by read transactions to determine visibility.</li>
  <li>ROLL_PTR points to the exact location in the undo space where the old value of the modified field is stored, and is used to reconstruct the previous version of the row when needed.</li>
</ul>

<p>After the update is finished, if a previously existing read transaction reads this row again, it will find, based on the TRX_ID, that the row has been modified by a later write transaction. Therefore, the current version of the row is not visible to this read transaction. It must roll back to the previous version. At this point, it uses the ROLL_PTR to locate the old value in the undo space, applies it to the current row, and thus reconstructs the version that it is supposed to see.</p>

<h2 id="2-basic-implementation-of-mysql-blob">2. Basic Implementation of MySQL BLOB</h2>

<p>The primary key record in MySQL contains the values of all fields and is stored in the clustered index. However, BLOB columns are an exception. Since they are usually very large, MySQL stores their data in separate data pages called external pages.</p>

<p>A BLOB value is split into multiple parts and stored sequentially across multiple external pages. These pages are linked together in order, like a linked list. So how does the primary key record locate the corresponding BLOB data stored in those external pages? For each BLOB column, the clustered record stores a reference (lob::ref_t). This ref_t contains some metadata about the column and a pointer to the first external page where the BLOB data starts.</p>

<p><img src="/public/images/2025-12-01/2.png" alt="image-2" /></p>

<p>When reading the row, MySQL first locates the row via the primary key index, then follows this reference to find the external pages and reconstructs the full BLOB value by copying the data from those pages.</p>

<p>This is a very straightforward and intuitive design, simple and sufficient. It is also exactly how BLOB was implemented in older versions of MySQL.</p>

<h2 id="3-a-thought-exercise">3. A “Thought Exercise”</h2>

<p>Based on the two points above, here is a question:</p>

<p><strong>How is MVCC implemented for BLOB in MySQL?</strong></p>

<p>The intuitive answer is as follows: the lob::ref_t stored in the primary key record follows the same MVCC rules. Every time a BLOB column is updated, the old BLOB value is read out, modified, and then the entire modified BLOB is written into newly allocated external pages. The corresponding lob::ref_t in the primary key record is overwritten with the new reference. At the same time, following the MVCC mechanism, the old lob::ref_t is copied into the undo space.</p>

<p><img src="/public/images/2025-12-01/3.png" alt="image-3" /></p>

<p>After the modification, the situation looks like this (as shown in the figure): the undo space stores the lob::ref_t that points to the old BLOB value, while the lob::ref_t in the primary key record points to the new value.</p>

<p>This is exactly how older versions of MySQL worked. The next question is:</p>

<p><strong>What are the pros and cons of this design?</strong></p>

<p>The advantage is that the undo log only needs to record the lob::ref_t, and it does not need to store the entire old BLOB value.</p>

<p>The disadvantage is that no matter how small the change to the BLOB is, even if only a single byte is modified, the entire modified BLOB still has to be written into newly allocated external pages. BLOB columns are usually very large, so if each update only changes a very small portion, this design introduces a lot of extra I/O and space overhead.</p>

<p>A typical example is JSON. Internally, MySQL stores JSON as BLOB. Usually, updates to JSON are local and small. However, with the old design, each small partial update still requires reading the entire JSON, modifying a part of it, and then inserting the whole value back again. This is obviously very heavy.</p>

<p>So how to solve this problem? MySQL introduced BLOB partial update to address it.</p>

<h1 id="2-implementation-of-blob-partial-update">2. Implementation of BLOB Partial Update</h1>

<p>MySQL optimized the format of the external pages used to store BLOB data and redesigned the original simple linked-list structure:</p>

<p><img src="/public/images/2025-12-01/4.png" alt="image-4" /></p>

<ol>
  <li>Each external page now has a corresponding index entry.</li>
  <li>These index entries are organized as a linked list and stored in the <strong>BLOB first page</strong>. (If there are too many index entries to fit, they are stored in separate BLOB index pages.)</li>
  <li>Under normal circumstances, these index entries are linked together in order, just like the external pages in the old implementation.</li>
  <li>To support partial updates, MySQL changes the granularity of BLOB updates from the whole BLOB to individual external pages. Only the external pages involved in the current modification are updated. The modified external page is copied into a new page and updated there, while the other external pages remain unchanged.</li>
</ol>

<p>Then the question becomes: <strong>how can MySQL make sure that it can read the correct new and old BLOB values?</strong> The answer is that the new external page and the old external page share the same logical position in the index entry list. In other words, at this specific position in the list, there are now two versions, version 1 and version 2. Which one is used is determined by the version number recorded in the current lob::ref_t. The idea is illustrated in the figure below.</p>

<p><img src="/public/images/2025-12-01/5.png" alt="image-5" /></p>

<p>In summary, MySQL transforms the original external-page linked list into a linked list of index entries. For each index entry in this list, if the corresponding external page is modified, a new version of the index entry is created at the same horizontal position to point to the new version of that external page. Essentially, this introduces multi-versioning for external pages.</p>

<h2 id="special-case-blob-small-changes">Special Case: BLOB Small Changes</h2>

<p>The implementation described above is not the whole story. MySQL makes a practical trade-off between creating a new index entry (which requires copying the entire external page) and copying only the modified portion into the undo space.</p>

<p>For BLOB small-change scenarios, when the modification to a blob is smaller than 100 bytes, MySQL does not create a new index entry and link it into the version chain for that page. Instead, it modifies the page in place. Following MVCC principles, the portion to be modified is first written into the undo space before the in-place update happens.</p>

<p><img src="/public/images/2025-12-01/6.png" alt="image-6" /></p>

<p>It is worth noting that in this case, the lob::ref_t stored in the primary key record does not advance its base version number. It shares the same base as the previous version. When a read transaction needs to read the previous version, it first constructs the latest BLOB value based on the lob::ref_t and the index entry list. Then, following the MVCC logic, it finds that the TRX_ID indicates that this version is not visible. At this point, it follows the ROLL_PTR to the undo space, where the old value of the modified external page is stored. By applying that old data back onto the current value, the complete and correct historical BLOB value can be reconstructed.</p>

<p>In this scenario, the recovery process is a combination of two steps:</p>

<ol>
  <li>First, the version corresponding to the lob::ref_t is reconstructed via the index entry version chain.</li>
  <li>Then, the version visible to the current transaction is reconstructed via the ROLL_PTR chain.</li>
</ol>

<h2 id="index-entry-details">Index Entry Details</h2>

<p>Index entries are the key to the implementation of BLOB partial update. To make them easier to understand, I drew the following diagram to illustrate the logical relationships among index entries. It is a two-dimensional linked list. The horizontal dimension represents the sequential position when assembling the full BLOB value. The vertical dimension represents multiple versions at the same position. Each time the page at that position is modified, a new node is added vertically.</p>

<p><img src="/public/images/2025-12-01/7.png" alt="image-7" /></p>

<p>Of course, this is only a logical model. The physical layout is not organized exactly like this. Each BLOB has a BLOB first page. This page stores a portion of the BLOB data (the initial part) and 10 index entries. Each index entry corresponds to one BLOB data page. When all 10 index entries are used up, a new BLOB index page is allocated, and additional index entries are allocated from there. In reality, the index entries distributed across the BLOB first page and the BLOB index pages are linked together to form the logical structure shown in the diagram above.</p>

<p><img src="/public/images/2025-12-01/8.png" alt="image-8" /></p>]]></content>
    <author>
      <name>Zhao Song</name>
    </author>
    <summary type="html"><![CDATA[In this blog, I would like to introduce the implementation of BLOB and BLOB partial update in MySQL, and explain how the current design works together with the MVCC module to support multi-version control for BLOB columns.]]></summary>
  </entry>
  <entry>
    <title type="html">SIMD in Vector Search - “Hand-Tuned SIMD vs Compiler Auto-Vectorization”</title>
    <link href="https://kernelmaker.github.io/simd" rel="alternate" type="text/html" title="SIMD in Vector Search - “Hand-Tuned SIMD vs Compiler Auto-Vectorization”"/>
    <published>2025-09-08T00:00:00+00:00</published>
    <updated>2025-09-08T00:00:00+00:00</updated>
    <id>https://kernelmaker.github.io/simd</id>
    <content type="html" xml:base="https://kernelmaker.github.io/simd"><![CDATA[<p><strong>SIMD</strong> (Single instruction, multiple data) is often one of the key optimization techniques in vector search. In particular, when computing the distance between two vectors, SIMD can transform what was originally a one-dimensional-at-a-time calculation into 8- or 16-dimensions-at-a-time, significantly improving performance.</p>

<p>Here, as I mentioned in previous posts, MariaDB and pgvector take different approaches:</p>

<ol>
  <li><strong>MariaDB</strong>: directly implements distance functions using SIMD instructions.</li>
  <li><strong>pgvector</strong>: implements distance functions in a naive way and relies on compiler optimization (<code class="language-plaintext highlighter-rouge">-ftree-vectorize</code>) for vectorization.</li>
</ol>

<p>To better understand the benefits of SIMD vectorization, and to compare these two approaches, I ran a series of benchmarks — and <strong>discovered some surprising performance results along the way.</strong></p>

<h2 id="1-test-environment-and-method">1. Test Environment and Method</h2>

<p><strong>Environment</strong></p>

<ol>
  <li>AWS EC2: c5.4xlarge, 16 vCPUs, 32 GiB memory</li>
  <li>Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz</li>
  <li>gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0</li>
</ol>

<p><strong>Method</strong></p>

<ol>
  <li>
    <p>First, I implemented 4 different squared L2 distance (L2sq) functions (i.e., Euclidean distance without the square root):</p>

    <ul>
      <li>Naive L2sq implementation</li>
    </ul>

    <div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kr">inline</span> <span class="kt">double</span> <span class="nf">l2sq_naive_f32</span><span class="p">(</span><span class="k">const</span> <span class="kt">float</span><span class="o">*</span> <span class="n">a</span><span class="p">,</span> <span class="k">const</span> <span class="kt">float</span><span class="o">*</span> <span class="n">b</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">n</span><span class="p">)</span> <span class="p">{</span>
  <span class="kt">float</span> <span class="n">acc</span> <span class="o">=</span> <span class="mf">0.</span><span class="n">f</span><span class="p">;</span>
  <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span> <span class="kt">float</span> <span class="n">d</span> <span class="o">=</span> <span class="n">a</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="n">b</span><span class="p">[</span><span class="n">i</span><span class="p">];</span> <span class="n">acc</span> <span class="o">+=</span> <span class="n">d</span> <span class="o">*</span> <span class="n">d</span><span class="p">;</span> <span class="p">}</span>
  <span class="k">return</span> <span class="p">(</span><span class="kt">double</span><span class="p">)</span><span class="n">acc</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div>    </div>

    <ul>
      <li>Naive high-precision L2sq (converting float to double before computation)</li>
    </ul>

    <div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kr">inline</span> <span class="kt">double</span> <span class="nf">l2sq_naive_f64</span><span class="p">(</span><span class="k">const</span> <span class="kt">float</span><span class="o">*</span> <span class="n">a</span><span class="p">,</span> <span class="k">const</span> <span class="kt">float</span><span class="o">*</span> <span class="n">b</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">n</span><span class="p">)</span> <span class="p">{</span>
  <span class="kt">double</span> <span class="n">acc</span> <span class="o">=</span> <span class="mf">0.0</span><span class="p">;</span>
  <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span> <span class="kt">double</span> <span class="n">d</span> <span class="o">=</span> <span class="p">(</span><span class="kt">double</span><span class="p">)</span><span class="n">a</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="p">(</span><span class="kt">double</span><span class="p">)</span><span class="n">b</span><span class="p">[</span><span class="n">i</span><span class="p">];</span> <span class="n">acc</span> <span class="o">+=</span> <span class="n">d</span> <span class="o">*</span> <span class="n">d</span><span class="p">;</span> <span class="p">}</span>
  <span class="k">return</span> <span class="n">acc</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div>    </div>

    <ul>
      <li>SIMD (AVX2) L2sq implementation, computing 8 dimensions at a time</li>
    </ul>

    <div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Reference: simSIMD</span>
<span class="n">SIMSIMD_PUBLIC</span> <span class="kt">void</span> <span class="nf">simsimd_l2sq_f32_haswell</span><span class="p">(</span><span class="n">simsimd_f32_t</span> <span class="k">const</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span>
                                             <span class="n">simsimd_f32_t</span> <span class="k">const</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span>
                                             <span class="n">simsimd_size_t</span> <span class="n">n</span><span class="p">,</span>
                                             <span class="n">simsimd_distance_t</span> <span class="o">*</span><span class="n">result</span><span class="p">)</span> <span class="p">{</span>
   
    <span class="n">__m256</span> <span class="n">d2_vec</span> <span class="o">=</span> <span class="n">_mm256_setzero_ps</span><span class="p">();</span>
    <span class="n">simsimd_size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(;</span> <span class="n">i</span> <span class="o">+</span> <span class="mi">8</span> <span class="o">&lt;=</span> <span class="n">n</span><span class="p">;</span> <span class="n">i</span> <span class="o">+=</span> <span class="mi">8</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">__m256</span> <span class="n">a_vec</span> <span class="o">=</span> <span class="n">_mm256_loadu_ps</span><span class="p">(</span><span class="n">a</span> <span class="o">+</span> <span class="n">i</span><span class="p">);</span>
        <span class="n">__m256</span> <span class="n">b_vec</span> <span class="o">=</span> <span class="n">_mm256_loadu_ps</span><span class="p">(</span><span class="n">b</span> <span class="o">+</span> <span class="n">i</span><span class="p">);</span>
        <span class="n">__m256</span> <span class="n">d_vec</span> <span class="o">=</span> <span class="n">_mm256_sub_ps</span><span class="p">(</span><span class="n">a_vec</span><span class="p">,</span> <span class="n">b_vec</span><span class="p">);</span>
        <span class="n">d2_vec</span> <span class="o">=</span> <span class="n">_mm256_fmadd_ps</span><span class="p">(</span><span class="n">d_vec</span><span class="p">,</span> <span class="n">d_vec</span><span class="p">,</span> <span class="n">d2_vec</span><span class="p">);</span>
    <span class="p">}</span>
   
    <span class="n">simsimd_f64_t</span> <span class="n">d2</span> <span class="o">=</span> <span class="n">_simsimd_reduce_f32x8_haswell</span><span class="p">(</span><span class="n">d2_vec</span><span class="p">);</span>
    <span class="k">for</span> <span class="p">(;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">float</span> <span class="n">d</span> <span class="o">=</span> <span class="n">a</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="n">b</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
        <span class="n">d2</span> <span class="o">+=</span> <span class="n">d</span> <span class="o">*</span> <span class="n">d</span><span class="p">;</span>
    <span class="p">}</span>
   
    <span class="o">*</span><span class="n">result</span> <span class="o">=</span> <span class="n">d2</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">SIMSIMD_INTERNAL</span> <span class="n">simsimd_f64_t</span> <span class="n">_simsimd_reduce_f32x8_haswell</span><span class="p">(</span><span class="n">__m256</span> <span class="n">vec</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// Convert the lower and higher 128-bit lanes of the input vector to double precision</span>
    <span class="n">__m128</span> <span class="n">low_f32</span> <span class="o">=</span> <span class="n">_mm256_castps256_ps128</span><span class="p">(</span><span class="n">vec</span><span class="p">);</span>
    <span class="n">__m128</span> <span class="n">high_f32</span> <span class="o">=</span> <span class="n">_mm256_extractf128_ps</span><span class="p">(</span><span class="n">vec</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
   
    <span class="c1">// Convert single-precision (float) vectors to double-precision (double) vectors</span>
    <span class="n">__m256d</span> <span class="n">low_f64</span> <span class="o">=</span> <span class="n">_mm256_cvtps_pd</span><span class="p">(</span><span class="n">low_f32</span><span class="p">);</span>
    <span class="n">__m256d</span> <span class="n">high_f64</span> <span class="o">=</span> <span class="n">_mm256_cvtps_pd</span><span class="p">(</span><span class="n">high_f32</span><span class="p">);</span>
   
    <span class="c1">// Perform the addition in double-precision</span>
    <span class="n">__m256d</span> <span class="n">sum</span> <span class="o">=</span> <span class="n">_mm256_add_pd</span><span class="p">(</span><span class="n">low_f64</span><span class="p">,</span> <span class="n">high_f64</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">_simsimd_reduce_f64x4_haswell</span><span class="p">(</span><span class="n">sum</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">SIMSIMD_INTERNAL</span> <span class="n">simsimd_f64_t</span> <span class="n">_simsimd_reduce_f64x4_haswell</span><span class="p">(</span><span class="n">__m256d</span> <span class="n">vec</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// Reduce the double-precision vector to a scalar</span>
    <span class="c1">// Horizontal add the first and second double-precision values, and third and fourth</span>
    <span class="n">__m128d</span> <span class="n">vec_low</span> <span class="o">=</span> <span class="n">_mm256_castpd256_pd128</span><span class="p">(</span><span class="n">vec</span><span class="p">);</span>
    <span class="n">__m128d</span> <span class="n">vec_high</span> <span class="o">=</span> <span class="n">_mm256_extractf128_pd</span><span class="p">(</span><span class="n">vec</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
    <span class="n">__m128d</span> <span class="n">vec128</span> <span class="o">=</span> <span class="n">_mm_add_pd</span><span class="p">(</span><span class="n">vec_low</span><span class="p">,</span> <span class="n">vec_high</span><span class="p">);</span>
   
    <span class="c1">// Horizontal add again to accumulate all four values into one</span>
    <span class="n">vec128</span> <span class="o">=</span> <span class="n">_mm_hadd_pd</span><span class="p">(</span><span class="n">vec128</span><span class="p">,</span> <span class="n">vec128</span><span class="p">);</span>
   
    <span class="c1">// Convert the final sum to a scalar double-precision value and return</span>
    <span class="k">return</span> <span class="n">_mm_cvtsd_f64</span><span class="p">(</span><span class="n">vec128</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div>    </div>

    <ul>
      <li>SIMD (AVX-512) L2sq implementation, computing 16 dimensions at a time</li>
    </ul>

    <div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Reference: simSIMD</span>
<span class="n">SIMSIMD_PUBLIC</span> <span class="kt">void</span> <span class="nf">simsimd_l2sq_f32_skylake</span><span class="p">(</span><span class="n">simsimd_f32_t</span> <span class="k">const</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span>
                                             <span class="n">simsimd_f32_t</span> <span class="k">const</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span>
                                             <span class="n">simsimd_size_t</span> <span class="n">n</span><span class="p">,</span>
                                             <span class="n">simsimd_distance_t</span> <span class="o">*</span><span class="n">result</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">__m512</span> <span class="n">d2_vec</span> <span class="o">=</span> <span class="n">_mm512_setzero</span><span class="p">();</span>
    <span class="n">__m512</span> <span class="n">a_vec</span><span class="p">,</span> <span class="n">b_vec</span><span class="p">;</span>
   
<span class="nl">simsimd_l2sq_f32_skylake_cycle:</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">n</span> <span class="o">&lt;</span> <span class="mi">16</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">__mmask16</span> <span class="n">mask</span> <span class="o">=</span> <span class="p">(</span><span class="n">__mmask16</span><span class="p">)</span><span class="n">_bzhi_u32</span><span class="p">(</span><span class="mh">0xFFFFFFFF</span><span class="p">,</span> <span class="n">n</span><span class="p">);</span>
        <span class="n">a_vec</span> <span class="o">=</span> <span class="n">_mm512_maskz_loadu_ps</span><span class="p">(</span><span class="n">mask</span><span class="p">,</span> <span class="n">a</span><span class="p">);</span>
        <span class="n">b_vec</span> <span class="o">=</span> <span class="n">_mm512_maskz_loadu_ps</span><span class="p">(</span><span class="n">mask</span><span class="p">,</span> <span class="n">b</span><span class="p">);</span>
        <span class="n">n</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">else</span> <span class="p">{</span>
        <span class="n">a_vec</span> <span class="o">=</span> <span class="n">_mm512_loadu_ps</span><span class="p">(</span><span class="n">a</span><span class="p">);</span>
        <span class="n">b_vec</span> <span class="o">=</span> <span class="n">_mm512_loadu_ps</span><span class="p">(</span><span class="n">b</span><span class="p">);</span>
        <span class="n">a</span> <span class="o">+=</span> <span class="mi">16</span><span class="p">,</span> <span class="n">b</span> <span class="o">+=</span> <span class="mi">16</span><span class="p">,</span> <span class="n">n</span> <span class="o">-=</span> <span class="mi">16</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">__m512</span> <span class="n">d_vec</span> <span class="o">=</span> <span class="n">_mm512_sub_ps</span><span class="p">(</span><span class="n">a_vec</span><span class="p">,</span> <span class="n">b_vec</span><span class="p">);</span>
    <span class="n">d2_vec</span> <span class="o">=</span> <span class="n">_mm512_fmadd_ps</span><span class="p">(</span><span class="n">d_vec</span><span class="p">,</span> <span class="n">d_vec</span><span class="p">,</span> <span class="n">d2_vec</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">n</span><span class="p">)</span> <span class="k">goto</span> <span class="n">simsimd_l2sq_f32_skylake_cycle</span><span class="p">;</span>
   
    <span class="o">*</span><span class="n">result</span> <span class="o">=</span> <span class="n">_simsimd_reduce_f32x16_skylake</span><span class="p">(</span><span class="n">d2_vec</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">......</span>
</code></pre></div>    </div>
  </li>
  <li>
    <p>I generated a dataset of 10,000 float vectors (dimension = 1024, 64B aligned) and one target vector. Then, for the following 5 scenarios, I searched for the vector with the closest L2sq distance to the target. Each distance computation was repeated 16 times (to create a CPU-intensive workload), and each scenario was executed 5 times, taking the median runtime to eliminate random fluctuations:</p>

    <ol>
      <li>SIMD L2sq implementation</li>
      <li>Naive L2sq implementation</li>
      <li>Naive L2sq with compiler vectorization disabled (<code class="language-plaintext highlighter-rouge">-fno-tree-vectorize -fno-builtin -fno-lto -Wno-cpp -Wno-pragmas</code>)</li>
      <li>Naive high-precision L2sq implementation</li>
      <li>Naive high-precision L2sq with compiler vectorization disabled</li>
    </ol>
  </li>
  <li>
    <p>Compile with AVX2 (<code class="language-plaintext highlighter-rouge">-O3 -mavx2 -mfma -mf16c -mbmi2</code>) and run the 5 scenarios.</p>
  </li>
  <li>
    <p>Compile with AVX-512 (<code class="language-plaintext highlighter-rouge">-O3 -mavx512f -mavx512dq -mavx512bw -mavx512vl -mavx512cd -mfma -mf16c -mbmi2</code>) and run the 5 scenarios again.</p>
  </li>
</ol>

<h2 id="2-results-and-analysis">2. Results and Analysis</h2>

<p><img src="/public/images/2025-09-08/1.png" alt="image-1" /></p>

<h4 id="expected-results">Expected results：</h4>

<ol>
  <li>
    <p>SIMD L2sq implementations are much faster than others, and AVX-512 outperforms AVX2 since it processes 16 dimensions at once instead of 8.</p>
  </li>
  <li>
    <p>Under AVX2, naive L2sq (178.385ms) is faster than naive high-precision L2sq (183.973ms), because the latter incurs float→double conversion overhead.</p>
  </li>
  <li>
    <p>Under both AVX2 and AVX-512, naive implementations with compiler vectorization disabled perform the worst, since they are forced into scalar execution.</p>
  </li>
</ol>

<h4 id="unexpected-results">Unexpected Results</h4>

<p>In addition to the expected results above, some surprising findings appeared:</p>

<ol>
  <li>For naive L2sq, <strong>AVX-512 performance (208.822ms) was actually slower than AVX2 (178.385ms).</strong></li>
  <li>With AVX-512, <strong>naive L2sq was slower than naive high-precision L2sq.</strong></li>
</ol>

<p>Both deserve deeper analysis.</p>

<p><strong>(1) Why was naive L2sq with AVX-512 slower than with AVX2?</strong></p>

<p>Although this was a naive implementation, with <code class="language-plaintext highlighter-rouge">-O3</code> we would expect the compiler to auto-vectorize. However, the vectorized result generated by the compiler was far worse than our manual SIMD implementation, and AVX-512 even performed worse than AVX2.</p>

<p>To investigate further, I used <code class="language-plaintext highlighter-rouge">objdump</code> to examine the AVX2 and AVX-512 binaries for <code class="language-plaintext highlighter-rouge">l2sq_naive_f32()</code>.</p>

<ul>
  <li>
    <p>Under AVX2:</p>

    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0000000000007090 &lt;_ZL19l2sq_naive_f32PKfS0_m&gt;:
     ... ...
     70b7:       48 c1 ee 03             shr    rsi,0x3
     70bb:       48 c1 e6 05             shl    rsi,0x5
     70bf:       90                      nop
     70c0:       c5 fc 10 24 07          vmovups ymm4,YMMWORD PTR [rdi+rax*1]
     70c5:       c5 dc 5c 0c 01          vsubps ymm1,ymm4,YMMWORD PTR [rcx+rax*1]
     70ca:       48 83 c0 20             add    rax,0x20
     70ce:       c5 f4 59 c9             vmulps ymm1,ymm1,ymm1
       
     70d2:       c5 fa 58 c1             vaddss xmm0,xmm0,xmm1
     70d6:       c5 f0 c6 d9 55          vshufps xmm3,xmm1,xmm1,0x55
     70db:       c5 f0 c6 d1 ff          vshufps xmm2,xmm1,xmm1,0xff
     70e0:       c5 fa 58 c3             vaddss xmm0,xmm0,xmm3
     70e4:       c5 f0 15 d9             vunpckhps xmm3,xmm1,xmm1
     70e8:       c4 e3 7d 19 c9 01       vextractf128 xmm1,ymm1,0x1
     70ee:       c5 fa 58 c3             vaddss xmm0,xmm0,xmm3
     70f2:       c5 fa 58 c2             vaddss xmm0,xmm0,xmm2
     70f6:       c5 f0 c6 d1 55          vshufps xmm2,xmm1,xmm1,0x55
     70fb:       c5 fa 58 c1             vaddss xmm0,xmm0,xmm1
     70ff:       c5 fa 58 c2             vaddss xmm0,xmm0,xmm2
     7103:       c5 f0 15 d1             vunpckhps xmm2,xmm1,xmm1
     7107:       c5 f0 c6 c9 ff          vshufps xmm1,xmm1,xmm1,0xff
     710c:       c5 fa 58 c2             vaddss xmm0,xmm0,xmm2
     7110:       c5 fa 58 c1             vaddss xmm0,xmm0,xmm1
     ... ...
</code></pre></div>    </div>

    <p>The compiler did use vector instructions (<code class="language-plaintext highlighter-rouge">vmovups</code>, <code class="language-plaintext highlighter-rouge">vsubps</code>, <code class="language-plaintext highlighter-rouge">vmulps</code>) to compute L2sq in groups of 8 floats. But when folding the 8 results horizontally into <code class="language-plaintext highlighter-rouge">xmm0</code>, it extracted elements using <code class="language-plaintext highlighter-rouge">vshufps</code>, <code class="language-plaintext highlighter-rouge">vunpckhps</code>, <code class="language-plaintext highlighter-rouge">vextractf128</code>, etc., and then added them one by one with scalar <code class="language-plaintext highlighter-rouge">vaddss</code>. Worse, this folding happened <strong>in every iteration</strong>.</p>

    <p><img src="/public/images/2025-09-08/2.png" alt="image-2" /></p>

    <p>This per-iteration horizontal reduction became the bottleneck. Instead, like the manual SIMD implementation, it should have accumulated vector results across the whole loop and performed just one horizontal reduction at the end.</p>
  </li>
  <li>
    <p>Under AVX-512:</p>

    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    a057:       48 c1 ee 04             shr    rsi,0x4
    a05b:       48 c1 e6 06             shl    rsi,0x6
    a05f:       90                      nop
    a060:       62 f1 7c 48 10 2c 07    vmovups zmm5,ZMMWORD PTR [rdi+rax*1]
    a067:       62 f1 54 48 5c 0c 01    vsubps zmm1,zmm5,ZMMWORD PTR [rcx+rax*1]
    a06e:       48 83 c0 40             add    rax,0x40
    a072:       62 f1 74 48 59 c9       vmulps zmm1,zmm1,zmm1
      
    a078:       c5 f0 c6 e1 55          vshufps xmm4,xmm1,xmm1,0x55
    a07d:       c5 f0 c6 d9 ff          vshufps xmm3,xmm1,xmm1,0xff
    a082:       62 f3 75 28 03 d1 07    valignd ymm2,ymm1,ymm1,0x7
    a089:       c5 fa 58 c1             vaddss xmm0,xmm0,xmm1
    a08d:       c5 fa 58 c4             vaddss xmm0,xmm0,xmm4
    a091:       c5 f0 15 e1             vunpckhps xmm4,xmm1,xmm1
    a095:       c5 fa 58 c4             vaddss xmm0,xmm0,xmm4
    a099:       c5 fa 58 c3             vaddss xmm0,xmm0,xmm3
    a09d:       62 f3 7d 28 19 cb 01    vextractf32x4 xmm3,ymm1,0x1
    a0a4:       c5 fa 58 c3             vaddss xmm0,xmm0,xmm3
    a0a8:       62 f3 75 28 03 d9 05    valignd ymm3,ymm1,ymm1,0x5
    a0af:       c5 fa 58 c3             vaddss xmm0,xmm0,xmm3
    a0b3:       62 f3 75 28 03 d9 06    valignd ymm3,ymm1,ymm1,0x6
    a0ba:       62 f3 7d 48 1b c9 01    vextractf32x8 ymm1,zmm1,0x1
    a0c1:       c5 fa 58 c3             vaddss xmm0,xmm0,xmm3
    a0c5:       c5 f0 c6 d9 55          vshufps xmm3,xmm1,xmm1,0x55
    a0ca:       c5 fa 58 c2             vaddss xmm0,xmm0,xmm2
    a0ce:       c5 f0 c6 d1 ff          vshufps xmm2,xmm1,xmm1,0xff
    a0d3:       c5 fa 58 c1             vaddss xmm0,xmm0,xmm1
    a0d7:       c5 fa 58 c3             vaddss xmm0,xmm0,xmm3
    a0db:       c5 f0 15 d9             vunpckhps xmm3,xmm1,xmm1
    a0df:       c5 fa 58 c3             vaddss xmm0,xmm0,xmm3
    a0e3:       c5 fa 58 c2             vaddss xmm0,xmm0,xmm2
    a0e7:       62 f3 7d 28 19 ca 01    vextractf32x4 xmm2,ymm1,0x1
    a0ee:       c5 fa 58 c2             vaddss xmm0,xmm0,xmm2
    a0f2:       62 f3 75 28 03 d1 05    valignd ymm2,ymm1,ymm1,0x5
    a0f9:       c5 fa 58 c2             vaddss xmm0,xmm0,xmm2
    a0fd:       62 f3 75 28 03 d1 06    valignd ymm2,ymm1,ymm1,0x6
    a104:       62 f3 75 28 03 c9 07    valignd ymm1,ymm1,ymm1,0x7
    a10b:       c5 fa 58 c2             vaddss xmm0,xmm0,xmm2
    a10f:       c5 fa 58 c1             vaddss xmm0,xmm0,xmm1
</code></pre></div>    </div>

    <p>The first part similarly used vector instructions to compute 16 values at a time. But folding 16 results was even more complex and expensive, involving <code class="language-plaintext highlighter-rouge">vshufps</code>, <code class="language-plaintext highlighter-rouge">valignd</code>, <code class="language-plaintext highlighter-rouge">vunpckhps</code>, <code class="language-plaintext highlighter-rouge">vextractf32x4</code>, <code class="language-plaintext highlighter-rouge">vextractf32x8</code>, etc. This additional complexity canceled out the gains from processing 16 dimensions per iteration, which explains why AVX-512 was slower.</p>
  </li>
</ul>

<p><strong>(2) Why was naive float L2sq slower than naive high-precision L2sq under AVX-512?</strong></p>

<p>Theoretically, high-precision L2sq should be slower because of float→double conversions. So why was it faster?</p>

<p>Looking at the disassembly of <code class="language-plaintext highlighter-rouge">l2sq_naive_f64</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>000000000000a280 &lt;_ZL19l2sq_naive_f64PKfS0_m&gt;:
    a280:       f3 0f 1e fa             endbr64
    a284:       48 85 d2                test   rdx,rdx
    a287:       74 37                   je     a2c0 &lt;_ZL19l2sq_naive_f64_oncePKfS0_m+0x40&gt;
    a289:       c5 e0 57 db             vxorps xmm3,xmm3,xmm3
    a28d:       31 c0                   xor    eax,eax
    a28f:       c5 e9 57 d2             vxorpd xmm2,xmm2,xmm2
    a293:       0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
    a298:       c5 e2 5a 04 87          vcvtss2sd xmm0,xmm3,DWORD PTR [rdi+rax*4]
    a29d:       c5 e2 5a 0c 86          vcvtss2sd xmm1,xmm3,DWORD PTR [rsi+rax*4]
    a2a2:       c5 fb 5c c1             vsubsd xmm0,xmm0,xmm1
    a2a6:       48 83 c0 01             add    rax,0x1
    a2aa:       c5 fb 59 c0             vmulsd xmm0,xmm0,xmm0
    a2ae:       c5 eb 58 d0             vaddsd xmm2,xmm2,xmm0
    a2b2:       48 39 c2                cmp    rdx,rax
    a2b5:       75 e1                   jne    a298 &lt;_ZL19l2sq_naive_f64_oncePKfS0_m+0x18&gt;
    a2b7:       c5 eb 10 c2             vmovsd xmm0,xmm2,xmm2
    a2bb:       c3                      ret
    a2bc:       0f 1f 40 00             nop    DWORD PTR [rax+0x0]
    a2c0:       c5 e9 57 d2             vxorpd xmm2,xmm2,xmm2
    a2c4:       c5 eb 10 c2             vmovsd xmm0,xmm2,xmm2
    a2c8:       c3                      ret
    a2c9:       0f 1f 80 00 00 00 00    nop    DWORD PTR [rax+0x0]
</code></pre></div></div>

<ul>
  <li>The code is much shorter than the float version.</li>
  <li>Although it includes scalar float→double conversions (<code class="language-plaintext highlighter-rouge">vcvtss2sd</code>) and computes one dimension at a time, it avoids the complex and costly 16-element horizontal folding.</li>
</ul>

<p>In other words, even with the conversion overhead, the simpler scalar path was still faster than the float version with vector folding. The compiler likely chose the conservative scalar path here, avoiding vectorization.</p>

<p><strong>(3) How to Improve Naive L2sq for Better Compiler Vectorization?</strong></p>

<p>The reason for horizontal folding is likely that the compiler strictly follows IEEE 754 semantics, preserving the exact order of floating-point additions. This prevents the compiler from reordering additions into vectorized accumulations.</p>

<p>To relax this, we can explicitly allow reassociation:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kr">inline</span> <span class="kt">double</span> <span class="nf">l2sq_naive_f32</span><span class="p">(</span><span class="k">const</span> <span class="kt">float</span><span class="o">*</span> <span class="n">a</span><span class="p">,</span> <span class="k">const</span> <span class="kt">float</span><span class="o">*</span> <span class="n">b</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">n</span><span class="p">)</span> <span class="p">{</span>
    <span class="kt">float</span> <span class="n">acc</span> <span class="o">=</span> <span class="mf">0.</span><span class="n">f</span><span class="p">;</span>
    <span class="cp">#pragma omp simd reduction(+:acc)
</span>    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">float</span> <span class="n">d</span> <span class="o">=</span> <span class="n">a</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="n">b</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
        <span class="n">acc</span> <span class="o">+=</span> <span class="n">d</span> <span class="o">*</span> <span class="n">d</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="p">(</span><span class="kt">double</span><span class="p">)</span><span class="n">acc</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And compile with <code class="language-plaintext highlighter-rouge">-fopenmp-simd</code> to enable this directive.</p>

<p>Running again shows a significant improvement: compiler auto-vectorization now achieves performance close to manual SIMD implementations. Using <code class="language-plaintext highlighter-rouge">-ffast-math</code> also works.</p>

<p><img src="/public/images/2025-09-08/3.png" alt="image-3" /></p>

<h2 id="3-summary">3. Summary</h2>

<ol>
  <li>SIMD significantly improves distance computation performance.</li>
  <li>Hand-written SIMD implementations perform best.</li>
  <li>For naive implementations, <strong>allowing reassociation</strong> (via <code class="language-plaintext highlighter-rouge">#pragma omp simd reduction(+:acc)</code> or appropriate subsets of <code class="language-plaintext highlighter-rouge">-ffast-math</code>) is the key to approaching hand-written SIMD performance. Under strict IEEE semantics, the compiler conservatively generates per-iteration folding, which creates slow paths where AVX-512 does not necessarily have an advantage.</li>
</ol>]]></content>
    <author>
      <name>Zhao Song</name>
    </author>
    <summary type="html"><![CDATA[SIMD (Single instruction, multiple data) is often one of the key optimization techniques in vector search. In particular, when computing the distance between two vectors, SIMD can transform what was originally a one-dimensional-at-a-time calculation into 8- or 16-dimensions-at-a-time, significantly improving performance.]]></summary>
  </entry>
  <entry>
    <title type="html">Is pgvector breaking PostgreSQL’s Repeatable Read isolation?</title>
    <link href="https://kernelmaker.github.io/pgvector_rr" rel="alternate" type="text/html" title="Is pgvector breaking PostgreSQL’s Repeatable Read isolation?"/>
    <published>2025-08-11T00:00:00+00:00</published>
    <updated>2025-08-11T00:00:00+00:00</updated>
    <id>https://kernelmaker.github.io/pgvector_rr</id>
    <content type="html" xml:base="https://kernelmaker.github.io/pgvector_rr"><![CDATA[<p>This thought hit me on the way to work today:
(The table ‘items’ has an HNSW index on the vector column ‘embedding’)</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>BEGIN;
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
SELECT * FROM items ORDER BY embedding &lt;-&gt; '[3,1,2]' LIMIT 5;
……
</code></pre></div></div>

<p><strong>Can we really say this SELECT is repeatable read safe❓</strong></p>

<p>I used to assume pgvector, as a PostgreSQL extension, naturally inherits Postgres’s transactional guarantees — but after thinking it through, that might not be the case.</p>

<h3 id="postgresql-mvcc-relies-on-3-assumptions">PostgreSQL MVCC relies on 3 assumptions:</h3>

<ol>
  <li><strong>Indexes are append-only</strong>: Write operations only insert new index entries — never update or delete them.</li>
  <li><strong>The heap stores version history</strong>: Each row’s versions are retained for snapshot-based visibility checks.</li>
  <li><strong>VACUUM coordinates cleanup</strong>: It purges dead heap tuples and their corresponding index entries together.</li>
</ol>

<p>This works well with native ordered index like nbtree. For example:</p>

<ol>
  <li>A REPEATABLE READ transaction performs the same SELECT twice.</li>
  <li>Between them, a new row B is inserted.</li>
  <li>In the second SELECT, B appears in the index scan but is filtered out after a heap visibility check.</li>
</ol>

<p>So, the query still returns the same results — consistent with REPEATABLE READ.</p>

<p><img src="/public/images/2025-08-11/1.png" alt="image-1" /></p>

<h3 id="but-hnsw-behaves-differently">But HNSW behaves differently…</h3>
<p>When inserting a new vector B:</p>

<ol>
  <li>B searches the graph to find neighbors.</li>
  <li>Selected neighbors (say, T) update their neighbor lists to include B.</li>
  <li>If T’s list is full, HNSW re-selects top-k neighbors — possibly evicting an existing node like D.</li>
</ol>

<p>Here’s the issue: T’s neighbor list is modified — breaking assumption #1.
Now, suppose a REPEATABLE READ transaction had previously discovered D via T. In its second identical query, it may no longer reach D, simply because D was evicted from T’s neighbor list. At the same time, the newly inserted B is now reachable — but is correctly rejected due to heap visibility checks.</p>

<h3 id="root-cause">Root cause:</h3>

<ol>
  <li>The HNSW index breaks MVCC’s immutability assumption: It performs in-place modifications to graph nodes during insertions.</li>
  <li>No versioning in HNSW index: There’s no way to preserve historical neighbor lists for concurrent transactions. Even though I prefer pgvector’s low-level, native integration (at the same level as nbtree), MariaDB’s design may provide better transactional isolation here. Its HNSW index is implemented as a separate InnoDB table — which naturally supports MVCC, including versioned index “rows.”</li>
</ol>

<p>This question came to mind today — I reached a tentative conclusion through some code review and thought experiments. Haven’t verified this with a test case yet, so feel free to correct me if I’m wrong.</p>

<p>🤔 BTW, lately, I’ve been comparing how vector search is implemented in transactional databases vs dedicated vector databases by reading through their code. It’s exciting to see traditional databases embracing new trends — but what do you think:
Do transactions bring real value to vector search, or are they more of a burden in practice? And what about the other way around?</p>

<h3 id="discussion">Discussion</h3>

<p>This post has sparked some discussion on LinkedIn, with two main points being raised:</p>

<ol>
  <li>HNSW is approximate search by nature, so strict Repeatable Read isn’t required.</li>
  <li>PostgreSQL doesn’t currently guarantee identical results in all cases anyway (e.g., non-unique indexes with <code class="language-plaintext highlighter-rouge">SELECT ... ORDER BY ... LIMIT ...</code>), because different execution plans can produce different result orders.</li>
</ol>

<p>I’m not convinced by either of these arguments:</p>

<ol>
  <li>Approximate search is an inherent trade-off in the vector search domain. It’s unrelated to PostgreSQL’s ACID guarantees, and using vector search shouldn’t be a reason to compromise on them.</li>
  <li>The core issue here isn’t about result <strong>order</strong> — it’s about the result <strong>set</strong> itself. Query plan variability doesn’t explain this away, because even if we strictly control every runtime condition to ensure identical execution plans, HNSW can still produce different result sets (not just differently ordered sets) due to the root cause I described above.</li>
</ol>]]></content>
    <author>
      <name>Zhao Song</name>
    </author>
    <summary type="html"><![CDATA[This thought hit me on the way to work today: (The table ‘items’ has an HNSW index on the vector column ‘embedding’)]]></summary>
  </entry>
</feed>
