Zhao Song’s Blog

InnoDB B+Tree Performance Optimization Proposal - Insert Path Improvements and Concurrent Split Handling

2026-05-10T00:00:00+00:00

1. Bottlenecks in InnoDB B+Tree Inserts

The current InnoDB B+Tree index insert path has three major performance bottlenecks:

Repeated B+Tree descent

A pessimistic insert may go through the B+Tree descent path multiple times:
1. Optimistic insert attempt: btr_cur_search_to_nth_level(BTR_MODIFY_LEAF) -> btr_cur_optimistic_insert
2. Pessimistic insert with page split: btr_cur_search_to_nth_level(BTR_MODIFY_TREE) -> btr_cur_optimistic_insert -> btr_cur_pessimistic_insert
Each path requires btr_cur_search_to_nth_level() to descend the B+Tree again.
BTR_MODIFY_TREE serializes SMO through SX(dict_index_t::lock)

The pessimistic insert path acquires SX(dict_index_t::lock), which means all SMO operations are effectively serialized.
The pessimistic BTR_MODIFY_TREE phase holds X latches on the affected subtree

MySQL introduced SX(dict_index_t::lock) in 8.0. Although it is compatible with the S(dict_index_t::lock) held by readers and optimistic writers, and therefore improves concurrency between SMO and other operations to some extent, the subtree involved in the pessimistic BTR_MODIFY_TREE phase is still held with X latches until the SMO completes.

During this period, no other read or write operation can enter that subtree. This becomes even worse when cascading SMO reaches higher levels of the tree.

To optimize these issues, in principle we need:

A new B+Tree descent, ascent, and page-latching protocol, so that after an optimistic insert fails, we do not need to release the current leaf page latch and restart from the root. Instead, we should be able to start the pessimistic insert process in place.
Pessimistic insert should no longer hold SX(dict_index_t::lock). It should only need S(dict_index_t::lock), so that SMO operations can run in parallel.
SMO should not pre-lock the whole affected subtree. The latch granularity should be reduced. The split should proceed bottom-up, one level at a time: latch, split, and release. This allows the subtree being modified by SMO to remain readable and writable as much as possible.

Combining these three points, we can borrow ideas from B-link Trees:

https://www.csd.uoc.gr/~hy460/pdf/p650-lehman.pdf

and perform a more complete redesign.

2. High-Level Design

1. Introduce a new B-link-style descent path: `blink_search_to_nth_level`

The new search path works as follows:

Acquire S(dict_index_t::lock).
Starting from the root page, acquire an S latch on the current page, get the child page number, release the current page latch, then descend and acquire an S latch on the child page. Repeat this until reaching the level immediately above the target level. During descent, record the page numbers of the internal pages in path[].
After obtaining the target page number at the target level, acquire either an S latch or an X latch on the target page depending on whether the operation is a read or write.

There is no longer a fundamental distinction between BTR_MODIFY_LEAF and BTR_MODIFY_TREE in the descent protocol. The main difference is only the target level.

After descent completes, the operation holds the latch on the target page, and also has the page numbers of all internal pages on the path. These can later be used to optimistically locate the parent page if needed.

2. Latching-order protocol

During descent, latch coupling is not required. We latch one level at a time: parent -> child page number -> release parent -> latch child.
During ascent, latch coupling is also not required.
When moving right on the same level through the B-link high key, latch coupling is required. The protocol is current -> right: acquire the right page latch while still holding the current page latch, then release the current page latch.
Pages are split only to the right, which fits the B-link Tree right-move logic.

3. Full insert flow

Use blink_search_to_nth_level() to descend to the target leaf page L, acquire an X latch on L, and return the internal-page path path[].
Try an optimistic insert. If it succeeds, the insert is complete and returns.
If the optimistic insert fails, continue holding the X latch on L, acquire an X latch on L’s right sibling R, allocate a new page N, move part of L’s records to N, and insert N between L and R.

At the same time, set the INCOMPLETE_SPLIT flag on L, and set L’s high key.

The semantics of INCOMPLETE_SPLIT are:
- concurrent readers and optimistic writers can still enter L;
- they can use the high key to decide whether they need to move right to continue the operation;
- concurrent pessimistic writers are blocked by this flag and return retry, waiting for the previous unfinished cascade to make progress.
After the split is complete, obtain the node_ptr for N, then release all latches on L, N, and R.
Use path[] to optimistically acquire an X latch on the parent page P. If P has changed, descend again to locate the correct P and acquire its X latch.

If P has enough space for an optimistic insert, insert the node_ptr, then reacquire the latch on L, clear L’s INCOMPLETE_SPLIT flag, and return success.
If P cannot accept the node_ptr, perform step 3 at P’s level and split P. Then reacquire the X latch on L and clear L’s INCOMPLETE_SPLIT flag.
Perform step 4 at P’s level and insert the node_ptr of P’s new right page into P’s parent.
Continue this process upward until the cascading split completes.

With this design, we can effectively address the three bottlenecks described above.

I implemented a PoC on MySQL 9.7.0. In a carefully constructed workload with 32 concurrent insert threads that triggers a large number of concurrent page splits — almost every insert event triggers a leaf split — the comparison is:

Version	TPS	Avg latency	P95 latency
MySQL 9.7.0	5,666.22	5.65 ms	17.01 ms
Optimized version	91,523.92	0.35 ms	0.40 ms

This is a 16.2x improvement in TPS, a 16.1x reduction in average latency, and a 42.5x reduction in P95 latency.

I also collected perf samples on the optimized version. The hotspot has shifted from index-lock contention to redo log commit wait. In particular, log_wait_for_write accounts for about 40% of the samples.

This indicates that SMO concurrency inside the B-link-style protocol is no longer the bottleneck.

The improvement is significant, so I believe this proposal is worth further investigation.

3. InnoDB Low-Level Design

The main difficulty of this proposal is that it requires a major redesign of InnoDB core modules. There are also historical design constraints in InnoDB that conflict with this new design.

1. Add high keys to pages

The high key is an important part of B-link Trees. It allows concurrent threads to repair their access path when they encounter a page being split. Therefore, it must be introduced.

In InnoDB, for compatibility reasons, I found that the high byte of the PAGE_DIRECTION field in PAGE_HEADER is effectively unused. PAGE_DIRECTION is a 2-byte field, but InnoDB only uses the low byte to store the last insert direction. The high byte remains 0 before and after upgrade.

Therefore, in my PoC, I use it as a B-link Tree flag byte, reusing the existing page-header space without changing the file format.

The flags are:

bit 0: PAGE_IS_BLINK Indicates whether the page is a B-link page or a normal B+Tree page. It is set when the page is first touched by SMO and is never cleared.
bit 1: PAGE_INCOMPLETE_SPLIT Indicates that the page’s cascade has not yet been propagated to the parent. This is an intermediate state during the B-link split process.

If a page is a B-link page, the high key always exists as the record immediately before supremum.

On internal pages, the high key has the same shape as a normal node_ptr. The difference is that the high key’s child page number is FIL_NULL, meaning it points to +infinity.

This requires adding high-key checks and skip logic in code paths that fetch the next record, so that the high key is treated as an internal reserved record.

2. Split SMO into multiple mtrs

In the current InnoDB implementation, when BTR_MODIFY_TREE descends the B+Tree, the mtr already latches all pages that may need to split along the path. These latches are held until the whole pessimistic insert finishes and the mtr commits.

During this time, the whole affected area cannot be accessed concurrently. Also, dirty-page unlatching in the current mtr model happens together with mtr commit.

For the B-link-style design, we need to split this large mtr into multiple smaller mtrs.

mtr 1

Acquire the index S lock, descend the B+Tree to the target leaf page L, and hold L’s X latch.

If an optimistic insert succeeds, commit the mtr and the insert is complete.

Otherwise, acquire an X latch on L’s right sibling R, allocate a new page, split L, set L’s INCOMPLETE_SPLIT flag and high key, and then start mtr 2.

mtr 2

Transfer the index S lock held by mtr 1 to mtr 2 through a new mtr interface transfer_to, ensuring that the index S lock is held during the entire insert operation.

Then mtr 1 can commit, and the split becomes visible.

mtr 2 continues to hold the index S lock and tries to insert the new page’s node_ptr into the corresponding parent page P.

The design is to locate the parent using the path[] collected during the mtr 1 descent. The cached information includes the parent page number and LSN. After acquiring an X latch on the parent page, we compare the LSN to determine whether the cached parent is still valid. This can avoid descending the tree again. Even when the cache is invalid, the same-level right-move logic can be used to repair the search path.

However, in my current PoC, each cascade step still uses btr_cur_search_to_nth_level() to descend again and locate P.

Then we check whether P can optimistically accept the node_ptr:

If it can, insert the node_ptr directly. Then use the cached page number of L to acquire an X latch on L.

According to the proposed protocol, L has INCOMPLETE_SPLIT set. It can accept reads and optimistic writes, but blocks pessimistic writes. Therefore, L can be safely located by page number.

Then clear L’s INCOMPLETE_SPLIT flag, commit mtr 2, and the insert succeeds.
If P cannot accept the node_ptr, acquire an X latch on P’s right sibling, split P, and similarly set P’s INCOMPLETE_SPLIT flag and high key.

Then obtain the node_ptr of the new page created by splitting P.

After that, acquire the X latch on L using L’s page number and clear L’s INCOMPLETE_SPLIT flag.

Start mtr 3, transfer the index S lock to mtr 3, and then commit mtr 2. At this point, P’s split also becomes visible.

mtr 3 and above

mtr 3 now holds the index S lock and is ready to continue inserting the node_ptr into the upper level. This is the same state as mtr 2. The process repeats until the cascading split completes.

This is the mtr-splitting model.

In principle, each level modification corresponds to one mtr. The index S lock is transferred between mtrs to ensure it is held throughout the whole operation.

For cascading splits, the mtr that splits the parent and the mtr that clears the child’s INCOMPLETE_SPLIT flag must be atomic with respect to each other. Therefore, they need to be in the same mtr.

However, before clearing INCOMPLETE_SPLIT, we do not need to keep holding the child page latch. We only need to reacquire the child page by page number at the end and then clear the flag.

3. Page allocation and the FSP locking problem

After changing SMO from holding index SX to holding index S, the bottleneck moves from the index SX lock to the locks inside FSP page allocation.

Specifically, when allocating a new page from FSP, InnoDB needs to hold SX on the root page and then hold SX on space->lock.

This is the biggest historical constraint I encountered in this optimization.

The current InnoDB BTR_MODIFY_TREE path holds the root page SX or X latch while descending the tree, which makes this latch order valid in the existing design.

However, in the new protocol, allocating a page requires holding the root page latch, which breaks the B+Tree latch-order protocol and may introduce deadlocks.

More seriously, even if we adjust the order to avoid deadlocks, the root SX latch would still block concurrent SMO. Also, the latch is released only when the mtr holding it commits. Even after splitting the mtr, the resulting mtr is still too large for such a global latch and becomes a performance bottleneck.

This is a difficult issue, and I do not currently have a perfect solution. The main problems are:

If we remove the root page dependency from the FSP new-page allocation logic, this may involve data-file format changes and introduce compatibility issues.
If we use a separate alloc_mtr to minimize the new-page allocation phase, there are two problems.

First, in the new insert flow, we allocate a new page while holding a data page latch. This still creates a data-page X -> root-page X order, which may cause deadlocks.

To avoid that, we would need to release the held data page latch before allocation, and then reacquire it after alloc_mtr commits. But this workaround exists only to accommodate historical constraints, and it is not very clean from an engineering perspective. It may also hurt performance.

Second, if alloc_mtr commits early and the later operation fails, the newly allocated page cannot be returned to FSP easily.

Since this is not the main topic of the proposal, my current PoC uses a temporary solution: a B-link allocator thread.

The allocator thread pre-allocates pages from FSP into a pool. The pool stores page numbers only.

The allocator maintains separate watermarks for FSP_LEAF and FSP_NON_LEAF, so that the high consumption rate of leaf pages does not starve the internal-page segment.

During B-link split, all new pages are taken directly from this pool. The allocator monitors the pool watermarks and keeps them at a reasonable level, so that there are normally enough pre-allocated page numbers available.

When a split needs a new page, it simply takes one from the pool. If the pool is empty, it signals the allocator thread and returns DB_BLINK_RETRY. Then the row layer retries the whole insert path. This avoids blocking the latch path on an empty allocation pool.

This decouples the root SX latch required by page allocation from the B+Tree insert path.

Of course, this is only the PoC solution and still needs further refinement.

For example, we need to consider how to reclaim pre-allocated pages after a server crash. One possible solution is to introduce a new redo record to record the pre-allocation state, and then reclaim unconsumed pre-allocated pages during server restart. This should be solvable from an engineering perspective.

Another reason why I think an allocator thread is useful is that, for B-link Trees, if merge is supported in the future, pages created or removed by merge cannot be immediately freed back to FSP.

B-link Tree correctness requires that concurrent operations may still see intermediate SMO states. If a page is freed too early, another operation may follow a stale right-link and reach a freed page, causing correctness problems.

A page can be safely freed only after all mtrs that were active at the moment when the page was removed from the tree have completed. This is also implementable in engineering terms, similar to an MVCC visibility check, but much lighter than full MVCC.

This asynchronous safe-free work can also be handled by the allocator thread after merge support is added.

Merge is also an important part of the design, but I will not expand on it in this proposal.

4. Crash Recovery

When a writer crashes in the middle of a cascading split, a lower-level page may have its INCOMPLETE_SPLIT flag durably written to disk, while the corresponding parent page may not yet contain the node_ptr pointing to the new page.

In my current PoC, during restart, B-link recovery runs as the first step of srv_start_threads_after_ddl_recovery.

It scans the buffer pool in a single thread. For each page with both PAGE_IS_BLINK and PAGE_INCOMPLETE_SPLIT set, it separately acquires X(index). Since this is during recovery and runs single-threaded, there is no concurrent contention.

Then it reuses the runtime split/root-raise primitives to complete the interrupted cascade.

There is an important invariant that simplifies recovery:

The parent node_ptr installation and the clearing of the child page’s INCOMPLETE_SPLIT flag commit in the same mtr.

Therefore, if a flagged page survives crash recovery, it means the corresponding parent must be missing the node_ptr. We do not need an idempotency check for whether the parent already contains the node_ptr.

The current implementation only scans the buffer pool.

There is an edge case where a flagged page has been flushed to disk, the checkpoint has advanced, and recovery misses it. In that case, the impact is mainly performance-related rather than correctness-related, because descent itself does not consume INCOMPLETE_SPLIT.

The high key and right-link can still route the operation to the correct target page.

A complete solution can be implemented in multiple ways, such as runtime repair, or maintaining page numbers of dirty pages that still have PAGE_INCOMPLETE_SPLIT set at flush time.

I implemented a PoC on MySQL 9.7.0 and constructed a workload with severe concurrent splits caused by concurrent inserts.

The workload first loads 2.5 million initial rows, trying to make most leaf pages close to the split boundary. Then it runs 32 concurrent threads to insert 400,000 randomized new rows, triggering a large number of leaf splits.

The final comparison is very good:

The 400,000 inserts take 70 seconds on MySQL 9.7.0.
The optimized version finishes in only 4 seconds.
TPS improves by 16x.
P95 latency decreases from 17.01 ms in MySQL 9.7.0 to 0.40 ms in the optimized version, a 42x improvement.

This proposal is large. It involves major changes to core InnoDB modules and is challenging. Also, this is only the split part. Merge support will definitely be needed later. I plan to put the merge design into a separate follow-up proposal, or as an extension of this proposal.

However, the PoC at least proves the feasibility of this direction and shows significant potential benefits.

Dissecting the MySQL 8.0 Performance Regression on oltp_update_non_index

2026-04-16T00:00:00+00:00

The performance regression in MySQL 8.0 is well known, but it is still not fully understood. That is because it is not a regression caused by one obvious bottleneck. MySQL 8.0 introduced many new designs and refactored major subsystems, so the gap comes from a combination of configuration defaults, architectural trade-offs, and many small overheads spread across different layers.

So I picked one workload, profiled it carefully, and tried to answer a more practical question: where exactly does the regression come from, and how much does each part contribute?

In this post, I use oltp_update_non_index , one of the worst regression cases between MySQL 5.7.44 and 8.0.45 as a starting point. Beginning from a -32.0% throughput gap, I narrow it to -0.2% by systematically isolating one factor at a time. As expected, the regression is not dominated by a single bottleneck. It is the combined effect of default settings, architectural changes, and dozens of small code-level costs.

1. Setup

Both MySQL versions are compiled with GCC 8.5.0 at -O3. mysqld and sysbench are each pinned to 8 dedicated physical cores with no hyperthreading overlap. The buffer pool is 23 GB. The dataset is a single table with 50 million rows. innodb_flush_log_at_trx_commit=2, sync_binlog=0, and the adaptive hash index is OFF. Both versions use identical InnoDB settings wherever the options are comparable. Full configuration details are listed in the appendix.

The benchmark is sysbench oltp_update_non_index, with 8 threads and 90 seconds per run. This is one of the worst regression cases..

2. Narrowing the gap step by step

The method is straightforward. I start with MySQL 8.0.45 and 5.7.44 under the same default-like configuration, then use perf record and perf report to see where CPU time goes. Once a subsystem becomes the dominant bottleneck, I either adjust the configuration to remove that cost or write a small targeted patch. Then I profile again and repeat.

Each round removes one visible layer of overhead and exposes the next one underneath.

Starting from the baseline and applying changes one by one, I was able to close almost the entire gap:

Step	Description	5.7 TPS	8.0 TPS	Gap
	Baseline (PFS=1, bin=ON, writer=ON)	69,818	47,509	-32.0%
1	innodb_log_writer_threads=OFF	(69,818)	52,539	-24.7%
2	performance_schema=0	72,316	54,919	-24.1%
3	skip-log-bin	128,140	109,044	-14.9%
4	–db-ps-mode=auto	135,664	125,119	-7.8%
5	innodb_flush_log_at_trx_commit=0	149,577	144,941	-3.1%
6	5 code patches	(149,577)	149,273	-0.2%

From -32.0% to -0.2%. Below is the breakdown.

Step 1: `innodb_log_writer_threads=OFF`

This is an 8.0-only setting; it does not exist in 5.7.

With innodb_log_writer_threads=ON, a dedicated log writer thread is responsible for writing the log buffer to disk. The problem appears when mysqld is pinned to 8 cores and all 8 are already saturated by client threads. In that case, the log writer thread cannot get scheduled quickly enough. Client threads call log_write_up_to() and spin in ut_delay() while waiting for the writer to advance the write position, but the writer itself is CPU-starved. That creates a feedback loop: client threads spin longer, occupy more CPU, and make it even harder for the writer to run.

With innodb_log_writer_threads=OFF, no separate writer thread is needed, the calling thread takes over the writer role inside log_write_up_to, eliminating the scheduling dependency.

Step 2: `performance_schema=0`

Disabling Performance Schema reduces the gap further. MySQL 8.0 changed PFS significantly compared with 5.7, including v2 metadata lock instrumentation, new memory statistics layers, and allocator hook changes. However, I did not isolate the PFS internals deeply enough in this workload to say which part is the main contributor.

So for this step, I can say that PFS matters, but I cannot yet attribute the overhead to a specific internal component.

Step 3: `skip-log-bin`

This is the single largest step in absolute TPS gain. Both versions improve dramatically when binary logging is disabled (5.7: +67%, 8.0: +86%). The workload is commit-bound, and every transaction generates one binlog event. Even with sync_binlog=0, the binlog still adds per-commit CPU cost for event formatting, Table_map construction, and memory allocation.

MySQL 8.0 benefits more from disabling binlog than 5.7 does (+86% vs +67%), which suggests that the 8.0 binlog path adds extra per-event overhead. Looking at the code, I found several 8.0-only additions that are plausible contributors:

ColumnFilterOutboundFunctionalIndexes: is_filter_needed() returns true unconditionally, so the column filter is installed even for tables without functional indexes.
ReplicatedColumnsView: allocates a std::vector> for each Table_map event.
init_metadata_fields(): new in 8.0, adding metadata serialization work to every Table_map event.

These are reasonable suspects, but I have not profiled the binlog path in isolation, so I am not claiming that they are the confirmed dominant causes.

Step 4: `--db-ps-mode=auto` (prepared statements)

Switching from --db-ps-mode=disable to --db-ps-mode=auto helps by avoiding full parsing on every execution. In this mode, the statement is parsed once, then re-optimized on each execution. The net effect is that prepared statements help 8.0 more than 5.7.

I have not fully decomposed why the text-protocol path in 8.0 is heavier for a simple UPDATE. It is unlikely to be the grammar itself, because a simple UPDATE does not exercise features such as CTEs or window functions. More likely, the extra cost comes from surrounding setup work in the lexer, resolver, or optimizer path. That still needs confirmation.

Step 5: `innodb_flush_log_at_trx_commit=0`

Setting innodb_flush_log_at_trx_commit=0 decouples transaction commit from the redo log write. This makes it easier to isolate the cost of 8.0’s lock-free redo log design.

Profiling shows that with flush=2, the function ut_delay() , the busy-wait loop inside log_write_up_to() , consumes 9.59% of total CPU in 8.0. With writer_threads=OFF, each committing thread writes to the log buffer through log_buffer_reserve(), log_buffer_write(), and log_buffer_write_completed() and then spin-waits in log_write_up_to() for the write to reach disk. This lock-free coordination machinery (log_buffer_reserve 0.57% + log_buffer_write_completed 0.33% + log_wait_for_space_in_log_recent_closed 0.31%) has no equivalent in 5.7, which uses a simpler log_sys mutex-based design.

One interesting result is that 5.7 also spends a lot of CPU in ut_delay(), in fact even more than 8.0 (13.64% vs 9.59%). But its redo path per transaction is shorter: take log_sys, write, release. So even though it spins more, it still completes more useful work per transaction.

This is a real architectural trade-off. The 8.0 redo redesign favors scalability at higher concurrency and under stricter durability requirements. But for this workload, at 8 threads and flush=2, that trade-off is unfavorable.

Step 6: Five code patches

After aligning the configuration and benchmark settings and decoupling redo with flush=0, the remaining gap is -3.1%. At this point, profiling (924K perf samples) shows a very flat CPU profile: no single function is above 1.8%. The remaining gap is spread across many small 8.0-specific overheads.

Function	8.0 CPU%	5.7 CPU%	Category
`cmp_dtuple_rec_with_match_low`	1.18%	0.44%	Inline Regression
`buf_flush_note_modification`	0.67%	0% (inlined)	Inline Regression
`THD::store_cached_properties`	0.53%	0% (not present)	New Overhead
`fold_condition`	0.43%	0% (not present)	New Overhead
`ha_innobase::info_low`	0.36%	0% (lighter)	Missing Fast-Path

I group them into three categories:

[Inline Regression]: the 8.0 version of a function grew enough that GCC no longer auto-inlines it.
[New Overhead]: entirely new code in 8.0 that runs unconditionally, even when the feature behind it is not needed.
[Missing Fast-Path]: 8.0 added support for more cases but did not keep a cheap fast path for the common case.

Five small targeted patches reduce the remaining gap from -3.1% to -0.2% at flush=0. The patch details are listed in the appendix.

One important note: at flush=2, these 5 patches show no measurable TPS improvement (124,572 vs 125,119 TPS). The reason is that ut_delay() in the redo commit path already consumes 9.59% of CPU and acts as a throughput ceiling. The cycles freed by the patches are mostly absorbed by extra spin iterations instead of being converted into more completed transactions.

The patches only become visible at flush=0, which confirms that they are fixing real overhead that is otherwise masked by the redo bottleneck.

4. Going deeper: how much does each factor really contribute?

After completing the step-by-step analysis, I wanted to understand the true contribution of each factor. The cumulative table above is intuitive, but it has a methodological limitation: the attribution depends on the order in which changes are applied.

For example, disabling PFS in step 2, after writer_threads=OFF but before binlog=OFF, appears to recover only 0.7 percentage points. But when I measure PFS independently, by changing only PFS from the original common baseline, the contribution is actually 2.3 percentage points. In the cumulative sequence, other costs were masking it.

So I repeated the measurements using independent ablation: each factor is changed individually from the same common baseline. That prevents one factor from hiding or amplifying another.

Common baseline: performance_schema=1, log-bin=mysql-bin, innodb_log_writer_threads=ON (8.0), --db-ps-mode=disable, innodb_flush_log_at_trx_commit=2.

Factor changed	5.7 TPS	8.0 TPS	Gap	Attribution
Baseline (nothing)	69,818	47,509	-32.0%	—
A. `writer_threads=OFF`	69,818	52,539	-24.7%	7.2 pp
B. `performance_schema=0`	72,316	50,798	-29.7%	2.3 pp
C. `skip-log-bin`	116,588	88,379	-24.2%	7.8 pp
D. `db-ps-mode=auto`	74,406	53,838	-27.7%	4.3 pp
E. `flush_log_at_trx_commit=0`	81,952	60,906	-25.7%	6.3 pp

The independent factors sum to 27.9 percentage points. Adding the code-level overhead (2.9 pp, measured at flush=0) gives 30.8 pp. The remaining 1.2 pp appears to come from interaction effects between factors. For example, binlog overhead amplifies the redo commit path, so removing both together saves slightly more than the sum of removing each in isolation.

Comparing the cumulative and independent views gives a more accurate picture:

Factor	Cumulative	Independent	Observation
`writer_threads=OFF`	7.2 pp	7.2 pp	Same, measured first, so no masking
`performance_schema=0`	0.7 pp	2.3 pp	PFS is undercounted 3x in the cumulative sequence
`skip-log-bin`	9.2 pp	7.8 pp	Binlog is overcounted in the cumulative sequence
`db-ps-mode=auto`	7.1 pp	4.3 pp	SQL layer overhead is overcounted in the cumulative sequence
`flush_log_at_trx_commit=0`	4.7 pp	6.3 pp	Redo overhead is undercounted in the cumulative sequence

The main lesson is simple: if you want correct attribution, you need to control for interaction between variables. The same benchmarking principle applies to regression analysis itself.

5. Summary

The -32.0% regression on oltp_update_non_index is not caused by one dominant bottleneck. It is the result of several layers of overhead:

Factor	Independent attribution
Binary log overhead	7.8 pp
`innodb_log_writer_threads` CPU starvation	7.2 pp
Lock-free redo log architecture	6.3 pp
SQL text protocol / parser path	4.3 pp
Code-level overhead (5 patches)	2.9 pp
Performance Schema instrumentation	2.3 pp
Interaction effects	1.2 pp
Total	32.0 pp

The largest single factor is the binary log (7.8 pp), followed closely by log writer thread CPU starvation (7.2 pp) and redo log architecture (6.3 pp). Together, these three redo/binlog-related factors account for 21.3 percentage points, roughly two-thirds of the total regression.

The code-level overhead (2.9 pp) comes from many small additions that are easy to ignore in isolation, an extra function call here, an extra pass there, but they add up. Database development is always about trade-offs. MySQL 8.0 added many useful capabilities, but on a workload running at 125,000+ queries per second, even small per-query costs accumulate quickly.

Appendix A: Patch details

Patch 1: `cmp_data` `ALWAYS_INLINE` + loop-invariant hoist

File: storage/innobase/rem/rem0cmp.cc
Problem: In 8.0, cmp_data() grew past GCC’s auto-inline threshold because of added multi-value index support (is_asc, DATA_MULTI_VALUE assertions, dfield_is_multi_value() checks). The per-field comparison loop in cmp_dtuple_rec_with_match_low() also redundantly calls dict_index_is_ibuf() and checks dfield_is_multi_value() on every iteration.
Fix: Mark cmp_data() as ALWAYS_INLINE. Hoist is_ibuf and is_mv_index out of the loop.

Patch 2: `buf_flush_note_modification` `ALWAYS_INLINE`

Files: storage/innobase/include/buf0flu.h, buf0flu.ic
Problem: Both 5.7 and 8.0 support flush observers in buf_flush_note_modification(), but the 8.0 version grew enough that GCC no longer auto-inlines it. As a result, 8.0 emits a standalone function call on every dirty-page modification where 5.7 keeps it inlined.
Fix: Change static inline to static ALWAYS_INLINE in both the declaration and definition.

Patch 3: `server_store_cached_values` no-op

File: sql-common/net_serv.cc
Problem: 8.0 added server_store_cached_values(), which calls THD::store_cached_properties(RW_STATUS) on every network I/O path (packet reads, writes, async operations; 10 call sites in net_serv.cc). This refreshes cached THD properties that are rarely consumed. The mechanism does not exist in 5.7.
Fix: Replace the function body with an empty no-op.

Patch 4: `fold_condition` fast path

File: sql/sql_const_folding.cc
Problem: 8.0 introduced constant folding (fold_condition) during JOIN::optimize() → optimize_cond() → remove_eq_conds() on every execution, including every prepared statement re-execution. For the common shape field OP literal_constant , which matches every sysbench query here, the function does no useful work, but still walks the full folding logic. 5.7 has no such pass.
Fix: Add an early fast path in fold_condition() that detects field OP basic_const or field OP param and returns false immediately.

Patch 5: `info_low` fast path

File: storage/innobase/handler/ha_innodb.cc
Problem: ha_innobase::info_low() is called by the optimizer for cost estimation on every UPDATE and DELETE. In 8.0, the function became heavier because of additional statistics-related logic. The common case (HA_STATUS_VARIABLE | HA_STATUS_NO_LOCK) only needs a small amount of information, but still goes through update_thd(), op_info writes, and extra helper calls.
Fix: Add a fast path at the top that handles HA_STATUS_VARIABLE | HA_STATUS_NO_LOCK directly and returns immediately.

Appendix B: Environment details

CPU: AMD Ryzen Threadripper PRO 3975WX, 32 cores / 64 threads, single socket
OS: RHEL 8.10, kernel 4.18.0-553
Compiler: GCC 8.5.0, -O3 -g -DNDEBUG (RelWithDebInfo)
MySQL: 5.7.44 vs 8.0.45 (KernelMaker fork)
CPU pinning: mysqld on cores 16–23, sysbench on cores 24–31 (physical cores, no hyperthreading overlap)
Data: 1 table, 50M rows (sbtest1), about 12 GB InnoDB tablespace per version
Benchmark: sysbench oltp_update_non_index, 8 threads, 90 seconds, --report-interval=5

Shared InnoDB configuration

innodb_buffer_pool_size      = 23G
innodb_buffer_pool_instances = 4
innodb_flush_log_at_trx_commit = 2
innodb_flush_method          = O_DIRECT_NO_FSYNC
innodb_adaptive_hash_index   = OFF
innodb_io_capacity           = 10000
innodb_io_capacity_max       = 20000
innodb_page_cleaners         = 4
innodb_purge_threads         = 4
innodb_log_file_size         = 2G
innodb_log_files_in_group    = 15
innodb_log_buffer_size       = 64M
innodb_max_dirty_pages_pct   = 90
innodb_max_dirty_pages_pct_lwm = 80
sync_binlog                  = 0

8.0-only settings: innodb_dedicated_server=OFF, innodb_idle_flush_pct=1, innodb_doublewrite_pages=128, innodb_use_fdatasync=ON, default_authentication_plugin=mysql_native_password.

Non-patchable 8.0 overhead (architectural)

The following 8.0-specific CPU costs were visible in profiling, but they are not realistically patchable in the same way because they are either correctness-critical or fundamental to the current 8.0 architecture:

Function	CPU%	Reason
`locksys::Global_shared_latch_guard`	0.84%	Lock sharding; correctness-critical for concurrent lock operations
`log_buffer_reserve`	0.57%	Lock-free redo design; replaces 5.7’s `log_sys` mutex
`log_buffer_write` + `log_buffer_write_completed`	0.67%	Lock-free redo design
`log_wait_for_space_in_log_recent_closed`	0.31%	Lock-free redo design
`PolicyMutex::enter` (InnoDB trx fabric)	1.79%	Transaction infrastructure mutexes spread across multiple internal subsystems

MySQL vs PostgreSQL Internals (Part 2) — MVCC (Multi-version Concurrency Control)

2026-03-10T00:00:00+00:00

In the previous post, I took a detailed look at how MySQL and PostgreSQL differ in their buffer pool design and implementation. In this post, I will continue with a detailed comparison of their MVCC implementations.

The Role of MVCC

MVCC (Multi-version Concurrency Control) is a common mechanism used in transactional databases to resolve read–write conflicts. The core idea is that when a transaction modifies data, it does not overwrite the original data directly. Instead, it preserves the previous version while creating a new version of the record. As a result, all historical versions of a record are retained in the database.

The key benefit is that read and write transactions on the same record no longer need to block each other. Even if a write transaction has modified a record but not yet committed, a read transaction can still directly read the version that is visible to it from the historical versions.

The simplified principle is illustrated below:

For three modifications to the record with PK, each modification creates a new version, so all historical versions of the record are retained in the database.

So what is the benefit of retaining all historical versions? Consider the following example:

Three write transactions A, B, and C modify the same PK sequentially on the timeline, while three read transactions X, Y, and Z interleave with them and read the PK. Without MVCC, read transactions must block until write transactions commit and release locks. With MVCC, when read transaction X attempts to read the PK, the state of the PK is:

Write transaction A inserted PK: ‘aaa’ and has committed
Write transaction B modified it to PK: ‘bbb’ but has not yet committed

Under RC (Read Committed) and RR (Repeatable Read) isolation levels, PK: ‘bbb’ is not visible to transaction X. Because MVCC preserves the old version, transaction X can directly read the visible version PK: ‘aaa’ and ignore the currently running write transaction B. This greatly improves read–write concurrency.

As an essential capability of databases, MVCC is supported by both MySQL and PostgreSQL. Fundamentally, they both achieve the behavior described above, but their designs and implementations make different trade-offs.

In the following sections, I will compare their implementations in detail across three aspects:

Organization of multiple versions
Visibility checks for multiple versions
Garbage collection of old versions

1. Organization of Multiple Versions

PostgreSQL

In PostgreSQL, a tuple and all its historical versions reside in the heap, as shown below:

The leaf nodes of the nbtree index store the PK fields of the tuple and point to the actual location of the tuple in the heap (TID). Following the Heap TID leads to the tuple, which contains the full data including PK fields and value fields.

Each index tuple points to the oldest version in its corresponding HOT chain. When the tuple is modified, the old version is not changed. Instead, a new version is created with the same PK fields but different value fields. The ctid field of the old tuple points to the location of the next version. The latest version’s ctid points to itself.

In short, every version of a PostgreSQL tuple is a complete tuple containing all fields. Starting from the index tuple, the version chain is linked from old to new through the ctid field.

It is important to note that the version chain may become ‘broken’, as shown below:

The PK fields remain unchanged, but the value fields are modified multiple times. Each modification generates a new full version stored in the same heap page as the old version, as shown for versions 1, 2, and 3. Because they reside in the same heap page, advancing through the chain via ctid is very cheap (no additional heap page lock is required). This is the type of chain that can be efficiently traversed for version lookups, known as a HOT chain in PostgreSQL.

A HOT chain requires two conditions:

The new version can fit into the same heap page
The modified columns do not include any indexed columns

If either condition is not met, the HOT chain breaks.

As modifications continue, starting from version 4, the original heap page (heap page 1) can no longer accommodate the new tuple. The new tuple is therefore stored in another page (heap page 2). Although version 3’s ctid still points to version 4, the HOT chain effectively ends at version 3.

When traversing a HOT chain, the reader will not follow ctid across heap pages. Instead, it stops at the end of the chain. The reason is that both the latest and historical versions of tuples reside in heap pages. If the reader followed ctid from page 1 to page 2, it would hold a read lock on heap page 1 and attempt to acquire a read lock on heap page 2. Because both pages are heap pages and there is no defined lock ordering between them, another backend might hold the lock on page 2 and attempt to acquire the lock on page 1, leading to a deadlock.

At this point, the HOT chain is considered broken.

How does PostgreSQL transition from version 3 to version 4 then?

The implementation is to insert a new index tuple into the nbtree index with the same PK fields pointing to version 4, starting a new HOT chain. If a read operation traverses the version chain and finds versions 1, 2, and 3 all invisible, it stops the current HOT chain traversal and returns to the index layer. It then proceeds to the next index tuple (the one pointing to version 4).

This design leads to an interesting and somewhat counterintuitive behavior: multiple index tuples with identical PK fields may coexist in the nbtree index.

MySQL

In MySQL, data is stored directly in the clustered index (B+Tree) leaf nodes. This is an important difference from PostgreSQL: MySQL does not have a heap.

The second difference is that the record stored in the clustered index and its historical versions reside in different places. Historical versions are not stored in the clustered index. Instead, old values are stored in undo records in the undo space. When needed, historical versions are reconstructed by applying the undo records to the current record.

The third difference is that an undo record does not store a full copy of the record. It only stores the old values of the columns modified in the operation.

The fourth difference is the direction of the version chain. In MySQL, the clustered index always stores the latest version. Before a record is modified, the old values of the columns being changed (together with the PK fields) are copied to the undo space, and the record is updated in place.

As shown below:

The clustered index record contains two system fields: TRX_ID and ROLL_PTR.

TRX_ID records the transaction ID that last modified the record and is used for MVCC visibility checks.
ROLL_PTR links the version chain.

Similar to PostgreSQL’s ctid, ROLL_PTR links versions together, but the direction is opposite: ctid points from old to new, while ROLL_PTR points from new to old.

In the figure, the record was modified three times:

Field 2 was modified
Field 2 was modified again
Field 3 was modified

Therefore, the clustered index record stores the latest version after the three modifications. Through ROLL_PTR, it points to the previous version stored in the undo space (the version before Field 3 was modified), and so on.

Summary

The differences in version organization between PostgreSQL and MySQL can be summarized in three contrasts:

Versions mixed together vs latest version and historical versions stored in different spaces
Old versions contain full tuples vs old versions mainly store the primary key and old values of modified columns
Version chain ordered from old to new vs from new to old

2. Visibility Checks for Multiple Versions

Once multiple versions exist, the next question is: how does a read transaction determine which version it should see?

This is the core of MVCC: visibility checks.

To determine visibility, the database must establish an order among transactions. Taking the RR isolation level as an example, when a transaction begins, it must know which write transactions are currently active in the system. All modifications produced by those active transactions are invisible to the read transaction. Only modifications from transactions that were already committed at that moment are visible.

Therefore, if the database can define an order among write transactions, it becomes straightforward to perform visibility checks.

Most databases achieve this by using a globally increasing transaction ID. When a write transaction is created, it obtains the current maximum transaction ID plus one. This naturally orders write transactions.

Once transaction IDs exist, each data modification can be tagged with the transaction ID that produced it. A read transaction, when created, obtains the list of currently active write transactions. Later, when reading data, it simply compares the transaction ID recorded on the data with this list and applies the visibility rules to determine whether the data is visible.

Both PostgreSQL and MySQL follow this approach.

PostgreSQL

As mentioned earlier, transaction IDs are critical. In PostgreSQL, the globally increasing transaction ID is called nextXid.

Each write transaction obtains the latest value when it starts.

Transaction A is created first and obtains xid 7. It inserts PK: ‘aaa’. The tuple records this through the xmin field, which stores the inserter’s transaction ID (7).

After transaction A commits, transaction B is created and obtains xid 11. It updates the record to ‘bbb’. Following the multi-version rule, transaction B does not overwrite the tuple inserted by A. Instead, it creates a new version. The old tuple’s xmax is set to 11, indicating that transaction 11 has “deleted” this tuple version. The new tuple records xmin = 11. The old tuple’s ctid points to the new tuple.

Transaction C proceeds similarly.

Thus, each PostgreSQL tuple contains two fields recording the related transactions:

xmin : the inserter
xmax : the deleter

With the global transaction ID (nextXid) and the transaction tags (xmin, xmax) on each tuple, the next requirement is the snapshot used by read transactions for visibility checks.

At the top of the figure are the globally increasing transaction IDs and the currently active write transactions. The next ID to allocate is 16. Among all assigned IDs, transactions ≤7 have already committed. Between 8 and 15, some have committed, and the currently active write transactions are 8, 11, 12, and 14.

If a read transaction starts now, it obtains a snapshot:

xmin : the smallest active transaction ID (8)
xmax : the next transaction ID to allocate (16)
xids[] : the list of active transactions

With this snapshot, it can determine whether a tuple’s transaction tag is visible. For example:

If a tuple has xmin = 7, it is visible to the snapshot.
If a tuple has xmin = 14, it is not visible.

Now that we know how to determine the visibility of transaction tags, the final question is how to determine whether a tuple itself is visible, given that it has both xmin and xmax.

The core principle is:

A tuple is visible to a snapshot if its inserter (xmin) is visible and its deleter (xmax) is not visible.

The process is:

Check xmin. If xmin is not visible, the tuple is invisible.
If xmin is visible, check xmax.
If xmax is also visible, the tuple has been deleted in the snapshot and is therefore invisible.
If xmax is not visible, the tuple is visible.

Finally, the following figure shows the process of locating a tuple visible to a snapshot starting from the nbtree index:

MySQL

MySQL also has a globally increasing transaction ID called next_trx_id_or_no.

In the example, three transactions modify the same record three times. Both the clustered index record and the undo records contain a TRX_ID field. This field is the transaction tag used by MySQL. The TRX_ID records which transaction created that version of the record.

Unlike PostgreSQL, a record in MySQL has only one transaction tag, TRX_ID, rather than two. The reason will be explained later.

Next, consider the visibility check.

MySQL’s ReadView is extremely similar to PostgreSQL’s snapshot and serves the same purpose. The only difference is that MySQL’s ReadView contains an additional field: m_creator_trx_id.

This field is necessary because the transaction that creates the ReadView is itself included in m_ids[] (since it is an active transaction). Without m_creator_trx_id, the transaction would not be able to see its own modifications. It also handles cases where a read transaction is promoted to a write transaction.

Aside from this, the visibility rules are almost identical.

Given the visibility rules, determining whether a record is visible to a ReadView becomes straightforward:

The process is simple: check whether the record’s TRX_ID is visible to the ReadView. Unlike PostgreSQL, MySQL does not need two separate checks for xmin and xmax.

Finally, the process of finding a version visible to a ReadView starting from the B+Tree is shown below:

Summary

PostgreSQL and MySQL use highly similar visibility mechanisms. The primary difference is that PostgreSQL stores two transaction tags (xmin and xmax) on each tuple, requiring two checks. MySQL stores only one (TRX_ID), requiring only one check.

Why is this the case? The fundamental reason is the direction of the version chain:

PostgreSQL’s version chain goes from old to new. In theory, xmin alone would be sufficient because it records the inserter. However, when traversing the chain, the reader cannot stop immediately after finding a visible insert because the next version might also be visible. The reader must continue until it finds the first version whose insert is invisible. The previous version is then the visible version. This means at least one extra step is required. PostgreSQL therefore stores the insert transaction of the next version as the deleter (xmax) of the current version, avoiding that extra traversal. Additionally, xmax is required for DELETE operations where no next version exists.
MySQL’s version chain goes from new to old. Once the latest version is found to be invisible, the reader simply moves to the previous version until it finds the first visible one. No additional step is required.

3. Garbage Collection of Multiple Versions

The next core problem in MVCC is garbage collection of historical versions. Historical versions are not always needed.

Because global transaction IDs advance linearly, the snapshots (or ReadViews) of read transactions also move forward. A historical version can be safely purged when:

No active snapshot or ReadView in the system still needs that version (i.e., all snapshots can already see the newer version that replaced it).

This is where PostgreSQL and MySQL differ most significantly.

PostgreSQL

PostgreSQL reclaims historical versions through the Vacuum backend.

PostgreSQL uses GlobalVisState to track purge boundaries. It contains two variables:

maybe_needed

This is the minimum value among all backend transaction IDs and the xmin values of their snapshots. Backend transaction IDs must be considered because a backend may have started a write transaction and obtained an xid but not yet created a snapshot. That xid still forms a lower bound that cannot be crossed.

All tuple xmax values (deleters) are compared against maybe_needed. If xmax is smaller than maybe_needed, the deleter is visible to all backends and snapshots, meaning the tuple is globally deleted and can be safely purged.

definitely_needed

This is the xmin of the latest snapshot taken by the Vacuum backend. Any tuple whose xmax is greater than or equal to definitely_needed is invisible to the Vacuum backend and cannot be purged.

These two values define the continuous upper bound that can be purged and the lower bound that cannot. For tuples whose xmax falls between these bounds, Vacuum may need to refresh maybe_needed and re-evaluate, since the snapshot used by Vacuum might be outdated. Because refreshing is expensive, PostgreSQL optimizes this by checking whether RecentXmin has advanced. If it has not changed, refreshing is skipped.

With these rules, the workflow of the Vacuum backend is:

Scan all heap tuples and determine whether they can be purged using GlobalVisState. Collect purgable tuples into a set.
Scan all index tuples and check whether they reference heap tuples in the purge set. If so, delete those index tuples.
Scan again all pages containing dead tuples collected in step 1 and reclaim them (setting their line pointers to LP_UNUSED).

This process involves extensive scanning of both heap and index structures, which can be expensive. PostgreSQL mitigates this cost with several optimizations:

Visibility map – allows the first scan to skip pages where all tuples are visible.
HOT pruning – during normal reads of heap pages, PostgreSQL opportunistically removes dead tuples through heap_page_prune(), reducing the workload of Vacuum.
LP_REDIRECT – when intermediate versions in a HOT chain are removed, the head line pointer is redirected to the surviving tuple instead of being marked unused, so existing index tuples can still locate the correct tuple without index updates.

MySQL

MySQL takes a different approach.

All undo records (historical versions) are grouped by the transactions that produced them. These transactions are then organized according to their global commit order (forming a min-heap).

With this ordering, MySQL can quickly identify the undo records belonging to the earliest committed transaction, which are typically the closest candidates for purging.

The purge thread compares the transaction number (trx_no) of the earliest transaction in the history list with m_low_limit_no from the purge view.

If trx_no < m_low_limit_no, all active ReadViews can see this transaction’s commit, so its undo records are no longer needed and can be safely purged.
Otherwise, it cannot be purged. Since it is the earliest transaction, later ones cannot be purged either, so the purge process stops and waits.

An important optimization is that transactions are ordered by commit order rather than creation order.

Sorting by creation order would be safe because the earliest transaction must be purged first. However, it has a drawback: if the earliest transaction does not commit for a long time, later transactions that have already committed cannot be purged even if they are no longer needed.

For example:

Trx A is created and modifies record R1 from ‘111’ to ‘222’
Trx B is created and modifies record R2 from ‘aaa’ to ‘bbb’
Read-only Trx X starts. Since Trx B has not committed, X sees R2 as ‘aaa’
Trx B commits
Trx X commits

If transactions were ordered by creation time, Trx A would come before Trx B. Because Trx A has not committed, purge would be blocked and Trx B’s undo records could not be purged, even though no ReadView needs them anymore.

By ordering transactions by commit time instead, MySQL can purge Trx B’s undo records immediately after its commit.

This is an important optimization. Notably, trx_id and trx_no both come from the same global variable: next_trx_id_or_no.

The workflow is shown below:

The purge thread first clones the oldest active ReadView in the system. The m_low_limit_no in this ReadView represents the smallest trx_no that was still committing when the view was created. All transactions with smaller trx_no values have already committed.

In the undo space, committed transactions’ undo records are linked together in the history list in commit order (ascending trx_no). The purge thread simply compares m_low_limit_no with the smallest trx_no in the history list to determine whether purging is possible.

Summary

Garbage collection of historical versions is a major implementation difference between PostgreSQL and MySQL.

In fact, it reflects their different design philosophies. This difference was already visible in the previous post discussing buffer pools.

MySQL tends to favor precise control and ordered structures, such as the LRU list and flush list, which allow it to quickly identify the oldest pages that can be evicted or flushed. Similarly, undo purge maintains ordered historical versions so that the oldest purgeable undo records can be quickly located.

PostgreSQL, on the other hand, tends to rely more on global scanning mechanisms, both in shared buffers and in Vacuum. In the buffer pool case, the cost of global scanning is relatively low because it scans descriptor arrays in memory. However, Vacuum must scan heap and index disk pages (although visibility maps can skip many all-visible pages). For frequently updated tables, the amount of scanning can still be substantial.

MySQL vs PostgreSQL Internals (Part 1) – Buffer Pool

2026-02-16T00:00:00+00:00

The debate over “MySQL vs PostgreSQL, which one is better?” has been around for a long time. As two outstanding representatives of open-source OLTP databases, I personally don’t think one overwhelmingly dominates the other. Transactional database theory has been stable for decades; both systems are practical implementations built under the same theoretical framework.

The differences mainly come from different trade-offs made during engineering practice. I’ve always believed that database development is the art of trade-offs. So I’m planning a series that compares MySQL and PostgreSQL from the perspective of kernel design and implementation, focusing on the different trade-offs they make when pursuing similar goals.

As the first article in this series, I’ll start with the design and implementation differences of the Buffer Pool.

Comparison Dimensions

The Buffer Pool in MySQL and the corresponding module in PostgreSQL (commonly referred to as Shared Buffers) are critical subsystems. Their primary job is to cache on-disk data pages in memory to minimize disk I/O as much as possible, and they are therefore a major factor in relational database performance.

In essence, it is a huge hash table:

The key corresponds to a specific on-disk data page.
The value is a pointer (or index) to the in-memory representation of that page.

In the following sections, I compare MySQL and PostgreSQL buffer pool designs from these aspects:

Hash table structure and implementation
Eviction policy for old pages and its implementation
Dirty page flushing strategy and its implementation

1. Hash Table

MySQL

MySQL’s buffer pool is not backed by a single hash table, it uses multiple hash tables. As illustrated conceptually:

Multiple buf_pool_t instances shard one large buffer pool. Each buf_pool_t maintains its own hash table.
The hash key is (space_id, page_no), identifying a specific page within a data file (tablespace). During lookup:
- First, it computes a hash using (space_id, page_no >> 6) to locate the corresponding buf_pool_t instance.
- Why shift page_no >> 6? Because MySQL tries to place 64 consecutive pages under the same space_id into the same buf_pool_t. This helps in two ways:
  - During reads, it enables read-ahead (prefetching contiguous pages).
  - During flushing, it increases the chance to flush contiguous dirty pages together, improving I/O utilization.
- After locating the buf_pool_t, it computes a hash over the full key (space_id, page_no) to find the target cell in that instance’s hash table.
  - Pages with the same hash value are chained in that cell.
  - The lookup then traverses the chain and compares keys to find the target page.
The hash table stores only pointers to the corresponding page objects (buf_page_t). The actual buf_block_t objects and page frames live in a large memory region.
- MySQL splits the page memory into multiple chunks (buf_chunk_t).
- Each chunk is a contiguous block of memory.
- The first part stores per-page metadata (buf_block_t) for the pages in that chunk.
- The second part stores the actual 16KB page frames.
- The mapping between buf_block_t and the actual page frame is done via the frame pointer in buf_block_t.

PostgreSQL

Conceptually (as illustrated):

PostgreSQL also shards the shared buffer mapping, with a similar idea.
It first hashes the key (tablespaceOid, dbOid, relNumber, forkNum, blockNum) to obtain a bucket number.
Then it uses bucket_number >> 8 to locate the directory entry in the first-level mapping, i.e., the segment (dir).
Each segment contains 256 buckets, so after finding the segment, it uses bucket_number % 256 to locate the bucket within the segment.
It then traverses the bucket chain, comparing keys one by one to find the page.
All page frames are stored in one contiguous memory region, as an array: BufferBlocks[].
- Each page is 8KB.
- PostgreSQL does not split this region into chunks like MySQL does, all pages are stored together.
- Metadata for pages is stored separately in another array: BufferDescriptors[].
- Both arrays have the same number of elements, equal to the total number of buffers/pages.
- The indices align one-to-one: it is straightforward to locate the actual page frame from the metadata by index.
- The hash table stores buf_id, which is the index into both BufferDescriptors[] and BufferBlocks[].

Summary: Both MySQL and PostgreSQL implement fairly standard hash-table-based page lookup; there isn’t a fundamental difference there. The biggest difference is that MySQL splits pages into chunks, which makes it easier to dynamically resize the buffer pool by adding/removing chunks.

2. Eviction Policy for Old Pages (Aging) and Implementation

MySQL

MySQL maintains page aging information in a direct way: pages in the hash table are also linked into an LRU doubly-linked list. Each page’s buf_page_t::LRU is the list node that links the page into the LRU list.

The LRU head points to the most recently accessed page.
The LRU tail points to the least recently accessed page.

Each time a page is found via hash lookup, MySQL moves the page to the head of the LRU list via buf_page_t::LRU. Over time, pages that are not accessed drift toward the tail. When memory is insufficient and an old page must be evicted, the tail provides a fast candidate.

Of course, that is the conceptual LRU behavior. MySQL adds an important optimization, because the above design has a major problem: if requests perform table scans, a large number of pages enter the LRU and can overwrite/destroy the existing hot/cold information. To avoid scan workloads disrupting the LRU, MySQL splits the LRU list.

Roughly ~37.5% from the tail, it maintains a midpoint:

To the left is the young area: the true hot region.
To the right is the old area: a screening region for newly loaded pages.

All new pages loaded from disk are initially inserted at the midpoint, i.e., the head of the old list. Since it is close to the tail, such pages are more likely to be evicted quickly. If a page is accessed again before it is evicted, MySQL does not immediately promote it to the young region. Instead, it records the first access time, and the page’s position stays unchanged. Only when it is accessed again, and the elapsed time since the first access exceeds innodb_old_blocks_time (default 1 second), will it be promoted to the LRU head (young region). As a result, pages introduced by full table scans typically stay in the old area for less than 1 second and are evicted quickly, without polluting the hot working set in the young region.

When a user thread needs to read a disk page but the buffer pool is full, it evicts an old page from the LRU tail and uses that frame to load the needed page. But eviction is not that trivial. Below is the concrete eviction procedure when a user thread needs a new page:

First attempt (n_iterations == 0)

First, try the free list. If a free page is found, return it. Otherwise:
If try_LRU_scan == true, it indicates a partial LRU scan is allowed. Scan from the tail forward, at most 100 pages.
- If an evictable page is found, reset it and move it to the free list, then return to step 1 and retry.
- If no evictable page is found, set try_LRU_scan = false to tell other user threads that partial LRU scanning is ineffective, so they should skip partial scans and go directly to the single-page flush path.
Notify the page cleaner thread that free pages are insufficient and it should accelerate cleaning.
Scan forward from the tail.
- If a clean evictable page is found, evict it directly.
- Otherwise, locate the first dirty page that can be flushed; perform a synchronous flush of that single page; then add it to the free list and proceed to the next attempt.

Second attempt (n_iterations == 1)

Same as first attempt step 1.
Perform a full LRU list scan starting from the tail, searching for an evictable page; if found, move it to the free list and return to step 1 to retry. If that fails:
Same as first attempt step 3.
Same as first attempt step 4.

Third and subsequent attempts (n_iterations > 1)

Same as first attempt step 1.
Same as second attempt step 2.
Same as first attempt step 3.
Sleep for 10ms.
Same as first attempt step 4.

One more detail worth mentioning: the LRU scan does not always start from the tail for every thread. Each buf_pool_t maintains a global scan cursor lru_scan_itr (type LRUItr). After a thread finishes scanning, it leaves the cursor at its current position, and the next thread continues scanning from there, avoiding multiple threads repeatedly scanning the same region. Only when the cursor is empty/invalid, or still within the old region (meaning the previous scan did not progress far enough), will it be reset back to the tail. In addition, single-page flushing (step 4) uses another independent cursor single_scan_itr; these two cursors do not interfere with each other.

PostgreSQL

PostgreSQL does not maintain a global LRU list like MySQL does, but that doesn’t mean it does not perform LRU-style eviction. It simply takes another path.

All page metadata lives in the BufferDescriptors[] array. Each BufferDescriptor has two fields representing the current usage state of its corresponding page:

refcount: how many backends are currently using (pinning) the page
usage_count: the accumulated number of accesses to the page (capped at 5, When accessed via a ring buffer strategy, it is only incremented if it is currently 0, limiting it to 1)

Whenever a backend accesses a page via the hash table, it increments both refcount and usage_count. When the backend is done with the page, it only decrements refcount. Therefore, usage_count serves as an approximate LRU weight (but not unbounded, it stops increasing once it reaches 5).

When a backend tries to load a page from disk but finds no free page, it starts a clock sweep: it traverses BufferDescriptors circularly. If a buffer is not currently used by any backend (refcount == 0), it decrements usage_count (cooling down the LRU weight) and continues sweeping. Eventually it finds a buffer where both refcount == 0 and usage_count == 0, and that buffer becomes the victim for eviction.

Of course, this alone is still insufficient to prevent LRU pollution from one-time full scans. PostgreSQL has its own optimization: introducing a local ring buffer.

Each backend has its own local ring buffer: essentially a fixed-length array of buffer IDs. A buffer ID points to a page slot in the global BufferDescriptors. The ring buffer limits how many global buffers the backend consumes at once, so eviction is more likely to happen within the ring buffer itself, reducing pollution of the global shared buffers.

More concretely, suppose a backend is performing a sequential scan and the upper layer marks the operation to use the ring buffer. When reading pages via the hash table:

If the backend’s local ring buffer is not full, it stores the buffer ID into the ring buffer.
As reading continues, the ring buffer becomes full.
After it is full, when it needs to read the next page:
- It checks the page at the ring buffer’s current cursor position.
- If that buffer is not used by other backends (refcount == 0 and usage_count <= 1), it reuses it directly: evict and load the next page into it.
- If that buffer is currently used by other backends, it falls back to searching in BufferDescriptors for another available buffer to load the next page, and then replaces the current ring entry with the new buffer ID.

Here you can see the different approaches MySQL and PostgreSQL take for the same scenario. MySQL introduces an “old/young” split in the global LRU list as a general strategy to prevent pollution. PostgreSQL’s ring buffer is essentially also an “old area”, but it relies on higher-level operation tagging: only scan-heavy operations such as VACUUM, sequential scan, bulk insert, etc., will use the ring buffer.

Below is the complete procedure PostgreSQL uses to find a free buffer when a backend needs one:

Determine whether to use the ring buffer. If yes, inspect the buffer at the ring’s current cursor position:

a. If it has not been used before, the ring is not full yet, go to step 2.

b. Otherwise the ring is full. If the buffer is not used by any backend (refcount == 0 and usage_count <= 1), it can be reused immediately, return this buffer.

c. If the buffer is used by other backends, fall back to step 2 to find a buffer from the global pool; after success, replace the current ring entry with the newly found buffer ID.
Check the free list. If a buffer is available, return it.
Start clock sweep: traverse from nextVictimBuffer (the current sweep cursor in BufferDescriptors):
- If refcount != 0, skip.
- Otherwise, if usage_count != 0, decrement it (cooling down) and continue.
- Otherwise, the buffer is evictable. If it is not dirty, return it immediately. If it is dirty, flush it and then return it.
- Advance nextVictimBuffer accordingly.

Summary: MySQL and PostgreSQL are similar in essence: both are LRU-like. MySQL chooses to implement an explicit LRU list for more precise eviction, at the cost of additional overhead to maintain the list. PostgreSQL uses reference counting plus usage_count as an approximate LRU, avoiding the locking overhead of maintaining a true LRU list but losing precision. This is the result of different trade-offs. Another notable difference: when a MySQL foreground thread tries to find a free page, it tends to prefer evicting old pages that are not dirty first; PostgreSQL’s sweep does not have an explicit priority between dirty and clean pages in the same sense.

3. Dirty Page Flushing Strategy and Implementation

Earlier we mentioned that MySQL user threads and PostgreSQL backends may flush a single dirty page when searching for a free page (single-page flush). However, such foreground single-page flushing is only an emergency measure when no free page is available.

For normal bulk flushing, both MySQL and PostgreSQL have dedicated background threads/processes. The goal is to flush dirty pages in advance and evict old pages so that foreground threads can quickly find free pages.

Background flushing has two goals:

LRU flush: flush old pages in advance based on foreground free-page pressure, reducing foreground wait time for free pages.
Checkpoint flush: flush dirty pages associated with the oldest WAL LSN to advance the checkpoint, purge old WAL, and reduce crash recovery time.

MySQL

In MySQL (InnoDB), background flushing is performed by page cleaner threads, consisting of one coordinator and N workers.

Coordinator

Sleep for ~1 second, or be woken by a foreground thread.
Check whether work is needed (sync flush / adaptive / idle). If yes:
Dynamically calculate the number of dirty pages to flush in the next batch: n_pages.
Pass n_pages to all workers and wake them up. Each worker is responsible for one buf_pool_t slot. The coordinator itself also works as worker 0.
Wait for all workers to finish.

Worker

Wait to be woken by the coordinator.
Locate the assigned buf_pool_t slot.
LRU flush: scan from the LRU tail forward, scanning at most srv_LRU_scan_depth pages.
- If a page is clean and not being used, move it directly from the LRU list to the free list.
- If a page can be flushed, initiate asynchronous I/O; after I/O completes, move it into the free list.
- Stop early if the free list length reaches srv_LRU_scan_depth.
Checkpoint flush: scan from the flush list tail forward and flush continuously until:
- the number of flushed pages satisfies the quota assigned by the coordinator, or
- the WAL LSN advances to the target LSN assigned by the coordinator.
Finish and report to the coordinator.

Now, step 3 in the coordinator is adaptive: it calculates the flush workload and the target LSN advancement. The logic is as follows:

a. Based on dirty page percentage (`get_pct_for_dirty()`)

Compute dirty_pct, the percentage of dirty pages in the buffer pool:

If innodb_max_dirty_pages_pct_lwm (low watermark) is set and dirty_pct >= lwm, start progressive flushing and return the percentage of io_capacity as: dirty_pct * 100 / (max_dirty_pages_pct + 1)
If no low watermark is set, but dirty_pct >= innodb_max_dirty_pages_pct (high watermark), flush at 100% io_capacity.
Otherwise, do not flush based on dirty ratio (return 0).

b. Based on redo log age (`get_pct_for_lsn(age)`)

Compute checkpoint age: age = current_lsn - oldest_lsn

If age < innodb_adaptive_flushing_lwm (default 10% of redo log capacity), no adaptive flushing needed (return 0).
If age exceeds the low watermark: age_factor = age * 100 / limit_for_dirty_page_age Return the percentage of io_capacity as: (max_io_capacity / io_capacity) * age_factor * sqrt(age_factor) / 7.5

This is a super-linear growth curve: as redo space approaches exhaustion, flushing ramps up aggressively.

Combined calculation (`set_flush_target_by_lsn()`)

Take: pct_total = max(pct_for_dirty, pct_for_lsn)

Then compute the target LSN: target_lsn = oldest_lsn + lsn_avg_rate * 3 (i.e., advance by 3× the recent average redo generation rate; buf_flush_lsn_scan_factor = 3)

Then traverse each buffer pool instance’s flush list and count the number of pages whose oldest_modification <= target_lsn. Call this number pages_for_lsn (pages that must be flushed to advance checkpoint to target_lsn).

Finally, take the average of three estimates:

n_pages = (PCT_IO(pct_total) + page_avg_rate + pages_for_lsn) / 3

Where:

PCT_IO(pct_total) is the I/O demand estimated from dirty ratio / redo age.
page_avg_rate is the recent actual average flushing rate (moving average across multiple iterations).
pages_for_lsn is the precise demand obtained from scanning the flush list.

Averaging these three makes the flushing rate smoother and avoids abrupt oscillation. n_pages is capped by srv_max_io_capacity.

If redo pressure is high (pct_for_lsn > 30), the per-instance flush quota is weighted by how many pages in each instance’s flush list need flushing; otherwise, it is evenly distributed across instances.

Sync Flush mode

When redo log space is extremely tight (checkpoint cannot keep up with redo generation), log_sync_flush_lsn() returns non-zero and the coordinator enters sync flush mode:

It no longer sleeps for 1 second; it starts the next iteration immediately.
n_pages is set directly to pages_for_lsn (no averaging), with a lower bound of srv_io_capacity.
It loops until redo pressure is relieved.

Idle flushing

When the server is idle (no user activity) and the 1-second sleep times out, the coordinator does not run the adaptive algorithm. Instead, it flushes in the background using innodb_idle_flush_pct percent of innodb_io_capacity (default 100%), keeping the buffer pool clean.

PostgreSQL

PostgreSQL also has both LRU flush and checkpoint flush, but unlike MySQL’s unified page cleaner, PostgreSQL separates responsibilities:

bgwriter handles LRU flush
checkpointer handles checkpoint flush

1. bgwriter

The goal of bgwriter is to predict the upcoming demand for free buffers based on historical and current pressure, and try to free enough buffers before backends are forced into heavy clock sweep work (i.e., flush dirty pages that are otherwise reusable victims).

The overall flow:

Collect historical info from clock sweep, including:
- strategy_buf_id: the current backend clock sweep position
- strategy_passes: how many full sweeps have been completed
- recent_alloc: how many buffers have been allocated by backends since the last bgwriter recycle
Compare bgwriter’s current position next_to_clean with clock sweep’s strategy_buf_id, and determine how far ahead it is:
- bufs_to_lap: number of buffers bgwriter must scan for next_to_clean to “lap” (catch up to) strategy_buf_id.
  - Case 1: same pass, bgwriter ahead → bufs_to_lap is the remaining distance to lap.
  - Case 2: same pass, bgwriter behind → set next_to_clean to strategy_buf_id, set bufs_to_lap = NBuffers, effectively reset bgwriter.
  - Case 3: bgwriter already one full pass ahead → bufs_to_lap may be negative, meaning bgwriter has scanned everything it can scan; no need to scan in this round.
- bufs_ahead = NBuffers - bufs_to_lap (how many buffers bgwriter is ahead of sweep)
Based on the history above, compute how many buffers clock sweep needs to scan to find one free buffer, i.e. scans_per_alloc. Maintain an exponential moving average: smoothed_density += (scans_per_alloc - smoothed_density) / 16;
Maintain smoothed_alloc similarly:
- If smoothed_alloc < recent_alloc, set smoothed_alloc = recent_alloc (fast attack).
- Otherwise decay slowly using EMA: smoothed_alloc += (recent_alloc - smoothed_alloc) / 16; (slow decay)
Compute the prediction for the next round:
- upcoming_alloc_est = smoothed_alloc * bgwriter_lru_multiplier (predict upcoming allocations)
- Estimate how many reusable buffers exist in the region bgwriter is ahead: reusable_buffers_est = bufs_ahead / smoothed_density
- Ensure minimum progress: min_scan_buffers = NBuffers / (120s / 200ms) Then: upcoming_alloc_est = max(upcoming_alloc_est, min_scan_buffers + reusable_buffers_est)
This “minimum progress” ensures that even if the system is idle, bgwriter will scan the entire buffer pool in about 120 seconds, continuously cleaning dirty pages.
Scan from next_to_clean. For each buffer, bgwriter only considers buffers with refcount == 0 and usage_count == 0 (truly reusable candidates). It skips buffers in use or recently used. If a candidate is dirty, it flushes it synchronously. Stop scanning when any of these is met:
- bufs_to_lap reaches 0 (caught up to clock sweep)
- reusable_buffers reaches upcoming_alloc_est (freed enough reusable buffers)
- num_written reaches bgwriter_lru_maxpages (default 100) to avoid excessive I/O in one round

After one scan round, bgwriter sleeps for bgwriter_delay (default 200ms) before next iteration. If bufs_to_lap == 0 and recent_alloc == 0 (no allocation activity), bgwriter enters hibernation and sleeps longer, until a backend needing buffers wakes it via latch.

2. checkpointer

The goal of checkpointer is to flush all dirty pages up to a consistency point, forming a checkpoint. This advances WAL recycling and reduces how much WAL must be replayed during crash recovery. Unlike bgwriter, checkpointer does not care whether a page was recently used, it must flush all pages that were dirty at checkpoint start.

Trigger conditions: in the main loop, checkpointer triggers a checkpoint when any of the following occurs:

Time since last checkpoint exceeds checkpoint_timeout (default 5 minutes)
WAL volume exceeds max_wal_size and backends notify checkpointer
User manually runs CHECKPOINT
Shutdown checkpoint during server shutdown

Detailed procedure:

Scan and collect dirty buffers: traverse all NBuffers BufferDescriptors. For each dirty page, set the BM_CHECKPOINT_NEEDED flag, and collect its identity info into CkptBufferIds[] (tablespace OID, relation number, fork number, block number, etc.). Note: only pages that are already dirty at checkpoint start are included. Pages that become dirty during the checkpoint are not included and will be handled in the next checkpoint.
Sort: sort CkptBufferIds[] by (tablespace, relation, fork, block). This clusters pages from the same file and orders them by increasing block number, converting random I/O into more sequential patterns as much as possible.
Build tablespace-level progress tracking: traverse the sorted array and group by tablespace. For each tablespace, build a CkptTsStatus structure tracking total pages to flush and current progress. Put all tablespaces into a binary heap (min-heap), ordered by flush progress.
Balanced flushing across tablespaces: repeatedly pop the tablespace with the lowest progress from the heap, flush its next dirty page (via SyncOneBuffer), update its progress, then re-heapify. The purpose is to spread writes evenly across tablespaces (possibly on different disks), instead of flushing one tablespace completely before another. Unlike bgwriter, checkpointer calls SyncOneBuffer with skip_recently_used = false, meaning it will flush buffers with BM_CHECKPOINT_NEEDED regardless of recent usage.
Write throttling: after flushing each page, call CheckpointWriteDelay() to throttle. The goal is to finish flushing within: checkpoint_completion_target (default 0.9) × checkpoint_timeout. The logic compares:
- flush progress (flushed pages / total),
- elapsed time progress,
- WAL progress.
- If flush progress is ahead of both time progress and WAL progress (IsCheckpointOnSchedule == true), sleep 100ms.
- If lagging behind, do not sleep and flush at full speed.
- In IMMEDIATE mode (e.g., shutdown checkpoint) or under urgent checkpoint requests, do not throttle.
This spreads checkpoint I/O across the entire checkpoint window and avoids I/O spikes.
Writeback coalescing: if not using O_DIRECT, similar to bgwriter, use WritebackContext to collect tags for flushed pages. After accumulating enough, batch-call IssuePendingWritebacks(), sort and coalesce adjacent blocks, and use posix_fadvise to hint the kernel to write back OS cache pages to disk. After checkpoint completion, force one more IssuePendingWritebacks() to ensure all pending writebacks are issued.

Summary: Although the implementations differ significantly, both MySQL and PostgreSQL aim to pre-clean pages in the background so that foreground threads can quickly find free pages. PostgreSQL’s bgwriter predicts upcoming buffer allocation demand from foreground activity; MySQL’s page cleaner reacts to dirty page pressure and redo log age.

From an engineering perspective, their differences largely come down to the trade-off between linked lists and arrays:

With linked lists, MySQL can precisely obtain LRU ordering and dirty-page ordering from old to new. This greatly improves precision in eviction and flushing decisions. In particular, for checkpoint flushing, it can directly take the oldest dirty pages from the flush list tail to advance checkpoint quickly. The trade-off is the cost of maintaining those lists.
PostgreSQL sacrifices some precision and scans arrays instead, avoiding the additional overhead of maintaining linked lists. It is also worth noting that PostgreSQL’s checkpoint flushing emphasizes balanced progress across tablespaces rather than globally prioritizing the oldest dirty pages to advance checkpoint in small steps.

Visualizing MySQL BLOB Internals Directly from MySQL Data Files (.ibd)

2026-02-08T00:00:00+00:00

In a previous post, I explored how MySQL implements partial updates and multi-versioning for BLOB columns internally.

To better see what actually happens inside the data files, I’ve added a new feature to ibdNinja, an interactive BLOB inspection mode:

--inspect-blob

This feature is designed as a extension of ibdNinja’s existing inspection workflow, allowing you to drill down from high-level structures to the actual BLOB data stored on disk.

How it works:

Step 1

Use ibdNinja’s existing features to parse, extract, and print information from a MySQL .ibd file at the table, index, page, and record levels. Once you’ve located a record you want to dive deeper into, note its page number and record number.

Step 2

Pass those identifiers to --inspect-blob:

ibdNinja -f  --inspect-blob ,

to start an interactive inspection of the BLOB field in that record.

As shown above, ibdNinja will:

Traverse the external BLOB page chain
Reconstruct the version chain introduced by partial updates
Visualize the complete on-disk layout of the BLOB across all versions

From there, you can choose any version and:

Hex-print or dump the full value for binary BLOBs (images, raw binary data, etc.)
Decode JSON BLOBs (MySQL JSON is still a BLOB internally) into readable text, or inspect the raw MySQL-encoded JSON in hex

If some historical versions have already been purged, ibdNinja will detect that and clearly report it.

If you’re into MySQL data file internals, or knee-deep in development, debugging, or production issues, give ibdNinja a try, dig under the hood — and consider bug reports part of the feature set.

A POC on optimizing MySQL’s unique index insertion path

2026-01-25T00:00:00+00:00

A few months ago, I wrote a post about a possible optimization in MySQL’s unique index insertion path. As illustrated there, the idea is to reduce the current 3 B+Tree searches into 1 B+Tree search plus a scan on the leaf page (or leaf level), in order to avoid the overhead of repeatedly traversing the tree. This weekend, I implemented a quick proof-of-concept on MySQL 8.0.45 and measure the effect.

1. Setup:

Table with 200K rows, a VARCHAR(700) unique key (latin1), creating a tall B-tree:

CREATE TABLE t1 (
 id INT PRIMARY KEY AUTO_INCREMENT,
 uk_col VARCHAR(700) NOT NULL,
 UNIQUE KEY uk_idx (uk_col)
) ENGINE=InnoDB CHARACTER SET latin1;

2. Test procedure:

Insert 100 TARGET rows with prefix “TARGET_ROW_”
Start a blocker transaction (START TRANSACTION WITH CONSISTENT SNAPSHOT) to prevent purge
Delete the 100 TARGET rows (creates delete-marked records)
Re-insert the same 100 TARGET rows, this triggers the duplicate-check path, since delete-marked records with the same unique key exist
Instrument row_ins_sec_index_entry_low() with timing around each B-tree search.
Run the benchmark twice: once with the original path, reset metrics, then with the optimized path

3. Results:

Original path (3 B-tree searches):

Search1: ~6,508 ns
Search2: ~5,649 ns
Search3: ~2,498 ns
Total: ~14,656 ns
Optimized path (1 B-tree search + inline scan):
Search1: ~7,272 ns
Inline: ~3,118 ns
Total: ~10,390 ns

Improvement: ~29.1% reduction in search-path time

This test focuses specifically on the unique index insertion path (row_ins_sec_index_entry_low()), comparing the cost of the original three searches with the optimized “one search + inline scan” approach. In this local scope, the saving is close to 30%, which matches the intuition of collapsing three tree traversals into one.

4. However, when evaluating the overall benefit, there are a few important considerations:

In a single-row insert, how large is this part relative to the whole insert path? If its share is small, the end-to-end gain will be diluted. In my tests, when measuring the full insert path, the improvement drops to single-digit percentages.
Under concurrent workloads, each of the three B-tree searches holds page latches. This is one of the key factors affecting scalability. Reducing this section by ~30% also shortens latch holding time, so the benefit may be more visible in parallel scenarios.
While implementing the POC, I also realized that this optimization is not a silver bullet. There are cases that still need to fall back to the original path, although there are ways to minimize how often that happens.

These are just the numbers from a quick POC. If this direction turns out to be meaningful, it would still require much more careful design, implementation, and testing.

Bug #118363

MySQL BLOB Internals - Partial Update Implementation and Multi-Versioning

2025-12-01T00:00:00+00:00

In this blog, I would like to introduce the implementation of BLOB and BLOB partial update in MySQL, and explain how the current design works together with the MVCC module to support multi-version control for BLOB columns.

1. Background

Before going into the details, I would like to briefly introduce two important concepts that are closely related to this topic.

1. Basic Principles of MySQL MVCC (Multi-Version Concurrency Control)

MySQL supports snapshot reads. Each read transaction reads data based on a certain snapshot, so even if other write transactions modify the data during the execution of a read transaction, the read transaction will always see the version it is supposed to see.

The underlying mechanism is that a write transaction directly updates the data in place on the primary key record. However, before the update happens, the old value of the field to be modified is copied into the undo space. At the same time, there is a ROLL_PTR field in the row that points to the exact location in the undo space where the old value (the undo log record) is stored.

As shown in the figure above, there is a row in the primary key index that contains three fields. Suppose a write transaction is modifying Field 2. It will first copy the original value of Field 2 into the undo space, and then overwrite Field 2 directly in the row. After that, two important system fields of the row are updated:

TRX_ID is set to the ID of the current write transaction and is used later by read transactions to determine visibility.
ROLL_PTR points to the exact location in the undo space where the old value of the modified field is stored, and is used to reconstruct the previous version of the row when needed.

After the update is finished, if a previously existing read transaction reads this row again, it will find, based on the TRX_ID, that the row has been modified by a later write transaction. Therefore, the current version of the row is not visible to this read transaction. It must roll back to the previous version. At this point, it uses the ROLL_PTR to locate the old value in the undo space, applies it to the current row, and thus reconstructs the version that it is supposed to see.

2. Basic Implementation of MySQL BLOB

The primary key record in MySQL contains the values of all fields and is stored in the clustered index. However, BLOB columns are an exception. Since they are usually very large, MySQL stores their data in separate data pages called external pages.

A BLOB value is split into multiple parts and stored sequentially across multiple external pages. These pages are linked together in order, like a linked list. So how does the primary key record locate the corresponding BLOB data stored in those external pages? For each BLOB column, the clustered record stores a reference (lob::ref_t). This ref_t contains some metadata about the column and a pointer to the first external page where the BLOB data starts.

When reading the row, MySQL first locates the row via the primary key index, then follows this reference to find the external pages and reconstructs the full BLOB value by copying the data from those pages.

This is a very straightforward and intuitive design, simple and sufficient. It is also exactly how BLOB was implemented in older versions of MySQL.

3. A “Thought Exercise”

Based on the two points above, here is a question:

How is MVCC implemented for BLOB in MySQL?

The intuitive answer is as follows: the lob::ref_t stored in the primary key record follows the same MVCC rules. Every time a BLOB column is updated, the old BLOB value is read out, modified, and then the entire modified BLOB is written into newly allocated external pages. The corresponding lob::ref_t in the primary key record is overwritten with the new reference. At the same time, following the MVCC mechanism, the old lob::ref_t is copied into the undo space.

After the modification, the situation looks like this (as shown in the figure): the undo space stores the lob::ref_t that points to the old BLOB value, while the lob::ref_t in the primary key record points to the new value.

This is exactly how older versions of MySQL worked. The next question is:

What are the pros and cons of this design?

The advantage is that the undo log only needs to record the lob::ref_t, and it does not need to store the entire old BLOB value.

The disadvantage is that no matter how small the change to the BLOB is, even if only a single byte is modified, the entire modified BLOB still has to be written into newly allocated external pages. BLOB columns are usually very large, so if each update only changes a very small portion, this design introduces a lot of extra I/O and space overhead.

A typical example is JSON. Internally, MySQL stores JSON as BLOB. Usually, updates to JSON are local and small. However, with the old design, each small partial update still requires reading the entire JSON, modifying a part of it, and then inserting the whole value back again. This is obviously very heavy.

So how to solve this problem? MySQL introduced BLOB partial update to address it.

2. Implementation of BLOB Partial Update

MySQL optimized the format of the external pages used to store BLOB data and redesigned the original simple linked-list structure:

Each external page now has a corresponding index entry.
These index entries are organized as a linked list and stored in the BLOB first page. (If there are too many index entries to fit, they are stored in separate BLOB index pages.)
Under normal circumstances, these index entries are linked together in order, just like the external pages in the old implementation.
To support partial updates, MySQL changes the granularity of BLOB updates from the whole BLOB to individual external pages. Only the external pages involved in the current modification are updated. The modified external page is copied into a new page and updated there, while the other external pages remain unchanged.

Then the question becomes: how can MySQL make sure that it can read the correct new and old BLOB values? The answer is that the new external page and the old external page share the same logical position in the index entry list. In other words, at this specific position in the list, there are now two versions, version 1 and version 2. Which one is used is determined by the version number recorded in the current lob::ref_t. The idea is illustrated in the figure below.

In summary, MySQL transforms the original external-page linked list into a linked list of index entries. For each index entry in this list, if the corresponding external page is modified, a new version of the index entry is created at the same horizontal position to point to the new version of that external page. Essentially, this introduces multi-versioning for external pages.

Special Case: BLOB Small Changes

The implementation described above is not the whole story. MySQL makes a practical trade-off between creating a new index entry (which requires copying the entire external page) and copying only the modified portion into the undo space.

For BLOB small-change scenarios, when the modification to a blob is smaller than 100 bytes, MySQL does not create a new index entry and link it into the version chain for that page. Instead, it modifies the page in place. Following MVCC principles, the portion to be modified is first written into the undo space before the in-place update happens.

It is worth noting that in this case, the lob::ref_t stored in the primary key record does not advance its base version number. It shares the same base as the previous version. When a read transaction needs to read the previous version, it first constructs the latest BLOB value based on the lob::ref_t and the index entry list. Then, following the MVCC logic, it finds that the TRX_ID indicates that this version is not visible. At this point, it follows the ROLL_PTR to the undo space, where the old value of the modified external page is stored. By applying that old data back onto the current value, the complete and correct historical BLOB value can be reconstructed.

In this scenario, the recovery process is a combination of two steps:

First, the version corresponding to the lob::ref_t is reconstructed via the index entry version chain.
Then, the version visible to the current transaction is reconstructed via the ROLL_PTR chain.

Index Entry Details

Index entries are the key to the implementation of BLOB partial update. To make them easier to understand, I drew the following diagram to illustrate the logical relationships among index entries. It is a two-dimensional linked list. The horizontal dimension represents the sequential position when assembling the full BLOB value. The vertical dimension represents multiple versions at the same position. Each time the page at that position is modified, a new node is added vertically.

Of course, this is only a logical model. The physical layout is not organized exactly like this. Each BLOB has a BLOB first page. This page stores a portion of the BLOB data (the initial part) and 10 index entries. Each index entry corresponds to one BLOB data page. When all 10 index entries are used up, a new BLOB index page is allocated, and additional index entries are allocated from there. In reality, the index entries distributed across the BLOB first page and the BLOB index pages are linked together to form the logical structure shown in the diagram above.

SIMD in Vector Search - “Hand-Tuned SIMD vs Compiler Auto-Vectorization”

2025-09-08T00:00:00+00:00

SIMD (Single instruction, multiple data) is often one of the key optimization techniques in vector search. In particular, when computing the distance between two vectors, SIMD can transform what was originally a one-dimensional-at-a-time calculation into 8- or 16-dimensions-at-a-time, significantly improving performance.

Here, as I mentioned in previous posts, MariaDB and pgvector take different approaches:

MariaDB: directly implements distance functions using SIMD instructions.
pgvector: implements distance functions in a naive way and relies on compiler optimization (-ftree-vectorize) for vectorization.

To better understand the benefits of SIMD vectorization, and to compare these two approaches, I ran a series of benchmarks — and discovered some surprising performance results along the way.

1. Test Environment and Method

Environment

AWS EC2: c5.4xlarge, 16 vCPUs, 32 GiB memory
Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0

Method

First, I implemented 4 different squared L2 distance (L2sq) functions (i.e., Euclidean distance without the square root):

Naive L2sq implementation

static inline double l2sq_naive_f32(const float* a, const float* b, size_t n) {
  float acc = 0.f;
  for (size_t i = 0; i < n; ++i) { float d = a[i] - b[i]; acc += d * d; }
  return (double)acc;
}

Naive high-precision L2sq (converting float to double before computation)

static inline double l2sq_naive_f64(const float* a, const float* b, size_t n) {
  double acc = 0.0;
  for (size_t i = 0; i < n; ++i) { double d = (double)a[i] - (double)b[i]; acc += d * d; }
  return acc;
}

SIMD (AVX2) L2sq implementation, computing 8 dimensions at a time

// Reference: simSIMD
SIMSIMD_PUBLIC void simsimd_l2sq_f32_haswell(simsimd_f32_t const *a,
                                             simsimd_f32_t const *b,
                                             simsimd_size_t n,
                                             simsimd_distance_t *result) {
   
    __m256 d2_vec = _mm256_setzero_ps();
    simsimd_size_t i = 0;
    for (; i + 8 <= n; i += 8) {
        __m256 a_vec = _mm256_loadu_ps(a + i);
        __m256 b_vec = _mm256_loadu_ps(b + i);
        __m256 d_vec = _mm256_sub_ps(a_vec, b_vec);
        d2_vec = _mm256_fmadd_ps(d_vec, d_vec, d2_vec);
    }
   
    simsimd_f64_t d2 = _simsimd_reduce_f32x8_haswell(d2_vec);
    for (; i < n; ++i) {
        float d = a[i] - b[i];
        d2 += d * d;
    }
   
    *result = d2;
}
SIMSIMD_INTERNAL simsimd_f64_t _simsimd_reduce_f32x8_haswell(__m256 vec) {
    // Convert the lower and higher 128-bit lanes of the input vector to double precision
    __m128 low_f32 = _mm256_castps256_ps128(vec);
    __m128 high_f32 = _mm256_extractf128_ps(vec, 1);
   
    // Convert single-precision (float) vectors to double-precision (double) vectors
    __m256d low_f64 = _mm256_cvtps_pd(low_f32);
    __m256d high_f64 = _mm256_cvtps_pd(high_f32);
   
    // Perform the addition in double-precision
    __m256d sum = _mm256_add_pd(low_f64, high_f64);
    return _simsimd_reduce_f64x4_haswell(sum);
}
SIMSIMD_INTERNAL simsimd_f64_t _simsimd_reduce_f64x4_haswell(__m256d vec) {
    // Reduce the double-precision vector to a scalar
    // Horizontal add the first and second double-precision values, and third and fourth
    __m128d vec_low = _mm256_castpd256_pd128(vec);
    __m128d vec_high = _mm256_extractf128_pd(vec, 1);
    __m128d vec128 = _mm_add_pd(vec_low, vec_high);
   
    // Horizontal add again to accumulate all four values into one
    vec128 = _mm_hadd_pd(vec128, vec128);
   
    // Convert the final sum to a scalar double-precision value and return
    return _mm_cvtsd_f64(vec128);
}

SIMD (AVX-512) L2sq implementation, computing 16 dimensions at a time

// Reference: simSIMD
SIMSIMD_PUBLIC void simsimd_l2sq_f32_skylake(simsimd_f32_t const *a,
                                             simsimd_f32_t const *b,
                                             simsimd_size_t n,
                                             simsimd_distance_t *result) {
    __m512 d2_vec = _mm512_setzero();
    __m512 a_vec, b_vec;
   
simsimd_l2sq_f32_skylake_cycle:
    if (n < 16) {
        __mmask16 mask = (__mmask16)_bzhi_u32(0xFFFFFFFF, n);
        a_vec = _mm512_maskz_loadu_ps(mask, a);
        b_vec = _mm512_maskz_loadu_ps(mask, b);
        n = 0;
    }
    else {
        a_vec = _mm512_loadu_ps(a);
        b_vec = _mm512_loadu_ps(b);
        a += 16, b += 16, n -= 16;
    }
    __m512 d_vec = _mm512_sub_ps(a_vec, b_vec);
    d2_vec = _mm512_fmadd_ps(d_vec, d_vec, d2_vec);
    if (n) goto simsimd_l2sq_f32_skylake_cycle;
   
    *result = _simsimd_reduce_f32x16_skylake(d2_vec);
}
......

I generated a dataset of 10,000 float vectors (dimension = 1024, 64B aligned) and one target vector. Then, for the following 5 scenarios, I searched for the vector with the closest L2sq distance to the target. Each distance computation was repeated 16 times (to create a CPU-intensive workload), and each scenario was executed 5 times, taking the median runtime to eliminate random fluctuations:
1. SIMD L2sq implementation
2. Naive L2sq implementation
3. Naive L2sq with compiler vectorization disabled (-fno-tree-vectorize -fno-builtin -fno-lto -Wno-cpp -Wno-pragmas)
4. Naive high-precision L2sq implementation
5. Naive high-precision L2sq with compiler vectorization disabled
Compile with AVX2 (-O3 -mavx2 -mfma -mf16c -mbmi2) and run the 5 scenarios.
Compile with AVX-512 (-O3 -mavx512f -mavx512dq -mavx512bw -mavx512vl -mavx512cd -mfma -mf16c -mbmi2) and run the 5 scenarios again.

2. Results and Analysis

Expected results：

SIMD L2sq implementations are much faster than others, and AVX-512 outperforms AVX2 since it processes 16 dimensions at once instead of 8.
Under AVX2, naive L2sq (178.385ms) is faster than naive high-precision L2sq (183.973ms), because the latter incurs float→double conversion overhead.
Under both AVX2 and AVX-512, naive implementations with compiler vectorization disabled perform the worst, since they are forced into scalar execution.

Unexpected Results

In addition to the expected results above, some surprising findings appeared:

For naive L2sq, AVX-512 performance (208.822ms) was actually slower than AVX2 (178.385ms).
With AVX-512, naive L2sq was slower than naive high-precision L2sq.

Both deserve deeper analysis.

(1) Why was naive L2sq with AVX-512 slower than with AVX2?

Although this was a naive implementation, with -O3 we would expect the compiler to auto-vectorize. However, the vectorized result generated by the compiler was far worse than our manual SIMD implementation, and AVX-512 even performed worse than AVX2.

To investigate further, I used objdump to examine the AVX2 and AVX-512 binaries for l2sq_naive_f32().

Under AVX2:

0000000000007090 <_ZL19l2sq_naive_f32PKfS0_m>:
     ... ...
     70b7:       48 c1 ee 03             shr    rsi,0x3
     70bb:       48 c1 e6 05             shl    rsi,0x5
     70bf:       90                      nop
     70c0:       c5 fc 10 24 07          vmovups ymm4,YMMWORD PTR [rdi+rax*1]
     70c5:       c5 dc 5c 0c 01          vsubps ymm1,ymm4,YMMWORD PTR [rcx+rax*1]
     70ca:       48 83 c0 20             add    rax,0x20
     70ce:       c5 f4 59 c9             vmulps ymm1,ymm1,ymm1
       
     70d2:       c5 fa 58 c1             vaddss xmm0,xmm0,xmm1
     70d6:       c5 f0 c6 d9 55          vshufps xmm3,xmm1,xmm1,0x55
     70db:       c5 f0 c6 d1 ff          vshufps xmm2,xmm1,xmm1,0xff
     70e0:       c5 fa 58 c3             vaddss xmm0,xmm0,xmm3
     70e4:       c5 f0 15 d9             vunpckhps xmm3,xmm1,xmm1
     70e8:       c4 e3 7d 19 c9 01       vextractf128 xmm1,ymm1,0x1
     70ee:       c5 fa 58 c3             vaddss xmm0,xmm0,xmm3
     70f2:       c5 fa 58 c2             vaddss xmm0,xmm0,xmm2
     70f6:       c5 f0 c6 d1 55          vshufps xmm2,xmm1,xmm1,0x55
     70fb:       c5 fa 58 c1             vaddss xmm0,xmm0,xmm1
     70ff:       c5 fa 58 c2             vaddss xmm0,xmm0,xmm2
     7103:       c5 f0 15 d1             vunpckhps xmm2,xmm1,xmm1
     7107:       c5 f0 c6 c9 ff          vshufps xmm1,xmm1,xmm1,0xff
     710c:       c5 fa 58 c2             vaddss xmm0,xmm0,xmm2
     7110:       c5 fa 58 c1             vaddss xmm0,xmm0,xmm1
     ... ...

The compiler did use vector instructions (vmovups, vsubps, vmulps) to compute L2sq in groups of 8 floats. But when folding the 8 results horizontally into xmm0, it extracted elements using vshufps, vunpckhps, vextractf128, etc., and then added them one by one with scalar vaddss. Worse, this folding happened in every iteration.

This per-iteration horizontal reduction became the bottleneck. Instead, like the manual SIMD implementation, it should have accumulated vector results across the whole loop and performed just one horizontal reduction at the end.

Under AVX-512:

    a057:       48 c1 ee 04             shr    rsi,0x4
    a05b:       48 c1 e6 06             shl    rsi,0x6
    a05f:       90                      nop
    a060:       62 f1 7c 48 10 2c 07    vmovups zmm5,ZMMWORD PTR [rdi+rax*1]
    a067:       62 f1 54 48 5c 0c 01    vsubps zmm1,zmm5,ZMMWORD PTR [rcx+rax*1]
    a06e:       48 83 c0 40             add    rax,0x40
    a072:       62 f1 74 48 59 c9       vmulps zmm1,zmm1,zmm1
      
    a078:       c5 f0 c6 e1 55          vshufps xmm4,xmm1,xmm1,0x55
    a07d:       c5 f0 c6 d9 ff          vshufps xmm3,xmm1,xmm1,0xff
    a082:       62 f3 75 28 03 d1 07    valignd ymm2,ymm1,ymm1,0x7
    a089:       c5 fa 58 c1             vaddss xmm0,xmm0,xmm1
    a08d:       c5 fa 58 c4             vaddss xmm0,xmm0,xmm4
    a091:       c5 f0 15 e1             vunpckhps xmm4,xmm1,xmm1
    a095:       c5 fa 58 c4             vaddss xmm0,xmm0,xmm4
    a099:       c5 fa 58 c3             vaddss xmm0,xmm0,xmm3
    a09d:       62 f3 7d 28 19 cb 01    vextractf32x4 xmm3,ymm1,0x1
    a0a4:       c5 fa 58 c3             vaddss xmm0,xmm0,xmm3
    a0a8:       62 f3 75 28 03 d9 05    valignd ymm3,ymm1,ymm1,0x5
    a0af:       c5 fa 58 c3             vaddss xmm0,xmm0,xmm3
    a0b3:       62 f3 75 28 03 d9 06    valignd ymm3,ymm1,ymm1,0x6
    a0ba:       62 f3 7d 48 1b c9 01    vextractf32x8 ymm1,zmm1,0x1
    a0c1:       c5 fa 58 c3             vaddss xmm0,xmm0,xmm3
    a0c5:       c5 f0 c6 d9 55          vshufps xmm3,xmm1,xmm1,0x55
    a0ca:       c5 fa 58 c2             vaddss xmm0,xmm0,xmm2
    a0ce:       c5 f0 c6 d1 ff          vshufps xmm2,xmm1,xmm1,0xff
    a0d3:       c5 fa 58 c1             vaddss xmm0,xmm0,xmm1
    a0d7:       c5 fa 58 c3             vaddss xmm0,xmm0,xmm3
    a0db:       c5 f0 15 d9             vunpckhps xmm3,xmm1,xmm1
    a0df:       c5 fa 58 c3             vaddss xmm0,xmm0,xmm3
    a0e3:       c5 fa 58 c2             vaddss xmm0,xmm0,xmm2
    a0e7:       62 f3 7d 28 19 ca 01    vextractf32x4 xmm2,ymm1,0x1
    a0ee:       c5 fa 58 c2             vaddss xmm0,xmm0,xmm2
    a0f2:       62 f3 75 28 03 d1 05    valignd ymm2,ymm1,ymm1,0x5
    a0f9:       c5 fa 58 c2             vaddss xmm0,xmm0,xmm2
    a0fd:       62 f3 75 28 03 d1 06    valignd ymm2,ymm1,ymm1,0x6
    a104:       62 f3 75 28 03 c9 07    valignd ymm1,ymm1,ymm1,0x7
    a10b:       c5 fa 58 c2             vaddss xmm0,xmm0,xmm2
    a10f:       c5 fa 58 c1             vaddss xmm0,xmm0,xmm1

The first part similarly used vector instructions to compute 16 values at a time. But folding 16 results was even more complex and expensive, involving vshufps, valignd, vunpckhps, vextractf32x4, vextractf32x8, etc. This additional complexity canceled out the gains from processing 16 dimensions per iteration, which explains why AVX-512 was slower.

(2) Why was naive float L2sq slower than naive high-precision L2sq under AVX-512?

Theoretically, high-precision L2sq should be slower because of float→double conversions. So why was it faster?

Looking at the disassembly of l2sq_naive_f64:

000000000000a280 <_ZL19l2sq_naive_f64PKfS0_m>:
    a280:       f3 0f 1e fa             endbr64
    a284:       48 85 d2                test   rdx,rdx
    a287:       74 37                   je     a2c0 <_ZL19l2sq_naive_f64_oncePKfS0_m+0x40>
    a289:       c5 e0 57 db             vxorps xmm3,xmm3,xmm3
    a28d:       31 c0                   xor    eax,eax
    a28f:       c5 e9 57 d2             vxorpd xmm2,xmm2,xmm2
    a293:       0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
    a298:       c5 e2 5a 04 87          vcvtss2sd xmm0,xmm3,DWORD PTR [rdi+rax*4]
    a29d:       c5 e2 5a 0c 86          vcvtss2sd xmm1,xmm3,DWORD PTR [rsi+rax*4]
    a2a2:       c5 fb 5c c1             vsubsd xmm0,xmm0,xmm1
    a2a6:       48 83 c0 01             add    rax,0x1
    a2aa:       c5 fb 59 c0             vmulsd xmm0,xmm0,xmm0
    a2ae:       c5 eb 58 d0             vaddsd xmm2,xmm2,xmm0
    a2b2:       48 39 c2                cmp    rdx,rax
    a2b5:       75 e1                   jne    a298 <_ZL19l2sq_naive_f64_oncePKfS0_m+0x18>
    a2b7:       c5 eb 10 c2             vmovsd xmm0,xmm2,xmm2
    a2bb:       c3                      ret
    a2bc:       0f 1f 40 00             nop    DWORD PTR [rax+0x0]
    a2c0:       c5 e9 57 d2             vxorpd xmm2,xmm2,xmm2
    a2c4:       c5 eb 10 c2             vmovsd xmm0,xmm2,xmm2
    a2c8:       c3                      ret
    a2c9:       0f 1f 80 00 00 00 00    nop    DWORD PTR [rax+0x0]

The code is much shorter than the float version.
Although it includes scalar float→double conversions (vcvtss2sd) and computes one dimension at a time, it avoids the complex and costly 16-element horizontal folding.

In other words, even with the conversion overhead, the simpler scalar path was still faster than the float version with vector folding. The compiler likely chose the conservative scalar path here, avoiding vectorization.

(3) How to Improve Naive L2sq for Better Compiler Vectorization?

The reason for horizontal folding is likely that the compiler strictly follows IEEE 754 semantics, preserving the exact order of floating-point additions. This prevents the compiler from reordering additions into vectorized accumulations.

To relax this, we can explicitly allow reassociation:

static inline double l2sq_naive_f32(const float* a, const float* b, size_t n) {
    float acc = 0.f;
    #pragma omp simd reduction(+:acc)
    for (size_t i = 0; i < n; ++i) {
        float d = a[i] - b[i];
        acc += d * d;
    }
    return (double)acc;
}

And compile with -fopenmp-simd to enable this directive.

Running again shows a significant improvement: compiler auto-vectorization now achieves performance close to manual SIMD implementations. Using -ffast-math also works.

3. Summary

SIMD significantly improves distance computation performance.
Hand-written SIMD implementations perform best.
For naive implementations, allowing reassociation (via #pragma omp simd reduction(+:acc) or appropriate subsets of -ffast-math) is the key to approaching hand-written SIMD performance. Under strict IEEE semantics, the compiler conservatively generates per-iteration folding, which creates slow paths where AVX-512 does not necessarily have an advantage.

Zhao Song’s Blog

InnoDB B+Tree Performance Optimization Proposal - Insert Path Improvements and Concurrent Split Handling

1. Bottlenecks in InnoDB B+Tree Inserts

2. High-Level Design

1. Introduce a new B-link-style descent path: blink_search_to_nth_level

2. Latching-order protocol

3. Full insert flow

3. InnoDB Low-Level Design

1. Add high keys to pages

2. Split SMO into multiple mtrs

3. Page allocation and the FSP locking problem

4. Crash Recovery

Dissecting the MySQL 8.0 Performance Regression on oltp_update_non_index

1. Setup

2. Narrowing the gap step by step

Step 1: innodb_log_writer_threads=OFF

Step 2: performance_schema=0

Step 3: skip-log-bin

Step 4: --db-ps-mode=auto (prepared statements)

Step 5: innodb_flush_log_at_trx_commit=0

Step 6: Five code patches

4. Going deeper: how much does each factor really contribute?

5. Summary

Appendix A: Patch details

Patch 1: cmp_data ALWAYS_INLINE + loop-invariant hoist

Patch 2: buf_flush_note_modification ALWAYS_INLINE

Patch 3: server_store_cached_values no-op

Patch 4: fold_condition fast path

Patch 5: info_low fast path

Appendix B: Environment details

Shared InnoDB configuration

Non-patchable 8.0 overhead (architectural)

MySQL vs PostgreSQL Internals (Part 2) — MVCC (Multi-version Concurrency Control)

The Role of MVCC

1. Organization of Multiple Versions

PostgreSQL

In PostgreSQL, a tuple and all its historical versions reside in the heap, as shown below:

MySQL

Summary

2. Visibility Checks for Multiple Versions

PostgreSQL

MySQL

Summary

3. Garbage Collection of Multiple Versions

PostgreSQL

MySQL

Summary

MySQL vs PostgreSQL Internals (Part 1) – Buffer Pool

Comparison Dimensions

1. Hash Table

MySQL

PostgreSQL

2. Eviction Policy for Old Pages (Aging) and Implementation

MySQL

First attempt (n_iterations == 0)

Second attempt (n_iterations == 1)

Third and subsequent attempts (n_iterations > 1)

PostgreSQL

3. Dirty Page Flushing Strategy and Implementation

MySQL

Coordinator

Worker

a. Based on dirty page percentage (get_pct_for_dirty())

b. Based on redo log age (get_pct_for_lsn(age))

Combined calculation (set_flush_target_by_lsn())

Sync Flush mode

Idle flushing

PostgreSQL

1. bgwriter

2. checkpointer

Visualizing MySQL BLOB Internals Directly from MySQL Data Files (.ibd)

How it works:

Step 1

Step 2

A POC on optimizing MySQL’s unique index insertion path

1. Setup:

2. Test procedure:

3. Results:

Original path (3 B-tree searches):

Optimized path (1 B-tree search + inline scan):

1. Introduce a new B-link-style descent path: `blink_search_to_nth_level`

Step 1: `innodb_log_writer_threads=OFF`

Step 2: `performance_schema=0`

Step 3: `skip-log-bin`

Step 4: `--db-ps-mode=auto` (prepared statements)

Step 5: `innodb_flush_log_at_trx_commit=0`

Patch 1: `cmp_data` `ALWAYS_INLINE` + loop-invariant hoist

Patch 2: `buf_flush_note_modification` `ALWAYS_INLINE`

Patch 3: `server_store_cached_values` no-op

Patch 4: `fold_condition` fast path

Patch 5: `info_low` fast path

a. Based on dirty page percentage (`get_pct_for_dirty()`)

b. Based on redo log age (`get_pct_for_lsn(age)`)

Combined calculation (`set_flush_target_by_lsn()`)