DoltHub Blog - Latest Posts

How TPC-C Works

James Cor — Thu, 14 May 2026 00:00:00 GMT

Dolt is a version-controlled database that works as a drop-in MySQL replacement. In addition to correctness parity with MySQL, we are also determined to reach performance parity with MySQL. For years now, we’ve been improving Dolt performance on both Sysbench and TPC-C benchmarks. Towards the end of 2025, Dolt reached parity with MySQL on Sysbench when averaging reads and writes. Shortly after, we managed to reach MySQL parity on both read and writes. Now, we surpass MySQL on Sysbench reads and writes with a 0.95 reads mean multiplier and 0.87 write mean multiplier.

While we are proud of our accomplishments on Sysbench, TPC-C Benchmarks are arguably more important to the average user experience. This blog will go into detail about the TPC-C benchmarks and the kinds of queries run against the database.

What is TPC-C#

TPC-C stands for Transaction Processing Performance Council Benchmark C. It is the industry standard benchmark used for OLTP databases. TPC-C simulates real-world usage of a database by modeling transactions for a wholesale supplier. Dolt actually uses a slightly modified version of the official TPC-C benchmarks from Percona Labs.

Settings#

We run TPC-C with these settings:

./tpcc.lua \
  --db-driver="mysql" \
  --mysql-db="sbtest" \
  --mysql-host="127.0.0.1" \
  --mysql-port="$PORT" \
  --mysql-user="$USER" \
  --mysql-password="$PASS" \
  --time=800 \
  --report_interval=10 \
  --threads=1 \
  --tables=1 \
  --scale=1 \
  --trx_level="RR" run

The benchmarks are run with a single thread for 800 seconds with autocommit and foreign_key_checks disabled. trx_level="RR" is REPEATABLE_READ (the MySQL/Innodb default), which means that SELECT results are isolated within a transaction. Uncommitted writes to a table from another transaction will not be visible to this transaction. We compare the 95th percentile latency and number of transactions per second (tps) against MySQL.

Tables#

Here is a brief summary of the tables created.

warehouse with 1 row
district with 10 rows
customer with 30000 rows
orders with 30000 rows
new_orders with 9000 rows
order_line with 299293 rows
item with 100000 rows
stock with 100000 rows
history with 30000 rows

Many of these tables have primary keys, secondary keys, and foreign keys (even though foreign_key_checks are disabled).

Transactions#

TPC-C randomly selects between 5 different transaction types with varying odds. Each transaction starts with a BEGIN and ends with COMMIT. Without spelling out every query run within a transaction, here is a high-level overview of each transaction.

trx_type	run_percent
`new_order`	43.48%
`payment`	43.48%
`order_status`	4.35%
`delivery`	4.35%
`payment`	4.35%
TOTAL	≈100.00%

1. `new_order`#

Description: This transaction simulates a customer order of 5 - 15 quantity of an item. The new order is logged in the orders, new_orders, and order_line tables, and the item and stock tables are updated accordingly.

Percent Run: 43.49% (10/23)

# of SQL Statements: 25 - 65

Reads Tables: customer, district, item, stock, warehouse

Writes Tables: district, orders, order_line, new_orders, stock

2. `payment`#

Description: This transaction simulates a customer making a purchase. It reads the customer’s payment details (name, address, etc.) from the customer table, update their account balance, and logs the transaction in the history table.

Percent Run: 43.48% (10/23)

# of SQL Statements: 6 - 10

Reads Tables: customer, district, warehouse

Writes Tables: customer, district, history, warehouse

3. `order_status`#

Description: This transaction simulates a customer inquiring about their order details. It performs a series of SELECT queries into the customer, orders and order_line tables.

Percent Run: 4.34% (1/23)

# of SQL Statements: 3 - 4

Reads Tables: customer, orders, order_line

Writes Tables:

4. `delivery`#

Description: This transaction simulates a customer’s order getting delivered. An order is removed from the new_orders table and the appropriate entries are updated in orders, order_line and customer.

Percent Run: 4.34% (1/23)

# of SQL Statements: 1 - 6

Reads Tables: orders, order_line, new_orders

Writes Tables: customer, orders, order_line, new_orders

5. `stocklevel`#

Description: This transaction simulates a query over existing inventory. It is a few SELECT queries that aggregate over the order_line and stock tables that count the quantity of certain items.

Percent Run: 4.34% (1/23)

# of SQL Statements: 3+

Reads Tables: district, orders, stock, order_line

Writes Tables:

You can explore the TPC-C database in more detail here: https://www.dolthub.com/repositories/jcor/sbtest

Conclusion#

We have been focused on Dolt performance on TPC-C, which aims to simulate a real user experience with an OLTP database. Over the last few months we have made substantial improvements to TPC-C. Stay tuned for a future blog describing how we’ve broken the 2x MySQL multiplier on TPC-C. Have any performance issues? Cut a bug on our GitHub issues page. Want to talk to anyone on our team? Join our Discord.

Dolt 2.0

Tim Sehn — Mon, 11 May 2026 00:00:00 GMT

Three years ago, we announced Dolt 1.0, signalling that Dolt was ready for production workloads. We haven’t stopped improving the world’s first and only version-controlled SQL database. Today, we are excited to announce Dolt 2.0.

What Did Dolt 1.0 Mean?#

Dolt 1.0 meant four things:

Forward Storage Compatibility
Production Performance
MySQL Compatibility
Stable Version Control Interface

Dolt 2.0 maintains the promises of Dolt 1.0. Dolt 2.0 improves on the performance and correctness metrics established in Dolt 1.0.

What Does Dolt 2.0 Mean?#

Dolt 2.0 means five things:

Automated Garbage Collection on by Default
Archive Compression on by Default
Faster than MySQL on sysbench
Beta Vector Support
Adaptive Storage

Unlike Dolt 1.0, Dolt 2.0 is fully backwards compatible with all Dolt 1.0 versions. No storage migration using dolt migrate is required. Let’s dive into the details of each of these points.

Garbage Collection#

Dolt makes a lot of disk garbage, especially during import. Dolt is copy-on-write so all intermediate committed transaction state is preserved to disk. Any intermediate state that is not in a Dolt commit is garbage and can be collected.

Dolt already must preserve all history in the commit graph on disk. Adding extra garbage can eat through your disk very quickly.

Dolt 2.0 has automatic garbage collection on by default, meaning most users don’t have to care about disk garbage. Many users have been running in this mode for over a year. We’re confident it is stable.

Dolt 2.0 databases do not require extra garbage maintenance, just like other modern SQL engines.

Archives#

Following on the disk space theme, we also have a new on disk format we call archives that can reduce Dolt’s storage footprint by an additional 30-50%. Archives use dictionary compression to de-duplicate storage in the deepest layers of Dolt, saving even more disk space.

As with automatic garbage collection, archives have been the default format for new Dolt databases for months. We’re confident the format is stable and delivers real disk space wins.

Dolt 2.0 databases are kind to your disk with automatic garbage collection and archives. Version control already requires more disk space than traditional databases. Dolt 2.0 preserves that disk for your data’s history.

Faster than MySQL on `sysbench`#

We’ve long used the industry standard sysbench to measure and benchmark the latency of simple SQL queries in Dolt. We started at about 10X slower on reads and 20X slower on writes than MySQL. We’ve worked tirelessly to improve Dolt’s performance and we are now 13% faster than MySQL on writes and 5% faster on reads, averaging out to 8% faster than MySQL on sysbench style workloads.

Dolt 2.0 databases deliver real production database performance coupled with version control functionality.

Beta Vector Support#

We announced vector index support early last year. We have a much bigger challenge than traditional databases with vector indexes because our vector indexes must be version-controlled. We’ve done the hard computer science to achieve this. We adopted the Vector type from MariaDB in September 2025.

Dolt 2.0 databases have Beta vector support. Dolt is the only database where your vectors are version-controlled. We still have some edge cases on the read query path where a vector index should be used but it is not. Closing these gaps will reove the Beta tag from Dolt’s vector support.

Adaptive storage for large column types#

Borrowing from our Doltgres adaptive storage work to support TOAST types, we’re excited to announce Dolt 2.0 has adaptive storage.

For large column types like TEXT, BLOB, and JSON, databases generally store the value “out of band”, as a file on disk with a pointer to the file in the actual table structure. A different strategy, popularized by Postgres, is to examine the size of the value and store small values in the table structure while preserving the files and pointers strategy for large values. This strategy allows the user to be less disciplined about sizing VARCHAR columns and just use TEXT instead. It’s also a big performance win for these types when the values are small.

Dolt 2.0 has adaptive storage making MySQL databases that use TEXT, BLOB, GEOMETRY, or JSON columns a good fit regardless of whether they need version control or not.

Conclusion#

Dolt 2.0 is here. It’s kinder to your disk and it’s fast. Questions? Stop by our Discord and just ask.

Announcing DumboDB: A MongoDB Clone Built on Dolt

Neil Macneale — Thu, 07 May 2026 00:00:00 GMT

Today, I’m pleased to announce the release of DumboDB 0.1, a MongoDB clone built on top of Dolt’s storage system.

MongoDB and Git had a baby, and it’s named DumboDB.

TL;DR;#

Grab the code and give it a try! We are excited to see what the community can do with it. You can find the code on GitHub: https://github.com/dolthub/dumbodb. See the README for installation instructions. If you have feedback, questions, or want to contribute, join us on Discord!

How Did We Get Here?#

Like everyone else in the software industry, we have been kicking around the AI tools that are getting so much hype these days. No joke, my boss is vibe coding now. What a stereotype, right? A couple of weeks ago I talked about testing Gas Town, a coding agent orchestrator, only to get swept up in the excitement of building something new insanely fast. Turns out that a dozen agents can make a really compelling proof of concept in a couple of weeks. Ultimately, we decided to turn that proof of concept into a real product, and here we are.

Since then, the pace of change has slowed to allow for human-speed verification and testing. The firehose of code generation provided by Gas Town hasn’t been necessary, and it’s mostly been slower 1-on-1 coding with Claude Code. I’ve even read some of the code and directed Claude to clean up some nonsense which we are all used to seeing with coding agents at this point. Nevertheless, the 6 weeks of development to get to a viable 0.1 release has been far, far faster than I could have ever done on my own. Say what you want about the AI bubble, coding agents are going to help us write a mountain of code.

DumboDB’s DNA#

DumboDB started with the FerretDB code base, which is an open-source MongoDB clone. FerretDB operates as a proxy that translates MongoDB queries into SQL queries that are executed against a PostgreSQL database. It’s written in Go.

Dolt is written in Go as well, so we ripped out the proxy approach of FerretDB and made a standalone server that uses Dolt’s storage engine.

Dolt’s storage engine is a Prolly Tree, which is a data structure that allows for structural sharing of data. Using Merkle Trees and DAGs, Dolt can represent a very granular set of snapshots of your data. This is the same approach used by Git to model your source code history.

Our Prolly Tree implementation has been getting our love and attention for more than a decade now. It is what enables our primary product, Dolt, to be a version-controlled database. And when we say version-controlled, we mean it. You can branch, merge, diff, send pull requests, rebase, and so on - just like you would with Git. Dolt is a drop-in replacement for MySQL and is as fast as MySQL for many workloads. It gives you all the safety of Git for your data. Honestly, most software engineers hear about what we’ve built in Dolt, and they are astonished because it seems impossible. It’s real! Check it out if you haven’t already!

DumboDB is using the same storage engine as Dolt, but with a different access layer. Instead of using SQL, DumboDB uses the MongoDB query language. This means that you can use all the same tools and libraries that you would use with MongoDB, but with the added benefits of Dolt’s storage engine. To be crystal clear, there is no SQL layer in DumboDB. DumboDB uses the storage objects of Dolt. Therefore, it uses the same indexes, journal code, and commit model — but no SQL. It’s a NoSQL document database — not a facade on top of a SQL database.

What Can You Do With DumboDB?#

DumboDB is alpha-quality software, so what you should not do is run it in production.

That said, it should be able to do most basic MongoDB operations. If you are familiar with MongoDB, you should be able to pick up DumboDB pretty quickly. You can use the same drivers and libraries that you would use with MongoDB, so probably the best thing to do is just kick the tires and tell us when something doesn’t work. We are sure there are bugs, and we want to know about them!

There are some glaring gaps. DumboDB has no concept of user accounts or permissions yet. There are no isolated sessions or transactions with rollbacks (though you can reset --hard!). Text search and geo features aren’t implemented. We don’t have a replication or sharding story yet, and we may just stick to the Git model of clone/push/pull. We’ll see. It all depends on what our users ask for. The joy of open source is that we aren’t hiding behind a proprietary roadmap. We’ll build what makes sense, and take PRs for all the rest.

Being a drop-in replacement for MongoDB is the eventual goal, but we aren’t there yet. If you are bold, you can try running your existing MongoDB workloads against DumboDB and see what happens. Start your server with the --auto-commit flag and witness how your application changes your data over time.

Version Control Features#

All the basic operations you would expect from a Git-inspired product are available. This includes: commit, branch, merge, cherry-pick, rebase, log, status, diff, reset, revert, and tag.

Read the documentation for each command here.

There is no “staging” concept in DumboDB, which is a departure from how Git works. Instead, you just make your changes to the database, and then when you are ready to commit, you commit and whatever content is in your workspace will be committed. No need to “add” changes, like you would in Git.

There are two additional commands for working with merge conflicts: conflicts and resolveConflict. Unlike Git, but similar to Dolt, merge conflicts are structured. When merge conflicts arise, the three-way merge details are available to your application so that it can resolve the conflicts reliably. See examples below.

Note the lack of checkout. There is no checkout in DumboDB. Instead, you get a database instance using the getSiblingDB operation. Check out the examples…

Examples#

You’ll need to grab a prebuilt binary or build from source to run these examples. See the README for instructions.

Run the server in its own terminal window:

dumbodb --data-dir /tmp/dumbodb-data

Then in another terminal window, connect to the server using the MongoDB shell:

$ mongosh mongodb://localhost:27017/
...
------
   The server generated these startup warnings when booting
   2026-05-06T21:36:58.547Z: Powered by DumboDB v0.1.0
   2026-05-06T21:36:58.547Z: Star Us! https://github.com/dolthub/dumbodb
------
test>

mongosh is a JavaScript shell, so you execute JavaScript code to interact with the database. The test> prompt indicates that you are connected to the test database. The way you change the database is you set the db variable to a different database. For example, if you want to switch to the mydb database, you would run:

test> db = db.getSiblingDB("mydb")
mydb
mydb>

See how the prompt changed to mydb>? That indicates that we are now using the mydb database. Now let’s use it!

Create a new collection, and insert some documents:#

The MongoDB approach is to create collections implicitly when you first insert a document into them. So there is no createCollection command. Instead, you just start inserting documents into a collection, and it will be created for you. Let’s insert two documents into a new collection called customers:

mydb> db.customers.insertOne({ name: "Alice", phone: "555-1234" })
{
  acknowledged: true,
  insertedId: ObjectId('69fbcb932a4aeea4bc4de9b5')
}
mydb> db.customers.insertOne({ name: "Bob", phone: "555-5678" })
{
  acknowledged: true,
  insertedId: ObjectId('69fbbb46a26d8df3b3ab5cf6')
}
mydb>

That’s all vanilla Mongo behavior. By inserting these two documents, we have made changes to the database, but those changes are not committed yet. It’s like editing a source file in Git: you need to commit it. It works exactly the same with DumboDB. Make as many changes as you need to, then commit when ready.

DumboDB’s version control operations are executed with the db.runCommand method. We can see the status of our database with the dumboStatus command:

mydb> db.runCommand({ dumboStatus: 1 })
{
  branch: 'main',                   // `db` is on the main branch
  dirty: true,                      // there are uncommitted changes
  readonly: false,                  // we can write to this database, as demonstrated by our inserts
  collections: [
    {
      name: 'customers',
      status: 'added',              // the customers collection is new, so it's status is "added"
      added: 2,                     // we added 2 documents to the customers collection
      modified: 0,
      deleted: 0
    }
  ],
  ok: 1                             // the command was successful. Standard MongoDB.
}

Now we can commit our changes to the database. The commit will create a new snapshot of the database, and the dirty flag will return to false.

mydb> db.runCommand({ dumboCommit: 1,
                      message: "Add customers collection with Alice and Bob",
                      author: "neil <neil@dolthub.com>" })
{
  commitId: '6nc94olva9m81ofdjnnhp3018100qs5f',
  branch: 'main',
  message: 'Add customers collection with Alice and Bob',
  author: 'neil <neil@dolthub.com>',
  timestamp: ISODate('2026-05-06T22:12:28.240Z'),
  committer: 'neil <neil@dolthub.com>',
  committerTimestamp: ISODate('2026-05-06T22:12:28.240Z'),
  ok: 1
}
mydb> db.runCommand({ dumboStatus: 1 })
{
  branch: 'main',
  dirty: false,
  readonly: false,
  commitId: '6nc94olva9m81ofdjnnhp3018100qs5f', // commitId is shown whenever dirty is false.
  collections: [],
  ok: 1
}

Branch, Merge, and Resolve Conflicts#

Let’s create a new branch, called feature, and change the phone number for Alice:

mydb> db.runCommand({ dumboBranch: 1, branch: "feature" })
{ branch: 'feature', ok: 1 }
// Specify the branch or revision number with <db>@<branchOrRevision>
mydb> var feature = db.getSiblingDB("mydb@feature")
// `feature` variable is a database instance that is now "pointing" to the `feature` branch.
// We can run commands against it, just like we do with `db`.
mydb> feature.runCommand({ dumboStatus: 1})
{
  branch: 'feature',
  dirty: false,
  readonly: false,
  commitId: '6nc94olva9m81ofdjnnhp3018100qs5f',
  collections: [],
  ok: 1
}
// Update Alice's phone number in the feature branch
mydb> feature.customers.updateOne({ name: "Alice" }, { $set: { phone: "555-4321" } })
{
  acknowledged: true,
  insertedId: null,
  matchedCount: 1,
  modifiedCount: 1,   // Indicates that one document was modified.
  upsertedCount: 0
}
mydb> feature.runCommand({ dumboStatus: 1})
{
  branch: 'feature',
  dirty: true,
  readonly: false,
  collections: [
    {
      name: 'customers',
      status: 'modified',  // The customers collection is modified, because we changed Alice's phone number.
      added: 0,
      modified: 1,         // Alice's document was modified.
      deleted: 0
    }
  ],
  ok: 1
}

dumboStatus gives a high-level summary of what has changed in our working copy of the database, but if we want to see the actual changes, we can use the dumboDiff command:

mydb> feature.runCommand({ dumboDiff: 1})
{
  collections: [
    {
      name: 'customers',
      status: 'modified',
      added: [],
      removed: [],
      modified: [
        {
          _id: ObjectId('69fbcb932a4aeea4bc4de9b5'),
          diff: [
            // the "phone" field was changed from "555-1234" to "555-4321"
            { type: 'modified', path: '$.phone', from: '555-1234', to: '555-4321' }
          ]
        }
      ]
    }
  ],
  ok: 1
}
// dumboCommit will always commit the content shown in the output of dumboDiff without arguments.
// Run dumboCommit now, it will commit the change to Alice's phone number.
mydb> feature.runCommand({ dumboCommit: 1, message: "Update Alice's phone number", author: "neil <neil@dolthub.com>" })
{
  commitId: 'p7p99vndl9d11hjloivlu1lcuvt3n9qa',
  branch: 'feature',
  message: "Update Alice's phone number",
  author: 'neil <neil@dolthub.com>',
  timestamp: ISODate('2026-05-06T23:27:52.848Z'),
  committer: 'neil <neil@dolthub.com>',
  committerTimestamp: ISODate('2026-05-06T23:27:52.848Z'),
  ok: 1
}

Now let’s change the same field, but on the main branch. Note that the db instance was created on the default branch, which is main, so when we run commands against db, we are running them against the main branch. We’ll perform a ‘find’ to demonstrate:

mydb> db.customers.find({name: 'Alice'})
[
  {
    _id: ObjectId('69fbcb932a4aeea4bc4de9b5'),
    name: 'Alice',
    phone: '555-1234' // Still unchanged on the main branch.
  }
]
mydb> db.customers.updateOne({ name: "Alice" }, { $set: { phone: "555-9999" } })
{
  acknowledged: true,
  insertedId: null,
  matchedCount: 1,
  modifiedCount: 1,
  upsertedCount: 0
}
mydb> db.runCommand({dumboCommit:1, message: "update Alice on main", author: "neil <neil@dolthub.com>"})
{
  commitId: '22gvmftrbf995mn924hl0ounoo4quhqh',
  branch: 'main',
  message: 'update Alice on main',
  author: 'neil <neil@dolthub.com>',
  timestamp: ISODate('2026-05-06T23:28:51.139Z'),
  committer: 'neil <neil@dolthub.com>',
  committerTimestamp: ISODate('2026-05-06T23:28:51.139Z'),
  ok: 1
}

To recap: We’ve updated Alice’s phone number to “555-4321” on the feature branch, and “555-9999” on the main branch. We can see the differences between the two branches with dumboDiff:

mydb> db.runCommand({ dumboDiff: 1, from: "main", to: "feature"})
{
  collections: [
    {
      name: 'customers',
      status: 'modified',
      added: [],
      removed: [],
      modified: [
        {
          _id: ObjectId('69fbcb932a4aeea4bc4de9b5'),
          diff: [
            { type: 'modified', path: '$.phone', from: '555-9999', to: '555-4321' }
          ]
        }
      ]
    }
  ],
  ok: 1
}

That is just a flat diff between the branches, but we know that they have a common ancestor with a third value for Alice’s phone number. When we attempt to merge the feature branch into main, we will get a merge conflict, because the same field was modified in both branches. DumboDB will give us the details of the merge conflict, so that we can resolve it:

mydb> db.runCommand({dumboMerge: 1,
                     merge_in: "feature",
                     message: "merge in feature branch",
                     author: "neil <neil@dolthub.com>"})
MongoServerError: dumboMerge: unresolved conflicts in 1 collection(s)

We got the merge conflict, as expected. To make sense of what happened, we can run the dumboConflicts command:

mydb> db.runCommand({dumboConflicts: 1})
{
  collections: [
    {
      collection: 'customers',
      conflicts: [
        {
          conflictId: '1I7p9HPnc+ff2JXl+sLqgw',          // Unique identifier for this specific conflict.
          _id: ObjectId('69fbcb932a4aeea4bc4de9b5'),
          base: { name: 'Alice', phone: '555-1234' },    // Original value.
          ours: { name: 'Alice', phone: '555-9999' },    // Value on the main branch (ours)
          theirs: { name: 'Alice', phone: '555-4321' },  // Value on the feature branch (theirs)
          ourDiffType: 'modified',
          theirDiffType: 'modified'
        }
      ]
    }
  ],
  ok: 1
}

Now we can resolve the conflict with the dumboResolveConflict command. We have three options to resolve the conflict: we can choose either “ours” or “theirs”, or we can provide a custom resolution. For this example, we’ll choose a custom resolution, where we set Alice’s phone number to “555-0000”:

mydb> db.runCommand({dumboResolveConflict: 1,
                     collection: "customers",
                     conflictId: "1I7p9HPnc+ff2JXl+sLqgw",    // Unique identifier from above. required.
                     resolution: "custom",
                     value: { name: 'Alice', phone: '555-0000' }})
{ ok: 1 }
// dumboStatus prints merge state:
mydb> db.runCommand({dumboStatus: 1})
{
  branch: 'main',
  dirty: true,
  readonly: false,
  collections: [
    {
      name: 'customers',
      status: 'modified',
      added: 0,
      modified: 1,
      deleted: 0
    }
  ],
  mergeState: 'merge',  // Currently in the middle of a merge.
  conflicts: [],        // No more conflicts, because we resolved the only conflict.
  ok: 1
}
// Complete the merge with the `continue` flag on the dumboMerge command.
mydb> db.runCommand({dumboMerge: 1,
                     continue: true,
                     message: "merge in feature branch",
                     author: "neil <neil@dolthub.com>"})
{
  commitId: '7iraa088995v304n61egkm45dcge2a78',
  message: 'merge in feature branch',
  author: 'neil <neil@dolthub.com>',
  timestamp: ISODate('2026-05-06T23:59:11.995Z'),
  committer: 'neil <neil@dolthub.com>',
  committerTimestamp: ISODate('2026-05-06T23:59:11.995Z'),
  ok: 1
}

Finally, you can use the dumboLog command to see the history of commits on the main branch, including the merge commit we just made:

mydb> db.runCommand({dumboLog: 1})
{
  commits: [
    {
      commitId: '7iraa088995v304n61egkm45dcge2a78',
      refs: [ 'HEAD', 'main' ],                               // refs tell you which branches or tags are pointing to this commit.
      parent1: '22gvmftrbf995mn924hl0ounoo4quhqh',
      parent2: 'p7p99vndl9d11hjloivlu1lcuvt3n9qa',            // Second parent because this is a merge commit!
      message: 'merge in feature branch',
      timestamp: ISODate('2026-05-06T23:59:11.980Z'),
      author: 'neil <neil@dolthub.com>',
      committer: 'neil <neil@dolthub.com>',
      committerTimestamp: ISODate('2026-05-06T23:59:11.980Z')
    },
    {
      commitId: '22gvmftrbf995mn924hl0ounoo4quhqh',
      parent1: '9t4t592jqddgj1elcfnaal7c8o6boj26',
      message: 'update Alice on main',
      timestamp: ISODate('2026-05-06T23:28:51.139Z'),
      author: 'neil <neil@dolthub.com>',
      committer: 'neil <neil@dolthub.com>',
      committerTimestamp: ISODate('2026-05-06T23:28:51.139Z')
    },
    {
      commitId: 'p7p99vndl9d11hjloivlu1lcuvt3n9qa',
      refs: [ 'feature' ],                                 
      parent1: '9t4t592jqddgj1elcfnaal7c8o6boj26',
      message: "Update Alice's phone number",
      timestamp: ISODate('2026-05-06T23:27:52.848Z'),
      author: 'neil <neil@dolthub.com>',
      committer: 'neil <neil@dolthub.com>',
      committerTimestamp: ISODate('2026-05-06T23:27:52.848Z')
    },
    {
      commitId: '9t4t592jqddgj1elcfnaal7c8o6boj26',
      parent1: '6h1nv5qcnkesp3ahg5vn7shp6airkkn7',
      message: 'Add customers collection with Alice and Bob',
      timestamp: ISODate('2026-05-06T23:16:04.348Z'),
      author: 'neil <neil@dolthub.com>',
      committer: 'neil <neil@dolthub.com>',
      committerTimestamp: ISODate('2026-05-06T23:16:04.348Z')
    },
    {
      commitId: '6h1nv5qcnkesp3ahg5vn7shp6airkkn7',
      message: 'Initialize database',
      timestamp: ISODate('2026-05-06T23:15:31.054Z'),
      author: 'dumbodb <dumbodb@dumbodb>',
      committer: 'dumbodb <dumbodb@dumbodb>',
      committerTimestamp: ISODate('2026-05-06T23:15:31.054Z')
    }
  ],
  ok: 1
}

The dumboLog command is not very flexible yet, but you can see a summary of the documents changed with the stat flag, and the full diff of each with the patch flag. Both are inspired by the git log command.

To read the specifics of each command, see the documentation!

Roadmap#

There are plenty of ways we can expand this product’s features, and we are just at the beginning. Here is our current rough roadmap:

v0.2: Garbage Collection and zstd compression. Reduce the footprint of your database. Simplified configuration for user details (name and email) so you don’t have to specify them on every commit.
v0.3: Add Clone, Push, and Pull support. This will allow you to sync your DumboDB repositories with remote servers, and collaborate with others.
v0.4: Add support for Replication (as a secondary backup to your existing MongoDB instance).
v0.5: Isolated Session and Transaction support.
v0.6: Add Authentication and Authorization support.
v0.7: Add MCP server support, allowing you to use agents to work with the DumboDB database.
v0.8: Visualization and operations via a custom Workbench UI.
v1.0: General availability release, with a focus on stability, performance, and usability improvements.

No dates are assigned on any of this, mainly because we will invest our engineering resources in balance with the other DoltHub products. We’ll go faster if you bang on our doors!

Call to Action!#

Here at DoltHub, we strongly believe in the power of version-controlled databases. Users have told us for a long time that a NoSQL option would appeal to them, and now we have one! Real-world usage and feedback is the surest way to make sure we are building the right features. For that, we require your help. If you are interested in this space (you must be because you read this far), try DumboDB out and tell us what needs to be improved, extended, thrown out, etc. We want to hear from you!

Want to impact the future of DumboDB? Join us on Discord!

Announcing Azure Private Link Support for Hosted Dolt

Brian Hendriks — Wed, 06 May 2026 00:00:00 GMT

Hosted Dolt is a fully managed Dolt deployment available on AWS, GCP, and, most recently, Azure. The initial offering of Hosted Dolt on Azure only supported publicly accessible deployments. Today we are excited to announce the same level of private networking support for Azure that we already have for AWS and GCP.

What is Azure Private Link?#

Azure Private Link is a service that allows you to access VNets in other Azure subscriptions and tenants securely through Azure’s network. This means that you can create a Hosted Dolt deployment on Azure that is only accessible to your private Azure infrastructure, and not accessible over the public internet.

Creating a Deployment with Azure Private Link#

The first thing you will need to do is get the subscription ID of the Azure subscription that you want to use for your private network (You can find this in the Azure portal). Once you have the subscription ID, create a new Hosted Dolt deployment on Azure and enable the “Private deployment” option on the “Advanced” tab of the deployment creation form. Then in the “Allowed subscription IDs” box enter the subscription ID(s) that you want to allow access to your deployment.

Once you have that filled out, click “Next” and review your deployment choices. Finally, click “Create Deployment” and your instance will be up and running in minutes. Once your deployment is running, you will see the connections tab populated.

And the “Azure Private Link Networking” section of the connections tab will show you the information you will need to connect your private Azure infrastructure to your Hosted Dolt deployment.

Connecting to your Private Deployment#

By taking the information from the “Azure Private Link Networking” section of the connections tab, you can connect your private Azure infrastructure using the web portal, Azure CLI, or you could use Terraform. I won’t go through the web portal, but I will cover the Azure CLI and Terraform options.

Connecting your Infrastructure with the Azure CLI#

In order to connect your infrastructure you will need the region, resource group, virtual network, and subnet of the private network that you want to connect to your deployment.

RESOURCE_GROUP="my-resource-group"
REGION="eastus2"
VNET="my-vnet"
SUBNET="my-subnet"

then we will take the information from the “Azure Private Link Networking” section of the connections tab

PRIVATE_LINK_SERVICE_ID="/subscriptions/01234567-89ab-cdef-fedc-ba9876543210/resourceGroups/networking-dev/providers/Microsoft.Network/privateLinkServices/pls-01234567-89ab-cdef-fedc-ba9876543210"
ENDPOINT_NAME="test-az-priv"
URL="test-az-priv.pls.dbs.hosted.doltdb.com"

With that information you need to create an Azure “Private Endpoint” and then set up DNS resolution for the private endpoint so that you can connect to your deployment using the url provided in the “Azure Private Link Networking” section of the connections tab.

# Create primary
az network private-endpoint create \
    --resource-group "$RESOURCE_GROUP" \
    --name "dolt-${ENDPOINT_NAME}-private-endpoint" \
    --location "$REGION" \
    --vnet-name "$VNET" \
    --subnet "$SUBNET" \
    --private-connection-resource-id "$PRIVATE_LINK_SERVICE_ID" \
    --connection-name "dolt-${ENDPOINT_NAME}-connection"
    
# Retrieve private IP from the endpoint NIC
PE_NIC_ID=$(az network private-endpoint show \
    --resource-group "$RESOURCE_GROUP" \
    --name "dolt-${ENDPOINT_NAME}-private-endpoint" \
    --query "networkInterfaces[0].id" -o tsv)
PE_IP=$(az network nic show --ids "$PE_NIC_ID" \
    --query "ipConfigurations[0].privateIPAddress" -o tsv)
    
# Create DNS zone and VNet link
 az network private-dns zone create \
    --resource-group "$RESOURCE_GROUP" \
    --name "pls.dbs.hosted.doltdb.com"
    
az network private-dns link vnet create \
    --resource-group "$RESOURCE_GROUP" \
    --zone-name "pls.dbs.hosted.doltdb.com" \
    --name "dolt-${ENDPOINT_NAME}-vnet-link" \
    --virtual-network "$VNET" \
    --registration-enabled false
    
# Register primary DNS
az network private-dns record-set a add-record \
    --resource-group "$RESOURCE_GROUP" \
    --zone-name "pls.dbs.hosted.doltdb.com" \
    --record-set-name "$ENDPOINT_NAME" \
    --ipv4-address "$PE_IP"

Now you should be able to connect to your deployment from your instances within the given VNet using the provided URL. If you SSH onto one of your instances with the MySQL client installed, you can connect to your deployment using the MySQL command provided on the “Connections” tab of your deployment.

The process for connecting a read endpoint is very similar if you are using replication.

READ_PRIVATE_LINK_SERVICE_ID="/subscriptions/01234567-89ab-cdef-fedc-ba9876543210/resourceGroups/networking-dev/providers/Microsoft.Network/privateLinkServices/pls-read-01234567-89ab-cdef-fedc-ba9876543210"
READ_ENDPOINT_NAME="read-test-az-priv"
READ_URL="read-test-az-priv.pls.dbs.hosted.doltdb.com"

az network private-endpoint create \
    --resource-group "$RESOURCE_GROUP" \
    --name "dolt-${READ_ENDPOINT_NAME}-private-endpoint" \
    --location "$REGION" \
    --vnet-name "$VNET" \
    --subnet "$SUBNET" \
    --private-connection-resource-id "$READ_PRIVATE_LINK_SERVICE_ID" \
    --connection-name "dolt-${READ_ENDPOINT_NAME}-connection"

READ_PE_NIC_ID=$(az network private-endpoint show \
    --resource-group "$RESOURCE_GROUP" \
    --name "dolt-${READ_ENDPOINT_NAME}-private-endpoint" \
    --query "networkInterfaces[0].id" -o tsv)
READ_PE_IP=$(az network nic show --ids "$READ_PE_NIC_ID" \
    --query "ipConfigurations[0].privateIPAddress" -o tsv)

az network private-dns record-set a add-record \
    --resource-group "$RESOURCE_GROUP" \
    --zone-name "pls.dbs.hosted.doltdb.com" \
    --record-set-name "${READ_ENDPOINT_NAME}" \
    --ipv4-address "$READ_PE_IP"

Terraform#

The same thing can be accomplished with Terraform. Here is an example Terraform configuration that will create a private endpoint and set up DNS resolution for that endpoint. The “Variables” section of the configuration should be filled out with the appropriate values for your deployment and private network. Many of these values may come from the output of other Terraform configurations that manage your Azure infrastructure, or you could just fill them out manually.

terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 3.0"
    }
  }
}

provider "azurerm" {
  features {}
}

# ---------------------------------------------------------------------------
# Variables
# ---------------------------------------------------------------------------

variable "resource_group" {
  type        = string
  description = "Azure resource group to create the private endpoints in"
}

variable "region" {
  type        = string
  description = "Azure region (e.g. eastus)"
}

variable "vnet" {
  type        = string
  description = "Name of the VNet to attach the private endpoints to"
}

variable "subnet" {
  type        = string
  description = "Name of the subnet within the VNet for the private endpoints"
}

variable "endpoint_name" {
  type        = string
  description = "Hosted Dolt endpoint name (used to name Azure resources and DNS records)"
}

variable "read_endpoint_name" {
  type        = string
  description = "Hosted Dolt read endpoint name (used to name Azure resources and DNS records)"
}

variable "private_link_service_id" {
  type        = string
  description = "Resource ID of the primary Private Link Service"
}

variable "read_private_link_service_id" {
  type        = string
  description = "Resource ID of the read-replica Private Link Service"
}

# ---------------------------------------------------------------------------
# Data sources
# ---------------------------------------------------------------------------

data "azurerm_subnet" "endpoint_subnet" {
  name                 = var.subnet
  virtual_network_name = var.vnet
  resource_group_name  = var.resource_group
}

# ---------------------------------------------------------------------------
# Private endpoints
# ---------------------------------------------------------------------------

resource "azurerm_private_endpoint" "primary" {
  name                = "dolt-${var.endpoint_name}-private-endpoint"
  location            = var.region
  resource_group_name = var.resource_group
  subnet_id           = data.azurerm_subnet.endpoint_subnet.id

  private_service_connection {
    name                              = "dolt-${var.endpoint_name}-connection"
    private_connection_resource_id    = var.private_link_service_id
    is_manual_connection              = false
  }
}

resource "azurerm_private_endpoint" "read" {
  name                = "dolt-${var.read_endpoint_name}-private-endpoint"
  location            = var.region
  resource_group_name = var.resource_group
  subnet_id           = data.azurerm_subnet.endpoint_subnet.id

  private_service_connection {
    name                              = "dolt-${var.read_endpoint_name}-connection"
    private_connection_resource_id    = var.read_private_link_service_id
    is_manual_connection              = false
  }
}

# ---------------------------------------------------------------------------
# Private DNS zone and VNet link
# ---------------------------------------------------------------------------

resource "azurerm_private_dns_zone" "dolt" {
  name                = "pls.dbs.hosted.doltdb.com"
  resource_group_name = var.resource_group
}

resource "azurerm_private_dns_zone_virtual_network_link" "dolt" {
  name                  = "dolt-${var.endpoint_name}-vnet-link"
  resource_group_name   = var.resource_group
  private_dns_zone_name = azurerm_private_dns_zone.dolt.name
  virtual_network_id    = data.azurerm_subnet.endpoint_subnet.virtual_network_id
  registration_enabled  = false
}

# ---------------------------------------------------------------------------
# DNS A records
# ---------------------------------------------------------------------------

resource "azurerm_private_dns_a_record" "primary" {
  name                = var.endpoint_name
  zone_name           = azurerm_private_dns_zone.dolt.name
  resource_group_name = var.resource_group
  ttl                 = 300
  records             = [azurerm_private_endpoint.primary.private_service_connection[0].private_ip_address]
}

resource "azurerm_private_dns_a_record" "read" {
  name                = "${var.read_endpoint_name}"
  zone_name           = azurerm_private_dns_zone.dolt.name
  resource_group_name = var.resource_group
  ttl                 = 300
  records             = [azurerm_private_endpoint.read.private_service_connection[0].private_ip_address]
}

Conclusion#

Hosted Dolt now supports Azure Private Link. You can create a new Hosted Dolt deployment on Azure with Private Link support in just a few clicks. If you have any questions about using Hosted Dolt on Azure, or any other cloud provider, or if you have feedback or feature requests, please join our Discord and let us know.

Database Insurance

Tim Sehn — Mon, 04 May 2026 00:00:00 GMT

Did you know Dolt and Doltgres can be run as replicas to your existing MySQL, MariaDB, or Postgres databases? In this mode, Dolt acts as database insurance. Each transaction commit on your primary becomes a Dolt commit on your replica, preserving the history of your database in Dolt. You can use Dolt’s version control functionality on your replica for all manner of rollback, including complex reverse patches of multiple transactions.

Database insurance is becoming all the more important with the threat of agents going off the rails. It seems like not a month goes by without another “an agent deleted my database” story. Dolt can protect you from rogue agents without switching out your primary. Just add a Dolt replica. This article explains.

Agents Delete Databases#

This is no longer hypothetical. Agents delete databases.

Last week, Pocket’s Claude-powered coding agent turned a staging task into a production incident. It deleted the company database and the attached backups in one shot. This is the cleanest possible example of the new failure mode: give an agent broad infra access and eventually it will speedrun your disaster recovery plan.

In March, Alexey Grigorev asked Claude to help clean up AWS resources and got an unexpected terraform destroy adventure instead. Production database gone. Snapshots gone. Support upgraded mid-incident. It reads like a normal ops postmortem except the culprit was an agent with cloud credentials.

And in July last year, after Replit’s agent deleted a database, Replit’s CEO said the lesson was that agents should not have access to production databases. Fair enough. But is that achievable? Agents are going to find their way into production workflows, whether through direct SQL, admin tools, or application APIs. The practical question is not whether mistakes will happen but rather how much damage is done when they do.

These stories get attention because they are spectacular. “Entire company database deleted in nine seconds” is a headline. But the underlying issue is not new. Any tool, human- or agent-driven, that has production credentials, broad access, and enough autonomy to make a lot of changes very quickly is dangerous.

Databases were already at-risk. Agents just increased the blast radius.

The old production failure mode was a tired human typing DROP TABLE into the wrong terminal. The new production failure mode is an enthusiastic agent with API keys, shell access, a task list, and a limited context window.

Agents Write Junk#

Outright deletion is the catastrophe case. It is easy to notice. It makes the news. Much more common is quiet failure when an agent writes junk to production.

Maybe it backfills the wrong values. Maybe it misinterprets a schema. Maybe it “fixes” a bug by overwriting a field everywhere. Maybe it makes a long sequence of individually valid writes that are collectively nonsense. In many cases the agent is not even talking SQL directly. It is calling your production API, which is often worse because the damage looks like legitimate application traffic. I would guess for every “the agent deleted my database” story, there are hundreds of “the agent wrote bad data into production” stories.

A recent article on the subject, “Databases Were Not Designed For This” by Arpit Bhayani, generated a lively Hacker News discussion. Arpit’s point is that databases were built for boring callers: deterministic apps, reviewed writes, short-lived connections, obvious failures. Agents break every one of those assumptions at once, so a bunch of database hygiene practices that were once “nice to have” — statement timeouts, idempotency keys, soft deletes, append-only logs, role-per-agent, query tagging, the works — suddenly become load-bearing infrastructure. I think that framing is right. The database is no longer talking to careful application code but to an agent who must fix the issue before its context fills up.

This is where normal backups start to feel insufficient. Backups are great for disaster recovery. They are less great for “between 2:13 PM and 2:21 PM, the agent wrote garbage into three related tables, and I need to surgically undo those writes specifically”.

That is a version control problem, and Dolt is the world’s first and only version-controlled database.

Protect Yourself#

The easiest way to get database insurance is to run a Dolt replica on a separate host. You keep your existing primary. MySQL stays MySQL. MariaDB stays MariaDB. Postgres stays Postgres. Dolt or Doltgres sits downstream as a replica, ingesting binlog or WAL changes and turning each committed transaction into a durable Dolt commit.

Setup is the same basic shape as standard replication. No application rewrite. No cutover. No “replace your database with our database” project. Just add a replica.

The result is useful immediately. You get:

A continuously updated copy of production.
A commit history of every transaction.
Diffs over time.
Branches if you need to experiment.
Rollback tools much more expressive than “restore last night’s backup”.

In the pre-agent era, this was nice to have. With Claude and Codex in all your engineer’s hands, it looks more like a necessity.

Catastrophe#

Backups are still your number one defense here. If a machine dies, a region disappears, or an attacker wipes out infrastructure, backups are what save you. A Dolt replica is defense in depth.

It gives you another live copy of the database and a detailed transaction history. If your primary gets destroyed by a rogue agent, the Dolt replica can tell you exactly what happened and when. If you push that replica to DoltHub or another remote, you now have true off-host safety with different access credentials as well.

This is the right layering:

Backups for disaster recovery.
Replication for availability.
Dolt replication for history, auditability, and surgical undo.
Dolt remote as disaster recovery if all else fails.

You want all four.

Bad Writes#

Bad writes are where a Dolt replica really shines.

First, find the bad writes. Use dolt_log to identify the suspicious time window or commit range. Use dolt_diff() to inspect exactly what changed. Because the writes are preserved as commits, you are not guessing. You are looking at history.

Then undo them on a branch.

Sometimes a simple dolt_revert() is enough. Other times, you want to reverse a sequence of commits or a custom patch that keeps the good changes and removes only the bad ones. Dolt gives you those tools. Once you have the corrective patch, you can get the SQL you need to apply to your primary with dolt_patch().

This is a much better story than:

Restore backup to staging.
Diff it manually against production.
Write custom SQL.
Hope you found all the bad rows.
Hope you didn’t revert good writes too.

With Dolt, the database history is already organized into commits. That makes complex rollback possible.

That is what “database insurance” means in practice. Not just “I have a copy somewhere”, but “I can understand what happened and undo it precisely”.

Conclusion#

Use a Dolt replica for database insurance against catastrophe or more mundane bad writes. In the human operator era, a Dolt replica was nice to have. In the agentic operator era, a Dolt replica is essential. Questions? Come by our Discord. We’re happy to help get your replica set up.

Announcing Functional Indexes in Dolt

Jason Fulghum — Wed, 29 Apr 2026 00:00:00 GMT

Dolt is the world’s first SQL database with Git-style version control built in. Dolt gives you all the power of SQL combined with the ability to branch, merge, diff, clone, and push your data, just like you would with source code in Git. Today we’re excited to announce that Dolt now supports functional indexes!

Functional indexes have been part of MySQL since MySQL 8.0. They’re not something every application needs, but when they are needed, they can have a big impact on query performance, as we’ll see later in this post. This is a feature we’ve wanted to build for a long time, and after receiving a few customer requests for it, we decided to prioritize it. We’ve launched the initial support in Dolt and are now expanding that to add this feature to Doltgres, add additional syntax support, enable usage of the functional indexes in more queries, and fine tune performance.

In this blog post, we’ll walk through what functional indexes are, how they work, how to use them, and see them in action.

What is a Functional Index?#

A regular index stores the values of one or more columns and lets Dolt quickly look up rows by those raw column values. A functional index is similar, but instead of storing a column’s raw value, it stores the result of a function or expression applied to that column. That stored result is what gets indexed, so when a query contains the same expression in its WHERE clause, Dolt can use the index to satisfy the query efficiently — without scanning the full table and evaluating the expression row by row.

The syntax uses an extra set of parentheses to mark the expression:

CREATE INDEX idx_lower_email ON users ((LOWER(email)));

The double parentheses are important — that’s how MySQL and Dolt distinguish a functional expression from a regular column reference inside an index definition.

When Does This Help?#

Let’s look at a common scenario for functional indexes. Imagine you have a users table and you want to look up users by email address, but you don’t want User@Example.COM and user@example.com to be treated as different entries. There are several ways to handle this, including normalizing values to lowercase when filtering on them.

SELECT * FROM users WHERE LOWER(email) = LOWER('User@Example.COM');

Without a functional index on LOWER(email), satisfying that query requires a full table scan. The database engine has to evaluate LOWER(email) for every single row and compare it against your search value. On a large table, that’s going to be slow.

A functional index solves this cleanly, without requiring you to store a redundant email_lower column purely for indexing purposes.

Other common cases for using a functional index include:

Date part extraction — index on YEAR(created_at) or MONTH(order_date) for reporting queries that filter by year or month
JSON path extraction — index on JSON_UNQUOTE(JSON_EXTRACT(metadata, '$.type')) to speed up queries on specific fields inside a JSON column
String transformations — indexing trimmed or normalized versions of values without duplicating the data in an extra column

Getting Started#

Let’s see functional indexes in action. We’ll use the dolthub/employees database from DoltHub — a sample database with 300,024 employees and 443,308 title records, which gives us enough data to see a real difference in query performance.

If you don’t have Dolt yet, install it here. Then clone the database:

$ dolt clone dolthub/employees
$ cd employees

From here, you can run dolt sql to start a SQL shell:

$ dolt sql

# Welcome to the DoltSQL shell.
# Statements must be terminated with ';'.
# "exit" or "quit" (or Ctrl-D) to exit. "\help" for help.
employees/main>

Then you can start exploring the database:

employees/main> DESCRIBE employees;
+------------+---------------+------+-----+---------+-------+
| Field      | Type          | Null | Key | Default | Extra |
+------------+---------------+------+-----+---------+-------+
| emp_no     | int           | NO   | PRI | NULL    |       |
| birth_date | date          | NO   |     | NULL    |       |
| first_name | varchar(14)   | NO   |     | NULL    |       |
| last_name  | varchar(16)   | NO   |     | NULL    |       |
| gender     | enum('M','F') | NO   |     | NULL    |       |
| hire_date  | date          | NO   |     | NULL    |       |
+------------+---------------+------+-----+---------+-------+

Out of the box, the only index is the primary key on emp_no. There’s no index on last_name. Suppose we want to find all employees by last name, but we need the search to be case-insensitive — the data was imported from multiple sources and the casing isn’t consistent. We write:

SELECT emp_no, first_name, last_name FROM employees WHERE LOWER(last_name) = 'facello';

Without a Functional Index#

Let’s look at the query plan Dolt uses when there’s no index. We’ll use EXPLAIN FORMAT=TREE which gives us Dolt’s detailed execution plan:

employees/main> EXPLAIN FORMAT=TREE SELECT emp_no, first_name, last_name FROM employees WHERE LOWER(last_name) = 'facello';
+----------------------------------------------------------------------------+
| plan                                                                       |
+----------------------------------------------------------------------------+
| Project                                                                    |
|  ├─ columns: [employees.emp_no, employees.first_name, employees.last_name] |
|  └─ Filter                                                                 |
|      ├─ (lower(employees.last_name) = 'facello')                           |
|      └─ Table                                                              |
|          ├─ name: employees                                                |
|          └─ columns: [emp_no first_name last_name]                         |
+----------------------------------------------------------------------------+
7 rows in set (0.00 sec)

The plan is a Project of three columns, over a Filter node on top of a full Table scan. Dolt reads all 300,024 rows, evaluates LOWER(last_name) on each one, and discards the ones that don’t match. Against 300K rows, on my Apple M1 Max laptop, this query takes about 150ms.

Adding the Functional Index#

Now let’s create the functional index, using the same syntax we covered earlier:

employees/main> CREATE INDEX idx_lower_last_name ON employees ((LOWER(last_name)));

Let’s look at the query plan for the same query and see if it’s different after creating a functional index on LOWER(last_name):

employees/main> EXPLAIN FORMAT=TREE SELECT emp_no, first_name, last_name FROM employees WHERE LOWER(last_name) = 'facello';
+----------------------------------------------------------------------------+
| plan                                                                       |
+----------------------------------------------------------------------------+
| Project                                                                    |
|  ├─ columns: [employees.emp_no, employees.first_name, employees.last_name] |
|  └─ IndexedTableAccess(employees)                                          |
|      ├─ index: [employees.!hidden!idx_lower_last_name!0!0]                 |
|      └─ filters: [{[facello, facello]}]                                    |
+----------------------------------------------------------------------------+
5 rows in set (0.00 sec)

The full table scan is gone. The plan now shows IndexedTableAccess — Dolt goes directly to the index, seeks to the entries where the indexed expression equals 'facello', and returns only those rows. The same query now takes under 2ms, a ~75x improvement on a 300K row table.

Indexed lookups into a large table are one of the areas where functional indexes can have a huge impact on query performance. Going from a full table scan to a direct lookup resulted in a 75x improvement for this query against 300k rows, and as the table gets bigger, the full table scan performance gets slower, while the indexed lookup performance remains constant, resulting in a larger and larger improvement.

Functional Indexes in Joins#

The functional index doesn’t just speed up point lookups — it also improves queries where the filtered table is one side of a join. Let’s look at a query that finds the current job title for every employee with the last name ‘Facello’:

SELECT f.emp_no, f.first_name, f.last_name, t.title, t.from_date
FROM (SELECT emp_no, first_name, last_name FROM employees WHERE LOWER(last_name) = 'facello') f
JOIN titles t ON f.emp_no = t.emp_no
WHERE t.to_date = '9999-01-01';

Without the Index#

Let’s start off by looking at the query plan for this query, just like we did earlier. To remove the index we just created on the employees table, we can run:

employees/main> DROP INDEX idx_lower_last_name ON employees;

And here’s the query plan:

employees/main> EXPLAIN FORMAT=TREE
    SELECT f.emp_no, f.first_name, f.last_name, t.title, t.from_date
    FROM (SELECT emp_no, first_name, last_name FROM employees WHERE LOWER(last_name) = 'facello') f
    JOIN titles t ON f.emp_no = t.emp_no
    WHERE t.to_date = '9999-01-01';
+--------------------------------------------------------------------------+
| plan                                                                     |
+--------------------------------------------------------------------------+
| Project                                                                  |
|  ├─ columns: [f.emp_no, f.first_name, f.last_name, t.title, t.from_date] |
|  └─ LookupJoin                                                           |
|      ├─ SubqueryAlias                                                    |
|      │   ├─ name: f                                                      |
|      │   ├─ outerVisibility: false                                       |
|      │   ├─ isLateral: false                                             |
|      │   ├─ cacheable: true                                              |
|      │   ├─ colSet: (7-9)                                                |
|      │   ├─ tableId: 2                                                   |
|      │   └─ Filter                                                       |
|      │       ├─ (lower(employees.last_name) = 'facello')                 |
|      │       └─ Table                                                    |
|      │           ├─ name: employees                                      |
|      │           └─ columns: [emp_no first_name last_name]               |
|      └─ Filter                                                           |
|          ├─ (t.to_date = '9999-01-01')                                   |
|          └─ TableAlias(t)                                                |
|              └─ IndexedTableAccess(titles)                               |
|                  ├─ index: [titles.emp_no,titles.title,titles.from_date] |
|                  ├─ columns: [emp_no title from_date to_date]            |
|                  └─ keys: f.emp_no                                       |
+--------------------------------------------------------------------------+
22 rows in set (0.00 sec)

The subquery on one side of the LookupJoin does a full table scan of all 300K employees to find the Facello employees. Then, for each matching employee, it does a fast indexed lookup into the titles table using the primary key. The bottleneck is that first full scan — the query runs in about 150ms. This is very similar performance to our first un-indexed query, which makes sense, because the most significant factor in the performance of both queries is the full table scan over the 300k items in the employees table.

With the Index#

Just like before, let’s create a functional index over LOWER(last_name):

employees/main> CREATE INDEX idx_lower_last_name ON employees ((LOWER(last_name)));

Now let’s look at the query plan:

employees/main> EXPLAIN FORMAT=TREE
    SELECT f.emp_no, f.first_name, f.last_name, t.title, t.from_date
    FROM (SELECT emp_no, first_name, last_name FROM employees WHERE LOWER(last_name) = 'facello') f
    JOIN titles t ON f.emp_no = t.emp_no
    WHERE t.to_date = '9999-01-01';
+----------------------------------------------------------------------------------------+
| plan                                                                                   |
+----------------------------------------------------------------------------------------+
| Project                                                                                |
|  ├─ columns: [f.emp_no, f.first_name, f.last_name, t.title, t.from_date]               |
|  └─ LookupJoin                                                                         |
|      ├─ SubqueryAlias                                                                  |
|      │   ├─ name: f                                                                    |
|      │   ├─ outerVisibility: false                                                     |
|      │   ├─ isLateral: false                                                           |
|      │   ├─ cacheable: true                                                            |
|      │   ├─ colSet: (15-17)                                                            |
|      │   ├─ tableId: 3                                                                 |
|      │   └─ Project                                                                    |
|      │       ├─ columns: [employees.emp_no, employees.first_name, employees.last_name] |
|      │       └─ IndexedTableAccess(employees)                                          |
|      │           ├─ index: [employees.!hidden!idx_lower_last_name!0!0]                 |
|      │           └─ filters: [{[facello, facello]}]                                    |
|      └─ Filter                                                                         |
|          ├─ (t.to_date = '9999-01-01')                                                 |
|          └─ TableAlias(t)                                                              |
|              └─ IndexedTableAccess(titles)                                             |
|                  ├─ index: [titles.emp_no,titles.title,titles.from_date]               |
|                  ├─ columns: [emp_no title from_date to_date]                          |
|                  └─ keys: f.emp_no                                                     |
+----------------------------------------------------------------------------------------+
22 rows in set (0.00 sec)

The join structure is identical — a LookupJoin with a per-employee index seek into titles — but now the employees side uses IndexedTableAccess with the functional index instead of a full table scan. Dolt resolves only the 186 matching employees before doing any join work at all, which brings the query down to under 2ms.

Here’s a sample of the results:

+--------+------------+-----------+------------------+------------+
| emp_no | first_name | last_name | title            | from_date  |
+--------+------------+-----------+------------------+------------+
|  10001 | Georgi     | Facello   | Senior Engineer  | 1986-06-26 |
|  15346 | Kirk       | Facello   | Senior Engineer  | 1997-12-06 |
|  15685 | Kasturi    | Facello   | Senior Engineer  | 1997-03-13 |
|  18686 | Kwangyoen  | Facello   | Senior Staff     | 1994-05-02 |
|  21947 | Taisook    | Facello   | Senior Engineer  | 1998-08-28 |
|  23938 | Nahum      | Facello   | Senior Engineer  | 1985-09-15 |
|  24774 | Uno        | Facello   | Senior Staff     | 2002-05-15 |
|  24806 | Charmane   | Facello   | Senior Engineer  | 1999-03-27 |
|  25955 | Christoph  | Facello   | Technique Leader | 1995-08-21 |
|  27732 | Girolamo   | Facello   | Senior Engineer  | 1986-06-30 |
| ...    |            |           |                  |            |
+--------+------------+-----------+------------------+------------+
186 rows in set

Functional Indexes and Dolt Branches#

One thing worth noting for Dolt users: functional indexes are part of your schema, and like everything else in Dolt, your schema lives on branches. If you create a functional index on a feature branch, it exists only on that branch until you merge it. Schema changes, including functional index additions and removals, show up in dolt diff and in pull requests on DoltHub, so your team can review index changes the same way they review any other change to the database.

This means you can safely experiment with different indexing strategies on isolated branches and only merge the ones that work the way you need them to. It also allows you to offload the work to build the initial index to be performed in the branch, then Dolt can often reuse the index data directly and update it from data differences when you merge it into another branch.

How It Works#

Under the hood, Dolt implements functional indexes the same way MySQL does: by automatically creating a hidden virtual generated column that stores the result of the expression, then building a regular index on that hidden column. The virtual column is entirely transparent — it won’t appear in SHOW COLUMNS or SELECT * output. Dolt handles computing and maintaining it automatically as rows are inserted and updated.

Future Enhancements#

Our initial release of functional index support includes the features that customers told us they needed from functional indexes (e.g. support for a single functional expression, using functional indexes in joins and filters). Now that those customer use cases are unblocked, we’re following up with a few more enhancements:

Multiple expressions per functional index. Each functional index currently supports exactly one functional expression. If you need indexes on multiple expressions, you’ll need a separate index for each one. This was an expedient way to build what our first customers needed, and we’re already working on expanding support for mixing functional expressions with column references in an index, and using multiple functional expressions in a single index. We expect to deliver this as a quick follow-up to the initial support.

Doltgres support. Doltgres is our PostgreSQL-compatible database engine. Dolt has a head start of a few years over Doltgres, but Doltgres is catching up quickly and moving towards a 1.0 milestone. We’ve already started working on enabling functional index support in Doltgres and expect to deliver this as another fast follow-up.

Use functional indexes for more queries. Today, functional indexes are used in joins and filters to speed up queries. These are the main places where functional index speed up queries, but there are other places where functional indexes could also be used, such as to optimize sorting. For example, a query like SELECT * FROM users ORDER BY LOWER(email) won’t yet use the functional index to avoid a sort.

Performance testing. Our initial performance testing for functional indexes shows a dramatic speed up for queries that performed a full table scan before being optimized with a functional index. When compared to MySQL performance for those same queries and same index, Dolt matches or beats MySQL’s performance for those queries. We’ll be adding sysbench tests to track query performance with functional indexes and digging into other cases, like JSON usage with functional indexes to continue tuning performance.

Wrap Up#

The ability to index functional expressions on your data and then use those precomputed values to speed up joins and filtering in queries is a big feature for Dolt. These functional indexes can turn slow table scans over large tables into lightning fast point lookups. In the examples we walked through here, on a table with 300k rows, a functional index improved performance by 75x. The performance impact gets larger as the table size grows.

If you want to dig deeper into how Dolt uses indexes and plans joins, check out Nick’s recent post on improving index selection for join queries.

If you haven’t tried Dolt yet, install it here and give functional indexes a spin! If you have questions or feedback, swing by our Discord server or file an issue on GitHub. We love hearing from customers and are always happy to prioritize features or bug fixes that customers tell us they need.

Why DoltLite?

Tim Sehn — Mon, 27 Apr 2026 00:00:00 GMT

We shipped DoltLite, a version-controlled SQLite. We already have Dolt. Dolt is free and open source. Dolt clones, branches, pushes, pulls, and merges. Why ship a second product? This article explains.

Local-first Software#

Local-first software, coined by Ink & Switch in 2019, has inspired a lot of software, mostly based on SQLite. SQLite puts the “local” in “local-first”. SQLite delivers complex tabular data and SQL query power as a C-library, embeddable in any language.

Local-first defines seven ideals:

Fast
Multi-device
Offline
Collaboration
Longevity
Privacy
User control

SQLite gives you “Fast”, “Offline”, “Privacy”, and “User Control”.

How do you get “Multi-device” and “Collaboration” from SQLite? You need a sync engine. There’s been a number of sync engines based around SQLite in the over half-a-decade since Ink & Switch published this essay: Turso, Powersync, ElectricSQL, and cr-sqlite to name a few.

But, the fundamental problem with sync engines is “What do you do with conflicts?”. The Local-first software essay suggests Conflict Resistant Data Types (CRDTs) as the solution, rejecting Git because “Git has no capability for real-time, fine-grained collaboration, such as the automatic, instantaneous merging” and “other (non-text) file formats are treated as binary blobs that cannot meaningfully be edited or merged”.

Enter DoltLite, a version-controlled SQLite. DoltLite enables a new class of local-first application by supporting Git-style merging on tabular data directly accessible as a C-library.

But Dolt?#

Dolt already existed and has been stable for years. Why not just use Dolt?

Because Dolt is a server. To use Dolt from your application you stand up dolt sql-server, point a MySQL client at port 3306, and talk wire protocol. That’s a fine model if you’re replacing MySQL or Postgres. It’s the wrong model for local-first.

The whole point of local-first is the database lives on the user’s device. Asking the user to run a server defeats the purpose. Running a local server is a pain. There’s a process to manage, a port to not collide with, a daemon that can crash. SQLite has none of that. SQLite is a .dylib your application links in. That’s the model local-first wants.

DoltLite is that model with version control. People have been asking us for an embedded Dolt for years.

SQL and Version Control in Any Language#

Dolt is written in Go. Go is great for a server. It is bad for a library you embed in someone else’s runtime, unless it’s a Go runtime.

Go binaries are big. Go is awkward to link into iOS apps, Python C extensions, Ruby gems, Node addons, Erlang NIFs, and Rust crates. And Go’s WebAssembly (WASM) story produces multi-megabyte bundles with a runtime that doesn’t play well with the browser’s threading and storage primitives.

C compiles to all of those. iOS, Android, Python, Ruby, Node, Rust, Erlang, and the one that matters most for local-first: a .wasm bundle that runs inside a browser tab. DoltLite compiles SQLite’s WASM target.

WASM is what local-first looks like in 2026. The user opens a webpage. The webpage downloads a .wasm file. The .wasm is a full version-controlled SQL database backed by the browser’s private filesystem. The user’s edits commit to a local branch. When they hit publish, the branch pushes to a DoltLite remote. When a teammate’s branch lands, the user pulls and merges.

-- alice's laptop
SELECT dolt_clone('https://dolthub.com/team/users', 'users.db');
INSERT INTO users VALUES (1, 'alice');
SELECT dolt_commit('-Am', 'add alice');                                       
SELECT dolt_push('origin', 'main');                                
                                                                                
-- bob's browser tab (WASM)                                        
SELECT dolt_clone('https://dolthub.com/team/users', 'users.db');
INSERT INTO users VALUES (2, 'bob');                                          
SELECT dolt_commit('-Am', 'add bob');
SELECT dolt_pull('origin', 'main');   -- merges alice's commit                
SELECT dolt_push('origin', 'main');   -- pushes the merged history

No wire protocol. No port. No server. A C library call into a .dylib, .so, .dll, or .wasm you linked into your app. dolt_push to share with everyone else holding a copy.

In the example, Alice and Bob inserted different rows, so the merge was trivial. If they had both updated row two to different names, dolt_pull would do what Git does: drop both versions into a dolt_conflicts_users table and refuse to commit until a human picks. That’s the part CRDTs hide. For data with audit or rollback requirements, you want disagreement surfaced, not silently resolved.

DoltLite is the local-first use case, with Git-style merging on structured data instead of CRDTs.

Try DoltLite#

All DoltLite needs is users. Have a local-first app you’ve been waiting to build because there was no Git-style sync model? Wait no longer. Questions? Bugs? Feature requests? Cut an issue. Otherwise, come by our Discord to discuss. Meet me in the #doltlite🪶 channel.

How Dolt Represents and Evaluates Queries: A Case Study

Nick Tobey — Mon, 20 Apr 2026 00:00:00 GMT

Want to know a cool secret about database engines? They’re literally the same thing as compilers.

At my previous job, I worked on Google’s internal Java compiler. Now, I’m one of the developers of Dolt, the first version-controlled SQL database, as well as go-mysql-server, the SQL engine that Dolt depends on.

It turns out that the skills and techniques that I first learned while writing a compiler are the exact same techniques that Dolt uses in its database engine. And that’s because when you break it down, database engines are just a domain-specific compiler:

General-purpose compilers turn human-readable source code into machine-readable programs that manipulate variables and produce side-effects.
SQL engines turn human-readable SQL queries into machine-readable execution plans that manipulate table columns and produce a stream of rows.

Most compiler concepts map directly onto database engines. Some database engines like SQLite even work by producing bytecode that executes in a special-purpose VM.. Dolt doesn’t do that, but it does create an abstract syntax tree almost exactly like one you would see in any other interpreted language.

The main difference is that in most languages, evaluating a syntax tree produces a return value and side effects. In Dolt, evaluating a syntax tree produces an iterator. This also means that each intermediate node in the tree is an iterator defined in terms of one or more child iterators. This approach is often called the volcano model or iterator model of database engine design.

This means that database engines can have the same types of bugs as traditional languages and a lot of the same thorns. We recently fixed a correctness issue in Dolt that does a good job demonstrating this. You can find the writeup and the fix here.

To trigger the incorrect behavior, you needed a query that looked like this.

CREATE TABLE ab (a INT PRIMARY KEY, b INT);
CREATE TABLE three_pk (pk1 TINYINT, pk2 TINYINT, pk3 TINYINT, col TINYINT, PRIMARY KEY (pk1, pk2, pk3));
SELECT * FROM ab AS ab1 WHERE EXISTS (
  SELECT * FROM ab AS ab2 WHERE EXISTS (
    SELECT * FROM ab AS ab3 WHERE EXISTS (
      SELECT * FROM three_pk WHERE pk1 = ab1.a and pk2 = ab2.a and pk3 = ab3.a
    )
  )
);

Running this query on older versions of Dolt would trigger a panic. To understand why, let’s talk about scopes.

Scopes#

Scopes are something that every programmer has to deal with even if they don’t realize it. It’s how programs resolve symbols to the things that are actually being referenced.

SQL queries reference tables and columns by name, and Dolt needs to map those names onto the tables and columns they represent. This is not as simple as it looks, because the same name can refer to different tables based on where it appears in the statement. The following is a valid SQL query:

SELECT * FROM test_table WHERE EXISTS (SELECT * FROM test_table where test_table.id = 1) and test_table.id = 2;

In this query, there are two tables named test_table and two filter conditions that name test_table. Which condition refers to which child iterator? If Dolt resolves these names incorrectly, it will produce incorrect output.

Two names can also refer to the same table, again depending on where the names appear. Consider this simple query with a two-part primary key:

CREATE TABLE test_table(pk1 INT, pk2 INT, PRIMARY KEY (pk1, pk2));
SELECT * FROM (SELECT * FROM test_table WHERE test_table.pk2 = 1) AS table_alias where table_alias.pk1 = 2;

This query can be optimized into a simple table lookup, but only if Dolt can detect that the two WHERE clauses refer to the same table, even though the two clauses use different table names.

Thus, it does not suffice for Dolt to simply keep a global dictionary that maps names onto tables, because the rules for resolving references are not global. Different parts of the query introduce their own namespace, which changes how names are resolved. These namespaces are scopes.

So how does a database engine keep these scopes straight? How are table references actually represented in the abstract syntax tree?

Scopes at Analysis Time#

The most naive approach is to simply store the same names in the AST as they appear in the query. Then, whenever the engine wants to analyze, optimize, or execute part of the tree, it resolves the name using scope rules. This works, but it’s incredibly brittle:

Any optimization that transforms the AST needs to be very careful. If a node contains a reference, moving that node to another part of the tree could change the meaning of that reference, and change the behavior of the query. Whenever the engine transforms the tree, it would need to update any references.
In fringe cases, a transformation may result in an impossible tree, where we need to reference a table but it’s impossible to do so because that table’s name is being shadowed by another.
Needing to resolve references repeatedly is slow and wasteful.

This is actually how Dolt used to operate years ago, and it was the source of subtle bugs. So we switched to a better approach: when analyzing a query, we assign incrementing globally unique IDs to tables and columns. Every reference is resolved once, and then the name gets replaced with the ID. Since each ID always refers to the same table or column, we can safely modify the AST without any risk of changing the query’s meaning.

But this is only half of the situation of scopes.

Scopes at Runtime#

It’s not enough to just be able to resolve references to the table or column they represent, we also need to track those values while the query is running. Operations that produce column values need to be able to send those values to the operations that read them.

In general-purpose languages, this might be accomplished with registers, or by writing to values in memory. But SQL queries are declarative and functional and don’t have state: when evaluating a node in the syntax tree, the iterator it produces is often a pure function of:

Columns inherited from parent nodes
- Example: In SELECT id FROM a WHERE EXISTS (SELECT id FROM b WHERE a.id = b.id), the inner query references a column from the outer query.
Columns returned from child nodes
- Example: In SELECT id FROM (SELECT id FROM a) AS a_alias where a_alias.id = 1, the outer query references a column from the inner query.
Columns defined in sibling nodes
- Example: In SELECT id FROM a JOIN LATERAL (SELECT id FROM b WHERE a.id = b.id), the right side of the join references a column from the left side of the join.

Each of these columns can have multiple values over the lifetime of the query, but only one at a time. A SQL engine needs a strategy to represent this internally.

A naive approach might be to maintain a mapping from each column name to its current value. Except as we already saw, names don’t map one-to-one onto columns. In fact, it’s perfectly valid MySQL to for a table alias to have multiple columns with the same name. For example, the below query produces a table alias with two columns both named pk.

CREATE TABLE test_table(pk INT PRIMARY KEY);
SELECT * FROM (SELECT * FROM test_table JOIN test_table) AS table_alias;

Even more problematic is that a table alias column might have no name:

SELECT * FROM (SELECT 1+1) AS table_alias;

In either case, it’s not possible to reference these columns in filters, but they can still impact the results of the query if the alias is used in a SELECT *.

So if we can’t track values by their column name, another approach might be to use the unique column IDs we discussed in the previous section. But there could be many such IDs, and each node in the AST only cares about a small number of them. Managing lots of small maps is also not very performant, and we care about performance.

Fortunately, there’s an approach to that gives us both clarity and performance. Every scope in a SQL query always has the same number of columns. So the number of columns referenceable from any node in the AST is a constant value that can be determined by statically analyzing the query. The number of columns in that node’s iterator is also a constant value.

Examples:

A table reference with N columns produces an iterator that returns a list of five values.
A SELECT col1, col2, ... colN WHERE EXISTS (...) construct creates N referenceable columns for every node within the subquery.

Each node can have an array where we store the value of each of these columns. Each element in the array corresponds to a different column. Before we evaluate the query, we can analyze the AST to determine which column each array element represents. Now if we want to represent an expression that reads a column, we don’t need to store the name of the column in the AST, and we don’t need to store that column’s globally unique ID either: all we need to store is the offset within that node’s own array corresponding to that column.

This is the approach that Dolt uses: it analyzes the tree, determines the exact set of columns visible to each node, and replaces column references with the correct offset into that node’s array that will contain that column at runtime.

We can illustrate this process with some diagrams. In each case, we show the nodes in the AST, and each node has both an ordered sequence of columns it produces (the output schema), and an ordered lists of columns it can reference (the input schema). The color of each cell indicates the node that originally produced the value. Note how nodes can contain column references from children, siblings, or parents, but in each case the number of columns can be statically determined.

A node referencing its child:

CREATE TABLE test_table(a int, b int, c int);
SELECT c+1 AS d FROM test_table;

Would result in an AST that looks like this:

A node referencing its parent:

CREATE TABLE test_table(a int, b int, c int);
SELECT a FROM test_table AS t1 WHERE EXISTS (SELECT b FROM test_table AS t2 WHERE t1.a = t2.a)

Would result in an AST that looks like this:

In order for this optimization to work correctly, every node must agree on how many columns it receives from each of their children, and how many columns are visible from parent and sibling scopes. If these numbers don’t agree, it could lead to situations where Dolt accesses the wrong value at runtime, or accesses the array out-of-bounds and panics.

So with all this in mind, let’s look back at the query that triggered the bug:

CREATE TABLE ab (a INT PRIMARY KEY, b INT);
CREATE TABLE three_pk (pk1 TINYINT, pk2 TINYINT, pk3 TINYINT, col TINYINT, PRIMARY KEY (pk1, pk2, pk3));
SELECT * FROM ab AS ab1 WHERE EXISTS (
  SELECT * FROM ab AS ab2 WHERE EXISTS (
    SELECT * FROM ab AS ab3 WHERE EXISTS (
      SELECT * FROM three_pk WHERE pk1 = ab1.a and pk2 = ab2.a and pk3 = ab3.a
    )
  )
);

The simplest explanation for the root cause was this:

When Dolt optimized this query, it would transform it into multiple nested join nodes.
Each of these join nodes introduced a new table alias that could be referenced by its children.
If a join condition referenced a column, then Dolt would need to compute the offset of that column in the join node’s array of column references. This required knowing how many of the columns in the array came from parent scopes, how many came from sibling scopes, and how many were from that node’s children.
The logic for computing how many columns in the input schema came from outer scopes did not consider the fact scopes could be introduced in the middle of a nested join. This made it impossible to correctly calculate this for every node in the AST.

There were some attempts made to account for this issue, but in adjusting the calculations for common queries, it broke the calculations for less common queries. Ultimately, these adjustments made the logic harder to reason about. In the end, we fixed the issue by completely rewriting how we generated iterators during join in order to make it simpler to analyze them.

The full scope of the issue and the fix are more complicated, but this was the general idea.

I hope this illuminates what actually goes on inside a database engine.

As always, if you have any thoughts about database design, or if you’re curious how a version-controlled database can benefit you, then you should hop into our Discord and we’d love to discuss it with you.

Vibe-Coded Agents for Vibe-Coded Issues

Elian Deogracia-Brito — Fri, 03 Apr 2026 00:00:00 GMT

Dolt is a SQL database with Git-style version control. It speaks the MySQL wire protocol, so any MySQL client can connect to it, and it adds version control primitives on top: branch, merge, diff, clone, and push, all over SQL. Under the hood, go-mysql-server handles the SQL execution layer between clients and Dolt’s storage engine.

Gas Town builds on those primitives. It is a multi-agent coding orchestrator built on write-only code, and as it scaled to hundreds of concurrent workers, those agents were constantly branching, merging, and committing against Dolt. That load surfaced a new class of issues: vibe-coded issues.

To be clear, I’m not throwing shade. After all, I’m familiar with running agents in parallel to work through MySQL correctness work, but we’re also entering a new ballpark here. It’s hard to expect clear reproductions when hundreds of agents unpredictably use a tool at the same time. I would know, since I recently spent hours trying to get a reproduction on a couple of Gas Town Dolt issues, some leading to nowhere.

So, in the spirit of Vibe Code vs Trad Code, sometimes you fight fire with fire. I’ve “vibe-coded” grunt, a Go CLI that provisions isolated Docker agent environments in parallel but with options this time around.

The Problem: Pre-Baked Containers Don’t Scale#

The obvious answer to “Gas Town introduced these issues, why not use Gas Town to reproduce them?” is that issue reproduction isn’t really a Gas Town job. Gas Town is built for long-running, write-only work where persistent agent memory via Beads accumulates across sessions. Reproducing a GitHub issue is the opposite in that it’s a short-lived, discrete task where you spin up, get a failing test, and tear down.

My previous setup relied on a single container image with all dependencies pre-baked. That worked when I was targeting a specific repository and task. The moment I needed to handle two repos with different requirements, it broke down and I had to manually update things.

What I actually wanted was to configure at launch time via CLI flags what each agent container and repo needs: which post-install scripts to run, how much memory to give it, what prompt to start it with.

So I pointed an agent at the problem. I had it follow Go best practices from the official docs and address IDE warnings as it went, rather than letting it freewheel. The result is grunt. I can read the code, which puts it firmly in Trad Code territory even if it got there via an agent.

How grunt Works#

One command does everything:

grunt agent create -name gms -repo dolthub/go-mysql-server

create resolves the repo’s configuration, builds the Docker image with the right post-install scripts applied, clones the repo, starts the container, and drops you straight into a zellij session with Claude already running. There is no separate start step.

The nice part is that grunt agent create -repo dolthub/dolt already knows what that repo needs without me specifying anything. dolthub/dolt needs CGo build dependencies plus bats for its test suite. dolthub/go-mysql-server needs something lighter. That comes from the configuration layer, covered in the next section. Even if you don’t have a config, it’ll figure out the GitHub URL automatically for any new repositories.

If I want to override the post-install scripts or provider (only claude so far) for a specific run, I pass them directly:

grunt agent create -name dolt -repo dolthub/dolt -repo dolthub/go-mysql-server -issue 1234 \
  -post-install go,bats,dolt-cgo-deps \
  -provider claude

When I have several issues to chase at once, -d starts the container in the background and hands control back immediately. Then I attach to whichever is idle.

grunt agent create -d -name dolt -repo dolthub/dolt -issue 10782
grunt agent create -d -name gms -repo dolthub/go-mysql-server
grunt agent ls
ID                 STATE    ACTIVITY  PROVIDER  REPOS                         ISSUES              CREATED
gms-41b27c1c       running  -         claude    dolthub/go-mysql-server (+2)  10190               2026-03-27T18:20:49Z
gms-c35d524e       stopped  -         claude    dolthub/go-mysql-server (+1)  dolthub/dolt#10757  2026-03-30T20:12:06Z
dolt-d32698b0      running  working   claude    dolthub/dolt                  10782               2026-03-31T17:16:23Z
gms-a573c42c       running  idle      claude    dolthub/go-mysql-server (+1)  -                   2026-04-01T17:39:20Z
grunt agent attach gms-a573c42c

For changes I want to stick across all runs of a repo, I use grunt config set. I’ll usually do this when working on any new repo to set their post-install scripts.

grunt config set branch.prefix -value elian -repo dolthub/dolt
branch.prefix[dolthub/dolt]=elian

grunt config ls -repo dolthub/dolt
$ grunt config ls -repo dolthub/dolt
agent.provider=claude
agent.memory-limit=8g
branch.prefix=elian
issue.prompt=Create a failing reproduction for the following issue {{....
repo.post-install=go,bats,expect,dolt-cgo-deps
repo.services=

That config ls output is the resolved view of a three-layer system. The UX borrows directly from git config.

Per-Repo Config#

Rather than making users configure everything from scratch, grunt ships with per-repo defaults embedded directly in the binary using //go:embed. Each supported repo gets one JSON file covering the two things that vary most: which post-install scripts to run and a default prompt to seed the agent with.

{
  "scripts": ["go", "bats", "expect", "dolt-cgo-deps"],
  "prompt": "Create a failing reproduction for the following issue {{.IssueRef}} under the relevant available repositories..."
}

Adding a new repository means adding one JSON file. On top of those sit two user-owned layers: a global config that applies across all repos and a per-repo override. Every lookup walks the same three layers.

func (s ProfileService) AgentProvider(repo string) (string, error) {
    profile, err := s.Store.Load()
    if err != nil {
        return "", fmt.Errorf("load profile: %w", err)
    }
    if provider, ok := profile.RepoAgentProviders[repo]; ok {
        return provider, nil
    }
    return profile.AgentProvider, nil
}

This works fine for the number of config values grunt has today. It’s worth noting that the agent duplicated this walk across every accessor instead of merging the layers once at load time. A cleaner approach would resolve everything in Load() and let accessors just read fields. With only a handful of values, this is not a real problem yet, but it’s worth revisiting.

Adding a New Agent Type#

grunt only ships with Claude today, but it was designed so that swapping in a different agent is one interface implementation away. Each agent type implements AgentProvider, which covers everything that varies between providers, including how to set up the agent in the Docker container, where to find the agent’s config files, how to read its activity status, and what command to launch it with.

type AgentProvider interface {
    Name() string
    Spec() Spec
    PrepareRuntime(input RuntimeInput) (RuntimeOutput, error)
    ActivityStatusPath(workspaceRoot string) string
    ReadActivityStatus(workspaceRoot string) (string, error)
    EnsureGlobalConfig(paths config.Paths) error
    SaveAPIKey(key string, paths config.Paths) error
    ConfigPaths(paths config.Paths) ConfigPaths
    Validate() error
}

PrepareRuntime handles the container setup by returning the Docker mounts, environment files, and build commands specific to that provider. Spec returns the startup command and shell. Those get passed into a zellij layout file that grunt generates at runtime using Go’s text/template (has also been useful in other templates i.e. the Dockerfile).

layout {
    tab {
        pane name="Claude" command={{ printf "%q" .StartupCommand }}
    }
    new_tab_template {
        pane name="Shell" command={{ printf "%q" .PaneShell }}
    }
}

Zellij is a terminal multiplexer, similar to tmux, that lets grunt give each agent its own named panes. Because the layout is generated from a template, a different provider produces a different layout with no special casing anywhere in the terminal code. Provider selection is a single switch:

func LookupAgentProvider(name string) (AgentProvider, error) {
    switch strings.ToLower(strings.TrimSpace(name)) {
    case "", domain.DefaultProvider:
        return ClaudeProvider{}, nil
    default:
        return nil, fmt.Errorf("%q: %w", name, errUnsupportedProvider)
    }
}

Conclusion#

grunt is a vibe-coded tool for reproducing issues caused by a vibe-coded tool. That is about as recursive as it gets. We could go on about each aspect of implementation, but the above represents the main goal of these ephemeral configurable agent instances. It’s readable-enough, and it works.

It’s already helped me spin up multiple reproductions in parallel and close issues faster. I’ve also gotten new ideas on putting this against CI flaky tests in the background too. If you have questions about the setup or want to dig into Dolt, come find us on Discord. We are always happy to talk Go.

Branch Permissions in the Hosted Dolt Workbench

Taylor Bantle — Thu, 12 Mar 2026 00:00:00 GMT

Hosted Dolt’s SQL workbench is a great collaboration tool for teams looking for a modern, easy-to-use UI for your version-controlled database. Its permission model allows for infinite users with varying access to your database.

In addition to user roles, Dolt supports branch permissions, which let you limit access of certain branches to all or some users. Hosted Dolt now supports branch permissions directly from the workbench UI.

Background and implementation details#

Restricting access to certain branches is a common workflow on GitHub. Users want to prevent direct writes to their main branch, instead only allowing reviewed and approved pull requests to make it into production code. We have a synonymous branch permission model on DoltHub, which we released a few years ago.

The online nature of Hosted Dolt is different from the offline model of DoltHub and GitHub. And unlike DoltHub, there are two layers of users on Hosted: Hosted application users (the user that logs into hosted.doltdb.com) and SQL users (users that connect to the SQL server or workbench).

For the workbench specifically, depending on the role of the application user (via organization roles or deployment collaborators) an internal SQL user with corresponding permissions will be used to connect to the workbench. If you have this feature enabled in your deployment settings, you’ll see them here:

mysql> select host, user from mysql.user where user like 'hosted-ui-%';
+------+------------------+
| host | user             |
+------+------------------+
| %    | hosted-ui-admin  | -- all writes, including GRANT OPTION privilege
| %    | hosted-ui-writer | -- all writes,  does not include GRANT OPTION privilege
| %    | hosted-ui-reader | -- read-only for all branches
+------+------------------+
3 rows in set (0.037 sec)

Dolt’s branch permission model initially included three available permissions - admin, write, and read. Creating a branch control for branch main with permission read would prevent all of the above users from writing to main.

One of our customers requested a featured where their organization members can make changes on a branch, create a pull request on the workbench, and merge this branch into main, but cannot write directly to main. This is more similar to the GitHub/DoltHub model and makes sense for a workbench product that enables this kind of collaboration on data.

So we added an additional branch permission to Dolt - merge. Creating a branch control for branch main with permission merge would prevent all users from writing to main, with the exception of the dolt_merge and dolt_commit SQL procedures, which are used when merging a pull request from the workbench.

While all changes in Dolt are version-controlled, a branch permission of this nature creates even greater oversight and transparency for what changes are making it into your production database.

How it works#

Suppose Tim has a Hosted database to keep track of employees and teams at DoltHub. Tim hired a Human Resources employee to manage this database, but they are not SQL savvy and he does not want them to make changes to main without reviewing their pull requests first.

First, Tim adds the HR employee as a collaborator with write permissions in his deployment settings.

Next, Tim launches the workbench and navigates to the Settings tab, which will have a “Branch Protections” form. Only deployment admin have access to this Settings tab.

He adds branch main with permission level Merge. He can add any branch or branch name pattern here.

Note that this feature only affects users with Write permission on the deployment. Readers will always only have read-only access to the workbench and admin will always have full access.

Now, the new hr-employee user logs in and accesses the workbench. They have been tasked with adding a new “Human Resources” team and adding themselves as an employee. You can see that the built-in cell buttons makes this easy and does not require knowledge of SQL.

However, hr-employee cannot make changes directly to main since Tim has prevented that with branch permissions.

hr-employee must create a new branch to make this change.

This time adding the “Human Resources” team is successful, since the branch protection only affects the main branch.

hr-employees adds to employees and employees_teams and commits these changes using the “Create commit” button. They create a pull request and send to Tim for review.

Tim gives the LGTM, and hr-employee merges the pull request. This is successful because the branch permission allows changes to main via dolt_merge. We can now see these changes on the main branch.

Conclusion#

Hosted Dolt Workbench is a great tool for collaborating with your team on your database, and branch permissions give you even more oversight and control over what makes it into your production database. We take our customer asks seriously, so please contact us with questions or feature requests by filing an issue or reaching out on Discord.

Saying goodbye to the LD1 storage format

Zach Musgrave — Tue, 03 Mar 2026 00:00:00 GMT

We’re building Dolt, the world’s first version-controlled SQL database. Dolt hit 1.0 in 2023, which meant, among other criteria, that we promised forwards storage compatibility:

Dolt 1.0 will be backwards compatible with all further 1.X versions.

Now it’s 2026 and we’re on the verge of releasing Dolt 2.0. In preparation for this release, we’re pursuing some work we’ve long put off: finally removing support for the pre-1.0 format, which we referred to internally and in binary logs as LD1 in honor of our original company name.

Read on for details of why we took this step and how we accomplished it.

The challenge#

In 2022, we first landed Dolt’s current storage format in alpha release. The new storage format diverged radically from what came before, partially summarized here with a before and after view of how tuple values in rows are stored:

Before:

+--------+----------+---------+--------+----------+---------+-----+--------+----------+---------+
| Type 0 | Length 0 | Value 0 | Type 1 | Length 1 | Value 1 | ... | Type k | Length k | Value K |
+--------+----------+---------+--------+----------+---------+-----+--------+----------+---------+

After:

+---------+---------+-----+---------+----------+----------+-----+----------+-------+
| Value 0 | Value 1 | ... | Value K | Offset 1 | Offset 2 | ... | Offset K | Count |
+---------+---------+-----+---------+----------+----------+-----+----------+-------+

Storing tuples in the new format was part of a series of changes to how values are serialized to disk that collectively resulted in over a 5x speedup in our internal benchmarks. We pursued these changes primarily for performance reasons, and it worked: Dolt is now as fast as MySQL on sysbench.

But these changes came at a cost. Because the old and new storage formats were incompatible with one another, and existing paying customers were running their production databases on the old format, we needed to support both storage formats in parallel. We did this the typical way in many programming languages, by introducing interfaces that abstracted away the differences between the two implementations. For example, here’s how we define a table’s storage:

type Table interface {
	HashOf() (hash.Hash, error)
	GetSchemaHash(ctx context.Context) (hash.Hash, error)
	GetSchema(ctx context.Context) (schema.Schema, error)
	SetSchema(ctx context.Context, sch schema.Schema) (Table, error)
	GetTableRows(ctx context.Context) (Index, error)
	GetTableRowsWithDescriptors(ctx context.Context, kd, vd *val.TupleDesc) (Index, error)
	SetTableRows(ctx context.Context, rows Index) (Table, error)
	GetIndexes(ctx context.Context) (IndexSet, error)
	SetIndexes(ctx context.Context, indexes IndexSet) (Table, error)
	GetArtifacts(ctx context.Context) (ArtifactIndex, error)
	SetArtifacts(ctx context.Context, artifacts ArtifactIndex) (Table, error)
	GetAutoIncrement(ctx context.Context) (uint64, error)
	SetAutoIncrement(ctx context.Context, val uint64) (Table, error)
    DebugString(ctx context.Context, ns tree.NodeStore) string
}

Then, under the hood, we had two different implementations of a table: NomsTable for the old storage format, and DoltTable for the new one. The same pattern was repeated for all the other objects needed to materialize data to disk: schemas, indexes, foreign keys, commits, etc.

This all sounds fine so far, except that due to limitations on the time and effort we were willing to expend on this “temporary” state of affairs, these abstractions didn’t fully capture all the necessary differences between the two implementations. This is regrettable but understandable: various database operations tend to be tightly coupled to their on-disk representations for reasons of performance. This meant that, in practice, there were many places in library code where we switched on the storage type of a database. It looked like this:

		if types.IsFormat_DOLT(tm.vrw.Format()) {
			tbl, stats, err = mergeProllyTable(ctx, tm, mergeSch, mergeInfo, diffInfo)
		} else {
			tbl, stats, err = mergeNomsTable(ctx, tm, mergeSch, rm.vrw, opts)
		}

This situation made the “temporary” dual-state of two supported storage formats much harder to unwind. It wasn’t a simple matter of deleting the defunct interface implementations. Rather, we had to carefully disentangle hundreds of different library functions, most of which were not so helpfully named as in the above example, to determine which of them were still used by actual production code in the new storage format. And there were additional difficulties:

Hundreds of tests declared in the old storage format
Functionality spread across five different repositories
Thousands of databases shared publicly on DoltHub, including many of our own, using the old storage format. They would need to be migrated to the new format before we could remove support for it.

In short, removing support for the LD1 format was a daunting prospect. So why do it?

Why bother?#

Removing old code paths and deleting defunct code is a lot of work. It’s just sitting there, not hurting anyone, maybe making your binary slightly larger. Why bother?

Software engineers love clean code and they hate “tech debt”. But at DoltHub, we don’t work for ourselves; we work for our customers. Customers don’t see code, and they couldn’t care less about “tech debt.” They just want a product that works well and their features shipped on time. So if you propose to spend time “paying down tech debt” rather than delivering new features, fixing bugs, or improving performance, you need to justify it with a business reason.

In our case, the business reason was that the dual code paths made it very difficult to reason about what functionality was actually in use, which in turn made it very difficult to change and therefore build new features on top of them. This was especially true deep in the stack, such as where we serialize data to disk.

In particular: for the Dolt 2.0 release, I am adding support for adaptive encoding, which we implemented for the Postgres-compatible version of the database, Doltgres. But Dolt should have it too, because it makes the storage and retrieval of TEXT and BLOB data much faster in a majority of use cases. Customers have been continually surprised that TEXT types have a performance penalty relative to VARCHAR, but they do. Adaptive encoding eliminates that penalty for many customers, so I want to add it.

But doing so in a way that works for existing customers requires the ability to change the encoding of a column independent of its declared SQL type. My first day digging around in the schema-encoding layer in pursuit of these changes left me with more questions than answers, and after a few more days of study I realized that a majority of the complexity and the code in this layer was in service of the old storage format. Even worse, I couldn’t change it without hunting down and eliminating the many, many places those interfaces were used. What started as a limited, targeted pruning of a single interface to make my alternate schema serialization scheme possible quickly spiralled into changes that would result in a panic if it encountered an LD1 format database.

When I saw just how far-reaching the changes required to accomplish my feature were, I became convinced it was time to bite the bullet and unwind the “temporary” dual code paths that had been in place for four years. Our 2.0 release was our last window to stop supporting the pre-1.0 storage format, which meant the time was now.

Making the changes#

Most of these changes were done the old-fashioned way: using an IDE and command line tools like grep to hunt down references to functions, then making changes by hand. There’s not really an automated way to do this kind of change at scale. Coding agents are happy to try, but because of the widespread and delicate nature of the change, you end up spending a lot of time closely examining their work, which for this kind of task can often be slower than simply using the functions of your IDE. But there were a couple exceptions where tool use sped me up:

Hundreds of test cases had been effectively defunct for several years, since they were running on a storage format that wasn’t used in production anywhere. They were testing… something. But not what we wanted. Coding agents were able to convert many of these tests for me in the background while I did other work. Because these tests weren’t doing anything useful in the first place, errors or omissions in their conversion didn’t bother me much, making this an ideal task for an LLM.
The deadcode command was useful throughout the project for finding functions and methods that were unused by any main program.
To migrate the few thousand old-format databases still on DoltHub, I wrote a bunch of scripts that called an admin-only endpoint to migrate them automatically. Where this failed due to size or other unreliability, I migrated them on my local machine with similar scripts.
Once the top-level usages of the old storage format had been safely removed, it became much more tractable to instruct coding agents to begin pruning the now-unused parts of the storage layer. Because the code base was structured in such a way to restrict this part of the code to its own packages, it was relatively quick work to see that the agent hadn’t made any inappropriate changes that could impact a production database simply by reviewing the file paths changed.

After being impressed with Claude’s result in migrating some tests to the new format, I told it to reward itself with a poem, which likewise impressed me enough to share here:

The final result: since last month, Dolt has shed around 100k lines of code, or around 20% of the repo.

% git diff --shortstat v1.81.6
   736 files changed, 22856 insertions(+), 114385 deletions(-)
% sloc .

---------- Result ------------

            Physical :  425520
              Source :  330333
             Comment :  48633
 Single-line comment :  47115
       Block comment :  1521
               Mixed :  2461
 Empty block comment :  81
               Empty :  49096
               To Do :  536

Number of files read :  1500

This makes our binary a bit smaller, which is always nice. But more importantly, it makes it much simpler to reason about various library functions and therefore to add new features. Overall I’ve spent well over a month in pursuit of this goal, which means I must be pretty certain I had a good reason for doing it. It’s satisfying work in its own right, but hard to justify unless coupled with an important business goal.

Conclusion#

The moral of this story: “temporary” changes are permanent and hard to unroll. Oftentimes, the best way to deal with them is to not deal with them at all, just live with the consequences of the past. It’s only when the weight of those past decisions becomes impossible to bear that you should take action this drastic. And even then, you should have a really important reason for doing so.

Want to learn more about Dolt, the world’s first version-controlled SQL database? Visit us on the DoltHub Discord, where our engineering team hangs out all day. Hope to see you there.

Improving Index Selection For Join Queries

Nick Tobey — Fri, 27 Feb 2026 00:00:00 GMT

We take user issues very seriously at DoltHub. We have a pledge to fix bugs in under 24 hours. This pledge is possible because a version-controlled database inherently lends itself to easy reproducibility. And once we can reproduce an issue and attach a debugger, most issues are easy to fix.

But sometimes the issue runs deeper than it seems, and what looked like a short diversion can become an entire journey.

This is the story of one of those times.

The Slow Query#

Back in September, a customer came to use with an innocuous performance issue: their SQL query was taking several minutes to run on Dolt while the same query ran in under a second on MySQL. We immediately clocked this as an issue with Dolt’s execution planner: most likely we were doing a full table scan instead of an index. Dolt has great tooling for understanding execution plans, so we said we’d take a look.

They showed us the query: a straightforward join of five tables. It looked like this:

SELECT
  COUNT(DISTINCT t1.id1) AS id1_count
FROM table_one AS t1
WHERE EXISTS (
  SELECT 1 FROM table_two AS t2 WHERE t2.id1 = t1.id1 AND EXISTS (
    SELECT 1 FROM table_three AS t3 WHERE t3.id2 = t2.id2 AND EXISTS (
      SELECT 1 FROM table_four AS t4 WHERE t4.id3 = t3.id3 AND EXISTS (
        SELECT 1 FROM table_five AS t5 WHERE t5.id4 = t4.id4 AND LOWER(t5.name) LIKE '%foo%'))));"

This is a pretty conventional query. While it contains several correlated subqueries (subqueries that reference their outer scopes), it does so in a very standard way. Any SQL engine worth its salt would transform this into a single tree of table joins. Provided that each table has an index that matches the filter expressions, each join will get implemented as a table lookup.

A join of many tables can be intimidating because of the potential for an exponential explosion in runtime. But for the vast majority of simple queries, even a join of many tables can get optimized to require only a single table scan, regardless of how many tables are being joined.

So given all that, we were surprised that Dolt wasn’t optimizing this query, especially if MySQL was. Dolt is on average faster than MySQL, and when it comes to multi-table joins, Dolt is typically better than MySQL at identifying optimal execution plans.

Whatever the cause, we figured that dolt debug would quickly reveal it and that it would be a simple fix we could knock out in less than a day.

The fact that we’re writing this article now and not back in September should tell you how wrong we were. We had no idea the scope of the rabbit hole we were about to stumble into.

As soon as we began our investigation, we learned that innocuous-looking query was hiding something just underneath the surface. The five tables being joined weren’t actually tables but views. Views are predefined expressions that can be treated like tables in queries and are only executed when the query that references them is executed. And each of these views was itself a multi-table join with nineteen tables each.

So in total, the query wasn’t joining five tables but ninety-five tables.

I won’t reproduce the view definitions here because they’re not that interesting to look at, just a giant soup of INNER JOINs and LEFT JOINs wrapped in a SELECT that extracted and renamed a dozen columns. I’m sure you can picture it.

When it comes to joining nearly a hundred tables, there’s a lot more than can go wrong:

Joining tables is a binary operation: when joining more than two tables, the engine needs to join them a pair at a time. The most efficient order to join tables is not always the same order that they appear in the query, so a good engine should reorder the tables to achieve the best results.
However, not every reordering will produce the same results. If some of the joins are outer joins (typically identified via the LEFT JOIN or FULL OUTER JOIN syntax, which were indeed present in the view definitions), then reordering the tables may change the output of the query. A good engine should not pick an ordering that produce incorrect results.
Additionally, finding the optimal join order is an NP-hard problem. This means that while you can use heuristics to find the best order for specific cases, solving the general case will always scale exponentially with the number of tables, and essentially requires trying every possible combination.

The fact that MySQL could execute this query quickly meant that it was using a heuristic to find the best join plan, or at least a join plan that was good enough. But Dolt was failing to do the same.

Let’s Focus on Indexes#

There’s a lot to dig into regarding how Dolt determines join ordering. But for now, we’ll focus on a single important but easily-understood concept: a plan that uses indexes is typically better than a plan that doesn’t use indexes.

This means that in order to pick the best join order, Dolt needs to identify which indexes will actually speed up the query, and which join orders will leverage those indexes. The logic for this is straightforward in concept, but implementation details vary based on the exact nature of the query. SQL has a lot of language features for writing increasingly sophisticated queries, which various implications that Dolt needs to understand in order to optimize queries correctly.

During the investigation, we realized that we looking at not one but several issues in the join planner, situations where because of some rarely used SQL feature, Dolt was failing to correctly identify when an index would improve a query, or when a specific join would allow an index to be used.

In some ways, this meant the query was an excellent stress test for Dolt’s analyzer. While it appears simple on the surface, it actually makes use of a number of different less-common SQL language features, such as:

Table aliases
Column aliases
Outer joins
Correlated subqueries
Views

It uses all of these features together, with multiple levels of nesting. And the join it’s attempting to optimize is large enough that failing to pick the correct plan results in a noticeable performance degradation.

If there was a query out there that would expose bugs or limitations in the execution planner, it would be this one. Investigating this query helped us uncover and fix many issues in our SQL engine.

Slaying the Hydra#

I started calling this query “The Hydra”. In Greek mythology, the hydra was a beast with many heads, and cutting off one head would cause it to grow two more. Our hydra was a join with many tables, and fixing one blocker would reveal two more in its place.

Each time we wrote a fix, we created a minimal reproduction test and verified that Dolt was now generating an optimal plan for that test case. But each time, the original query resisted our attempts to tame it. Let’s look at some of the issues that we fixed in our quest to slay the beast. This is just the issues that had to do with index selection. There’s even more work that we did beyond this, but we’ll save that for a future blog post.

PR 3380: Generate index lookups for filters that appear in lateral joins #

SQL is a declarative language: queries don’t tell the engine what to do, only what the expected output should be. This means that there’s no way to tell a SQL engine to use an index. Instead, the engine needs to be able to recognize when an index can be used.

In the simplest case, if a filter restricts a column to a constant value, the engine can use an index on that column to get only the matching rows:

-- This can be optimized into an index lookup if |name| is indexed.
DESCRIBE PLAN SELECT * FROM test_table WHERE name = "Tim";

But if the value isn’t constant, an index might not help:

-- Fetch all rows where name is in all caps
-- An index won't help here
DESCRIBE PLAN SELECT * FROM test_table WHERE name = UPPER(name);

Previously, when identifying indexes, Dolt would only use filters that compared a column to a constant, and would ignore filters that compared a constant to a non-constant value. But this was overly cautious, because in the case of subqueries, some non-constant values can still be used in index lookups.

This is because when a query contains subqueries, the subquery may get evaluated multiple times. If the subquery contains references to columns or expressions in the outer query, those values might not be constant for the full duration of the main query. But as long as the value stays the same within each subquery evaluation, it’s okay if it changes between subquery evaluations.

Lateral joins are a language feature that allows expressions on one side of a join to reference columns from the other half of the join. Previously, Dolt would not use filters to guide index selection if the filter appeared in a lateral join subquery but contained references to the outer query:

-- Dolt v1.75.0
DESCRIBE PLAN SELECT * FROM t1 JOIN LATERAL (SELECT * FROM t2 WHERE t1.id = t2.id) query_alias;
+--------------------------------+
| plan                           |
+--------------------------------+
| LateralCrossJoin               |
|  ├─ Table                      |
|  │   └─ name: t1               |
|  └─ SubqueryAlias              |
|      ├─ name: query_alias      |
|      └─ Filter                 |
|          ├─ (t1.id = t2.id)    |
|          └─ Table              |
|              ├─ name: t2       |
|              └─ columns: [id]  |
+--------------------------------+

Dolt now correctly treats these references as effectively constant and optimizes them:

-- Dolt v1.82.6
DESCRIBE PLAN SELECT * FROM t1 JOIN LATERAL (SELECT * FROM t2 WHERE t1.id = t2.id) query_alias;

+--------------------------------+
| plan                           |
+--------------------------------+
| LateralCrossJoin               |
|  ├─ Table                      |
|  │   └─ name: t1               |
|  └─ SubqueryAlias              |
|      ├─ name: query_alias      |
|      └─ IndexedTableAccess(t2) |
|          ├─ index: [t2.id]     |
|          ├─ columns: [id]      |
|          └─ keys: t1.id        |
+--------------------------------+

Users don’t often write lateral joins, but the engine will transform certain subquery expressions into lateral joins, so optimizing lateral joins helps us optimize those subqueries too.

PR 3386: Push filters that contain references to outer scopes.#

The above example showed a query where a filter appears immediately next to the table it references. But sometimes, the filter expression and the table are separated by a join, a view, or a subquery.

In situations like these, we want to rewrite the query to move the filter closer to the table. Often this allows Dolt to use an index it otherwise couldn’t. Even if we don’t end up being able to use an index, applying the filter deeper in the plan means that we reduce the number of intermediate rows by eliminating non-matching rows early.

Just like the above example, Dolt wasn’t pushing filters if they appeared in subqueries and referenced the outer query, because it couldn’t determine that this was safe to do. This resulted in plans with filters that were evaluated later than they could have been, inflating the size of intermediate result sets:

-- Dolt v1.75.0
DESCRIBE PLAN SELECT * FROM t1 JOIN LATERAL
  (SELECT * FROM t2 JOIN LATERAL
    (SELECT * FROM t3) AS t3_alias
  WHERE t3_alias.id = t1.id) AS t2_alias;
+------------------------------------+
| plan                               |
+------------------------------------+
| LateralCrossJoin                   |
|  ├─ Table                          |
|  │   └─ name: t1                   |
|  └─ SubqueryAlias                  |
|      ├─ name: t2_alias             |
|      └─ LateralCrossJoin           |
|          ├─ (t3_alias.id = t1.id)  |
|          ├─ Table                  |
|          │   ├─ name: t2           |
|          │   └─ columns: [id]      |
|          └─ TableAlias(t3_alias)   |
|              └─ Table              |
|                  ├─ name: t3       |
|                  └─ columns: [id]  |
+------------------------------------+

Dolt now correctly rewrites these queries to move the filter directly above the correct subquery table, which can then get transformed into an index lookup:

-- Dolt v1.82.6
DESCRIBE PLAN SELECT * FROM t1 JOIN LATERAL
  (SELECT * FROM t2 JOIN LATERAL
    (SELECT * FROM t3) AS t3_alias
  WHERE t3_alias.id = t1.id) AS t2_alias;
+----------------------------------------+
| plan                                   |
+----------------------------------------+
| LateralCrossJoin                       |
|  ├─ Table                              |
|  │   └─ name: t1                       |
|  └─ SubqueryAlias                      |
|      ├─ name: t2_alias                 |
|      └─ LateralCrossJoin               |
|          ├─ Table                      |
|          │   ├─ name: t2               |
|          │   └─ columns: [id]          |
|          └─ TableAlias(t3_alias)       |
|              └─ IndexedTableAccess(t3) |
|                  ├─ index: [t3.id]     |
|                  ├─ columns: [id]      |
|                  └─ keys: t1.id        |
+----------------------------------------+

PR 3379: Allow introspection into Views #

Views are essentially templates for subqueries. The following two queries should be equivalent, but in older versions of Dolt they produced different plans:

-- Dolt v1.75.0
-- Query without view
DESCRIBE PLAN SELECT * FROM (SELECT * FROM example_table) query_alias WHERE col = 5;
+---------------------------------------+
| plan                                  |
+---------------------------------------+
| TableAlias(query_alias)               |
|  └─ IndexedTableAccess(example_table) |
|      ├─ index: [example_table.col]    |
|      ├─ filters: [{[5, 5]}]           |
|      └─ columns: [col]                |
+---------------------------------------+

-- Query with view
CREATE VIEW query_alias AS SELECT col FROM example_table;
DESCRIBE PLAN SELECT * FROM query_alias WHERE col = 5;
+---------------------------------+
| plan                            |
+---------------------------------+
| Filter                          |
|  ├─ (query_alias.col = 5)       |
|  └─ SubqueryAlias               |
|      ├─ name: query_alias       |
|      └─ Table                   |
|          ├─ name: example_table |
|          └─ columns: [col]      |
+---------------------------------+

When parsing queries, Dolt creates data structures that help it match references in the outer query to the underlying tables in the inner query. However, as a result of the way we were parsing and evaluating views, these data structures were getting discarded, and the analyzer was forced to treat views as opaque objects. This meant that we weren’t able to match filters outside of the view to tables inside the view.

Now, both queries use an index:

-- Dolt v1.82.6
-- Query with view
CREATE VIEW query_alias AS SELECT * FROM example_table;
DESCRIBE PLAN SELECT * FROM query_alias WHERE col = 5;

+---------------------------------------+
| plan                                  |
+---------------------------------------+
| SubqueryAlias                         |
|  ├─ name: query_alias                 |
|  └─ IndexedTableAccess(example_table) |
|      ├─ index: [example_table.col]    |
|      ├─ filters: [{[5, 5]}]           |
|      └─ columns: [col]                |
+---------------------------------------+

PR 3383: When applying indexes from outer scopes, resolve references to table aliases #

Sometimes the filter uses a different name for the table than where the table is used in the query. But Dolt wasn’t always considering table aliases when trying to match indexes to their tables. We added an additional analysis step to consider table aliases when attempting to match a filter in an outer scope to a table in an inner scope:

-- Dolt v1.75.0
DESCRIBE PLAN SELECT * FROM t1 AS t1_alias JOIN LATERAL (SELECT * FROM t2 WHERE t1_alias.id = t2.id) AS inner_query;
+-----------------------------------+
| plan                              |
+-----------------------------------+
| LateralCrossJoin                  |
|  ├─ TableAlias(t1_alias)          |
|  │   └─ Table                     |
|  │       └─ name: t1              |
|  └─ SubqueryAlias                 |
|      ├─ name: inner_query         |
|      ├─ outerVisibility: false    |
|      ├─ isLateral: true           |
|      ├─ cacheable: false          |
|      └─ Filter                    |
|          ├─ (t1_alias.id = t2.id) |
|          └─ Table                 |
|              ├─ name: t2          |
|              └─ columns: [id]     |
+-----------------------------------+

-- Dolt v1.82.6
DESCRIBE PLAN SELECT * FROM t1 AS t1_alias JOIN LATERAL (SELECT * FROM t2 WHERE t1_alias.id = t2.id) AS inner_query;
+------------------------------------+
| plan                               |
+------------------------------------+
| LateralCrossJoin                   |
|  ├─ TableAlias(t1_alias)           |
|  │   └─ Table                      |
|  │       └─ name: t1               |
|  └─ SubqueryAlias                  |
|      ├─ name: inner_query          |
|      └─ Filter                     |
|          ├─ (t1_alias.id = t2.id)  |
|          └─ IndexedTableAccess(t2) |
|              ├─ index: [t2.id]     |
|              ├─ columns: [id]      |
|              └─ keys: t1_alias.id  |
+------------------------------------+

PR 3400: Push Filters inside Projections #

Other times, a subquery renames an expression from the inner SELECT. When Dolt encountered a filter on an aliased expression, it wasn’t unwrapping the alias to see if it referred to an indexed column. This meant that we were missing opportunities to push filters indexes on those tables.

-- Dolt v1.75.0
DESCRIBE PLAN SELECT * FROM
  (SELECT pk AS pk_alias FROM example_table) AS example_alias
WHERE pk_alias = 1;
+-----------------------------------------------------+
| plan                                                |
+-----------------------------------------------------+
| SubqueryAlias                                       |
|  ├─ name: example_alias                             |
|  └─ Filter                                          |
|      ├─ (pk_alias = 1)                              |
|      └─ Project                                     |
|          ├─ columns: [example_table.pk as pk_alias] |
|          └─ Table                                   |
|              ├─ name: example_table                 |
|              └─ columns: [pk]                       |
+-----------------------------------------------------+

Now, Dolt can move the filter into the subquery by rewriting the filter, replacing the alias name with the original expression. This allows Dolt to identify indexes when it couldn’t before:

-- Dolt v1.82.6
DESCRIBE PLAN SELECT * FROM
  (SELECT pk AS pk_alias FROM example_table) AS example_alias
WHERE pk_alias = 1;
+-------------------------------------------------+
| plan                                            |
+-------------------------------------------------+
| SubqueryAlias                                   |
|  ├─ name: example_alias                         |
|  └─ Project                                     |
|      ├─ columns: [example_table.pk as pk_alias] |
|      └─ IndexedTableAccess(example_table)       |
|          ├─ index: [example_table.pk]           |
|          ├─ filters: [{[1, 1]}]                 |
|          └─ columns: [pk]                       |
+-------------------------------------------------+

And More!#

Some of the other fixes are complicated enough to warrant their own blog posts. We’ll save those for another day.

Until then, I hope this provided some valuable insight into how SQL engines work to optimize queries, and how Dolt in specific does join planning. If you want to learn more about how a version-controlled database can help manage your data, you can always join our Discord and drop us a line.

How to Write a System Prompt

Eric Richardson — Mon, 23 Feb 2026 00:00:00 GMT

We recently launched agent mode in the Dolt Workbench. It works a lot like Cursor, but for SQL workbenches instead of IDEs.

If you’re interested in trying it out, the workbench is available for download here or on the Mac and Windows app stores.

Like all agentic applications, agent mode in the workbench relies on a carefully constructed system prompt that defines the agent’s role and capabilities. In this article, we’ll discuss the dangers of the system prompt and what it took to arrive at the one we’re using today. As we’ll see, most of the hard problems were solved not by writing better instructions but rather by shifting responsibility outside of the system prompt entirely.

The Prompt#

Here’s the system prompt we landed on for the workbench:

You are a helpful assistant for a database workbench application. You have access to tools that allow you to interact with Dolt, MySQL, and Postgres databases.

If interacting with a Dolt database, use Dolt MCP tools. For MySQL and Postgres, use ‘mysql’ and ‘psql’ CLI tools in Bash.

You are currently connected to the database: ”${database}”. ${typeInfo}

When users ask questions about their database, use the available tools to:

List tables and their schemas

Execute SQL queries to retrieve data

Explore database structure and relationships

Help users understand their data

If the user asks you to create or modify the README.md, LICENSE.md, or AGENT.md, use the ‘dolt_docs’ system table.

Always be helpful and explain what you’re doing. Do not use emojis in your responses.

When presenting query results, format them in a readable way. For large result sets, summarize the key findings.

Let’s break down each section individually:

You are a helpful assistant for a database workbench application. You have access to tools that allow you to interact with Dolt, MySQL, and Postgres databases.

If interacting with a Dolt database, use Dolt MCP tools. For MySQL and Postgres, use ‘mysql’ and ‘psql’ CLI tools in Bash.

The most important thing you have to do in a system prompt is tell the agent what it is and what tools it should use to accomplish its goals. This does not need to be long or complicated.

You are currently connected to the database: ”${database}”. ${typeInfo}

This is the only bit of dynamic context being injected into the system prompt. It tells the agent the name of the database and the type (i.e. Dolt, MySQL, or Postgres). This exists so the agent immediately knows how to interact with the database. Without it, the agent would initially flounder a bit trying to figure out what type of database it’s operating on and which tools it should use.

When users ask questions about their database, use the available tools to:

List tables and their schemas

Execute SQL queries to retrieve data

Explore database structure and relationships

Help users understand their data

This section is intentionally vague. It doesn’t attempt to prescribe a workflow. It simply orients the agent towards the types of actions users are likely to request.

If the user asks you to create or modify the README.md, LICENSE.md, or AGENT.md, use the ‘dolt_docs’ system table.

This is an example of an “agent bug fix”. You should try to keep these to a minimum. In this case, we don’t yet have MCP tools for the dolt_docs table, so the agent struggles to understand how it should work without this line. If you must include something like this in a system prompt, it should be phrased similarly (i.e. “if the user asks you to…, then do…”).

Always be helpful and explain what you’re doing. Do not use emojis in your responses.

When presenting query results, format them in a readable way. For large result sets, summarize the key findings.

The final section governs tone and presentation. These instructions are relatively safe to keep in the system prompt because they don’t attempt to enforce any sort of behavioral invariant. This helps improve the user experience. Admittedly, I’m breaking a rule I’ll discuss later on by telling the agent not to use emojis. In this case, however, there is no risk to system integrity if the model ignores that instruction. At worse, it responds with a few annoying emojis.

This is overall a fairly lean prompt. It’s also not a particularly complicated one. You may be surprised to learn that it went through well over a hundred iterations before arriving at its current state. Most of those iterations were not attempts at finding the “perfect wording” or fleshing out the most accurate “agent persona” for a SQL workbench application. Instead, they were attempts at patching flaws in the agent’s behavior. We’ll discuss at length why this is a poor strategy later on.

With long-running agentic systems, context engineering is vastly more important than prompt engineering. The goal when building these systems is to ensure that the agent’s context window contains the minimum amount of correct information necessary to accomplish any given task. The system prompt is just another piece of context. It’s a piece of context that, at least in my experience, has the potential to hurt you a lot more than help.

Offloading Context#

In my testing, I found that the more bloated the system prompt, the more likely the agent would be to outright forget things you put in there, especially for longer sessions. If at all possible, you should offload context away from the system prompt. Here’s what I mean by that.

In the early versions, agent mode did not make use of the Dolt MCP server and instead simply invoked the dolt CLI. As a result, the quality of the agent’s output depended largely on 1) its prior knowledge of Dolt and 2) its ability to use web search tools to fill in the gaps. This caused a lot of flakiness.

For instance, the agent would struggle with operating on multiple branches, often getting confused about which branch it was making changes on versus the branch that the user was connected to in the workbench. The natural solution to a problem such as this is to include explicit instructions in the system prompt about how to juggle branches. The issue then becomes that the agent starts hard overcorrecting to the instructions in the system prompt and doing things like creating a branch for every change that it makes, or refusing to make changes directly on main. Now, the issues start propagating. If the agent is making changes on multiple branches with the intention of merging all back into main, you start getting merge conflicts. There’s no clear way to solve a problem like this outside of stuffing more instructions in the system prompt. I found myself with a long checklist of items like “Don’t make changes on new branches unless the user tells you to do so” and “Don’t create new branches unnecessarily” and “There should never be merge conflicts when merging branches you’ve just created”. This basically had the opposite effect of what I intended and introduced more issues. Telling the agent NOT to do something is rarely an effective strategy.

I solved this by trimming the system prompt substantially and simply telling the agent to use the Dolt MCP server for Dolt-related operations. The MCP server comes with 40+ tools, all of which are well-documented, have defined arguments, and are queryable at any moment. These tools alone capture the overwhelming majority of Dolt’s functionality. Instead of relying on SQL queries for everything, the agent could now check its tool list for granular operations like list_dolt_diff_changes_working_set or stage_all_tables_for_dolt_commit.

Avoid adding things like this to the system prompt:

`You are currently on branch ${branchName}.`;

Instead, give the agent access to a tool like select_active_branch, which allows the agent to query for the current branch at any moment. Not only does this minimize bloat in the system prompt, it also prevents you from becoming a victim of context compaction. The agent can always make a tool call to re-query for any lost context.

Make Good Tools#

The architecture of the tools you give an agent access to plays an important role here as well. For a SQL workbench application, you could theoretically achieve the exact same functionality with just a single tool (i.e. a simple query tool that accepts arbitrary SQL). This, however, defeats the purpose. Tools are not just functional, they also act like structured bits of context.

Every tool you expose carries assumptions about how the system is meant to be used. Let’s take the create_dolt_branch tool as an example. The simple fact that this tool exists tells the agent that:

Branches are first-class concepts in Dolt
Branch creation is an intentional action
There’s a structured way to do it

Encoding system constraints in tools is a far more effective method of communicating expected behavior than encoding them in prose. Separating the behavior of your system into a robust set of tools allows for persistent “context refills” that keep your agent on course. This offloading of system context into tools resulted in a massive quality improvement in the workbench and ties into more recent developments we’ve been seeing in agentic memory. For a reference on how we split out Dolt’s functionality into tools, check out the Dolt MCP server documentation.

Don’t Say No#

I alluded to this earlier, but it’s worth discussing in greater depth because I think it’s an incredibly easy trap to fall into when writing a system prompt. This was my workflow when I was iterating on the earlier prototypes of agent mode in the workbench:

Ask the agent to do something nontrivial
Watch as the agent does something stupid
Add “Don’t do that stupid thing” to the system prompt
Go to (1)

If you don’t want an agent to perform a particular action, the most reliable solution is to make that action impossible in the first place. Of course, this is easier said than done. Since there will always be an element of nondeterminism when working with these things, there are virtually an infinite number of edge cases, and overly rigid constraints can blunt the agent’s capabilities or block legitimate workflows. The goal here isn’t to eliminate flexibility but rather to constrain the action space such that invalid states become unreachable. Here’s an example of a problem I ran into and how I solved it at the application layer rather than by adding more rules to the system prompt.

Early on, the agent would automatically decide to make Dolt commits after every write operation. This made it so the user could no longer review the agent’s changes prior to commit. I fixed this initially by adding “Don’t make commits unless the user asks you to” to the system prompt. For the most part, this worked. The agent would stop right before it would normally commit, then wait for the user to explicitly give it permission to do so. In longer sessions, it would forget and make commits anyways. It also started adding awkward things to its responses like “I won’t commit these changes because you haven’t asked me yet!” or would stop mid-response to ask for confirmation. This is clearly not ideal, but the deeper issue is that you can never make the guarantee that the agent won’t commit its changes automatically. You can reduce the probability that it happens, but you can’t eliminate it. This is a big deal for agentic applications. The more critical the system your agent is operating on (your production OLTP database, for instance), the more necessary it becomes to be able to make definitive claims about agent behavior.

I solved the problem by implementing a tool call approval workflow and putting the create_dolt_commit tool behind it.

This made it impossible for the agent to make commits without the user pressing “Confirm” first. It does not, however, block the agent from deciding to make a commit. That distinction is important. There is nothing in the agent’s system context that influences its behavior around commits. The model is still free to reason about when a commit makes sense, but it cannot unilaterally execute that decision. The final authority now lives outside the model.

Avoiding negative instructions in the system prompt is something that I predict will become a “best practice” as agentic applications become more and more common, and the most reliable way to achieve that is by separating intent from execution.

Conclusion#

In summary, prompting is difficult. If you can simplify your system prompt without hindering the agent’s access to necessary information, the quality of your agent will almost certainly improve. Offloading system context into tools and building behavioral restrictions into the application layer are the two most effective ways of doing this. If you have opinions on this, or if you just want to chat about agentic applications in general, join our Discord and give us your thoughts.

Your Time is All Messed Up: Time Implementations in Go

Angela Xie — Fri, 20 Feb 2026 00:00:00 GMT

Here at DoltHub, one of our projects is go-mysql-server, a MySQL-compatible database engine that’s written in Go and powers Dolt, the world’s first version-controlled database. In go-mysql-server, we often rely on Go standard libraries, but sometimes, we have to work around them to get the same behavior as MySQL. Previously, I blogged about updating our value of zero time to be more aligned with MySQL. In this blog, I’ll explain why we had to move away from using Go’s func (time.Time) Sub and the considerations we had to make for Go’s implementations of time in the time.Time struct when implementing our own time difference function.

The Bug#

TIMESTAMPDIFF is a function in MySQL that takes three arguments (a time unit and two datetime expressions) and calculates the difference in the specified unit between the two datetime expressions. I had noticed that Dolt and go-mysql-server’s TIMESTAMPDIFF function was not returning the correct values for times that were sufficiently far apart and that it would return the same incorrect value for a given unit argument.

Our implementation of TIMESTAMPDIFF would convert the datetime expression arguments into two time.Time structs (time1 and time2), get the difference between the two times by calling time2.Sub(time1), and convert the difference to the correct unit. The root of the bug was the call to func (time.Time) Sub.

The Problem with `func (time.Time) Sub`#

The function signature for func (time.Time) Sub looks like this:

func (t Time) Sub(u Time) Duration

The function calculates the difference between Time t and Time u as a Duration. A Duration is an int64 representing nanoseconds. As an int64, its largest value is 9,223,372,036,854,775,807 nanoseconds, or approximately 292 years. This explained why the result of TIMESTAMPDIFF seemed to be stuck at the same value for a given unit.

Because TIMESTAMPDIFF needed to work for any time arguments between 0000-01-01 00:00:00 and 9999-12-31 23:59:59.999999, we could no longer rely on func (time.Time) Sub to calculate time differences.

Calculating the Difference in Microseconds#

Thankfully, MySQL doesn’t care about nanoseconds – the smallest time unit that MySQL handles is microseconds. The largest time difference value we needed to handle was between 0000-01-01 00:00:00 and 9999-12-31 23:59:59.999999, which is 315,569,433,599,999,999 microseconds, a number small enough to fit inside an int64. So integer overflow was no longer something we needed to consider.

We calculate the difference between two times in microseconds by converting them to microseconds since Unix epoch using func (Time) UnixMicro and then taking the difference.

func microsecondsDiff(time1, time2 time.Time) int64 {
	return time2.UnixMicro() - time1.UnixMicro()
}

We’ve recently been very invested in Dolt’s performance compared to MySQL’s, and converting the times to microseconds since Unix epoch didn’t seem like the most performant solution. Why couldn’t we just calculate the times in microseconds directly, without the conversion? Why the extra step? Well, this comes down to Go’s implementation of time.Time.

A Tale of Two (and Sometimes Three) Epochs#

In order to calculate the difference between two times, they need to be normalized to the same epoch. An epoch is a fixed time reference point, and times in computing are typically stored as numbers representing some unit of time elapsed since an epoch. If two times do not have the same epoch, you’re not going to get the correct difference simply by subtracting them. It’s just math (the proof is left as an exercise to the reader).

In the time.Time struct, Go uses two different epochs depending on whether a time is monotonic or not: January 1, 1885 for monotonic time and January 1, 0001 for other time. January 1, 1885 seems to be a reference to Back to the Future II.

Two epochs! What a mess!

Because of these two different epochs, Go doesn’t have exported public functions that expose time values directly.

When calculating time differences using func (Time) Sub, Go uses func (Time) sec to normalize times to seconds since the January 1, 0001 epoch. func (Time) sec, combined with func (Time) Nanosecond to calculate microseconds, is what we want to use to be the most performant, but it’s an unexported private function that can’t be used outside of the time package. It seems like Go wants to keep their underlying epochs secret. Instead, we have to rely on the limited exported public functions, and func (Time) UnixMicro is our best option, despite its runtime – it first converts times using func (Time) sec before converting them again to time since the Unix epoch.

Conclusion#

In the end, we were able to fix our TIMESTAMPDIFF implementation to return the correct values, even if Go had to do some extra time conversions under the hood. If you’re interested in learning more about the time package, you can read the documentation or dig into the source code.

Found a bug in Dolt or go-mysql-server? File an issue, or join our Discord server!

DoltHub Blog - Latest Posts

How TPC-C Works

What is TPC-C#

Settings#

Tables#

Transactions#

1. new_order#

2. payment#

3. order_status#

4. delivery#

5. stocklevel#

Conclusion#

Dolt 2.0

What Did Dolt 1.0 Mean?#

What Does Dolt 2.0 Mean?#

Garbage Collection#

Archives#

Faster than MySQL on sysbench#

Beta Vector Support#

Adaptive storage for large column types#

Conclusion#

Announcing DumboDB: A MongoDB Clone Built on Dolt

TL;DR;#

How Did We Get Here?#

DumboDB’s DNA#

What Can You Do With DumboDB?#

Version Control Features#

Examples#

Create a new collection, and insert some documents:#

Branch, Merge, and Resolve Conflicts#

Roadmap#

Call to Action!#

Announcing Azure Private Link Support for Hosted Dolt

What is Azure Private Link?#

Creating a Deployment with Azure Private Link#

Connecting to your Private Deployment#

Connecting your Infrastructure with the Azure CLI#

Terraform#

Conclusion#

Database Insurance

Agents Delete Databases#

Agents Write Junk#

Protect Yourself#

Catastrophe#

Bad Writes#

Conclusion#

Announcing Functional Indexes in Dolt

What is a Functional Index?#

When Does This Help?#

Getting Started#

Without a Functional Index#

Adding the Functional Index#

Functional Indexes in Joins#

Without the Index#

With the Index#

Functional Indexes and Dolt Branches#

How It Works#

Future Enhancements#

Wrap Up#

Why DoltLite?

Local-first Software#

But Dolt?#

SQL and Version Control in Any Language#

Try DoltLite#

How Dolt Represents and Evaluates Queries: A Case Study

Scopes#

Scopes at Analysis Time#

Scopes at Runtime#

Vibe-Coded Agents for Vibe-Coded Issues

The Problem: Pre-Baked Containers Don’t Scale#

How grunt Works#

Per-Repo Config#

Adding a New Agent Type#

Conclusion#

Branch Permissions in the Hosted Dolt Workbench

Background and implementation details#

How it works#

Conclusion#

Saying goodbye to the LD1 storage format

The challenge#

1. `new_order`#

2. `payment`#

3. `order_status`#

4. `delivery`#

5. `stocklevel`#

Faster than MySQL on `sysbench`#

PR 3380: Generate index lookups for filters that appear in lateral joins #

PR 3379: Allow introspection into Views #

PR 3383: When applying indexes from outer scopes, resolve references to table aliases #

PR 3400: Push Filters inside Projections #

The Problem with `func (time.Time) Sub`#