<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
  <!-- Source: https://www.etsy.com/codeascraft/rss -->
  <channel>
    <title>Etsy Engineering | Code as Craft</title>
    <link>https://siftrss.com/f/q8X9B8099Wv</link>
    <description>The engineers who make Etsy make our living with a craft we love: software. This is where we'll write about our craft and our collective experience building and running the world's most vibrant handmade marketplace.</description>
    <pubDate>Tue, 7 Apr 2026 06:20:21 -0400</pubDate>
    <language>en-US</language>
    <atom:link href="https://siftrss.com/f/q8X9B8099Wv" rel="self" type="application/rss+xml"/>
    <image title="Etsy Engineering | Code as Craft" link="http://www.etsy.com/codeascraft/rss" url="https://etsy.com/images/blogs/code_as_craft/cac_logo_lavender.png"/>
    <item>
      <title><![CDATA[Migrating Etsy’s database sharding to Vitess]]></title>
      <link>https://www.etsy.com/codeascraft/migrating-etsyas-database-sharding-to-vitess?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</link>
      <dc:creator><![CDATA[Ella Yarmo-Gray]]></dc:creator>
      <pubDate>Thu, 19 Mar 2026 17:03:54 -0400</pubDate>
      <guid isPermaLink="true">https://www.etsy.com/codeascraft/migrating-etsyas-database-sharding-to-vitess?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</guid>
      <description><![CDATA[<p>Etsy has maintained a sharded MySQL architecture since around 2010. This database cluster contains most of Etsy’s online data and is made up of ~1,000 tables distributed across ~1,000 shards. Over the last 16 years, it has grown significantly: combined, these tables have over 425 TB of data and receive roughly 1.7 million requests per second. </p>
<p>Etsy engineers access our MySQL data through a proprietary object-relational mapping (ORM). The ORM has a corresponding model for each MySQL table. When a table is sharded, its rows are partitioned among different databases known as shards. Each shard has identical table schemas and contains a distinct subset of all rows. Sharded tables require a unique id field to be designated on the model as the “shardifier id” which determines the shard for each record. </p>
<p>Shardifier ids are designed to co-locate related data in order to minimize the number of databases we need to rely on for a given query. For example, all records related to a single shop or user are stored on the same shard. The majority of models use shop_id or user_id as the shardifier id, but there are over 30 different options in use. Prior to Vitess, the ORM stored mappings of records to shards in a single, unsharded “index” database. When a record was created, the shard mapping for its id was determined randomly by the ORM and then stored in a table on the index database so it could be retrieved again later. When a query was sent to a sharded table using the ORM, the ORM first queried the index database to retrieve the shard mapping, and then targeted the desired query to the right shard.</p>
<p>This shard architecture brought two major benefits: scalability and resilience against host outages. The impact of a database outage was greatly reduced compared to an unsharded database – we only lost 1/1000 of traffic to the distributed database if a shard went down. However, it also had some drawbacks:</p>
<ul>
<li>The process for horizontally scaling was manual, complex, and took months to complete. </li>
<li>We were dependent on a single point of failure – the index database. If the index database went down, it could cause a full site outage since the ORM would no longer be able to route queries to any shard.</li>
<li>Product developers were required to understand and manage sharding when accessing data, which could create friction since it was often confusing and toilsome.</li>
</ul>
<p>As Etsy and the index database grew, we saw an increase in incidents when it was unavailable and addressing these vulnerabilities became a high priority. <a href="https://vitess.io/docs/24.0/overview/whatisvitess/">Vitess</a>, an open source software that provides an abstraction layer for scaling, deploying, and managing large MySQL clusters, offered a solution to these problems.</p>
<h2>Introducing Vitess</h2>
<p>In 2018, we added Vitess to our database architecture. Prior to this change, the ORM issued queries directly to the MySQL database for a specific shard. Now, each query was sent to Vitess, which would then query MySQL. At first, the ORM told Vitess which shard to query so we could validate that the new components worked in our environment before adding shard logic into Vitess. </p>
<p>With this new infrastructure in place, we were ready to start exploring <a href="https://vitess.io/docs/24.0/reference/features/vindexes/">vindexes</a>, which define sharding strategies within Vitess. Vindexes provide a way to calculate a shard given a column value akin to how we mapped shardifier ids to shard numbers in our index database. To familiarize ourselves with them, we <a href="https://www.etsy.com/codeascraft/search/Scaling%2520Etsy%2520Payments?ref=codeascraft">scaled several unsharded payments tables</a> by creating a new sharded database cluster that could use Vitess vindexes out-of-the-box. The success of that project paved the way for us to migrate our existing, in-house shard infrastructure to Vitess vindexes.</p>
<h2>Choosing a vindex</h2>
<p>Vitess provides a number of vindexes. We were most interested in those that calculated the shard using an algorithm like the <a href="https://vitess.io/docs/23.0/reference/features/vindexes/#predefined-vindexes">hash vindex</a>, which would remove the reliance on an additional data store for shard lookups. Since the ORM’s shard mappings are random and not algorithmic, using one of these out-of-the-box would require re-sharding all of our data – a process that would be manual and likely take years. Instead, we chose to write custom vindexes that ported our existing shard logic into Vitess so we could test how vindexes worked in our environment without the complexity and risk of moving data.</p>
<p>The first step of creating our custom vindexes was to modify the shard assignment algorithm in the ORM to match the Vitess hash vindex algorithm. After that change, no new shard mappings required an index database lookup, so we could store all of the existing mappings in a read-only database. We chose to use SQLite, which provided low latency reads and a small enough footprint to copy the database directly onto each Vitess server – avoiding the latency and dependency added by a call to an external database. Then we built a custom SQLite lookup vindex, which looks up shard information from a SQLite database.</p>
<p>Next, we built a custom hybrid vindex, which allows a table to apply one of two vindexes based on a threshold value – in this case, the SQLite vindex or the hash vindex. We set the threshold to the first id that was created after we changed the shard assignment algorithm. Any id that is above this value would use the hash vindex to look up its shard mapping, and any id that is below it would use the SQLite vindex.</p>
<h2>Introducing vindexes into our environment</h2>
<p>We were now able to introduce Vitess vindexes into our existing architecture without moving any data. Before doing so, we needed to ensure that all queries currently in use with legacy ORM sharding would still succeed and return the same results with Vitess managed sharding. Some incompatibility was expected: Vitess requires the shardifier id to be present in the WHERE clause of a SQL query in order to route it to the correct shard, but this was not previously required by the ORM. In fact, since Etsy developers were required to provide this information separate from the SQL query in our legacy architecture, many queries did not include it in the WHERE clause. More detail around how we found and solved for query incompatibilities can be found in <a href="https://www.etsy.com/codeascraft/scaling-etsy-payments-with-vitess-part-3--reducing-cutover-risk">Scaling Etsy Payments with Vitess, Part 3</a>.</p>
<p>The ORM contained over a decade of queries with many different ways of accessing MySQL, which made auditing each and every query time consuming. Building enough context to test changes to a model was also challenging since each table varied greatly in the design and purpose of its data. Given this, we decided to take an incremental approach and introduce vindexes into our codebase one table at a time. This would allow us to test on a smaller scale, ensuring each type of data access pattern would succeed with Vitess by monitoring how vindexes performed with a specific set of queries. We could slowly and safely surface incompatibilities while we made progress on the migration, allowing room for error without risking large-scale outages.</p>
<p>Since we would need to repeat the migration for hundreds of tables, we prioritized creating a clear and repeatable process. <a href="https://www.etsy.com/codeascraft/understanding-the-collective-impact-of-experiments">Etsy’s experimentation framework</a> enabled us to gradually ramp up the change by incrementally increasing the percentage of traffic that was using Vitess vindexes for shard routing for each table. This enabled us to assess impact by comparing the performance of a query when it was using vindexes vs ORM shard routing. If they encountered any issues along the way, it was easy to ramp down to 0. Minimizing the risk of each model ramp by starting small and having a quick way to roll back helped mitigate the risk of not testing all queries upfront.</p>
<h2>Working through challenges: database transactions</h2>
<p>Early on we ran into a challenge with this approach when handling database transactions. With our in-house shard routing, the ORM treated the shards as many separate databases and queried specific shards directly. With Vitess-managed sharding, the ORM queried the shards as if it were a single database, and they were only regarded as separate databases behind-the-scenes by Vitess. This meant that the two approaches created different connections to the database, which caused some issues for database transactions, since atomicity is only guaranteed per connection. To avoid data integrity issues that could come from breaking transaction guarantees, we required tables written to in a single database transaction to be ramped onto vindexes at the same time so they would use the same connection.</p>
<p>This was simple enough in theory, but ended up being pretty consequential. A small number of tables that represent the most complex and critical data models like receipts, listings, and transactions, make up a large portion of database traffic to the sharded cluster. As we were preparing one of them to use vindexes, we realized that just 27 models (3% of tables) accounted for ⅓ of our database traffic – and they were all connected via database transactions. Despite our best efforts to minimize risk by ramping table-by-table, some of the riskiest changes were tightly coupled anyway. We collaborated across the company to ramp these 27 models in unison.
<img alt="Percentage of traffic using vindexes" src="https://i.etsystatic.com/inv/3f0fee/7870795065/inv_fullxfull.7870795065_hr9alodj.jpg?version=0" title="Percentage of traffic using vindexes" /></p>
<h2>Reaping the benefits: cross-shard queries</h2>
<p>One major advantage of using Vitess vindexes is the ability to query across shards. By default, any query that does not contain the shardifier id will "scatter" – Vitess will send it to all shards in parallel, sort the results, and return a single result set. This is a powerful tool – one model’s query time was reduced from ~2 seconds to ~20 ms by using cross-shard queries! However, at Etsy’s scale, unintentionally sending an expensive query to all 1,000 shards at once could quickly become an issue. To protect against these scenarios, we disabled scatter queries by default in our environment and provided a way to specify that a query is allowed to scatter via the ORM. </p>
<p>Utilizing scatter queries brought notable improvements to bulk primary key lookups in the 27 table ramp mentioned above. The ORM’s previous implementation took an array of primary keys, batched them by shard, issued a query to each shard, and returned the combined results. Since Vitess enables us to query multiple shards in a single query, we can skip the batching by shard and include all primary keys in a single query. This greatly reduced the number of queries issued for bulk lookups for some models.
<img alt="Number of bulk lookup queries for one model" src="https://i.etsystatic.com/inv/4da339/7822693132/inv_fullxfull.7822693132_7yjbrc8p.jpg?version=0" title="Number of bulk lookup queries for one model" /><em>Bulk lookup queries for one model during its ramp onto vindexes. Purple lines denote 1%, 10%, 50%, and 100% of traffic to the model using vindexes.</em></p>
<h2>Conclusion</h2>
<p>Five years, approximately 2,500 pull requests and 6,000 queries later, we have successfully migrated Etsy’s shard management to Vitess vindexes! Despite the work we put in to streamline the migration process, it was still a challenge to replace the database infrastructure for a codebase of Etsy’s scale and age. As infrastructure engineers, my team usually had minimal context around the code we were changing and what might break if things went wrong. However, with coordination and careful testing across Etsy engineering, we were able to meet our goals:</p>
<ul>
<li>Scaling operations are no longer manual and can be performed in a matter of days rather than months.</li>
<li>We removed the index database as a single point of failure for our sharded cluster.</li>
<li>Shard infrastructure has been obscured from the developer’s view which makes data modeling and query writing much simpler.</li>
<li>Database performance stayed largely the same as it was prior to the migration, rendering the change invisible to our end users.</li>
<li>We were able to progress incrementally, roll back quickly, and integrate Vitess into our shard architecture without performing a massive data migration.</li>
</ul>
<p>We are excited to use all the new Vitess features that migrating to vindexes has unlocked – for example, re-sharding our data, rebalancing data across the shards, and using Vitess’s MoveTable operations to shard unsharded tables. Swapping out such critical and complex infrastructure with almost no downtime or impact on our users was an extremely challenging task, but it was also extremely gratifying.</p>
<h2>Acknowledgements</h2>
<p>This project was a shared accomplishment of the Data Access Platform team: Jessica Chen, Samantha Drago-Kramer, Hermes Garcia, Sam Kenny, David Leibovic, Kyle Leiby, Benjamin Mariscal, Juan Ortega, Adam Saponara, Wendy Sung, and Stephanie Wu. Thank you to everyone who contributed to the vindex project across engineering!</p>]]></description>
      <content:encoded><![CDATA[<p>Etsy has maintained a sharded MySQL architecture since around 2010. This database cluster contains most of Etsy’s online data and is made up of ~1,000 tables distributed across ~1,000 shards. Over the last 16 years, it has grown significantly: combined, these tables have over 425 TB of data and receive roughly 1.7 million requests per second. </p>
<p>Etsy engineers access our MySQL data through a proprietary object-relational mapping (ORM). The ORM has a corresponding model for each MySQL table. When a table is sharded, its rows are partitioned among different databases known as shards. Each shard has identical table schemas and contains a distinct subset of all rows. Sharded tables require a unique id field to be designated on the model as the “shardifier id” which determines the shard for each record. </p>
<p>Shardifier ids are designed to co-locate related data in order to minimize the number of databases we need to rely on for a given query. For example, all records related to a single shop or user are stored on the same shard. The majority of models use shop_id or user_id as the shardifier id, but there are over 30 different options in use. Prior to Vitess, the ORM stored mappings of records to shards in a single, unsharded “index” database. When a record was created, the shard mapping for its id was determined randomly by the ORM and then stored in a table on the index database so it could be retrieved again later. When a query was sent to a sharded table using the ORM, the ORM first queried the index database to retrieve the shard mapping, and then targeted the desired query to the right shard.</p>
<p>This shard architecture brought two major benefits: scalability and resilience against host outages. The impact of a database outage was greatly reduced compared to an unsharded database – we only lost 1/1000 of traffic to the distributed database if a shard went down. However, it also had some drawbacks:</p>
<ul>
<li>The process for horizontally scaling was manual, complex, and took months to complete. </li>
<li>We were dependent on a single point of failure – the index database. If the index database went down, it could cause a full site outage since the ORM would no longer be able to route queries to any shard.</li>
<li>Product developers were required to understand and manage sharding when accessing data, which could create friction since it was often confusing and toilsome.</li>
</ul>
<p>As Etsy and the index database grew, we saw an increase in incidents when it was unavailable and addressing these vulnerabilities became a high priority. <a href="https://vitess.io/docs/24.0/overview/whatisvitess/">Vitess</a>, an open source software that provides an abstraction layer for scaling, deploying, and managing large MySQL clusters, offered a solution to these problems.</p>
<h2>Introducing Vitess</h2>
<p>In 2018, we added Vitess to our database architecture. Prior to this change, the ORM issued queries directly to the MySQL database for a specific shard. Now, each query was sent to Vitess, which would then query MySQL. At first, the ORM told Vitess which shard to query so we could validate that the new components worked in our environment before adding shard logic into Vitess. </p>
<p>With this new infrastructure in place, we were ready to start exploring <a href="https://vitess.io/docs/24.0/reference/features/vindexes/">vindexes</a>, which define sharding strategies within Vitess. Vindexes provide a way to calculate a shard given a column value akin to how we mapped shardifier ids to shard numbers in our index database. To familiarize ourselves with them, we <a href="https://www.etsy.com/codeascraft/search/Scaling%2520Etsy%2520Payments?ref=codeascraft">scaled several unsharded payments tables</a> by creating a new sharded database cluster that could use Vitess vindexes out-of-the-box. The success of that project paved the way for us to migrate our existing, in-house shard infrastructure to Vitess vindexes.</p>
<h2>Choosing a vindex</h2>
<p>Vitess provides a number of vindexes. We were most interested in those that calculated the shard using an algorithm like the <a href="https://vitess.io/docs/23.0/reference/features/vindexes/#predefined-vindexes">hash vindex</a>, which would remove the reliance on an additional data store for shard lookups. Since the ORM’s shard mappings are random and not algorithmic, using one of these out-of-the-box would require re-sharding all of our data – a process that would be manual and likely take years. Instead, we chose to write custom vindexes that ported our existing shard logic into Vitess so we could test how vindexes worked in our environment without the complexity and risk of moving data.</p>
<p>The first step of creating our custom vindexes was to modify the shard assignment algorithm in the ORM to match the Vitess hash vindex algorithm. After that change, no new shard mappings required an index database lookup, so we could store all of the existing mappings in a read-only database. We chose to use SQLite, which provided low latency reads and a small enough footprint to copy the database directly onto each Vitess server – avoiding the latency and dependency added by a call to an external database. Then we built a custom SQLite lookup vindex, which looks up shard information from a SQLite database.</p>
<p>Next, we built a custom hybrid vindex, which allows a table to apply one of two vindexes based on a threshold value – in this case, the SQLite vindex or the hash vindex. We set the threshold to the first id that was created after we changed the shard assignment algorithm. Any id that is above this value would use the hash vindex to look up its shard mapping, and any id that is below it would use the SQLite vindex.</p>
<h2>Introducing vindexes into our environment</h2>
<p>We were now able to introduce Vitess vindexes into our existing architecture without moving any data. Before doing so, we needed to ensure that all queries currently in use with legacy ORM sharding would still succeed and return the same results with Vitess managed sharding. Some incompatibility was expected: Vitess requires the shardifier id to be present in the WHERE clause of a SQL query in order to route it to the correct shard, but this was not previously required by the ORM. In fact, since Etsy developers were required to provide this information separate from the SQL query in our legacy architecture, many queries did not include it in the WHERE clause. More detail around how we found and solved for query incompatibilities can be found in <a href="https://www.etsy.com/codeascraft/scaling-etsy-payments-with-vitess-part-3--reducing-cutover-risk">Scaling Etsy Payments with Vitess, Part 3</a>.</p>
<p>The ORM contained over a decade of queries with many different ways of accessing MySQL, which made auditing each and every query time consuming. Building enough context to test changes to a model was also challenging since each table varied greatly in the design and purpose of its data. Given this, we decided to take an incremental approach and introduce vindexes into our codebase one table at a time. This would allow us to test on a smaller scale, ensuring each type of data access pattern would succeed with Vitess by monitoring how vindexes performed with a specific set of queries. We could slowly and safely surface incompatibilities while we made progress on the migration, allowing room for error without risking large-scale outages.</p>
<p>Since we would need to repeat the migration for hundreds of tables, we prioritized creating a clear and repeatable process. <a href="https://www.etsy.com/codeascraft/understanding-the-collective-impact-of-experiments">Etsy’s experimentation framework</a> enabled us to gradually ramp up the change by incrementally increasing the percentage of traffic that was using Vitess vindexes for shard routing for each table. This enabled us to assess impact by comparing the performance of a query when it was using vindexes vs ORM shard routing. If they encountered any issues along the way, it was easy to ramp down to 0. Minimizing the risk of each model ramp by starting small and having a quick way to roll back helped mitigate the risk of not testing all queries upfront.</p>
<h2>Working through challenges: database transactions</h2>
<p>Early on we ran into a challenge with this approach when handling database transactions. With our in-house shard routing, the ORM treated the shards as many separate databases and queried specific shards directly. With Vitess-managed sharding, the ORM queried the shards as if it were a single database, and they were only regarded as separate databases behind-the-scenes by Vitess. This meant that the two approaches created different connections to the database, which caused some issues for database transactions, since atomicity is only guaranteed per connection. To avoid data integrity issues that could come from breaking transaction guarantees, we required tables written to in a single database transaction to be ramped onto vindexes at the same time so they would use the same connection.</p>
<p>This was simple enough in theory, but ended up being pretty consequential. A small number of tables that represent the most complex and critical data models like receipts, listings, and transactions, make up a large portion of database traffic to the sharded cluster. As we were preparing one of them to use vindexes, we realized that just 27 models (3% of tables) accounted for ⅓ of our database traffic – and they were all connected via database transactions. Despite our best efforts to minimize risk by ramping table-by-table, some of the riskiest changes were tightly coupled anyway. We collaborated across the company to ramp these 27 models in unison.
<img alt="Percentage of traffic using vindexes" src="https://i.etsystatic.com/inv/3f0fee/7870795065/inv_fullxfull.7870795065_hr9alodj.jpg?version=0" title="Percentage of traffic using vindexes" /></p>
<h2>Reaping the benefits: cross-shard queries</h2>
<p>One major advantage of using Vitess vindexes is the ability to query across shards. By default, any query that does not contain the shardifier id will "scatter" – Vitess will send it to all shards in parallel, sort the results, and return a single result set. This is a powerful tool – one model’s query time was reduced from ~2 seconds to ~20 ms by using cross-shard queries! However, at Etsy’s scale, unintentionally sending an expensive query to all 1,000 shards at once could quickly become an issue. To protect against these scenarios, we disabled scatter queries by default in our environment and provided a way to specify that a query is allowed to scatter via the ORM. </p>
<p>Utilizing scatter queries brought notable improvements to bulk primary key lookups in the 27 table ramp mentioned above. The ORM’s previous implementation took an array of primary keys, batched them by shard, issued a query to each shard, and returned the combined results. Since Vitess enables us to query multiple shards in a single query, we can skip the batching by shard and include all primary keys in a single query. This greatly reduced the number of queries issued for bulk lookups for some models.
<img alt="Number of bulk lookup queries for one model" src="https://i.etsystatic.com/inv/4da339/7822693132/inv_fullxfull.7822693132_7yjbrc8p.jpg?version=0" title="Number of bulk lookup queries for one model" /><em>Bulk lookup queries for one model during its ramp onto vindexes. Purple lines denote 1%, 10%, 50%, and 100% of traffic to the model using vindexes.</em></p>
<h2>Conclusion</h2>
<p>Five years, approximately 2,500 pull requests and 6,000 queries later, we have successfully migrated Etsy’s shard management to Vitess vindexes! Despite the work we put in to streamline the migration process, it was still a challenge to replace the database infrastructure for a codebase of Etsy’s scale and age. As infrastructure engineers, my team usually had minimal context around the code we were changing and what might break if things went wrong. However, with coordination and careful testing across Etsy engineering, we were able to meet our goals:</p>
<ul>
<li>Scaling operations are no longer manual and can be performed in a matter of days rather than months.</li>
<li>We removed the index database as a single point of failure for our sharded cluster.</li>
<li>Shard infrastructure has been obscured from the developer’s view which makes data modeling and query writing much simpler.</li>
<li>Database performance stayed largely the same as it was prior to the migration, rendering the change invisible to our end users.</li>
<li>We were able to progress incrementally, roll back quickly, and integrate Vitess into our shard architecture without performing a massive data migration.</li>
</ul>
<p>We are excited to use all the new Vitess features that migrating to vindexes has unlocked – for example, re-sharding our data, rebalancing data across the shards, and using Vitess’s MoveTable operations to shard unsharded tables. Swapping out such critical and complex infrastructure with almost no downtime or impact on our users was an extremely challenging task, but it was also extremely gratifying.</p>
<h2>Acknowledgements</h2>
<p>This project was a shared accomplishment of the Data Access Platform team: Jessica Chen, Samantha Drago-Kramer, Hermes Garcia, Sam Kenny, David Leibovic, Kyle Leiby, Benjamin Mariscal, Juan Ortega, Adam Saponara, Wendy Sung, and Stephanie Wu. Thank you to everyone who contributed to the vindex project across engineering!</p>]]></content:encoded>
    </item>
  </channel>
</rss>
