<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html" version="2.0">
  <!-- Source: https://medium.com/feed/airtable-eng -->
  <channel>
    <title><![CDATA[The Airtable Engineering Blog - Medium]]></title>
    <description><![CDATA[The Airtable Engineering blog shares stories, learnings, best practices, and more from our journey to build a modular software toolkit. - Medium]]></description>
    <link>https://siftrss.com/f/pWVkjQN1V1</link>
    <image>
      <url>https://cdn-images-1.medium.com/proxy/1*TGH72Nnw24QL3iV9IOm4VA.png</url>
      <title>The Airtable Engineering Blog - Medium</title>
      <link>https://medium.com/airtable-eng?source=rss----103630b30187---4</link>
    </image>
    <generator>Medium</generator>
    <lastBuildDate>Tue, 07 Apr 2026 10:31:29 GMT</lastBuildDate>
    <atom:link href="https://siftrss.com/f/pWVkjQN1V1" rel="self" type="application/rss+xml"/>
    <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
    <atom:link href="http://medium.superfeedr.com" rel="hub"/>
    <item>
      <title><![CDATA[How we reduced archive storage costs by 100x and saved millions]]></title>
      <link>https://medium.com/airtable-eng/how-we-reduced-archive-storage-costs-by-100x-and-saved-millions-21754b5a6c8e?source=rss----103630b30187---4</link>
      <guid isPermaLink="false">https://medium.com/p/21754b5a6c8e</guid>
      <category><![CDATA[infrastructure]]></category>
      <category><![CDATA[database]]></category>
      <category><![CDATA[s3]]></category>
      <category><![CDATA[big-data]]></category>
      <category><![CDATA[mysql]]></category>
      <dc:creator><![CDATA[Matthew Jin]]></dc:creator>
      <pubDate>Wed, 07 Jan 2026 18:42:16 GMT</pubDate>
      <atom:updated>2026-01-10T01:44:06.400Z</atom:updated>
      <content:encoded><![CDATA[<p>In this post, we introduce a new storage system that we built in order to cost-efficiently store log data while providing interactive query latency. We’ll cover some motivations, architecture, migration process, and interesting optimizations we made along the way.</p><h3>Archive Data</h3><p>Going into 2024, cost savings was one of the major goals for the storage team. Our AWS MySQL RDS storage footprint was rapidly growing, with petabytes of stored data. Moreover, some of our largest databases in our fleet were approaching the RDS maximum disk space of 64TB, which would render the databases inoperational. We noticed that our largest dataset in MySQL was our “cell history” and “action log” tables (which we will collectively call “archive data”), basically an audit log for the activity within a base. This is used to service our revision history features, and also often for internal debugging purposes. For some enterprise customers, we have committed to retain this data for up to 10 years. We wanted to stop storing this data in MySQL to save money, but we still needed to provide interactive level query latency and the same level of durability and availability guarantees as MySQL.</p><p>These are some characteristics of our archive data:</p><ul><li>The vast majority (petabytes, trillions of rows) is old data that is relatively infrequently accessed by application code.</li><li>When it is accessed, the majority of the QPS comes from query patterns that are point selects or range queries used in pagination. The queries are also always filtered by a specific base.</li><li>Old data is read-only, with the exception of when the data is hard deleted.</li><li>The data is primary key’d by MySQL’s autoincr_id, meaning that it is essentially sorted in insertion order.</li></ul><p>MySQL provides easy query access to this data, but is expensive and thus a poor fit for such cold traffic patterns. We formulated an idea to archive and migrate the data out of MySQL into AWS S3 (which is 10x cheaper byte for byte), partition the data by base into individual <a href="https://parquet.apache.org/">Apache Parquet files</a>, and build a new query engine using <a href="https://datafusion.apache.org/">Apache DataFusion</a> to serve S3 Parquet files. Recent cell history and action log rows would continue to write to and serve from MySQL, making this a two-tier storage system.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*67nBSswdKTKrqJS4uvjMNg.png" /><figcaption>Architecture Overview</figcaption></figure><h3>Parquet</h3><p>First of all, a quick overview of Parquet. It is a columnar file format designed for analytical workloads. Instead of storing data row-by-row like MySQL InnoDB, Parquet stores each column contiguously on disk. Furthermore, Parquet files are organized by <strong>row groups</strong>, each of which contains a horizontal partition of the dataset. A row group contains a chunk of rows for every column, and each column is stored as a separate column chunk.</p><p>The Parquet file format also contains metadata useful for query planning in an engine. <strong>File metadata</strong> provides offsets and size information useful for navigating the Parquet file and <strong>page header metadata</strong> is stored in-line with the page data, and contains useful statistics such as min/max values and bloom filters of the column chunks. Modern query engines that support querying Parquet are able to use the file format’s metadata to construct its execution plan and avoid scanning and filtering over many row groups entirely.</p><p>This is what a simplified visual representation of the type of data Parquet files have:</p><pre>Parquet File<br>└── RowGroup[0]<br>│ ├── ColumnChunk[colA]<br>│ │ ├── ColumnIndex (min/max stats)<br>│ │ ├── BloomFilterIndex<br>│ │ └── Page data<br>│ └── ColumnChunk[colB]<br>…<br>└── File Metadata</pre><p>In constructing our Parquet files, we did not significantly change the schema from its original MySQL schema. We kept them sorted by autoincr_id because most of our query patterns are point selects or range queries on autoincr_id (and some other additional conditions). This ordering allows our query engines to effectively consult metadata and selectively download the relevant byte range of data from S3. We also partitioned our Parquet files by base so that our queries (which are all per base) avoid scanning unrelated base data.</p><p>Lastly, we decided to register the S3 locations of these Parquet files in DynamoDB as our own additional layer of metadata. This allows our storage client code to easily register in the query engine which files it needs to query over. It was also important in supporting features like data residency and encrypting data by customer keys.</p><p>As a side node, thanks to the columnar file format, we were able to get significantly higher compression ratios — our final archive data set was also 10 times smaller than the original data set in MySQL. Coupled with S3 itself being 10 times cheaper byte for byte, we were actually able to build a system that was 100 times cheaper than MySQL in storage costs!</p><h3>DataFusion</h3><p>In early 2024, we began the project by first benchmarking a variety of query engines over Parquet files. All these engines speak SQL and are able to query Parquet files from S3.</p><ul><li>AWS Athena</li><li>DuckDb</li><li>Starrocks</li><li>DataFusion</li></ul><p>One of the first engines we tried was AWS Athena. However, Athena’s architecture is more suitable for general OLAP workloads, and we it didn’t meet our latency needs to provide an interactive, user-facing feature. The Athena API expects queries to be made with a StartQueryExecution call followed by a GetQueryExecution poll to retrieve the query result — as a result, we generally had seconds of latency. Also, as a managed service by AWS, there’s no isolation between separate bases, something we like to prioritize here at Airtable.</p><p>With DuckDB, we found that the query planner did not always effectively use projection pushdowns, a method to reduce the number of data scanned by first moving filters into the initial data scan. Some of our query patterns resulted in entire files being downloaded. Simple point select query on a single autoincr_id worked as expected and achieved subsecond latency, but we found it generally worse compared to DataFusion. In general, we believe the tool is more appropriate for ad-hoc analytics or if you just want to query Parquet fast. In fact, we leveraged DuckDB frequently for debugging purposes throughout our development process — it was extremely convenient to be able to run a CLI tool to quickly inspect contents of S3 Parquet files when working through validation issues.</p><p>Here’s a quick table of some benchmarked results on one select internal Airtable base. By no means is this representative of all query patterns for these various query engines. It is simply a small subset of the queries that we have.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*kWFgFp6725Z94HrGTfcNWQ.png" /><figcaption>*<em>rough number as the performance of all queries were averaged in this experiment, and some were cached results</em></figcaption></figure><p>With Starrocks, we found solid performance results comparable to DataFusion, but the operational complexity involved with running a cluster full time in k8s to serve relatively low QPS queries for cold storage put us off. Like Athena, it also lacks strong isolation between bases.</p><p>After these investigations, we settled on DataFusion. It is an extensible query engine written in Rust, and we found it to be the best engine at using Parquet’s advanced features to implement queries efficiently. Its extensible nature also proved to be handy as we made query optimizations, which we will discuss more later. As an embedded library, we were also able to embed it into our worker processes that are already per base, which has a number of advantages:</p><ul><li>Low operational overhead: Since the engine is embedded, there is no additional service to manage in production. Local development, CI, etc. also did not require much additional setup.</li><li>Strong isolation between bases: Again, since the engine is embedded, our existing architecture for per base processes provides us this guarantee. We did not need to introduce any new mechanism to prevent bases from contending with each other for CPU, RAM, or network bandwidth.</li><li>Strong affinity with requests by base: This allows us to implement effective caching mechanisms, which we will discuss later.</li></ul><h3>Data Migration</h3><p>After we settled on our choice of cost efficient storage and a query engine, we had to migrate the data out of MySQL. We prioritized a process to do a one-time migration of data out of MySQL into S3 to immediately start saving money.</p><p>In order to get a consistent view of the data for export, we chose to leverage <a href="https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ExportSnapshot.html">AWS RDS’s snapshotting capabilities</a>, which returns large Parquet files of the entire table. We also prototyped a system to run SQL directly on the databases and write Parquet files, but we did not productionize orchestrating such a system at scale. We preferred AWS’s snapshot capabilities because it runs on the backup database instance and does not incur additional load to our production systems.</p><p>However, these are snapshots of massive tables across a host of database shards. All our queries are filtered by specific bases, so we also decided that our Parquet files should be partitioned by base. In order to construct these partitioned serving Parquet files, we spun up a number of <a href="https://flink.apache.org/">Flink</a> jobs that parallelized over all our database shard snapshots, repartitioned these snapshots by base, and dumped them in some intermediate S3 directories. Then, we used AWS step functions to scan these S3 directories and enqueue the bases into AWS SQS. Lastly, from there, we ran some custom “compactor” code that merged these intermediate files together. In this compaction process, we merge-sorted the various files, deduplicated records, and made sure each individual final Parquet file did not exceed 1GB. This was an appropriate serving size we determined during initial benchmarking that had a good density of page groups per file.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*jnrNNVJahujI6wB6HVYpcw.png" /><figcaption><em>Overview of the migration process</em></figcaption></figure><h3>Validation</h3><p>It was an important priority for us to provide a seamless transition for our end users throughout this migration process, so before we launched this new system to serve live traffic, we had to do some validation first to ensure we didn’t corrupt data during the migration or introduce unexpected bugs.</p><p>We previously wrote about our bulk validation process to guard against data corruption in more depth in this <a href="https://medium.com/airtable-eng/live-shard-data-archive-export-and-ingestion-to-starrocks-for-validation-6af555e8b3fe">blog post</a>. TL;DR, we spun up a Starrocks cluster so that we could quickly run validation queries between the serving Parquet files and the RDS snapshots. Fortunately, we found zero cases of data corruption throughout this process — the repartitioner and compactor code had worked flawlessly.</p><p>However, this project introduced a lot of new storage client code. To give an idea of the new complexity here, we</p><ul><li>Wrote a query engine with DataFusion in Rust.</li><li>Integrated the Rust query engine into our Node.js client code with <a href="https://napi.rs/">napi-rs</a>.</li><li>Wrote new client code with logic to combine MySQL and S3 results, identify which S3 files to query, etc.</li><li>Supported existing enterprise features such as encryption with customer provided keys, regional data residency, hard deletion, etc.</li></ul><p>Bulk validation was a necessary test to ensure our data migration processes did not corrupt data, but it did not validate all the other client code and demonstrate that users see the exact same revision history in their bases before and after the migration.</p><p>Once we had the query engine built, we began to perform shadow validation on live traffic. Every request would read from MySQL like normal, but in the background, also issue the same request via the new query engine. We caught a number of simple implementation bugs, and also saw a variety of interesting issues, such as:</p><ul><li>Mismatched float precision between Javascript (our typical client) and Rust’s serde JSON library.</li><li>An interesting case where it looks like we dropped entire database shards of data! But it turned out that DataFusion was unexpectedly doing a lexicographical instead of numerical sort.</li><li>A crashing SIGABRT issue with async napi-rs and Node.js worker threads.</li><li>Latency performance concerns.</li></ul><p>In the end, we were able to resolve all these bugs prior to launching this to users and deleting the data in MySQL.</p><h3>Performance Optimizations</h3><p>During our staged rollout process, we discovered a number of performance issues around latency. They were caused by a variety of issues, ranging from inefficient query plans, bottlenecks in network requests to S3, or sparse filters resulting in more data downloaded than necessary. We’ll discuss some of the interesting improvements made here.</p><h4>Caching</h4><p>One particularly interesting performance optimization we made was around caching. As one can imagine building a query engine off S3, the bottlenecks are primarily in the network roundtrips to S3. DataFusion basically converts our SQL statement into a series of S3 GET statements. It fetches Parquet footer metadata, Parquet column chunk metadata, and then uses this information to decide which page groups and which column chunks of the Parquet files actually need to be fetched. We built a tiered caching system to reduce the number of S3 GET requests needed.</p><p>For the first layer, we were able to easily use DataFusion’s built-in cache infrastructure (<a href="https://docs.rs/datafusion/latest/datafusion/execution/cache/cache_manager/struct.CacheManagerConfig.html">CacheManagerConfig</a>) to cache Parquet file metadata and S3 ListObjects calls.</p><p>Next, we cached the rest of the Parquet page header metadata. Together with the built-in file metadata cache, we effectively reduced the constant need to roundtrip to S3 during the query planning process. With pushdown filtering, DataFusion only had to consult the cached metadata to get a good idea of which page group byte ranges it needed to scan. Being an extensible query engine, DataFusion makes it simple to write your own cache implementation. Typically, DataFusion provides a default parquet reader interface (which basically implements functions like get_metadata, get_bytes, get_byte_ranges), but it also allows you to substitute your own implementation instead. So, we wrote an implementation that cached metadata results in-memory. It was also straightforward to add observability and other instrumentation here, and we were able to confirm that we typically have a 99%+ cache hit ratio here. This was as expected because of how our DataFusion engine is embedded in a per-base process and how Parquet files are similarly partitioned by base — the request affinity is as good as it gets here.</p><p>Lastly, we also implemented an on-disk cache that preemptively downloads the Parquet files for a base. Similar to the previous cache layer, we were able to implement this by substituting the default provided S3 ObjectStore implementation with a custom implementation that downloaded to and read from the local disk. Caching entire Parquet files on disk is more work and cost than just metadata though — fortunately, this was only needed for an extremely small number of bases with poorer performance due to large amounts of data and pathological query patterns.</p><p>Overall, we found DataFusion’s extensible nature to be easy to work with, flexible, and the ideal tool to build high performance query engines.</p><h4>Custom Indexes</h4><p>We previously highlighted that most of our queries are point selects and range queries on autoincr_id, but we also often have additional conditions that filter rows far more efficiently than just autoincr_id. Some examples include queries that</p><ul><li>filter by the action type, e.g. looking for actions that updated Airtable column names.</li><li>filter by the row being updated</li><li>omit updates from our sync feature</li></ul><p>Some of these additional filters resulted in matching on a tiny fraction of rows, which makes fetching and scanning entire Parquet files wasteful. For example, imagine filtering out sync updates, but having a base that was predominantly made of sync actions — such naive queries would naturally be slow.</p><p>No database system with varying query patterns is complete without secondary indexes. In order to make these queries more efficient, we built an indexing system leveraging DataFusion that scans through Parquet files and writes indexes as new Parquet files. Our DataFusion and surrounding client code is aware of these index files, and queries them first in order to generate a more efficient query on the original Parquet files. It was easy to build this ad-hoc indexing system mainly because of how our data is read-only. We never have to worry about our data changing and also having to update indexes in sync, which more sophisticated database systems typically have to do.</p><h4>Bloom Filters</h4><p>As we previously mentioned, most of our query patterns are similar to point selects or range queries on an autoincr_id, the “primary key” of how our Parquet files are laid out. However, we do have some significantly lower QPS queries that were point selects on a different, unique identifier. This unique identifier was randomly distributed, so the min/max statistics on Parquet were useless — we’d end up having to fetch and scan every single page group and then apply the filter afterwards. We could address this with the same custom index strategy as above, but we found it simpler to just rely on Parquet’s bloom filter metadata, which DataFusion understands.</p><p>Bloom filters are a space efficient probabilistic data structure that tests whether an element is a member of the set, with false positives being possible but false negatives impossible. In this case, each column chunk metadata for the identifier contained the bloom filter. Because false negatives are impossible, they can be used to filter out some page groups that definitely don’t need to be fetched.</p><h3>Conclusion</h3><p>Overall, we were able to move petabytes of data out of MySQL, build a system that was 100x cheaper in storage costs, and save millions of dollars per year, all while maintaining interactive query latencies for our users.</p><p>For future work, we have plans to make this system incrementally archive data out of MySQL. We prioritized a manual batch archiving process to migrate petabytes of data out of MySQL and start saving money immediately, but there’s still a lot of engineering work to be done to make this a fully automatic system with less operational burden. We envision setting up a CDC system like Flink to handle this, but there are going to be a lot more unsolved and interesting problems around how we handle compacting Parquet files together, rebuilding indexes, managing the operational side of things, etc. Also, our initial implementation targeted our largest dataset, but there are other log-like tables we could onboard to this system as well.</p><p>If this type of work optimizing database query engines and working with petabytes of data sounds exciting to you, apply to Airtable! We’re hiring at <a href="https://airtable.com/careers">https://airtable.com/careers</a>.</p><p><em>Thanks to all past and present Airtablets on storage and across the organization who contributed to this project: Nathan Chou, Aiden Dou, Riley Hockett, Matthew Jin, Daniel Kozlowski, Brian Larson, Keunwoo Lee, Mike Milkin, Gavin Towey, Andrew Wang, Xiaobing Xia, Alex Yao, Brian Zhang, Kun Zhou</em></p><p><em>Apache, Apache Parquet, and Apache DataFusion are trademarks of The Apache Software Foundation.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=21754b5a6c8e" width="1" height="1" alt=""><hr><p><a href="https://medium.com/airtable-eng/how-we-reduced-archive-storage-costs-by-100x-and-saved-millions-21754b5a6c8e">How we reduced archive storage costs by 100x and saved millions</a> was originally published in <a href="https://medium.com/airtable-eng">The Airtable Engineering Blog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
    </item>
    <item>
      <title><![CDATA[Live Shard Data Archive: Export and Ingestion to StarRocks for Validation]]></title>
      <link>https://medium.com/airtable-eng/live-shard-data-archive-export-and-ingestion-to-starrocks-for-validation-6af555e8b3fe?source=rss----103630b30187---4</link>
      <guid isPermaLink="false">https://medium.com/p/6af555e8b3fe</guid>
      <category><![CDATA[data-ingestion]]></category>
      <category><![CDATA[storage]]></category>
      <category><![CDATA[starrocks]]></category>
      <category><![CDATA[data-validation]]></category>
      <category><![CDATA[data-archiving]]></category>
      <dc:creator><![CDATA[Riley]]></dc:creator>
      <pubDate>Mon, 24 Mar 2025 17:11:09 GMT</pubDate>
      <atom:updated>2025-03-24T17:11:09.730Z</atom:updated>
      <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/720/1*SFXlyGVFRyUHP3shoHkqGA.jpeg" /></figure><h3>Overview</h3><p>At Airtable, we store our application or “base-scoped” data on a number of sharded MySQL instances in Amazon’s Relational Database Service (RDS). Each Airtable base is associated with a single one of these sharded instances, and as the base and the data in the base changes, we store some append-only data associated with the history of the base. This data powers features such as undo and revision history, allowing us to display record revisions as far back as 3 years ago for our enterprise customers. As we have grown as a business, these append-only tables have become increasingly large and their content now represents close to half of all the data that we store at Airtable. Additionally, much of this data is very infrequently accessed, but is stored in the same storage layer as all of our base-scoped data for every day product use making it expensive.</p><h3>The Project</h3><p>The Live Shard Data Archive (LSDA) project allowed us to shrink the disc volumes of the bulk of our RDS instances by taking this infrequently accessed, append-only data and storing it in a cheaper storage solution, S3. Once it was stored in that cheaper solution, we were able to drop the old data, and rebuild the RDS instances to reclaim the space.</p><p>Moving this data to S3 required three major steps. First, we had to archive and transform the data from RDS into S3 such that it could be accessed by our codebase in an efficient and consistent way. Second, we had to validate that this archived data matched the existing source data in RDS, and that our process of archiving did not cause any inconsistencies between the two datasets. Finally, we had to make application code changes to serve this data from S3. After these steps, we were able to truncate the data from RDS that we were serving from S3, allowing us to shrink the allocated storage space of these instances, saving a substantial amount of our overall RDS bill. This blog post will focus on the first phase of the second step of that process, data validation.</p><h3>Archiving and Transformation</h3><p>The data archival from RDS into its final shape in S3 was a three-step process. First, we exported snapshots of our databases into S3 as parquets. This is a built-in RDS feature. To optimize query latency of the archive, we then repartitioned the snapshot exports by our customers. This was a challenging process due to the amount of data (&gt;1PB) and the number of files involved (&gt;10M). We used Apache Flink to incrementally ingest these files and repartitioned them into per customer partitions. Finally, we ran a highly-concurrent Kubernetes rewriter job that sorted the archive for each customer, and added the proper index and bloom filters to the rewritten parquets to speed up the most common query patterns.</p><h3>Validation overview and considered approaches</h3><p>Validating the data required us to do a row by row comparison from the archive to our source data and make sure that for every row, these values were equal. Naively, an easy way to do this would be to read a row from our archive, find that row in RDS, and confirm they are the same. However, we were dealing with almost 1PB of data and close to 2 trillion rows. Additionally, our RDS instances in production serve customer traffic, so saddling them with these additional requests was not really an option, especially at the volume we would require to validate our entire archive. As a result, we decided to use the original, unmodified RDS export as our source data for this validation project. This data was stored in S3, and while that alleviated the problem of querying serving instances, it would simply be too slow to go row by row and validate this data. Ultimately, we decided that if we had all of the data in some relational database, we could just join the two tables together, and find any discrepancies in the data that way.</p><h3>Leveraging StarRocks for Data Validation — Airtable Data Infrastructure Team</h3><p>For the data validation project, the Data Infrastructure team helped the Storage team in selecting the best tool to complete the validation work efficiently.</p><h3>Why Use StarRocks to Address the Data Validation Problem?</h3><p>The core of the data validation problem involves performing a large number of join operations between two massive datasets, each containing nearly a trillion rows of data. The primary challenge was executing these computationally intensive join operations efficiently.</p><p>After thorough investigation, StarRocks was chosen due to its exceptional join performance. It can handle these operations with affordable computational costs, whereas other query engines struggle significantly with the same workload.</p><p>To address the problem, we decided to load raw Parquet files from S3 into local tables in StarRocks. By leveraging StarRocks’ colocation mechanism, we could efficiently perform the join operations required for data validation.</p><h3>StarRocks Architecture</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/774/0*jox4cUIPf5QB9m1E" /></figure><p>The diagram above illustrates the StarRocks architecture, which can access the following data sources:</p><ol><li><strong>Data Lakes on S3:</strong> Includes Hudi Lake, Delta Lake, Iceberg Lake, and Paimon Lake.</li><li><strong>Native Format on S3:</strong> StarRocks allows the creation of tables that persist data directly in S3 using its native format.</li><li><strong>Raw Parquet, JSON, or CSV Files on S3:</strong> Queries can be executed directly on raw Parquet, JSON, or CSV files stored in S3.</li></ol><p>In our specific scenario, we loaded raw Parquet files from S3 storage into StarRocks’ local tables to perform data validation, as highlighted above.</p><h3>Ingestion Optimization: Enhances data ingestion performance in StarRocks</h3><p>We had to load nearly 1 trillion rows of data from raw Parquet files into StarRocks local tables. The dataset consisted of hundreds of millions of small Parquet files. Without proper optimization and parallelization, ingesting the entire dataset would have taken several months.</p><p>To accelerate the ingestion throughput, we implemented the following optimizations:</p><ol><li><strong>Reduce the Number of Replicas (from 3 to 1):<br></strong>Since this is a one-time validation task, maintaining high availability for production is unnecessary. Reducing the number of replicas significantly decreases the total data volume to be ingested.</li><li><strong>Increase Internal Ingestion Parallelism:<br></strong>As the validation process involves ingestion first, followed by join-based validation, ingestion performance does not affect serving scenarios. We increased parallelism by tuning the following parameters:<br> - <strong>pipeline_dop<br> </strong>-<strong> pipeline_sink_dop</strong></li><li><strong>Increase the Number of Buckets per Partition:<br></strong>Given the large data volume, we ensured that each bucket contained no more than 5GB of data. Increasing the number of buckets per partition significantly improves ingestion throughput. Although this may cause compaction to lag behind, it is not a concern in our specific scenario.</li></ol><p>These optimizations collectively help to efficiently handle the massive data ingestion process required for our validation workload.</p><h3>Ingestion</h3><h3>Export</h3><p>Once StarRocks was set up, we needed to get all of the data ingested from both our RDS export and the transformed archive which we planned on serving from. We decided that it was not cost and time efficient to store all ~1PB in StarRocks, so we decided to just hash all of the non-key columns in the table. We ingested two tables as a part of this process, but our examples will focus primarily on just one, _actionLog.</p><h4>Initial solution: Simple table with hashed non-key columns</h4><p>We started with a table schema for each of our tables which looked like this:</p><pre>CREATE TABLE `_rdsExportActionLog` (<br>`id` bigint(20) NOT NULL COMMENT &quot;&quot;,<br>`application` varchar(65533) NOT NULL COMMENT &quot;&quot;,<br>`hash_value` varchar(65533) NULL COMMENT &quot;&quot;<br>) ENGINE=OLAP<br>PRIMARY KEY(`id`, `application`)<br>DISTRIBUTED BY HASH(`id`, `application`)<br>ORDER BY(`application`)<br>PROPERTIES (<br>&quot;replication_num&quot; = &quot;1&quot;,<br>&quot;colocate_with&quot; = &quot;action_log_group&quot;,<br>&quot;in_memory&quot; = &quot;false&quot;,<br>&quot;enable_persistent_index&quot; = &quot;true&quot;,<br>&quot;replicated_storage&quot; = &quot;true&quot;,<br>&quot;compression&quot; = &quot;ZSTD&quot;<br>);</pre><p>And we started to load the data using insert statements like this</p><pre>INSERT INTO \`exportActionLog\`<br>WITH LABEL ${label}<br>(id, application, hash_value)<br>SELECT id, application, XX_HASH3_64(CONCAT_WS(&#39;,&#39;,<br>&lt;columns&gt;)) as hash_value<br>FROM FILES(<br>&quot;path&quot; = &quot;s3://${bucket}/${folder}*.parquet&quot;,<br>&quot;format&quot; = &quot;parquet&quot;,<br>);</pre><h4>Data distribution and Loading Bottlenecks</h4><p>However, we found that this loading operation took a really long time, on the order of almost 1 day to load two of our shards. We found that as we added more data and the table got bigger, our ingestion rate slowed further. What was also odd is that if we increased the local parallelization of this ingestion, i.e. ingested multiple shards at once, we didn’t see almost any performance boost, and when we set this value to be more than five, we saw a lot of this:</p><pre>JobId: 14094<br>Label: insert_16604b11–7f2d-11ef-888c-46341e0f370e<br>State: LOADING<br>Progress: ETL:100%; LOAD:99%<br>Type: INSERT</pre><p>You can see here that our load value is 99%, but a number of these large loads would just get stuck at this value and not be able to progress past this state quickly despite getting to 99% quickly. Per the <a href="https://docs.starrocks.io/docs/sql-reference/sql-statements/loading_unloading/SHOW_LOAD/">StarRocks documentation</a>, “When all data is loaded into StarRocks, 99% is returned for the LOAD parameter. Then, loaded data starts taking effect in StarRocks. After the data takes effect, 100% is returned for the LOAD parameter.” Evidently we were experiencing some bottlenecks on the data taking effect with our initial solution.</p><h4>Improvement 1: Increase bucket count</h4><p>We were distributing our data by id and application, but we had yet to specify the number of buckets to distribute this data into (<a href="https://docs.starrocks.io/docs/table_design/data_distribution/">more info on StarRocks data distribution</a>). Our hypothesis was that as we stored more and more data, these buckets got increasingly larger and more cumbersome which led us to the slow down that we were seeing. We consulted the StarRocks team who suggested that for our data volume, we should look at specifying somewhere on the order of 7200 buckets for the smaller of the two tables, so we changed our schema to look like this:</p><pre>CREATE TABLE `exportActionLog` (<br>`id` bigint(20) NOT NULL COMMENT &quot;&quot;,<br>`application` varchar(65533) NOT NULL COMMENT &quot;&quot;,<br>`hash_value` varchar(65533) NULL COMMENT &quot;&quot;<br>) ENGINE=OLAP<br>PRIMARY KEY(`id`, `application`)<br>DISTRIBUTED BY HASH(`id`, `application`) BUCKETS 7200<br>ORDER BY (`application`)<br>PROPERTIES (<br>&quot;replication_num&quot; = &quot;1&quot;,<br>&quot;colocate_with&quot; = &quot;action_log_group&quot;,<br>&quot;in_memory&quot; = &quot;false&quot;,<br>&quot;enable_persistent_index&quot; = &quot;true&quot;,<br>&quot;replicated_storage&quot; = &quot;true&quot;,<br>&quot;compression&quot; = &quot;ZSTD&quot;<br>);</pre><p>However, while we were able to load this data much more quickly than we were previously, we ran into this memory issue:</p><pre>message: &#39;primary key memory usage exceeds the limit. <br>tablet_id: 10367, consumption: 126428346066, limit: 125241246351. <br>Memory stats of top five tablets: 53331(73M)53763(73M)53715(73M)53667(73M)53619(73M): </pre><h4>Improvement 2: Partition the table by shard ID</h4><p>We realized that it would make sense to just partition the table by the shardId and try to load it that way. This would allow us to specify a number of buckets for each partition, and they would be stored in a more efficient manner. Using some rough math, we found that:</p><pre>actionLog =&gt; 10TB (Hashed) =&gt; 10 * 1024 / 148 shards = 69GB per shard =&gt; <br>34 buckets to host it =&gt; add some buffer, 64 buckets per partition<br><br>In total: 64 buckets per partition * 148 shards = 9472 buckets</pre><p>We figured that using this distribution could also allow us to validate shard by shard which would help to not overwhelm the memory of the cluster. In the end we created this table and adjusted our loading statement to pull the shard ID from the S3 file path.</p><pre>CREATE TABLE `exportActionLog` (<br>`id` bigint(20) NOT NULL COMMENT &quot;&quot;,<br>`application` varchar(65533) NOT NULL COMMENT &quot;&quot;,<br>`shard` int(11) NOT NULL COMMENT &quot;&quot;,<br>`hash_value` varchar(65533) NULL COMMENT &quot;&quot;<br>) ENGINE=OLAP<br>PRIMARY KEY(`id`, `application`, `shard`)<br>PARTITION BY (`shard`)<br>DISTRIBUTED BY HASH(`id`, `application`) BUCKETS 64<br>ORDER BY(`application`)<br>PROPERTIES (<br>&quot;replication_num&quot; = &quot;1&quot;,<br>&quot;colocate_with&quot; = &quot;action_log_group_partition_by_shard&quot;,<br>&quot;in_memory&quot; = &quot;false&quot;,<br>&quot;enable_persistent_index&quot; = &quot;true&quot;,<br>&quot;replicated_storage&quot; = &quot;true&quot;,<br>&quot;compression&quot; = &quot;ZSTD&quot;<br>);</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*h0nsHPKJ52oYn6mA" /></figure><p>This drastically sped up our ingestion time, LSDA data was successfully loaded into StarRocks from S3 in under <strong>10 hours</strong>. The average throughput was approximately <strong>2 billion rows</strong> per minute.</p><h3>Archive</h3><p>Now that the full RDS export, our source of truth data for validation, had been ingested into StarRocks, we needed to ingest the archive we were going to serve the data from and run our validation process across the two datasets. Unfortunately, the archive data was not stored in the same format as the export data which presented an additional set of challenges. Drawing from our experience with the export ingestion, we decided to create the tables in an identical way, with both the same number of buckets and the same partitioning strategy by shardId. This also allowed us to have the export and the archive in the same colocation group, such that we could use <a href="https://docs.starrocks.io/docs/using_starrocks/Colocate_join/">StarRocks’ colocation join</a>.</p><pre>CREATE TABLE `_rdsArchiveActionLog` (<br>`autoincr_id` bigint(20) NOT NULL COMMENT &quot;&quot;,<br>`applicationId` varchar(65533) NOT NULL COMMENT &quot;&quot;,<br>`shardId` int(11) NOT NULL COMMENT &quot;&quot;,<br>`hash_value` varchar(65533) NULL COMMENT &quot;&quot;<br>) ENGINE=OLAP<br>PRIMARY KEY(`autoincr_id`, `applicationId`, `shardId`)<br>PARTITION BY (`shardId`)<br>DISTRIBUTED BY HASH(`autoincr_id`, `applicationId`) BUCKETS 64<br>ORDER BY(`applicationId`)<br>PROPERTIES (<br>&quot;replication_num&quot; = &quot;1&quot;,<br>&quot;colocate_with&quot; = &quot;action_log_group_partition_by_shard&quot;,<br>&quot;in_memory&quot; = &quot;false&quot;,<br>&quot;enable_persistent_index&quot; = &quot;true&quot;,<br>&quot;replicated_storage&quot; = &quot;true&quot;,<br>&quot;compression&quot; = &quot;ZSTD&quot;<br>);</pre><h4>Discrepancies in directory structure between RDS export and archive</h4><p>However, a major difference between our export and our archive was the way in which they were stored in S3. For our export, it was stored in large directories, each corresponding to an individual shard and partition in StarRocks. This made our insert queries to StarRocks quite simple; we could just wildcard all of the parquet files across these directories, and from the directory we were able to get the shard ID which we could then insert into the row and use for partitioning.</p><p>For the archive, our data was stored by application to make serving from S3 in the application simpler. This meant that we had over <strong>6 million</strong> small directories corresponding to each application that did not necessarily correspond to a given shard, some applications can be stored across multiple shards. We also had not stored the source shard information in S3, so unlike with our export, we had no ability to easily get our shardId from our single call to S3. Instead, we had stored yet to be validated file metadata in DynamoDb. Given this information, we created a Global Secondary Index in DynamoDb with a sort key as the shard ID, and then queried all of the files for that shardId to insert those.</p><h4>Grouping inserts in StarRocks using Union</h4><p>Additionally, StarRocks only lets you specify a single path for each insert statement, meaning that each file would need its own insert statement. This was in stark contrast to our export which only had about ~160 folders and corresponding insert statements. We now had more than 6 million insert statements for our archive. To solve this problem, we figured we could just try to heavily parallelize the process. We thought that we could just run multiple shards at once and run multiple processes per shard. Note, this is all through Node and TS, so the idea of multithreading for this is not really true, but we could run multiple processes which would run if others were blocked on I/O. We tried to run a single shard with 10 threads, but hit this problem:</p><pre>message: &#39;Failed to load data into tablet 14775287, <br>because of too many versions, current/limit: 1006/1000. <br>You can reduce the loading job concurrency, or increase loading data batch size. <br>If you are loading data with Routine Load, you can increase <br>FE configs routine_load_task_consume_second and max_routine_load_batch_size</pre><p>Essentially, because multiple processes were running insertions at the same time, we were creating too many versions of the table for StarRocks to compact based on the compaction frequency. Additionally, you can no longer increase the version limit in StarRocks beyond 1000.</p><p>Given this, we tried to pursue a strategy which would let us reduce the number of insert queries to try to hand off more of the work to StarRocks (at this point it was running at low CPU and memory and the process was taking a while/failing). To do this, given that we can only specify a single path per select statement, we just created a for loop to generate an insert statement like this</p><pre>INSERT INTO &lt;table&gt;<br>SELECT * FROM FILE_1<br>UNION ALL<br>SELECT * FROM FILE_2<br>……</pre><p>With this strategy, we were actually able to run this for 100 applications at once, speeding this up significantly. This allowed us to load all ~1 trillion rows on the archive side from 6 millions applications in just about 3 days.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/917/0*KWq7E2UDRos9LdyZ" /></figure><h3>Acknowledgements</h3><p>Thank you to Daniel Kozlowski, Kun Zhou, Matthew Jin and Xiaobing Xia for all of their contributions on this project.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=6af555e8b3fe" width="1" height="1" alt=""><hr><p><a href="https://medium.com/airtable-eng/live-shard-data-archive-export-and-ingestion-to-starrocks-for-validation-6af555e8b3fe">Live Shard Data Archive: Export and Ingestion to StarRocks for Validation</a> was originally published in <a href="https://medium.com/airtable-eng">The Airtable Engineering Blog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
    </item>
    <item>
      <title><![CDATA[Building a Resilient Embedding System for Semantic Search at Airtable]]></title>
      <link>https://medium.com/airtable-eng/building-a-resilient-embedding-system-for-semantic-search-at-airtable-d5fdf27807e2?source=rss----103630b30187---4</link>
      <guid isPermaLink="false">https://medium.com/p/d5fdf27807e2</guid>
      <category><![CDATA[semantic-search]]></category>
      <category><![CDATA[embedding]]></category>
      <category><![CDATA[airtable]]></category>
      <category><![CDATA[search]]></category>
      <category><![CDATA[ai]]></category>
      <dc:creator><![CDATA[Will Powelson]]></dc:creator>
      <pubDate>Wed, 20 Nov 2024 17:27:11 GMT</pubDate>
      <atom:updated>2024-11-20T19:45:46.437Z</atom:updated>
      <content:encoded><![CDATA[<p>When ChatGPT burst into the public eye in 2022, a small team of engineers at Airtable started ideating on the different ways our platform could leverage this new set of capabilities. One idea kept popping up: rich, semantic search over customer data.</p><p>Imagine a marketing team asking, “Can you find past campaigns similar to this one?”, a product management team asking “can you find engineers whose expertise matches this project?”, or in my case “can you find past issues (called “escalations” internally) similar to this one?”. Embedding powered systems can allow teams to quickly identify insights that would normally take hours of manual labor.</p><figure><img alt="a UI showing a suggested past escalation similar to the one the user is viewing." src="https://cdn-images-1.medium.com/max/1024/0*RSzCyXnHyRlkeNe1" /><figcaption>[from our internal issue tracking base]</figcaption></figure><p>Thinking About Building An Embedding Powered System:</p><p>Embeddings are at the heart of semantic search. Embeddings are vectors, an array of numbers, numerical representations of data, that capture meaning in a way that machines can understand. Concretely, an embedding can tell you that “cat” is similar to “tabby”, but not similar to “car”.</p><p>Building a system around embeddings isn’t trivial. There are multiple dimensions to consider:</p><ul><li><strong>(Focus of this post) </strong><em>Embedding Lifecycle Management<br>- Triggering</em> — As data changes embeddings need to be (re)generated. We need to durably track the state of generation.<br>- <em>Generation</em> — Generating the embeddings via API calls (over the network, or in-process)<br>- <em>Persistence</em> — Where and how are the embedding stored<br>- <em>Deletion </em>— Deleting old data when it’s no longer needed<br>- <em>Consistency</em> — How do we maintain consistency between our vector data and primary database?<br>- <em>Migrations</em> — This has several flavors e.g. Changes to the DB schema, changes to the DB engine (e.g. LanceDB vs Opensearch), Changes to the AI embedding model, or changes to the underlying storage layer’s encryption<br>- <em>Disaster Recovery</em> — What happens if we experience data loss?-</li><li><em>Data Preparation<br>- Corpus Choice</em> — What data will we embed?<br>- <em>Configuration — </em>Embedding model choice, chunking strategy, other knobs</li><li><em>Query And Access<br>- Querying</em> — Via semantic search or direct access<br>- <em>Indexing</em> — Data structures for efficient semantic similarity queries<br>- <em>Permissions</em> — Making semantic search respect our permissions model</li><li><em>Operational Concerns<br>- Security</em> — Embeddings <a href="https://www.tonic.ai/blog/sensitive-data-in-text-embeddings-is-recoverable">are sensitive customer data</a> and must be treated accordingly<br>- <em>Cost</em> — How do we keep the cost of generation, persistence, and querying manageable?<br>- <em>Partition Management</em> — If you have many collections, how do you handle scaling the number of partitions?</li></ul><p>Given all that, we needed a system that was not just functional but robust and adaptable.</p><h4>A High-Level Overview of Airtable’s Architecture</h4><p>Airtable offers an app building platform built on top of a custom, in-memory database backed by MySql.</p><p>Let’s introduce two useful terms we’ll be using</p><ol><li><em>MemApp</em> — Our in memory database, manages all reads and writes for a base.<br>- The data is ultimately persisted in MySql</li><li><em>Base — </em>A particular instance of MemApp</li></ol><p>A critical detail: MemApp is a single-writer database. All writes occur serially, which becomes important when we talk about maintaining consistency.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*S1pdw-rPWm3Pbx1hIuEFpw.png" /><figcaption>[The Critical Data Flow For Airtable]</figcaption></figure><h4>Building A Data Model</h4><p>Fully Or Eventually Consistent?</p><p>Our first fundamental design question was this: Should embeddings be stored within MemApp to ensure full consistency with the data? While appealing in theory, this approach had two major drawbacks:</p><p>1. <strong>Cost</strong>: Memory usage is a significant factor in Airtable’s expenses. Storing embeddings in MemApp would be prohibitively expensive because embeddings are often 10x the size of the underlying data.</p><p>2. <strong>Performance</strong>: Achieving strong consistency would require generating embeddings within transactions. This process would be too slow for bulk updates and would limit us to less capable, in-house models rather than leveraging top-tier providers like OpenAI.</p><p>The solution? Embrace eventual consistency, and store and process embeddings outside of MemApp. This decision led us to design a data flow where embeddings are generated asynchronously and stored in a separate vector database — with their state tracked within MemApp.</p><p><em>Data Modeling</em></p><p>We anticipated that AI models, embedding providers, and storage engines would evolve. To future-proof our system, we introduced an abstraction within MemApp called an <strong>embedding config</strong>. This allows developers to map data from MemApp to a table in a vector database, without worrying about the underlying complexities.</p><p>An embedding config has the following primary components:</p><ol><li><em>data subscription</em> — A declarative description of the data a user has told us they want to embed.</li><li><em>embedding strategy</em> — How the data is to be embedded</li><li><em>storage configuration</em> — Where the data will be stored.</li><li><em>triggering configuration</em> — Once data is out of date, when do we re-generate it?</li></ol><p>We’ll be using the terms <strong>data subscription</strong> and <strong>embedding config</strong> repeatedly in this post.</p><p>Because the system is eventually consistent, we knew we needed to track the state of embedding for every piece of data that a data subscription needs to embed, which leads us to…</p><p><em>Tracking Consistency:</em></p><p>Given that we would be eventually consistent, the latest version of data persisted to our vector database will always lag behind the data in MemApp. We need to be able to determine which data has been embedded and what may be out of date (for filtering stale results or fixing up the vector database). To accomplish this it helps have an <em>ordering of data versions</em>. Ordered data versions also lets us handle out-of-order write operations.</p><p>Since MemApp is a serializable database, its <em>transaction number</em> is perfect for this — a BigInt that increments with each write. For each piece of data we embed we create an <strong>embedding state:</strong></p><pre>type EmbeddingState = {<br>  // last recorded transaction that caused a write to the vector database<br>  // this may be lower than the actual number in between when a write to<br>  // the vector database occurs and when the update to MemApp occurs<br>  // (those writes to MemApp may also be lost due to system failures)<br>  lastPersistedTransaction: number | null,<br>  // always correct and up to date, we update this transactionally when<br>  // data in the data subscription is updated<br>  lastUpdatedTransaction: number<br>}</pre><p>Being more concrete, we expect updates over the lifetime of an embedding state to look like this:</p><pre>// new data to be embedded enters the data subscription<br>{<br>  lastPersistedTransaction: null,<br>  lastUpdatedTransaction: 2,<br>}<br>// once the data is embedded, we mark it as up to date<br>{<br>  lastPersistedTransaction: 2,<br>  lastUpdatedTransaction: 2,<br>}<br>// data is edited, we mark it as stale so it can be re-embedded<br>{<br>  lastPersistedTransaction: 2,<br>  lastUpdatedTransaction: 5,<br>}</pre><h4>The Simple Life Of An Embedding</h4><p>Here’s a rough sketch of our system when data changes and needs to be re-embedded</p><figure><img alt="Data to embed flows to the embedding generation service. This sends the data to an AI provider to create embeddings, which are then stored in a vector database. Finally, the embedding generation service tells MemApp the vector database was successfully updated." src="https://cdn-images-1.medium.com/max/1024/1*jo6Hr_ASiKbhBi62Dd1mIQ.png" /></figure><p>Let’s walk through the flow in practice</p><p><em>Initialization</em></p><p>Upon creating a new embedding config, MemApp automatically provisions a new vector database table and generates embedding states for each relevant data chunk.</p><p><em>Detection</em></p><p>When data changes, we update the embedding state’s <em>lastUpdatedTransaction</em> to reflect the current transaction.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*w02YYKoxVKiwhtQr" /><figcaption>[an example of change detection]</figcaption></figure><p><em>Triggering</em></p><p>Tasks are created to generate the embeddings for each config that saw data in its data subscription updated. Currently we do this within the transaction that the detection occurs.</p><p><em>Generation</em></p><p>Our embedding service processes the task(s), and generates the embeddings. We have ample retry logic in case this fails, there’s also a separate blog post here on how to prevent poorly behaving bases from consuming our global rate limits from AI providers or otherwise tanking quality of service.</p><p><em>Persistence</em></p><p>Once embeddings are generated we store them in our vector database. We prevent out-of-order writes by making the insertion conditional on the transaction number for the write being greater than the transaction number stored in the vector DB.</p><p>If the update <em>has</em> been outpaced we silently exit the flow since no more work is necessary.</p><p>We then confirm with MemApp that the write has happened, updating the <em>lastPersistedTransaction</em>. Once again we handle potential out of order writes by only ever increasing the value.</p><p><em>Deletion</em></p><p>Deleting individual embedding states will delete the data in the embedding store. Again we use conditional deletions to handle out of order writes.</p><p>Deleting an entire embedding config triggers automatic cleanup of <strong>all</strong> data: we delete the associated embedding states and the vector database table. This reduces storage costs and lets us meet various data retention guarantees. This actually creates a very nice property to the system — the real meat of this post…</p><h4>Operations In Practice — Migrations And Failures</h4><p>Systems fail and requirements change. We needed a strategy that handled:</p><ul><li><em>Vector database Failures<br>- Database Corruption</em> — the database exists, but is in a bad state<br>- <em>Catastrophic Data Loss</em> — the database is deleted or unavailable</li><li><em>Vector Database Migrations<br>- Schema Migrations — </em>storing new metadata<em><br>- Database Engine Migration</em> — e.g. LanceDB to Milvus or vice versa<br>- <em>Data Residency Migration — </em>e.g. US to EU<br>- <em>KMS migration </em>— when a customer using <a href="https://support.airtable.com/docs/airtable-enterprise-key-management">our KMS feature</a> rotates their encryption key</li><li><em>Permission Changes<br>- AI Provider Changes </em>— Airtable supports multiple AI providers. What if a user forbids use of OpenAI and only allows models served via Bedrock<br>Note: Embeddings can/should not be compared across models. So updates require re-embedding everything.</li><li><em>MemApp Migrations<br>- Deprecating AI Models</em> — We need to update the AI model to one that is still supported<br>- <em>Updating the Embedding Strategy</em> — If we change how we embed data, we re-embed everything to keep comparisons apples-to-apples</li><li><em>MemApp Operations<br>- Updating Data Subscriptions — </em>This changes the data we’re embedding, and so usually requires completely re-embedding the data<br>- <em>Cloning a base</em> — Airtable allows you to clone a database. We need separate vector stores to keep the data model sensible over time.<br>- <em>Base Snapshot Restoration</em> — Airtable has multi-year old snapshots that can serialized in a database</li><li><em>MemApp Vector Database Synchronization<br>- MemApp corruption — </em>If we don’t create embedding states for some of our data<br>- <em>Detection Change Bugs </em>— what if we don’t propagate changes to the vector store because we fail to detect them?<br>- <em>Task Queue Failures</em> — What happens if our task queue violates its “at least once delivery” of updates?</li></ul><p>You can imagine many ways to handle these cases. One pattern I found myself coming up with over and over again was:</p><ol><li>Deleting the old embedding config (cleaning up old data if it exists)</li><li>Creating a new new embedding config in the old ones place (possibly with new settings)</li></ol><p>We call this process <strong>resetting</strong> the embedding config.</p><p>I realized we could handle all of these cases if any time the system detects that embeddings are, or about to become, invalid, we reset the embedding config.</p><p>Moving a base from the US to the EU is illustrative.</p><p>When a base moves from the US to the EU sensitive data related to the base must move to the EU, with nothing in the US remaining. Since embeddings are sensitive they must be moved as well. To handle this we delete every embedding config in the base, our system then automatically deletes all existing embedding data as part of the normal cleanup process. We then re-create the embedding configs with a configuration to store the data in the EU. Voila! All the embedding data for the base has now been migrated.</p><figure><img alt="picture of stages of migration:" src="https://cdn-images-1.medium.com/max/1024/1*axOuXtFZxN_YpKF4gc1RLg.png" /><figcaption>Example Migration</figcaption></figure><p>You can repeat this exercise for every single case listed above. This approach had a few tradeoffs we had to consider:</p><p><em>Cost</em></p><p>Regenerating embeddings incurs expenses, but these are manageable compared to storage and indexing costs from incremental updates.</p><p><em>Downtime</em></p><p>Resetting embeddings causes temporary unavailability of semantic search features. Given the rarity of resets and the speed of re-embedding (p99.9 under 2 minutes), this was acceptable.</p><p>Handling this kind of downtime is already necessary at the product level since users can trigger large scale generations in the product very easily (e.g. cloning a base).</p><p><em>Potential Runaway Resets</em></p><p>We had to implement safeguards to prevent continuous reset loops, which could lead to spiraling costs and prolonged downtime. We used a combination of metrics, alerts, rate limiting, and idempotent requests.</p><h4>Conclusion:</h4><p>Our embedding system isn’t perfect, but it’s resilient and adaptable. By accepting eventual consistency and designing for graceful handling of failures and migrations, we’ve built a foundation that supports powerful semantic search capabilities and are continuing to iterate on this system as we find new ways to leverage AI to increase the power of the Airtable platform.</p><p>Some other interesting topics we totally glossed over but may touch on in the future:</p><ol><li>Failures in the embedding generation</li><li>Downtime for MemApp</li><li>Managing global AI rate limits</li><li>Choice of vector indices</li><li>How we apply filters and permissions</li></ol><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=d5fdf27807e2" width="1" height="1" alt=""><hr><p><a href="https://medium.com/airtable-eng/building-a-resilient-embedding-system-for-semantic-search-at-airtable-d5fdf27807e2">Building a Resilient Embedding System for Semantic Search at Airtable</a> was originally published in <a href="https://medium.com/airtable-eng">The Airtable Engineering Blog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
    </item>
  </channel>
</rss>
