The Airtable Engineering Blog - Medium

Instant Schema Changes in Airtable’s New Database

Chris Pederkoff — Mon, 13 Jul 2026 17:44:42 GMT

In our last post, we explained why we rewrote Airtable’s in-memory database in Rust. This post is about a specific tradeoff that came with it: keeping schema changes fast.

Our legacy JavaScript database could add or remove a column without rewriting any existing rows, so schema changes were very quick. The new database’s row layout is more compact and faster to read, but that comes with a cost: to change the schema we must rewrite every row in the table. For some of our largest customers, that rewrite could turn what should be an instant operation into a noticeably slow one.

In this post, we’ll walk through how we got that speed back: how do you add or remove a column from a database while keeping the operation fast regardless of the size of the table?

The Naïve Approach

Our initial implementation took the simplest route: when adding a column, rewrite every row. Walk through the entire table, insert a default value into each row at the correct position, and commit the changes as transactional row updates. It’s correct and easy to reason about, but the cost scales linearly with the number of rows. In some cases we saw a migration take two minutes, 80% of which was spent rewriting the rows.

Adding a column writes a null value to all existing rows.

Alternatives We Considered

We explored a few options before landing on our new design. Regardless of the add column approach, column deletes were straightforward: mark the column metadata as deleted and skip it on read.

Removing a column causes new rows to write null in the removed spot

Separate storage for new columns. Store newly added columns in separate data structures, keeping the original rows untouched. This hurts data locality: reading a single logical row would require a lookup per new column, so this approach was rejected.

Append-only row extension. Always append new columns to the end of the row and use the row’s length as a proxy for its schema version. If a row has 12 columns and the current schema expects 14, we know the last two columns were added after this row was written, so we can return null for them. This works in the short term, but the number of physical columns only ever grows. When a column is deleted, we can’t reclaim that space, because the length of the row implicitly tells you the row’s schema version. Shortening the row would be ambiguous. Memory usage scales with the cumulative number of columns created since the database was first loaded, not the number currently in use. As an in-memory database, we are particularly sensitive to memory use, so this approach was rejected.

Our Solution: Per-Row Schema Versioning

We landed on an approach also used by MySQL 8.0’s instant ADD/DROP column: per-row schema versioning. With this approach, every row carries a schema version number which was the current version when the row was written.

When a column is added or removed, we increment the current schema version, modify the current schema, and update metadata tables which describe how to interpret rows written at older schema versions. When a row is read, we look up its schema version, determine what translation is needed, and return the row at the current schema version.

Schema changes are now metadata-only operations. Adding a column to a table with a million rows takes the same amount of time as a table with ten rows.

Implementation

At the center of the implementation is a schema translator which contains instructions for updating rows to the newest version.

When given a row and its schema version, the translator knows how to map each column in the current schema to the correct index in the physical row, or to return null if the column was added after the row was written.

One optimization that made a significant difference was only translating the columns actually needed. Consider a query that selects 2 columns from a table with 100 columns. Without this optimization, you’d translate the entire row to the current schema version upfront: 98 columns of wasted work.

To avoid this, we created a proxy that behaves as if it were a row at the current schema version. This struct holds a reference to the original row and a reference to the translation table at the row’s schema version. Code accesses specific values from this proxy as though it is at the current schema version and we use the translation table to return the correct value. The rest of the system doesn’t need to know that the underlying data might be at an older schema version.

The Tradeoffs

This approach optimizes for low latency schema changes, which is a feature our users love. A user adding a column to a large base should get an instant response, every time. In exchange, there’s a small hit to read performance from on-the-fly translation. There are a number of unrealized optimizations we can make here, so we have headroom to revisit this if it ever becomes an issue.

One optimization we plan to make is to migrate rows in a background task. Once all of the rows in a table have been migrated to the latest schema version, we can skip the version check entirely for that table, at least until the next schema migration.

Results

Before (red) and after (green) on a log scale

While read performance took a small hit, schema changes are dramatically faster at higher row counts because they are now a metadata-only operation rather than scaling with table size.

In production, we saw max and p99 schema change latency drop significantly after the deploy, with peaks falling from 60 seconds to 15 seconds:

What’s Next

Per-row schema versioning is one piece of a larger effort to make every operation in Airtable’s database feel instant, regardless of scale. In upcoming posts, we’ll cover more of the internals: search indexing, how MVCC and garbage collection work, our approach to incremental view maintenance, and lessons learned from running Rust inside Node.js in production.

Instant Schema Changes in Airtable’s New Database was originally published in The Airtable Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

How we reduced archive storage costs by 100x and saved millions

Matthew Jin — Wed, 07 Jan 2026 18:42:16 GMT

In this post, we introduce a new storage system that we built in order to cost-efficiently store log data while providing interactive query latency. We’ll cover some motivations, architecture, migration process, and interesting optimizations we made along the way.

Archive Data

Going into 2024, cost savings was one of the major goals for the storage team. Our AWS MySQL RDS storage footprint was rapidly growing, with petabytes of stored data. Moreover, some of our largest databases in our fleet were approaching the RDS maximum disk space of 64TB, which would render the databases inoperational. We noticed that our largest dataset in MySQL was our “cell history” and “action log” tables (which we will collectively call “archive data”), basically an audit log for the activity within a base. This is used to service our revision history features, and also often for internal debugging purposes. For some enterprise customers, we have committed to retain this data for up to 10 years. We wanted to stop storing this data in MySQL to save money, but we still needed to provide interactive level query latency and the same level of durability and availability guarantees as MySQL.

These are some characteristics of our archive data:

The vast majority (petabytes, trillions of rows) is old data that is relatively infrequently accessed by application code.
When it is accessed, the majority of the QPS comes from query patterns that are point selects or range queries used in pagination. The queries are also always filtered by a specific base.
Old data is read-only, with the exception of when the data is hard deleted.
The data is primary key’d by MySQL’s autoincr_id, meaning that it is essentially sorted in insertion order.

MySQL provides easy query access to this data, but is expensive and thus a poor fit for such cold traffic patterns. We formulated an idea to archive and migrate the data out of MySQL into AWS S3 (which is 10x cheaper byte for byte), partition the data by base into individual Apache Parquet files, and build a new query engine using Apache DataFusion to serve S3 Parquet files. Recent cell history and action log rows would continue to write to and serve from MySQL, making this a two-tier storage system.

Architecture Overview

Parquet

First of all, a quick overview of Parquet. It is a columnar file format designed for analytical workloads. Instead of storing data row-by-row like MySQL InnoDB, Parquet stores each column contiguously on disk. Furthermore, Parquet files are organized by row groups, each of which contains a horizontal partition of the dataset. A row group contains a chunk of rows for every column, and each column is stored as a separate column chunk.

The Parquet file format also contains metadata useful for query planning in an engine. File metadata provides offsets and size information useful for navigating the Parquet file and page header metadata is stored in-line with the page data, and contains useful statistics such as min/max values and bloom filters of the column chunks. Modern query engines that support querying Parquet are able to use the file format’s metadata to construct its execution plan and avoid scanning and filtering over many row groups entirely.

This is what a simplified visual representation of the type of data Parquet files have:

Parquet File
└── RowGroup[0]
│ ├── ColumnChunk[colA]
│ │ ├── ColumnIndex (min/max stats)
│ │ ├── BloomFilterIndex
│ │ └── Page data
│ └── ColumnChunk[colB]
…
└── File Metadata

In constructing our Parquet files, we did not significantly change the schema from its original MySQL schema. We kept them sorted by autoincr_id because most of our query patterns are point selects or range queries on autoincr_id (and some other additional conditions). This ordering allows our query engines to effectively consult metadata and selectively download the relevant byte range of data from S3. We also partitioned our Parquet files by base so that our queries (which are all per base) avoid scanning unrelated base data.

Lastly, we decided to register the S3 locations of these Parquet files in DynamoDB as our own additional layer of metadata. This allows our storage client code to easily register in the query engine which files it needs to query over. It was also important in supporting features like data residency and encrypting data by customer keys.

As a side node, thanks to the columnar file format, we were able to get significantly higher compression ratios — our final archive data set was also 10 times smaller than the original data set in MySQL. Coupled with S3 itself being 10 times cheaper byte for byte, we were actually able to build a system that was 100 times cheaper than MySQL in storage costs!

DataFusion

In early 2024, we began the project by first benchmarking a variety of query engines over Parquet files. All these engines speak SQL and are able to query Parquet files from S3.

AWS Athena
DuckDb
Starrocks
DataFusion

One of the first engines we tried was AWS Athena. However, Athena’s architecture is more suitable for general OLAP workloads, and we it didn’t meet our latency needs to provide an interactive, user-facing feature. The Athena API expects queries to be made with a StartQueryExecution call followed by a GetQueryExecution poll to retrieve the query result — as a result, we generally had seconds of latency. Also, as a managed service by AWS, there’s no isolation between separate bases, something we like to prioritize here at Airtable.

With DuckDB, we found that the query planner did not always effectively use projection pushdowns, a method to reduce the number of data scanned by first moving filters into the initial data scan. Some of our query patterns resulted in entire files being downloaded. Simple point select query on a single autoincr_id worked as expected and achieved subsecond latency, but we found it generally worse compared to DataFusion. In general, we believe the tool is more appropriate for ad-hoc analytics or if you just want to query Parquet fast. In fact, we leveraged DuckDB frequently for debugging purposes throughout our development process — it was extremely convenient to be able to run a CLI tool to quickly inspect contents of S3 Parquet files when working through validation issues.

Here’s a quick table of some benchmarked results on one select internal Airtable base. By no means is this representative of all query patterns for these various query engines. It is simply a small subset of the queries that we have.

*rough number as the performance of all queries were averaged in this experiment, and some were cached results

With Starrocks, we found solid performance results comparable to DataFusion, but the operational complexity involved with running a cluster full time in k8s to serve relatively low QPS queries for cold storage put us off. Like Athena, it also lacks strong isolation between bases.

After these investigations, we settled on DataFusion. It is an extensible query engine written in Rust, and we found it to be the best engine at using Parquet’s advanced features to implement queries efficiently. Its extensible nature also proved to be handy as we made query optimizations, which we will discuss more later. As an embedded library, we were also able to embed it into our worker processes that are already per base, which has a number of advantages:

Low operational overhead: Since the engine is embedded, there is no additional service to manage in production. Local development, CI, etc. also did not require much additional setup.
Strong isolation between bases: Again, since the engine is embedded, our existing architecture for per base processes provides us this guarantee. We did not need to introduce any new mechanism to prevent bases from contending with each other for CPU, RAM, or network bandwidth.
Strong affinity with requests by base: This allows us to implement effective caching mechanisms, which we will discuss later.

Data Migration

After we settled on our choice of cost efficient storage and a query engine, we had to migrate the data out of MySQL. We prioritized a process to do a one-time migration of data out of MySQL into S3 to immediately start saving money.

In order to get a consistent view of the data for export, we chose to leverage AWS RDS’s snapshotting capabilities, which returns large Parquet files of the entire table. We also prototyped a system to run SQL directly on the databases and write Parquet files, but we did not productionize orchestrating such a system at scale. We preferred AWS’s snapshot capabilities because it runs on the backup database instance and does not incur additional load to our production systems.

However, these are snapshots of massive tables across a host of database shards. All our queries are filtered by specific bases, so we also decided that our Parquet files should be partitioned by base. In order to construct these partitioned serving Parquet files, we spun up a number of Flink jobs that parallelized over all our database shard snapshots, repartitioned these snapshots by base, and dumped them in some intermediate S3 directories. Then, we used AWS step functions to scan these S3 directories and enqueue the bases into AWS SQS. Lastly, from there, we ran some custom “compactor” code that merged these intermediate files together. In this compaction process, we merge-sorted the various files, deduplicated records, and made sure each individual final Parquet file did not exceed 1GB. This was an appropriate serving size we determined during initial benchmarking that had a good density of page groups per file.

Overview of the migration process

Validation

It was an important priority for us to provide a seamless transition for our end users throughout this migration process, so before we launched this new system to serve live traffic, we had to do some validation first to ensure we didn’t corrupt data during the migration or introduce unexpected bugs.

We previously wrote about our bulk validation process to guard against data corruption in more depth in this blog post. TL;DR, we spun up a Starrocks cluster so that we could quickly run validation queries between the serving Parquet files and the RDS snapshots. Fortunately, we found zero cases of data corruption throughout this process — the repartitioner and compactor code had worked flawlessly.

However, this project introduced a lot of new storage client code. To give an idea of the new complexity here, we

Wrote a query engine with DataFusion in Rust.
Integrated the Rust query engine into our Node.js client code with napi-rs.
Wrote new client code with logic to combine MySQL and S3 results, identify which S3 files to query, etc.
Supported existing enterprise features such as encryption with customer provided keys, regional data residency, hard deletion, etc.

Bulk validation was a necessary test to ensure our data migration processes did not corrupt data, but it did not validate all the other client code and demonstrate that users see the exact same revision history in their bases before and after the migration.

Once we had the query engine built, we began to perform shadow validation on live traffic. Every request would read from MySQL like normal, but in the background, also issue the same request via the new query engine. We caught a number of simple implementation bugs, and also saw a variety of interesting issues, such as:

Mismatched float precision between Javascript (our typical client) and Rust’s serde JSON library.
An interesting case where it looks like we dropped entire database shards of data! But it turned out that DataFusion was unexpectedly doing a lexicographical instead of numerical sort.
A crashing SIGABRT issue with async napi-rs and Node.js worker threads.
Latency performance concerns.

In the end, we were able to resolve all these bugs prior to launching this to users and deleting the data in MySQL.

Performance Optimizations

During our staged rollout process, we discovered a number of performance issues around latency. They were caused by a variety of issues, ranging from inefficient query plans, bottlenecks in network requests to S3, or sparse filters resulting in more data downloaded than necessary. We’ll discuss some of the interesting improvements made here.

Caching

One particularly interesting performance optimization we made was around caching. As one can imagine building a query engine off S3, the bottlenecks are primarily in the network roundtrips to S3. DataFusion basically converts our SQL statement into a series of S3 GET statements. It fetches Parquet footer metadata, Parquet column chunk metadata, and then uses this information to decide which page groups and which column chunks of the Parquet files actually need to be fetched. We built a tiered caching system to reduce the number of S3 GET requests needed.

For the first layer, we were able to easily use DataFusion’s built-in cache infrastructure (CacheManagerConfig) to cache Parquet file metadata and S3 ListObjects calls.

Next, we cached the rest of the Parquet page header metadata. Together with the built-in file metadata cache, we effectively reduced the constant need to roundtrip to S3 during the query planning process. With pushdown filtering, DataFusion only had to consult the cached metadata to get a good idea of which page group byte ranges it needed to scan. Being an extensible query engine, DataFusion makes it simple to write your own cache implementation. Typically, DataFusion provides a default parquet reader interface (which basically implements functions like get_metadata, get_bytes, get_byte_ranges), but it also allows you to substitute your own implementation instead. So, we wrote an implementation that cached metadata results in-memory. It was also straightforward to add observability and other instrumentation here, and we were able to confirm that we typically have a 99%+ cache hit ratio here. This was as expected because of how our DataFusion engine is embedded in a per-base process and how Parquet files are similarly partitioned by base — the request affinity is as good as it gets here.

Lastly, we also implemented an on-disk cache that preemptively downloads the Parquet files for a base. Similar to the previous cache layer, we were able to implement this by substituting the default provided S3 ObjectStore implementation with a custom implementation that downloaded to and read from the local disk. Caching entire Parquet files on disk is more work and cost than just metadata though — fortunately, this was only needed for an extremely small number of bases with poorer performance due to large amounts of data and pathological query patterns.

Overall, we found DataFusion’s extensible nature to be easy to work with, flexible, and the ideal tool to build high performance query engines.

Custom Indexes

We previously highlighted that most of our queries are point selects and range queries on autoincr_id, but we also often have additional conditions that filter rows far more efficiently than just autoincr_id. Some examples include queries that

filter by the action type, e.g. looking for actions that updated Airtable column names.
filter by the row being updated
omit updates from our sync feature

Some of these additional filters resulted in matching on a tiny fraction of rows, which makes fetching and scanning entire Parquet files wasteful. For example, imagine filtering out sync updates, but having a base that was predominantly made of sync actions — such naive queries would naturally be slow.

No database system with varying query patterns is complete without secondary indexes. In order to make these queries more efficient, we built an indexing system leveraging DataFusion that scans through Parquet files and writes indexes as new Parquet files. Our DataFusion and surrounding client code is aware of these index files, and queries them first in order to generate a more efficient query on the original Parquet files. It was easy to build this ad-hoc indexing system mainly because of how our data is read-only. We never have to worry about our data changing and also having to update indexes in sync, which more sophisticated database systems typically have to do.

Bloom Filters

As we previously mentioned, most of our query patterns are similar to point selects or range queries on an autoincr_id, the “primary key” of how our Parquet files are laid out. However, we do have some significantly lower QPS queries that were point selects on a different, unique identifier. This unique identifier was randomly distributed, so the min/max statistics on Parquet were useless — we’d end up having to fetch and scan every single page group and then apply the filter afterwards. We could address this with the same custom index strategy as above, but we found it simpler to just rely on Parquet’s bloom filter metadata, which DataFusion understands.

Bloom filters are a space efficient probabilistic data structure that tests whether an element is a member of the set, with false positives being possible but false negatives impossible. In this case, each column chunk metadata for the identifier contained the bloom filter. Because false negatives are impossible, they can be used to filter out some page groups that definitely don’t need to be fetched.

Conclusion

Overall, we were able to move petabytes of data out of MySQL, build a system that was 100x cheaper in storage costs, and save millions of dollars per year, all while maintaining interactive query latencies for our users.

For future work, we have plans to make this system incrementally archive data out of MySQL. We prioritized a manual batch archiving process to migrate petabytes of data out of MySQL and start saving money immediately, but there’s still a lot of engineering work to be done to make this a fully automatic system with less operational burden. We envision setting up a CDC system like Flink to handle this, but there are going to be a lot more unsolved and interesting problems around how we handle compacting Parquet files together, rebuilding indexes, managing the operational side of things, etc. Also, our initial implementation targeted our largest dataset, but there are other log-like tables we could onboard to this system as well.

If this type of work optimizing database query engines and working with petabytes of data sounds exciting to you, apply to Airtable! We’re hiring at https://airtable.com/careers.

Thanks to all past and present Airtablets on storage and across the organization who contributed to this project: Nathan Chou, Aiden Dou, Riley Hockett, Matthew Jin, Daniel Kozlowski, Brian Larson, Keunwoo Lee, Mike Milkin, Gavin Towey, Andrew Wang, Xiaobing Xia, Alex Yao, Brian Zhang, Kun Zhou

Apache, Apache Parquet, and Apache DataFusion are trademarks of The Apache Software Foundation.

How we reduced archive storage costs by 100x and saved millions was originally published in The Airtable Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Live Shard Data Archive: Export and Ingestion to StarRocks for Validation

Riley — Mon, 24 Mar 2025 17:11:09 GMT

Overview

At Airtable, we store our application or “base-scoped” data on a number of sharded MySQL instances in Amazon’s Relational Database Service (RDS). Each Airtable base is associated with a single one of these sharded instances, and as the base and the data in the base changes, we store some append-only data associated with the history of the base. This data powers features such as undo and revision history, allowing us to display record revisions as far back as 3 years ago for our enterprise customers. As we have grown as a business, these append-only tables have become increasingly large and their content now represents close to half of all the data that we store at Airtable. Additionally, much of this data is very infrequently accessed, but is stored in the same storage layer as all of our base-scoped data for every day product use making it expensive.

The Project

The Live Shard Data Archive (LSDA) project allowed us to shrink the disc volumes of the bulk of our RDS instances by taking this infrequently accessed, append-only data and storing it in a cheaper storage solution, S3. Once it was stored in that cheaper solution, we were able to drop the old data, and rebuild the RDS instances to reclaim the space.

Moving this data to S3 required three major steps. First, we had to archive and transform the data from RDS into S3 such that it could be accessed by our codebase in an efficient and consistent way. Second, we had to validate that this archived data matched the existing source data in RDS, and that our process of archiving did not cause any inconsistencies between the two datasets. Finally, we had to make application code changes to serve this data from S3. After these steps, we were able to truncate the data from RDS that we were serving from S3, allowing us to shrink the allocated storage space of these instances, saving a substantial amount of our overall RDS bill. This blog post will focus on the first phase of the second step of that process, data validation.

Archiving and Transformation

The data archival from RDS into its final shape in S3 was a three-step process. First, we exported snapshots of our databases into S3 as parquets. This is a built-in RDS feature. To optimize query latency of the archive, we then repartitioned the snapshot exports by our customers. This was a challenging process due to the amount of data (>1PB) and the number of files involved (>10M). We used Apache Flink to incrementally ingest these files and repartitioned them into per customer partitions. Finally, we ran a highly-concurrent Kubernetes rewriter job that sorted the archive for each customer, and added the proper index and bloom filters to the rewritten parquets to speed up the most common query patterns.

Validation overview and considered approaches

Validating the data required us to do a row by row comparison from the archive to our source data and make sure that for every row, these values were equal. Naively, an easy way to do this would be to read a row from our archive, find that row in RDS, and confirm they are the same. However, we were dealing with almost 1PB of data and close to 2 trillion rows. Additionally, our RDS instances in production serve customer traffic, so saddling them with these additional requests was not really an option, especially at the volume we would require to validate our entire archive. As a result, we decided to use the original, unmodified RDS export as our source data for this validation project. This data was stored in S3, and while that alleviated the problem of querying serving instances, it would simply be too slow to go row by row and validate this data. Ultimately, we decided that if we had all of the data in some relational database, we could just join the two tables together, and find any discrepancies in the data that way.

Leveraging StarRocks for Data Validation — Airtable Data Infrastructure Team

For the data validation project, the Data Infrastructure team helped the Storage team in selecting the best tool to complete the validation work efficiently.

Why Use StarRocks to Address the Data Validation Problem?

The core of the data validation problem involves performing a large number of join operations between two massive datasets, each containing nearly a trillion rows of data. The primary challenge was executing these computationally intensive join operations efficiently.

After thorough investigation, StarRocks was chosen due to its exceptional join performance. It can handle these operations with affordable computational costs, whereas other query engines struggle significantly with the same workload.

To address the problem, we decided to load raw Parquet files from S3 into local tables in StarRocks. By leveraging StarRocks’ colocation mechanism, we could efficiently perform the join operations required for data validation.

StarRocks Architecture

The diagram above illustrates the StarRocks architecture, which can access the following data sources:

Data Lakes on S3: Includes Hudi Lake, Delta Lake, Iceberg Lake, and Paimon Lake.
Native Format on S3: StarRocks allows the creation of tables that persist data directly in S3 using its native format.
Raw Parquet, JSON, or CSV Files on S3: Queries can be executed directly on raw Parquet, JSON, or CSV files stored in S3.

In our specific scenario, we loaded raw Parquet files from S3 storage into StarRocks’ local tables to perform data validation, as highlighted above.

Ingestion Optimization: Enhances data ingestion performance in StarRocks

We had to load nearly 1 trillion rows of data from raw Parquet files into StarRocks local tables. The dataset consisted of hundreds of millions of small Parquet files. Without proper optimization and parallelization, ingesting the entire dataset would have taken several months.

To accelerate the ingestion throughput, we implemented the following optimizations:

Reduce the Number of Replicas (from 3 to 1):
Since this is a one-time validation task, maintaining high availability for production is unnecessary. Reducing the number of replicas significantly decreases the total data volume to be ingested.
Increase Internal Ingestion Parallelism:
As the validation process involves ingestion first, followed by join-based validation, ingestion performance does not affect serving scenarios. We increased parallelism by tuning the following parameters:
- pipeline_dop
- pipeline_sink_dop
Increase the Number of Buckets per Partition:
Given the large data volume, we ensured that each bucket contained no more than 5GB of data. Increasing the number of buckets per partition significantly improves ingestion throughput. Although this may cause compaction to lag behind, it is not a concern in our specific scenario.

These optimizations collectively help to efficiently handle the massive data ingestion process required for our validation workload.

Ingestion

Export

Once StarRocks was set up, we needed to get all of the data ingested from both our RDS export and the transformed archive which we planned on serving from. We decided that it was not cost and time efficient to store all ~1PB in StarRocks, so we decided to just hash all of the non-key columns in the table. We ingested two tables as a part of this process, but our examples will focus primarily on just one, _actionLog.

Initial solution: Simple table with hashed non-key columns

We started with a table schema for each of our tables which looked like this:

CREATE TABLE `_rdsExportActionLog` (
`id` bigint(20) NOT NULL COMMENT "",
`application` varchar(65533) NOT NULL COMMENT "",
`hash_value` varchar(65533) NULL COMMENT ""
) ENGINE=OLAP
PRIMARY KEY(`id`, `application`)
DISTRIBUTED BY HASH(`id`, `application`)
ORDER BY(`application`)
PROPERTIES (
"replication_num" = "1",
"colocate_with" = "action_log_group",
"in_memory" = "false",
"enable_persistent_index" = "true",
"replicated_storage" = "true",
"compression" = "ZSTD"
);

And we started to load the data using insert statements like this

INSERT INTO \`exportActionLog\`
WITH LABEL ${label}
(id, application, hash_value)
SELECT id, application, XX_HASH3_64(CONCAT_WS(',',
)) as hash_value
FROM FILES(
"path" = "s3://${bucket}/${folder}*.parquet",
"format" = "parquet",
);

Data distribution and Loading Bottlenecks

However, we found that this loading operation took a really long time, on the order of almost 1 day to load two of our shards. We found that as we added more data and the table got bigger, our ingestion rate slowed further. What was also odd is that if we increased the local parallelization of this ingestion, i.e. ingested multiple shards at once, we didn’t see almost any performance boost, and when we set this value to be more than five, we saw a lot of this:

JobId: 14094
Label: insert_16604b11–7f2d-11ef-888c-46341e0f370e
State: LOADING
Progress: ETL:100%; LOAD:99%
Type: INSERT

You can see here that our load value is 99%, but a number of these large loads would just get stuck at this value and not be able to progress past this state quickly despite getting to 99% quickly. Per the StarRocks documentation, “When all data is loaded into StarRocks, 99% is returned for the LOAD parameter. Then, loaded data starts taking effect in StarRocks. After the data takes effect, 100% is returned for the LOAD parameter.” Evidently we were experiencing some bottlenecks on the data taking effect with our initial solution.

Improvement 1: Increase bucket count

We were distributing our data by id and application, but we had yet to specify the number of buckets to distribute this data into (more info on StarRocks data distribution). Our hypothesis was that as we stored more and more data, these buckets got increasingly larger and more cumbersome which led us to the slow down that we were seeing. We consulted the StarRocks team who suggested that for our data volume, we should look at specifying somewhere on the order of 7200 buckets for the smaller of the two tables, so we changed our schema to look like this:

CREATE TABLE `exportActionLog` (
`id` bigint(20) NOT NULL COMMENT "",
`application` varchar(65533) NOT NULL COMMENT "",
`hash_value` varchar(65533) NULL COMMENT ""
) ENGINE=OLAP
PRIMARY KEY(`id`, `application`)
DISTRIBUTED BY HASH(`id`, `application`) BUCKETS 7200
ORDER BY (`application`)
PROPERTIES (
"replication_num" = "1",
"colocate_with" = "action_log_group",
"in_memory" = "false",
"enable_persistent_index" = "true",
"replicated_storage" = "true",
"compression" = "ZSTD"
);

However, while we were able to load this data much more quickly than we were previously, we ran into this memory issue:

message: 'primary key memory usage exceeds the limit. 
tablet_id: 10367, consumption: 126428346066, limit: 125241246351. 
Memory stats of top five tablets: 53331(73M)53763(73M)53715(73M)53667(73M)53619(73M):

Improvement 2: Partition the table by shard ID

We realized that it would make sense to just partition the table by the shardId and try to load it that way. This would allow us to specify a number of buckets for each partition, and they would be stored in a more efficient manner. Using some rough math, we found that:

actionLog => 10TB (Hashed) => 10 * 1024 / 148 shards = 69GB per shard => 
34 buckets to host it => add some buffer, 64 buckets per partition

In total: 64 buckets per partition * 148 shards = 9472 buckets

We figured that using this distribution could also allow us to validate shard by shard which would help to not overwhelm the memory of the cluster. In the end we created this table and adjusted our loading statement to pull the shard ID from the S3 file path.

CREATE TABLE `exportActionLog` (
`id` bigint(20) NOT NULL COMMENT "",
`application` varchar(65533) NOT NULL COMMENT "",
`shard` int(11) NOT NULL COMMENT "",
`hash_value` varchar(65533) NULL COMMENT ""
) ENGINE=OLAP
PRIMARY KEY(`id`, `application`, `shard`)
PARTITION BY (`shard`)
DISTRIBUTED BY HASH(`id`, `application`) BUCKETS 64
ORDER BY(`application`)
PROPERTIES (
"replication_num" = "1",
"colocate_with" = "action_log_group_partition_by_shard",
"in_memory" = "false",
"enable_persistent_index" = "true",
"replicated_storage" = "true",
"compression" = "ZSTD"
);

This drastically sped up our ingestion time, LSDA data was successfully loaded into StarRocks from S3 in under 10 hours. The average throughput was approximately 2 billion rows per minute.