Momento

Introducing valkey-lab: Stop Guessing When Your Cache Hits Its Limit

Khawaja Shams — Tue, 26 May 2026 20:14:51 GMT

Pop quiz: how many requests per second can your cache take before it stops meeting your latency SLO?

Chances are good you don’t know the answer, and that’s not a knock on you. It’s a genuinely hard number to come by. The standard tool for this is valkey-benchmark, and it’s great at exactly one thing: you point it at a server, you hammer it with some commands at full speed, and it prints a throughput number at the end. That number tells you the box is alive and roughly how fast it goes flat out.

From a production standpoint, that’s not as useful as it sounds.

How does the p999 hold up under an 80:20 read/write mix when half the requests land on hot keys? What does the tail look like at the rate you actually plan to run? How much headroom do you have before the SLO breaks? Was that latency spike at 10:32 a fluke or the ceiling? A single summary number printed after a sixty-second run can’t answer any of those, because it threw away the useful bits on the way to computing the average.

valkey-lab was built to answer these questions.

It’s a high-performance Valkey and Redis benchmark that uses io_uring for kernel-bypassed I/O, per-connection pipelining, and multi-threaded workers. The defaults are deliberately familiar. Run it without any arguments:

📄

valkey-lab

and you get a sixty-second run against localhost:6379 with an 80:20 GET/SET ratio and a million keys. Same shape as the tool you already know, same short flags (-h, -p, -c, -P, -r). The interesting part starts once you begin asking harder questions.

Saturation search: the headroom number, found for you

Here’s how you find your headroom number today, by hand. You run a benchmark at some rate, read the p999, decide it looks healthy, bump the rate, run it again, read it again. You do this five or ten times, squinting at each result, trying to find the rate where the tail skyrockets. Somewhere in that loop you lose track of which run had which config. Eventually you settle on a number you’re “pretty sure” is right and plan capacity around it. valkey-lab does that whole search for you with one command.

📄

valkey-lab saturate --slo-p999 1ms -c 16 -P 32

A synthetic benchmark might say a cache can handle 2M requests per second. But when you add a realistic read/write mix, hot keys, and a warm cache, your p999 suddenly crosses your SLO at 1.2M. Technically speaking, the server is still processing 2M requests per second, but the usable ceiling is significantly lower.

saturate starts issuing requests at whatever is provided in --start-rate (or 1000 if not provided) and multiplies the request rate by the --step factor on every step. The default step is 1.05, so the load compounds over time. Each step holds its rate for a sample window, measures the full percentile spread, and checks it against your SLO. The moment a percentile crosses the line, the ramp stops and reports the last rate that held.

When a step fails, valkey-lab tells you how it failed, either throughput-limited or latency-exceeded, which helps you tune your clusters more accurately.

Throughput-limited means the server couldn’t generate the requested rate at all. It topped out below the target. That’s a capacity problem: you need more CPU, more shards, or a different topology.

Latency-exceeded means the server kept up with the rate, but the tail blew past the SLO. The server can sustain the requested rate, but something in the path is introducing tail spikes under load. Could be a hot key, a GC pause, a scheduler stall, network jitter. You fix that by chasing the spike, and adding hardware won’t help.

So based on your failure, your mitigation strategy varies wildly. And it would be impossible to know which one to pursue if all you had was the throughput number.

Averages hide the interesting failures

The next problem surfaces when the benchmark completes. Summary statistics hide the behavior you’re usually trying to find. If your p999 was 312µs for fifty-nine seconds and 4.2 ms for one second, the run-level p999 still looks fine. The spike is the important part that you need to focus on.

valkey-lab streams one row per second with the full latency spread:

Every major percentile from p50 to p99.99 plus the max, the error count, and the cache hit rate, all per second. A spike that lasts one second appears as one row with a tall tail, a vast improvement over the executive summary at the end of a run. When you need it machine-readable instead, –output json gives you newline-delimited JSON you can pipe straight into something else, and –output quiet collapses the whole run to a single summary line.

Make the benchmark look like your workload

There’s an important gotcha with the saturation number, or any benchmark number. A ceiling is only as good as the load that produced it, and the default load most tools run is unrealistic.

Think about what a stock benchmark actually does. It sends all reads, or close to it, because a 100% GET run posts the biggest number (or it’s the easiest to simulate). It picks keys uniformly at random, so every key is equally cold and nothing is ever hot. It runs flat out, measuring throughput at saturation. And it normally starts against an empty cache. Now think about your production traffic. It’s a read-write mix. It has hot keys, with a small fraction of the keyspace taking most of the requests. And the cache is warm. Every one of those differences takes away from the realism of the benchmark run.

valkey-lab addresses each one of these gaps. Set the real read-write split with -r so you’re measuring the write path your cache actually carries. Turn on –distribution zipf so a small fraction of keys receives most of the traffic, like production systems often do. Uniform access patterns avoid contention and hide the behavior of your actual hot paths.

Pin the load with –rate-limit to track latency at the rate you plan to run. And warm the cache with –prefill, or model a read-through cache that fills on miss with –backfill, so a GET benchmark measures hits the way production would.

📄

valkey-lab --prefill -r 100:0 --distribution zipf -c 16 -P 32

Stack those and the ceiling you measure is a ceiling that meaningfully tracks production. There’s more depth when you need it, warmup tuning, RESP3, pinning workers to cores with –cpu-list, TLS, full TOML configs, but the move that matters is making the four big assumptions match your reality before you trust the number.

Getting the important data from a run

Now that we have realistic benchmark data, we have to make sure it’s useful after the run ends.

--parquet results.parquet saves the full dataset to disk. It stores the full metric set per snapshot: the counters, the gauges, and the latency distributions as actual nanosecond histograms. Combine this with the visualization functionality in valkey-lab, and you have a rich experience that lets you dig into every tiny detail.

📄

valkey-lab --parquet results.parquet
valkey-lab view results.parquet

view opens an interactive dashboard against the file, with a synchronized time axis you can zoom and pan through dimensions like throughput, hit rate, error rate, and latency split out by GET, SET, and combined, all on a log scale. Scrub to the exact second p999 jumped and read every other metric in that same window.

One use case for this is regression testing. Because every run is a Parquet file with the same schema, runs are directly comparable to each other. Benchmark before a Valkey upgrade and after, and the question “did this move my tail latency” is easily answered with a diff. The viewer is one way to read these files, but using your own queries is another easy way to act on changes in performance. DuckDB, pandas, and Polars all read Parquet directly, so a few lines of SQL across a directory of runs is a regression suite for cache performance. Point DuckDB at a folder of recorded runs and let it compute peak throughput per file:

📄

SELECT
  filename,
  max(responses_received) AS total_responses,
  max(request_errors)     AS errors
FROM read_parquet('runs/*.parquet', filename = true)
GROUP BY filename
ORDER BY filename;

That is a before-and-after table for every benchmark you have ever saved, built from data you already recorded.

Another use case for the Parquet output is root cause analysis. A spike on the latency chart tells you when something went wrong, not why. Point view at a Rezolus capture from the server or the client and it overlays system telemetry, CPU utilization, network, scheduler behavior, aligned to the same benchmark timeline. When a p999 spike lines up exactly with a scheduler stall or a network hiccup on the axis above it, you have your answer as simple as that.

Stop guessing

Back to the pop quiz. The reason it’s so hard to answer is that traditionally the tool you use to measure max RPS reports a summary and throws the important bits away. valkey-lab changes the approach. It remembers the mix, the hot keys, the per-second tail, and records your runs so you can come back to them. The headroom number that used to take an afternoon of manual ramping is now a single command, and it comes with the failure mode attached so you know what to do about it.

valkey-lab is built on top of cachecannon, inheriting its workload generation, saturation search, telemetry collection, and analysis capabilities. It needs Linux for io_uring (kernel 6.0+) and builds with Rust, under your choice of Apache-2.0 or MIT. Here is the whole getting-started path:

📄

cargo install --path . --bin valkey-lab
valkey-lab saturate --slo-p999 1ms

Run that against a Valkey server and see what number comes back. Stop asking “how fast can my cache go” and start asking “how fast can it go before my production workload breaks?” That’s the number you capacity-plan around if you want predictable systems at 3 AM.

Why Snap Was Willing to Fork, and Why They Still Came Back

Allen Helton — Thu, 21 May 2026 19:36:05 GMT

I have no intention of ever forking a database. The amount of bravery and engineering mastery that goes into it scares me to no end. But Snap did. They committed to it so hard that they acquired the company building it, open sourced the entire commercial codebase, and ran 100% of their caching infrastructure on it for years. KeyDB powered Snapchat at a scale most companies can only dream of.

And then they migrated to Valkey anyway.

At Unlocked San Jose, Ovais Khan, Principal Software Engineer at Snap, walked through that migration. As interesting as it was to hear how they did it, it was all the more interesting to hear why. Why it happened, why it wasn’t worth staying on the fork, and why when they came back, they came back to Valkey.

The case for forking in 2019

KeyDB started in 2019 as a project by John Sully and Ben Schermel at EQ Alpha Technology. The premise was simple. Redis ran a single-threaded event loop. Modern servers had 32, 64, 96 cores. To get peak throughput out of a single machine, you had to run a cluster of Redis nodes on it. That was wasteful, and Salvatore Sanfilippo, the creator of Redis, was on record arguing against changing it: “I/O threading is not going to happen in Redis AFAIK, because after much consideration I think it’s a lot of complexity without a good reason.” Simplicity of the codebase was a value he was actively protecting.

KeyDB took the other side of that bet. It added real multithreading, with per-thread event loops and lock-based synchronization on shared state. It also added active-active replication and FLASH storage for cost-efficient large datasets. On the same hardware, it could move several times the operations per second that Redis could.

This is the textbook case for forking. The upstream project had made a deliberate architectural choice. That choice was the right one for them and the wrong one for a certain kind of user (Snap) who needed to push a single node harder. A fork was the only way forward.

By 2021, Snap was running KeyDB across enough of their caching infrastructure to want a permanent stake in it. They acquired the team in May 2022 and brought the formerly commercial KeyDB Pro features into the open source codebase under BSD-3. For about two years after that, all of Snap was running on KeyDB.

What forking buys you

The benefits of forking are easy to articulate when you ship. Snap got features that were important for their specific operating model:

Multithreaded command execution, which let them get more out of every node
Zone-aware read routing, which kept cross-AZ traffic down and cut data transfer costs considerably
Forkless background saves, which made snapshots predictable at high memory
Same-zone replica behavior that reduced timeout blast radius during upgrades

These features weren’t going to make it into Redis on Snap’s timeline. The fork gave them room to build it as soon as they were ready.

As far as forking goes, that’s usually the part written in blog posts and talked about on the conference loop. You wanted a feature, the upstream said no, you built it yourself, and now it works. Forking feels like freedom.

What forking costs you

Every change to upstream Redis after the fork point became a decision. Does it get ported over? Rewritten? Skipped? There’s a long tail at the end of whatever decision was made. Porting means you carry merge conflicts forever. Rewriting means you have two implementations of the same idea drifting apart. Skipping means your fork stops being a superset of upstream and starts being something else.

Ovais addressed this specifically in his talk. Snap could not easily move from KeyDB’s Redis 6.2 base to Redis 7.2. The cost of staying current with upstream had become high enough that they were stuck on a flavor of 6.2 while everyone else moved on. That meant they were also stuck without features the broader community had built on top of 7.2.

The same goes for the ecosystem. Every client library, operator, monitoring tool, and benchmark gets tested against upstream first. Your fork either matches upstream behavior closely enough that those tools just work, or it doesn’t, and you start maintaining your own.

While forking might have started off feeling like an accelerator, it quickly became a drag.

The Redis license change

In March 2024, Redis Ltd. changed the Redis license from BSD-3 to a dual SSPL and RSALv2 model. Neither license is OSI-approved. For any company offering Redis as a managed service, this was an immediate problem. AWS, Google Cloud, Oracle, and Ericsson responded by forking the last BSD release, Redis 7.2.4, and donating it to the Linux Foundation. Eight days after the license change, Valkey existed.

Up until then, the case for staying on KeyDB was obvious. The KeyDB team was inside Snap. The codebase was theirs. The performance was what they needed.

But Valkey made them pause. The project had open governance under the Linux Foundation, with a Technical Steering Committee across multiple companies and no single controlling vendor. It was BSD-licensed and would stay that way. Its roadmap included the things Snap had previously forked to get: I/O threading, dual-channel replication, and a path toward features Snap wanted. And every major cloud provider was committing serious engineering effort to it.

The KeyDB story also got more complicated from the inside. In January 2025, John Sully, KeyDB’s original creator, left Snap. His parting note on the KeyDB repository said it plainly:

“When we made KeyDB we wanted to prove that caches should have great performance and I think we succeeded. Now there are many options, including Valkey which is fully open source and based on my testing has matched KeyDB’s performance. I’m not sure what Snap will do with the project, but I think that development effort should move to Valkey moving forward as they have clear momentum and are the most up to date.”

When the person who started the fork tells you the fork is done, the fork is done.

The secret migration back

Snap runs caching at a scale where you can’t just swap a binary. The migration had to be invisible to application teams, comparable in cost, and safe across radically different workload types. Ovais walked through the major decisions that made their migration as easy as possible.

Abstraction layers are key to managing workloads at scale

Snap had built a storage abstraction with a RESP proxy in front of every cluster. Applications never talked to KeyDB directly. They talked to the proxy, which spoke Redis wire protocol back to whatever was running behind it. That layer of indirection made this migration possible. Without it, every application team at Snap would have needed to know about the change. With it, nobody had to.

These layers let them migrate around 30 caches per week. By the time Ovais gave this talk, 70 to 80 percent of workloads were on Valkey.

Do a gap analysis before changing any code

Snap did a feature-by-feature comparison between KeyDB and Valkey before touching anything in production. KeyDB’s multithreading and Valkey’s I/O threading work differently, so they benchmarked carefully to confirm comparable throughput.

Some KeyDB features were blockers and had to be ported to Valkey. Zone awareness was the first one Snap contributed. Replica MOVED behavior during upgrades was another. CPU throttling at high utilization was a third.

A hidden gap that wasn’t found until much later was with MGET. KeyDB supported it across slots, but Valkey does not. So after moving to Valkey, Snap had issues with command parsing pressure in large batching workloads. They quickly ported cross-slot MGET to their internal build, and are working with the core maintainers to get it added upstream.

Pick a stable version for a base, not a new one

Snap started on Valkey 8.2 RC, ported the features they needed, and immediately ran into crashes at 9 to 10k QPS. The root cause was new TLS offloading work. They rolled back to 8.0.2, ported the necessary fixes onto that, and benchmarked from there. New releases need a baking period, and a migration is the wrong time to find out.

Categorize and prioritize your workloads

Snap divided their caches into three categories: CPU-bound, high-memory, and high-write-rate. Each category needed different validation. CPU-bound workloads were primarily a throughput question. High-memory workloads were really about replication buffer behavior during full syncs, because if the buffer fills before a snapshot completes, you enter a sync loop that never finishes. High-write workloads required tuning replica buffer sizes and primary write throttling, because Valkey’s dual-channel replication puts buffers on replicas rather than primaries. Inside each category, they went lowest-criticality first, highest-criticality last.

Lessons from going full circle

The fork was the right call in 2019. Redis was not going to go multithreaded, and the workloads Snap was running needed it. KeyDB was a solid piece of engineering that pushed the ceiling on what a single Redis-compatible node could do.

The migration back was the right call in 2025 because the conditions that justified the fork had changed. The upstream that resisted features they needed was no longer the upstream they cared about. Valkey’s governance was open. Its roadmap included the work Snap had previously done alone. And every additional year on a Redis 6.2 build was another year of compounding distance from where the ecosystem was going.

Forks are leverage. They are also debt. Be honest with yourself about which one you are accumulating at any given moment. Snap was. They forked when forking gave them speed, and they came back when the fork started to cost more than it earned.

I don’t want you to take away from this that forking is bad. Sometimes it’s the right thing to do. The decision to fork is not permanent, and treating it like it is permanent is how you end up running a five-year-old codebase while your competitors are shipping on a roadmap you helped fund.

When the world moves, move with it.

Happy coding!