Building crackeddb, Yahya Ehsan

two doctors are on call. the rule is that at least one of them has to stay on duty. each doctor independently checks that the other is still on, then takes themselves off. run that concurrently against most databases and you can end up with nobody on call. here it is against two:

target/release/write_skew_demostdout

write skew demo (concurrent)
----------------------------
initial: alice=on, bob=on
T1 (thread A): read both, write alice=off
T2 (thread B): read both, write bob=off
invariant: at least one stays on

crackeddb   T1 ssi_abort=false  T2 ssi_abort=true
crackeddb   alice=off  bob=on   held
rocksdb     T1=ok  T2=ok
rocksdb     alice=off  bob=off  VIOLATED

same scenario, same two transactions, two storage engines. t1 reads both keys and writes alice=off. t2 reads both keys and writes bob=off. the writes touch different keys, the reads overlap. rocksdb commits both and lands on alice=off, bob=off, a state no serial order of the two transactions could produce. crackeddb aborts one and the invariant holds. which one aborts changes between runs because the two threads race. that the invariant holds does not. five of five. source is in crates/bench/src/bin/write_skew_demo.rs.

that rocksdb commits both is not a bug. write skew is a known anomaly under snapshot isolation, and rocksdb's transactiondb does not promise to stop it. its pessimistic locking takes write locks on writes, not on reads, so two transactions that read shared keys and write disjoint ones never conflict in its model. the fix on rocksdb's side is to call get_for_update and take the read locks by hand. that is a real and defensible choice. but the demo shows the part worth sitting with: the anomaly was detectable the whole time. the database declined to detect it and made that the caller's problem.

that decline is the most common move in software. the flaky test you rerun until it passes. the heisenbug closed as cannot reproduce. the shrug that distributed systems are just like that. very little of that is fundamental randomness. most of it is manufactured, information some lower layer had and chose not to surface. crackeddb is a small argument the other way. it detects the anomaly rather than leaving it to the caller, it makes its own bugs reproduce byte for byte from a seed rather than leaving you to chase them, and it pays for both in throughput you can measure. the rest of this post is how it does that, and what the bill comes to.

01 — what it isthe shape, and the one thing it adds

crackeddb is a single node embedded database, roughly the shape of sqlite: it links into your process, it stores key value pairs, there is no server and no network. what it adds to that shape is serializable transactions, the strongest isolation level, enforced by serializable snapshot isolation rather than by locking everything in sight. underneath it is a log structured merge tree, the same family as rocksdb and leveldb. it is written in rust. it is not finished and it is not for production: no replication, no sql, no client library past the rust crate. the interesting part is not what it does. it is the constraints it holds itself to while doing it.

here is the part that is solid, stated plainly so the rest of the post reads as drawing boundaries rather than apologizing. the engine detects serialization anomalies, the write skew above is one, and aborts a transaction to keep them out. that detection rests on a protocol a model checker explored across 24.6 million states without finding a violation. every bug the engine has hit reproduces byte for byte from a seed. the demo at the top is that engine running. everything after this section is either how that was built or where the edges are.

the engine itself is ordinary in its parts: a log structured merge tree for storage, learned indexes over the keys in place of b trees (piecewise linear models following the pgm index), serializable snapshot isolation for transactions. what is less ordinary is the discipline around it, and that is what the rest of the post is about: deterministic simulation so bugs are reproducible, formal specs so the protocols are checkable, and a written record of every decision next to the test that pins it. the throughput numbers come later and they are not flattering. they are not the point. the point is what that discipline costs and what it buys.

02 — the disciplinethree habits that each kill a place a bug can hide

deterministic simulation so failures stop being random, formal specs so the protocol is checked rather than asserted, and a written record so a decision can't quietly rot. three sections, in that order.

the simulator

a flaky test is a test that passes and fails on the same code. the standard response is to rerun it. that response is an admission: you do not know what made it fail and you have decided not to find out. usually you cannot find out because the failure depended on something the test did not control. which thread won a race. what the clock read. which order two writes reached the disk. that state existed at the moment of failure. it was just never captured, so you cannot replay it.

crackeddb captures it. every source of nondeterminism in the engine goes through one trait, Env: the clock, file io, the random number generator, scheduling. no crate above the runtime layer may call std::time, std::fs, std::thread::sleep, or rand directly, and ci fails the build if one does. the trait has two implementations. RealEnv does the real thing in production. SimEnv runs the database single threaded inside a simulator, drives time and scheduling itself, and seeds the rng from one number.

the consequence is that a run is a pure function of its seed. the simulator picks a seed, drives thousands of operations with fault injection, crashes and recovers the database at chosen points, and checks invariants after every step. when an invariant breaks, the seed that broke it replays the identical failure, every time, on any machine. there is no rerunning until it passes. the failure is a value you can hold. and because the schedule is just data, the simulator shrinks it: it drops operations and reruns until it has the smallest schedule that still fails. you do not debug a thousand operation trace, you debug the twelve operations that matter.

“there is no rerunning until it passes. the failure is a value you can hold.”

— on what removing non-determinism actually buys

this is the whole bet of the project stated mechanically. the flakiness was never fundamental. it was information the runtime had and threw away. route that information through one place and keep it, and the heisenbug stops being a heisenbug. it becomes seed 0xDEADBEEF, cycle 185, which is a real bug from this codebase that the four bugs section is about.

the specs

deterministic simulation tells you the code does what it does, reproducibly. it does not tell you that what it does is correct. for that the protocol itself has to be checked, and a test cannot do it, because a test checks the schedules you thought to write. the interesting anomalies live in the schedules you did not.

so the two transaction protocols are specified in tla+, a language for describing what a system is allowed to do, and checked with tlc, a model checker that explores every reachable state of the spec and reports the first one that violates an invariant. specs/MVCC.tla describes begin, read, write, commit, abort, and checks that snapshot isolation holds. specs/SSI.tla describes rw antidependency tracking and the dangerous structure that signals a cycle no serial order could produce, and checks that every schedule which commits is serializable. tlc explored 24,617,893 distinct states of the ssi spec and found no violation. the run took about seven and a half minutes and the log is committed at specs/SSI.check.log, so the number is not something you take on my word. each spec is referenced by name from the rust file that implements it, so the spec and the code that claims to realize it sit one click apart.

Scope

i want to be exact about what this covers, because overclaiming verification is worse than not claiming it. the transaction protocols are checked. the storage engine, the wal, and crash recovery are not specified in tla+. a Storage.tla is on the todo list and not in the repo, and saying otherwise would be a lie the readme does not tell. recovery correctness is covered empirically instead, by an acceptance test that runs a thousand crash and recovery cycles under fault injection and asserts the database comes back consistent every time. that is weaker than a proof and stronger than hope.

the receipts

every architectural decision in crackeddb is written down in DECISIONS.md next to the regression test that pins it. there are 28 of them. the format is boring on purpose: context, the decision, the alternatives rejected and why, the consequences. when a bug is found it gets one too, recording the bug, the fix, and the property the new test checks, so the bug cannot come back unobserved. this is not documentation for its own sake. it is the difference between a system you can reason about and a pile of code that happens to pass tests.

the discipline earned its keep at the least comfortable possible moment, a final audit before publishing. the audit was supposed to confirm the claims. instead it caught several of them being false. a tla+ state count quoted in the docs could not be reproduced from any committed log, so the spec was rerun and the real log committed, which is where the 24.6 million above comes from. a test described as running was actually marked ignore and never executed in ci, so the attribute was removed and it now runs in under seven seconds. the benchmark table reported its abort counts as a sum across three runs while reporting throughput as the median of those same runs, an inconsistency that inflated the headline abort number roughly threefold, so the tables were redone to take every column from a single median run. and two decision records described an integration as shipped that had not shipped. one of those, the most embarrassing, is a bug in its own right and gets the full story in the next section.

“a discipline that only catches other people's mistakes is a marketing claim. one that catches your own, in the document where you were about to tell the world you were careful, is doing its job.”

— the audit that did not survive contact with itself

the honest version is that a verification methodology was pointed at its own paperwork and the paperwork did not survive. i think that is the strongest thing i can say about the project, stronger than any benchmark.

03 — four bugseach one a failing seed, or a false claim

these are the receipts. each one is a real entry in DECISIONS.md, each has a test that fails without the fix, and each is here because of how it was found, not just what it was.

a deleted key came back. (ADR-007, found by the simulator.) during the thousand cycle crash test the simulator reported an invariant break at seed 0xDEADBEEF, cycle 185: key k017 had been deleted, and after recovery a read returned v2968_c184, a value from before the delete. the cause was a type that could not tell two different things apart. the memtable lookup returned Option, and None meant both this key is not here and this key was deleted. a delete writes a tombstone; the tombstone said None; the engine read None as not here and went looking in an older sstable, where it found the stale value the tombstone was supposed to bury. the fix was to make the type carry the distinction it was missing: a LookupResult enum with three cases, Found, Deleted, and NotFound, so the compiler forces every caller to handle a tombstone as its own thing. the whole class of bug is gone, not patched, because the ambiguity that produced it no longer typechecks. the regression test replays the exact seed and cycle.

a transaction could commit in the past. (ADR-023, found by the crash stress test.) the engine keeps two counters. the storage layer has a sequence number for versioning on disk. the transaction manager has its own timestamp for snapshots and conflict detection. in normal running both climb together from one. after a crash they came back out of step: storage recovered its sequence to the highest value in the log, but the transaction manager was built fresh and started again at one. so a transaction could begin at timestamp one, fail to see committed data living at sequence one hundred, and commit at a timestamp earlier than a transaction that had already committed before the crash. time ran backwards across a recovery boundary. the fix seeds the transaction manager from the engine's recovered sequence on open, so the two counters agree about what time it is. the cost was one constructor and two call sites. finding it without a deterministic crash harness would have been a nightmare; with one it was a failing seed.

the recovery decoder could not parse its own log. (ADR-026, found while building it.) the write ahead log holds two kinds of record, the old key value records and the newer transaction records. a key value record begins with an eight byte sequence number that starts at one. a transaction record began with an eight byte transaction id that also starts at one. on recovery the decoder reads the first eight bytes to decide what it is holding, and a transaction with id one and a write at sequence one were byte for byte indistinguishable. misread one as the other and you get lost transactions, corrupted recovery state, silent data loss, the worst kind because nothing announces it. the fix prefixes every transaction record with u64::MAX as a magic number. a sequence counter incrementing by one would need about 584 years at a billion operations a second to reach that value, so it cannot collide with a real sequence. the decoder reads the prefix, sees the impossible number, and knows what it is holding.

Footnote

this one needs an honest footnote, and it is in the adr. the transaction record format is defined, encoded, and decoded correctly, and the collision fix is real and tested. but the commit path does not currently emit those records. it writes batch records instead, through the same fsync, with the same durability and the same recovery guarantees, and the transaction record path sits in the codebase as designed but unwired infrastructure. saying the magic prefix protects the live write path would be an overstatement, so the adr does not. it says exactly this.

an integration was reported done and was not. (ADR-025, found by reading the claim against the code.) a stage of work had closed with four commits that said they integrated the version store with the storage engine. the methods existed, named correctly, with the right signatures. the problem surfaced later while chasing an unrelated double write: the version store was still a pure in memory map. the methods named for writing to the log only mutated a btree in ram. nothing on the transaction path reached disk. a review had confirmed the methods existed and had not confirmed they did anything, which is the difference between checking a surface and checking a system. the redo made the version store actually engine backed. the lesson became a permanent ci gate: a test that commits, reads the log back, and asserts the records are there, which is the one test whose absence let the gap open. it runs on every build now. a method named for persistence that persists nothing passes a review that only reads names.

04 — benchmarksthe unflattering part

now the unflattering part. crackeddb is slower than the mature engines on essentially everything, and the gap is large. before the tables, the one thing that decides whether a comparison is fair: durability. crackeddb does an fsync on every commit. lmdb does too. rocksdb in its default transactiondb config does not, it buffers writes in the os cache and syncs later. sled does not either, it flushes on a 500ms timer. so two of the three competitors are trading per commit durability for speed, and only lmdb is matched to crackeddb on that axis. i point this out at each table because it is the difference between a fair gap and an unfair one.

measured on a hetzner ccx33, eight dedicated epyc cores, 32gb ram, nvme, eight workers, sixty second warmup, two minute measurement, three runs per combination. each row below is the run whose throughput was the median of its three, reported whole, so every cell in a row comes from the same run rather than mixing a median in one column with a sum in another.

tpc-c new order · 1gb

backend	sync / commit	throughput (ops/s)	p50 (us)	p99 (us)	aborts
crackeddb	yes	417	32,479	86,847	4,904
lmdb	yes	20,474	53	69	0
rocksdb	no	42,851	227	671	0
sled	500ms timer	9,642	1,149	1,979	0

read the durability column first. the only durability matched comparison is crackeddb against lmdb, and there crackeddb is about 49 times slower. that gap is not fsync, since both fsync. but it is not a clean measure of isolation overhead either: lmdb is a memory mapped, zero copy b+tree with a single writer, and that read path is structurally faster than any log structured merge tree regardless of isolation level. so the 49x is three things stacked, serializable snapshot isolation, lsm against mmap b+tree, and a maturity gap, and i can only say durability is held constant across it, not the rest. against rocksdb the gap looks like a hundred times, but part of that is rocksdb not syncing per commit, so that number conflates isolation overhead with a durability difference and i am not going to claim it as a clean measurement of either.

the column that explains crackeddb at all is the last one. it is the only engine in the table that aborts anything, and the reasons differ. rocksdb and sled run concurrent writers and do not track read sets, so they never look for the rw antidependency cycles that produce write skew. lmdb is different: it allows only one writer at a time, so two write transactions are physically serialized and that class of conflict cannot form in the first place. it reports zero aborts because it never needs any, by forbidding the concurrency that would create them. crackeddb takes the harder road, concurrent writers with serializability preserved by detection, and the abort column is the bill for that road. the 4,904 is not a failure rate. it is the number of times in this run that the database caught a serializability violation forming and stopped it, the same mechanism that held the doctors invariant at the top, now running under eight workers of contention. the throughput is what that costs per transaction: read set insertion, reader registration, version chain traversal at the snapshot timestamp, dangerous structure check at commit.

at 10gb the shape holds:

tpc-c new order · 10gb

backend	sync / commit	throughput (ops/s)	aborts
crackeddb	yes	557	2,673
rocksdb	no	26,244	0
sled	500ms timer	deadlocked, excluded	n/a

crackeddb's aborts drop from 4,904 to 2,673 because a larger keyspace spreads the same transactions across more warehouses, so per key contention falls. lmdb is missing from this table on purpose. at 10gb the lmdb runs hit MDB_MAP_FULL, which means the map size in our heed adapter was undersized for the dataset, and the three runs ranged from 1.5 to 5,200 ops per second with millions of aborts that are almost certainly a map full retry loop rather than real contention. i do not have a number there i trust, so i am not going to print one. sled is excluded because it deadlocks under sustained concurrent writes, a known upstream issue, spacejam/sled#1134 and #1152, not something our config caused.

ycsb at 1gb, crackeddb against rocksdb. these are throughput medians. rocksdb still defers its sync here, so read the read only row as the clean one and the write rows with that caveat:

ycsb · 1gb · throughput medians (ops/s)

workload	crackeddb	rocksdb	gap
c — 100% reads	130,257	915,064	7×
b — 95% reads	9,861	765,964	78×
a — 50/50	6,199	567,801	92×

workload c is the cleanest read on the design, because nothing commits, so there is no fsync on either side and no durability asymmetry to argue about. the 7x is purely what serializable snapshot isolation costs on a get: read set insertion, reader tracking, version chain traversal. that is the price of detection with the writes taken out. add even a trickle of writes and the gap widens hard, 7x to 78x at five percent writes, because crackeddb's fsync per commit ceiling takes over while rocksdb keeps deferring its sync. so the write side gap is two effects stacked, isolation overhead on top of a durability difference, and i am not going to pretend it is one clean number. group commit, batching commits into shared fsyncs, would close most of the fsync part and none of the isolation part. it is not implemented. i would rather ship the slow honest version with the ceiling labelled than quote a number i have not built.

05 — scopewhat this is not

so that the numbers are read in the right frame, what this is not. not distributed: one node, embedded in your process. not sql: key value with transactions, a query layer is a later problem. not production: no replication, no online backup, no operational tooling, no client beyond the rust crate. not fast: the fsync ceiling is in every write result and the ssi overhead is in every read. there is also a known correctness limit, documented in ADR-028: range scans track their read set, but the protection against phantoms has a gap recorded there rather than papered over. it is an embedded engine that exists to hold a set of constraints honestly, not a thing to put under your application.

06 — closethe whole project, in one move

the whole project fits in one move repeated at every layer. the write skew was detectable, so detect it. the flaky bug was reproducible if you kept the right state, so keep it and replay it from a seed. the protocol was checkable, so check it against twenty four million states instead of asserting it in a comment. the decisions were worth recording, so record them next to the tests that pin them, and when the records turned out to overstate what shipped, fix the records too. none of that is novel and none of it is fast. that is rather the point. correctness here is not a property the database claims. it is a quantity that shows up in the abort column and the ops per second, a bill the design chooses to pay in the open. crackeddb is what that choice looks like when you serialize it into a single node key value store: slower than the alternatives, and able to tell you exactly why, with the receipts in the repo.

repo: github.com/yussypu/database