← Blog

2021-09-19

A Dojo for System Design

You can read a hundred articles about eventual consistency, backpressure, and autoscaling, and still freeze the first time a queue actually backs up in production at 2am. The gap isn’t knowledge, it’s reps. And the trouble with practicing distributed systems is that the real components don’t hold still for you — a real message queue, a real autoscaled service, a real DB under load all behave a little differently every run. You can’t isolate the variable you’re trying to learn.

That’s the problem I built Dojo-SDK to solve.

Dojo-SDK logo — a Wing Chun torii gate mark

The name is literal. In a dojo, before you fight a person, you fight a Wing Chun wooden dummy — a fixed object with arms in fixed positions, so every strike lands the same way and you can isolate exactly what your body is doing wrong. Dojo-SDK is that dummy for system design. It gives you simulated versions of the components you’d normally only meet in production — a database, a message queue with a consumer, an auto-scaling microservice, a cron scheduler — wired together through one root object called a Matrix. Everything runs in-process, deterministic, and replayable, so when something breaks you can run it again and get the exact same break.

The Matrix dojo fight — "I know kung fu"

Determinism is the whole point

That determinism is the entire value proposition, not a footnote. A real message broker under load will drop, reorder, or retry nondeterministically depending on network jitter and GC pauses you don’t control. That’s great for production resilience and terrible for learning, because you can’t tell if your consumer logic is wrong or if you just got an unlucky run. Dojo-SDK’s queue and its autoscaler are simulated on purpose — they follow rules you can read, so a failure you provoke today is the same failure you’ll provoke tomorrow. You get to unit-test your understanding of backpressure the same way you’d unit-test a function.

I didn’t invent this idea — I borrowed it from people who bet their databases on it. FoundationDB famously built their entire correctness story on deterministic simulation; as their docs put it:

Determinism is crucial in that it allows perfect repeatability of a simulated run, facilitating controlled experiments to home in on issues.

TigerBeetle took the same approach with their VOPR simulator, replacing every source of indeterminism — disk, network, clock — so an entire cluster runs on one core and any bug reproduces from a seed. Antithesis turned it into a product built on “a custom hypervisor that lets us rewind time, try alternate execution paths, and perfectly reproduce every bug.” Dojo-SDK is the toy version of that idea, scaled down from “prove my database is correct” to “help me feel how these pieces move.”

What it looks like

Here’s roughly what setting up an environment looks like:

const matrix = new Matrix();

// spawn a DB with disk persistency (edit the JSON file live, watch it react)
await matrix.addDB(new DiskPersistencyManager('./.tmp/db.json', true), {
  col: { '618230709af3ade104bee1ff': { a: 100, _id: '618230709af3ade104bee1ff' } },
});

// spawn a message queue + consumer
await matrix.addMQ('queue1', { treat: (item) => console.log('consuming', item) });

// spawn a scheduler ticking every 5s
matrix.addScheduler('*/5 * * * * *', () => log.i('tick'), SchedulerTypes.Recurring);

// spawn a micro-service that autoscales 1 -> 10 instances under load
await matrix.addService('/my-resource', () => new (class extends BaseService {
  async handle(req, res) {
    res.type = ResponseTypes.OK;
    res.body = `You got it!`;
    return res;
  }
})(), 1, 10);

await matrix.request(new RequestX('/my-resource', RequestMethods.GET));

Notice what’s not there: no Docker Compose file, no three terminal windows, no Kafka cluster you have to remember to tear down. One process, one object graph, and every piece — DB, queue, service, scheduler — is a small, readable implementation you can step through in a debugger. The DB persists to a local JSON file that updates in real time, so you can literally hand-edit a record mid-run and watch your service react to it. That’s not something you get to do against a real Postgres instance without a lot more ceremony. It even runs in the browser if you want to poke at it without cloning anything.

The actual practicing happens through a companion repo, Dojo-Recipes — a set of system-design challenges you implement against the SDK. The pattern is: here’s a scenario (rate limiting, a producer/consumer pipeline, a service that needs to scale under bursty traffic), here’s the simulated environment, go build it, and because the simulation is deterministic, you can write a test that says “this exact sequence of events should produce this exact outcome” and actually trust a green checkmark. It’s the same reason people practice on LeetCode instead of only ever interviewing — you want a controlled arena before the real one.

What it isn’t

I’ll be straight about the limits. It’s not a production framework, and it’s not something you’d point real traffic at — the last commits are from 2022, and it never grew the StreamProcessor (a Kafka-like event-stream simulator) or the SQL-flavored DB simulation, both of which were on the roadmap and just… didn’t happen. The DB is NoSQL in shape. It’s a conceptual teaching tool, closer to a kata than a library. And the determinism here is a design convention, not the adversarial fault-injection engine that FoundationDB or TigerBeetle run — Dojo won’t hunt for the pathological interleaving that breaks you; it just won’t move the target while you’re learning.

If you want to learn how autoscaling decisions get made, or what a consumer has to handle when a queue redelivers a message, this gets you there faster than reading about it, faster than reading source for a real broker, and without infrastructure you have to babysit. It doesn’t replace fighting the real thing. It just makes sure that by the time you do, you’ve already taken the hit once.

Which raises the question I still don’t have a clean answer to: when you’re teaching yourself a distributed-systems concept, do you learn more from a deterministic dummy that always hits the same way — or from the messy real thing that surprises you? Where’s the line?

References