The Compiler: Replacing the Interpreter — What We Gained and What It Cost

We recently wrote about squeezing 10× out of the runtime engine through profiling-driven micro-optimizations. Sync fast paths, pre-computed keys, batched materialization — careful work that compounded to a 10× throughput gain on array-heavy workloads.

Then we asked: what if we removed the engine entirely?

Not removed as in “deleted the code.” Removed as in “generated JavaScript so direct that the engine isn’t needed at runtime.” An AOT compiler that takes a .bridge file and emits a standalone async function — no ExecutionTree, no state maps, no wire resolution, no shadow trees. Just await tools["api"](input), direct variable references, and native .map().

The result: a median 5× speedup on top of the already-optimized runtime. One extreme array benchmark hit 13×. Most workloads land between 2–7×. Simple bridges — passthrough, single tool chains — show no improvement at all, and a couple are actually slightly slower.

This is the honest story of building that compiler: the architectural bets, what paid off, what didn’t, and what 2,900 lines of code generation buys you in practice.

The Insight: Per-Request Overhead is the Enemy

After the first optimization round, we profiled the runtime in its optimized state and asked: what is it actually doing per request? Not “what is slow?” — “what exists at all?”

Here’s what the ExecutionTree does for every request, even after all our optimizations:

Trunk key computation — concatenates strings to build "module:type:field:instance" keys, then uses them as Map lookup keys. For a bridge with 5 tools and 15 wires, that’s ~20 string allocations per request.
Wire resolution — for each output field, scans the wire array comparing trunk keys. Our sameTrunk() is allocation-free and fast for small N, but it still runs per field, per request.
State map reads/writes — every resolved value goes into Record<string, any>, and every downstream reference reads from it. That’s hash map get/set for what is fundamentally a local variable assignment.
Topological ordering — the pull-based model means dependency order is discovered implicitly through recursive pullSingle() calls. Beautiful semantically, but it means the engine is re-discovering the execution plan on every request.
Shadow tree creation — for a 1,000-element array, the engine creates 1,000 lightweight clones via Object.create(), each with its own state map.

None of these are bugs. None of them are slow in isolation. But together, they add up to a fixed per-request overhead that scales with bridge complexity — and that overhead is fundamentally architectural. You can’t optimize it away with better code. You can only remove it by not having an interpreter.

The Compiler Architecture

The compiler (@stackables/bridge-compiler) takes a parsed BridgeDocument and an operation name, and generates a standalone async JavaScript function. It’s a drop-in replacement:

import { executeBridge } from "@stackables/bridge-core";
import { executeBridge } from "@stackables/bridge-compiler";

Same API. Same result shape. The first call compiles the bridge into a new AsyncFunction(...) and caches it in a WeakMap<BridgeDocument, Map<string, Function>>. Subsequent calls hit the cache — zero compilation overhead.

What the generated code looks like

A bridge like this:

bridge Query.catalog {
  with api as src
  with std.str.toUpperCase as upper
  with output as o

  o.title <- src.title
  o.entries <- src.items[] as it {
    .id <- it.id
    .label <- upper:it.name
    .active <- it.status == "active" ? true : false
  }
}

Compiles to this:

// Generated by @stackables/bridge-compiler

export default async function Query_catalog(input, tools, context, __opts) {
  const __BridgePanicError =
    __opts?.__BridgePanicError ??
    class extends Error {
      constructor(m) {
        super(m);
        this.name = "BridgePanicError";
      }
    };
  const __BridgeAbortError =
    __opts?.__BridgeAbortError ??
    class extends Error {
      constructor(m) {
        super(m ?? "Execution aborted by external signal");
        this.name = "BridgeAbortError";
      }
    };
  const __signal = __opts?.signal;
  const __timeoutMs = __opts?.toolTimeoutMs ?? 0;
  const __ctx = { logger: __opts?.logger ?? {}, signal: __signal };
  const __trace = __opts?.__trace;
  async function __call(fn, input, toolName) {
    if (__signal?.aborted) throw new __BridgeAbortError();
    const start = __trace ? performance.now() : 0;
    try {
      const p = fn(input, __ctx);
      let result;
      if (__timeoutMs > 0) {
        let t;
        const timeout = new Promise((_, rej) => {
          t = setTimeout(() => rej(new Error("Tool timeout")), __timeoutMs);
        });
        try {
          result = await Promise.race([p, timeout]);
        } finally {
          clearTimeout(t);
        }
      } else {
        result = await p;
      }
      if (__trace)
        __trace(toolName, start, performance.now(), input, result, null);
      return result;
    } catch (err) {
      if (__trace)
        __trace(toolName, start, performance.now(), input, null, err);
      throw err;
    }
  }
  const _t1 = await __call(tools["api"], {}, "api");
  return {
    title: _t1["title"],
    entries: await (async () => {
      const _src = _t1["items"];
      if (_src == null) return null;
      const _r = [];
      for (const _el0 of _src) {
        const _el_0 = await __call(
          tools["std.str.toUpperCase"],
          { in: _el0?.["name"] },
          "std.str.toUpperCase",
        );
        _r.push({
          id: _el0?.["id"],
          label: _el_0,
          active: _el0?.["status"] === "active" ? true : false,
        });
      }
      return _r;
    })(),
  };
}

Yes, it’s ugly. That’s the point. Nobody reads this code — V8 does. The __call wrapper handles abort signals, tool timeouts, and OpenTelemetry tracing. The error class preamble supports panic and throw control flow. Every tool goes through __call even when the tool function is synchronous (like toUpperCase), because the compiler currently treats all tool calls uniformly as async.

Look past the scaffolding and the interesting part is the body: _t1 is the API call, the for...of loop replaces the runtime’s per-element shadow trees, the comparison inlines to ===, and the pipe becomes a per-element __call. No engine, no state map, no wire resolution.

The Six Architectural Bets

Building the compiler required making decisions about what to generate. Each decision was a bet on where the performance would come from.

1. Topological sort at compile time

The runtime engine discovers dependency order lazily through recursive pullSingle() calls. The compiler pre-sorts tool calls using Kahn’s algorithm at compile time — a single topological sort over the dependency graph — and emits tool calls in the resolved order.

This means the generated code is a flat sequence of const _t1 = await ...; const _t2 = await ...; — no recursion, no scheduling, no dependency discovery at runtime. V8 loves straight-line code.

2. Direct variable references instead of state maps

The runtime stores all resolved values in a Record<string, any> state map, keyed by trunk keys like "_:Query:simple:1". Every read is a hash map lookup. Every write is a hash map insertion.

The compiler replaces this with local variables: _t1, _t2, _t3. A variable access in optimized JavaScript is a register read — effectively zero cost. No hashing, no collision chains, no string comparison.

3. Native `.map()` instead of shadow trees

This was the biggest architectural bet. The runtime creates a shadow tree per array element — a lightweight clone via Object.create() that inherits the parent’s state. For 1,000 elements, that’s 1,000 shadow trees, each with its own state map, each resolving element wires independently.

The compiler replaces this with a single .map() call:

(source?.["items"] ?? []).map((_el0) => ({
  id: _el0?.["item_id"],
  label: _el0?.["item_name"],
}));

No object allocation per element. No state map per element. Just a function call that returns an object literal. V8 can inline this, eliminate the closure allocation, and vectorize the field accesses.

4. Inlined internal tools

The Bridge language has built-in operators for arithmetic (+, -, *, /), comparisons (==, >=, <), and string operations. In the runtime, these are implemented as tool functions in an internal tool registry, dispatched through the same callTool() path as external tools.

The compiler inlines them as native JavaScript operators:

// Runtime: goes through tool dispatch, state map, wire resolution
// Compiled: emitted as a direct expression
const _t1 = Number(input?.["price"]) * Number(input?.["qty"]);

This is where the 5× speedup on math expressions comes from. The runtime pays the full tool-call overhead (build input object, dispatch, extract output) for what is fundamentally a * b.

5. Direct property access instead of wire resolution

In the runtime, accessing src.items.name means recursive pullSingle() calls — each path segment goes through wire resolution, state map lookups, and dependency tracking. The compiler replaces this with direct JavaScript property access. Bridge’s catch and ?. operators still compiles to actual try/catch blocks in the generated code.

6. `await` per tool, not `isPromise()` per wire

The first optimization round introduced MaybePromise<T> to avoid unnecessary await on already-resolved values. This was a big win for the runtime because most values are synchronous between tools.

The compiler takes a simpler approach: it just uses await on every tool call and does nothing special for synchronous intermediate values (which are just variable references). This is actually faster because:

Tool calls genuinely return promises (they call external functions)
Between tools, all access is synchronous variable reads with no await at all
V8’s await on an already-resolved promise is fast (~200ns), but the compiler doesn’t even hit that path for intermediate values

The Numbers

We built a side-by-side benchmark suite that runs identical bridge documents through the runtime interpreter and the compiler, measuring throughput after compile-once / parse-once setup:

Benchmark	Runtime (ops/sec)	Compiled (ops/sec)	Speedup
passthrough (no tools)	702K	652K	0.9×
simple chain (1 tool)	539K	603K	1.1×
3-tool fan-out	204K	516K	2.5×
short-circuit (overdefinition)	726K	630K	0.9×
fallback chains (?? / \|\|)	302K	524K	1.7×
math expressions	121K	638K	5.3×
flat array 10	162K	452K	2.8×
flat array 100	25K	182K	7.3×
flat array 1,000	2.6K	26.8K	10.1×
nested array 5×5	45K	230K	5.1×
nested array 10×10	16K	103K	6.3×
nested array 20×10	8.3K	55.5K	6.7×
array + tool-per-element 10	39K	285K	7.2×
array + tool-per-element 100	4.4K	57K	13.0×

Median speedup: 5.3×. The range is 0.9× to 13.0×, with the highest gains on array-heavy workloads where the runtime’s per-element shadow tree overhead dominates.

The pattern is nuanced. Simple bridges — passthrough, single chains, overdefinition short-circuits — show no gain or even a slight regression. The compiler’s setup overhead (function preamble, std scaffolding) costs more than the interpreter overhead it eliminates. You need a bridge that actually does work — array mapping, multiple tool calls, math expressions — before the compiler starts winning.

The sweet spot is mid-complexity: 3+ tools with some array work, where you get a reliable 3–7× improvement. The double-digit numbers (10–12×) only appear on extreme array workloads with 100+ elements, which is real but not the common case.

These numbers are on top of the runtime’s already 10× optimized state. Compared to the original unoptimized engine, the compiled path is faster on array workloads — but the last 5× cost significantly more engineering effort than the first 10×.

Why array workloads see the biggest gains

The array + tool-per-element benchmark (13× at 100 elements) is the best case — and it’s worth understanding why it’s an outlier, not the norm:

The runtime creates 100 shadow trees via Object.create(), each with its own state map
Each shadow tree resolves element wires, schedules the per-element tool call, builds input, calls the function, extracts output, stores in state map
100 elements × full resolution overhead per element

The compiler emits a single await Promise.all(items.map(async (el) => { ... })) with direct variable references. No shadow trees, no state maps, no wire resolution — just function calls and object literals. The overhead scales with the number of elements in the runtime, but stays constant in the compiled version.

Math expressions (5.3×) show a similar pattern — the compiler inlines Number(input?.["price"]) * Number(input?.["qty"]) instead of round-tripping through the internal tool dispatch.

But look at the other end of the table: passthrough and short-circuit bridges are slower with the compiler (0.9×). The compiled function has a fixed preamble — importing std tools, setting up the call wrapper — that the runtime doesn’t pay because it resolves lazily. For bridges that barely use the engine, that preamble is pure overhead.

What the Compiler Doesn’t Do

The compiler has full feature parity with the runtime — same API, same semantics, same results. But there’s one environmental constraint:

new Function() required. The compiler evaluates generated code via new AsyncFunction(...), which means it doesn’t work in environments that disallow eval — like Cloudflare Workers or Deno Deploy with default CSP. The runtime works everywhere.

What We Learned

1. Interpreters have a floor; compilers have a different floor

No matter how much we optimized the ExecutionTree, it had a structural minimum cost per request: create a context, resolve wires, manage state. The compiler eliminates that floor — but introduces its own: function preamble, std tool bundling, call wrapper setup. The runtime’s floor scales with bridge complexity; the compiler’s floor is roughly constant.

This means the compiler only wins when the bridge does enough work to amortize its fixed overhead. For passthrough bridges, the runtime is actually faster. The crossover point is around 2–3 tool calls — which, fortunately, is where most real bridges live.

2. Compile once, run many is the right caching model

The WeakMap<BridgeDocument, Map<string, Function>> cache means compilation happens exactly once per document lifetime. The WeakMap key on the document object means:

No cache invalidation logic needed
Garbage collected when the document is released
Zero overhead on the hot path (it’s a Map lookup)

We worried about new AsyncFunction() being slow — and it is, relatively (~0.5ms per compilation). But it happens once. For a production service handling thousands of requests per second, that 0.5ms is amortized to essentially zero.

3. Code generation is simple; feature parity is not

The codegen module is ~2,900 lines. It doesn’t use a code generation framework, templates, or an IR. It builds JavaScript strings directly:

lines.push(
  `  const ${tool.varName} = await __call(tools[${JSON.stringify(tool.toolName)}], ${inputObj});`,
);

String concatenation producing JavaScript source code. It’s not elegant, but it’s correct, testable, and easy to debug — you can console.log(code) and read what it produces.

The topological sort, ToolDef resolution, and wire-to-expression conversion are all straightforward tree walks over the existing AST. We didn’t need to invent new data structures — the AST already contains everything the compiler needs.

But 2,900 lines is a lot of code for a median 5× speedup. Each language feature — ToolDef extends chains, overdefinition bypass, scoped define blocks, break/continue in iterators, OTel tracing, prototype pollution guards, ToolDef-level dependency resolution — added another 50–200 lines of code generation, each with its own edge cases. The first 80% of feature coverage was fun; the last 20% was grind.

4. Shared tests are the foundation

The 1k shared tests are the single most important artifact. A forEachEngine() dual-runner wraps every language test suite and runs it against both execution paths:

forEachEngine("my feature", (run, ctx) => {
  test("basic case", async () => {
    const { data } = await run(bridgeText, "Query.test", input, tools);
    assert.deepStrictEqual(data, expected);
  });
});

When we added a new feature to the compiler, we didn’t have to guess if it matched the runtime — the test told us. When we found a runtime bug through compiler testing, we fixed it in both places simultaneously.

5. LLMs are surprisingly good at code generation… for code generators

An LLM helped write much of the initial codegen — emitting JavaScript from AST nodes is the kind of repetitive, pattern-based work where LLMs excel. The human added the architectural decisions (topological sort, caching model, internal tool inlining) and the LLM filled in the wire-to-expression conversion, fallback chain emission, and array mapping code generation.

The feedback loop was fast: write a test, ask the LLM to make it pass, check the generated JavaScript looks right, run the full suite. We went from “proof of concept that handles pull wires” to “984 tests passing with zero skips” in a series of focused sessions.

6. The compiler lost some runtime optimizations

Moving from interpreter to compiler isn’t a pure win. The runtime had optimizations that the compiler’s uniform code generation doesn’t replicate yet.

The most obvious: sync tool detection. The runtime’s MaybePromise<T> path avoids await on tools that return synchronous values — like std.str.toUpperCase, which is a pure function returning a string. The compiled code wraps every tool call in await __call(...), paying async overhead even for a function that never touches a promise. For array workloads with per-element sync tools, this is measurable.

The __call wrapper itself adds overhead: abort signal check, tracing timestamps, timeout Promise.race. The runtime’s hot-path skips most of this for internal tools. The compiler runs every tool through the full wrapper.

These are solvable — the compiler can learn to detect sync tools at compile time, skip the abort check when no signal is provided, inline the call for tools that don’t need tracing. But they’re reminders that a rewrite-from-scratch always re-loses optimizations that accumulated in the old system.

7. Performance work has diminishing returns

The honest takeaway: we spent roughly the same engineering effort on the compiler (2,900 lines) as we did on the 12 runtime optimizations combined. The runtime optimizations gave us 10×. The compiler gave us a median 5×. The marginal return on engineering investment dropped significantly.

Worse, the compiler’s gains are concentrated in array-heavy workloads that most bridges don’t hit. A typical bridge with 2–3 tool calls and no arrays sees maybe 2× improvement. Meanwhile, it now has to maintain two execution paths, keep them in sync, and run every test twice.

Is it worth it? For high-throughput scenarios with array mapping — yes, clearly. For the general case — it’s closer to a wash. The compiler is a valid optimization for a specific performance profile, not a universal upgrade.

The Compound Story

Step back and look at the full arc:

Phase	What we did	Array 1,000 ops/sec	vs. original
Original engine	Unoptimized interpreter	~258	—
After 12 optimizations	Profiling-driven micro-opts	~2,700	10.5×
After compiler	AOT code generation	~26,800	104×

From 258 ops/sec to 26,800 ops/sec. A 104× improvement — but the two phases were very different in efficiency.

The runtime optimizations (12 targeted changes) gave us 10.5× with relatively modest code changes. The compiler (2,900 new lines, a new package, dual test infrastructure) gave us another 10× on this specific benchmark. On typical bridges, the compiler adds 2–5×.

Neither phase alone would have gotten here. The interpreter optimizations taught us what the overhead was — which is exactly the knowledge needed to design a compiler that eliminates it.

The cycle starts again

Here’s the thing about moving to generated code: some of the runtime’s hard-won optimizations didn’t come along.

The runtime learned to distinguish sync tools from async ones. std.str.toUpperCase is a pure synchronous function, but the compiled code wraps every tool call in await __call(...) — paying the async overhead on a function that returns a plain string. The runtime’s sync fast-path, where MaybePromise<T> avoids unnecessary await, was an interpreter optimization that the compiler’s uniform code generation erased.

So the cycle starts again. We have a new baseline — generated JavaScript instead of an interpreter — and a new set of low-hanging fruit. Detect sync tools at compile time and call them without await. Use .map() instead of for...of when the loop body is synchronous. Eliminate the __call wrapper for tools that don’t need tracing or timeouts. Each of these is a targeted codegen improvement, the kind of work that compounds.

The first 10× came from profiling the interpreter. The second 10× came from replacing it. The next 3× will come from profiling the generated code. Different technique, same discipline.

Artifacts

Compiler source — ~2,900 lines of code generation
Compiler benchmarks — side-by-side runtime vs compiled
Runtime benchmarks — the original engine benchmarks
Assessment doc — feature coverage, trade-offs, API
Performance log — the compiler performance baseline
First blog post — the runtime optimization story