trials · nendlabs

Benchmarks Need an Operating Record

The common failure mode for eval code is not that it cannot call a model. It is that the run cannot explain itself later. A loop can produce a score, but a benchmark campaign needs to preserve what was tested, which variant ran, which transport executed the case, how the output was graded, and which record supports the claim.

The core design move is to treat the benchmark as a local runtime. Datasets, runners, graders, evals, variants, sweeps, results, and reports are distinct objects because they answer different questions about the same campaign. The Commonwealth workspace is deliberately small, but that is what makes the runtime pressure visible: fourteen short-answer cases, two model variants, one exact-answer grader, and a sweep that has to leave a durable account of its work.

The useful artifact is not a leaderboard. The useful artifact is the run directory: manifest, execution plan, events, result files, raw provider artifacts, timing summaries, retry metadata, and terminal state.

typescript

export default defineProject({
  name: "commonwealth-leaders",
  datasets,
  graders,
  runners,
  evals: {
    "history-qa": defineEval({
      dataset: "commonwealth-leaders-smoke",
      runner: "openai",
      grader: "exact-answer",
    }),
  },
  variants: {
    "gpt-5.4-mini": defineVariant({ config: { model: "gpt-5.4-mini" } }),
    "gpt-5.4": defineVariant({ config: { model: "gpt-5.4" } }),
  },
  sweeps: {
    "model-compare": defineSweep({
      evals: ["history-qa"],
      variants: ["gpt-5.4-mini", "gpt-5.4"],
    }),
  },
});

A Sweep Compiles Into a Plan

A sweep is not a hidden nested loop. It compiles into a single execution plan with target metadata, queue order, item count, campaign-level concurrency, and runner-owned scheduling constraints. The queue is round-robin across targets, so one slow target does not silently define the shape of the whole campaign.

That plan is the shared object used by single-process execution, local worker fanout, and attached shared-filesystem workers. Workers do not receive a private copy of the work. They claim leased items from disk, write results through the same store, and allow a finalizer to reconcile terminal state when the last work has drained.

This is the line between running an eval and operating an eval. Running an eval asks for outputs. Operating an eval asks for a plan, leases, status, retry state, cooldown state, and a way to resume after interruption.

typescript

for (let caseIndex = 0; caseIndex < maxCaseCount; caseIndex += 1) {
  for (const target of targets) {
    const item = target.cases[caseIndex];
    if (!item) continue;

    items.push({
      id: `${target.key}:${caseIndex}`,
      targetKey: target.key,
      caseIndex,
      queueIndex,
      enqueuedAtMs: createdAtMs,
      item,
    });
    queueIndex += 1;
  }
}

Runners Own Transport Pressure

The runner boundary is where comparison stops pretending every subject behaves the same. An eval definition should not know how to speak to a model provider, a browser agent, a local command, or a remote service. A runner normalizes the case, applies variant configuration, executes the transport, and returns a normalized result with enough metadata to explain what happened.

That boundary also owns scheduling pressure. Runner lanes can persist provider-specific caps into the execution plan, so transport constraints are part of the runtime rather than CLI folklore. The OpenAI runner also coordinates retry and cooldown behavior across processes through run-scoped state, and it handles structured-output edge cases before the grader sees an answer.

Fair comparison is not transport blindness. Fair comparison means provider behavior is isolated, named, scheduled, and recorded.

typescript

function extractOutput(response: ResponsesAPI.Response, invocation: NormalizedInvocation): JsonValue {
  if (!invocation.outputSchema) {
    return response.output_text;
  }

  const structuredText =
    getStructuredOutputText(response) ?? getStructuredRefusal(response);

  if (!structuredText) {
    const status = response.status ?? "unknown";
    throw new Error(
      `Expected structured JSON output, but the model returned empty text (status=${status}).`,
    );
  }

  if (structuredText.kind === "refusal") {
    throw new Error(`Model refused structured response: ${structuredText.refusal}`);
  }

  return JSON.parse(structuredText.text) as JsonValue;
}

Evidence Lives in Result Records

Reports can be regenerated. Summaries can change. The per-case result is the evidence. Each completed item stores input, output, expected answer when present, grade, eval name, variant name, dataset, runner, transport, queue position, worker identity, timing, usage, runner metadata, and raw provider artifacts.

The latest local model-compare run recorded twenty-eight planned items across two variants, two local workers, campaign concurrency four, and runner lane caps. It finished with twenty-seven passing cases out of twenty-eight. The one miss was useful: an exact-answer grader rejected Dr. Rajendra Prasad against Rajendra Prasad, which is exactly the kind of brittle scoring boundary that an eval system should expose rather than hide.

The result record keeps the benchmark from collapsing into a number. It gives the score a chain of custody.

typescript

const result: ResultRecord = {
  id: item.id,
  caseId: item.id,
  input: item.input,
  output: runnerResult.output,
  ...(item.expected !== undefined ? { expected: item.expected } : {}),
  grade,
  metadata: {
    eval: target.evalName,
    variant: target.variantName,
    dataset: target.resolved.evalDefinition.dataset,
    runner: target.resolved.evalDefinition.runner,
    transport: target.resolved.runnerDefinition.transport,
    execution: {
      planItemId: workItem.id,
      queueIndex: workItem.queueIndex,
      processId: process.pid,
      totalMs,
      retryCount: requestMetrics.retryCount,
      retryDelayMs: requestMetrics.retryDelayMs,
      ...(worker ? { workerId: worker.workerId } : {}),
    },
  },
};

The Boundary Is Explicit

The honest scope matters. The repository has one tracked commit, so most pivots are visible in design logs and generated run artifacts rather than in a long commit series. The large generated Commonwealth dataset is still a plan; the current evidence is the fourteen-case smoke fixture and the real runs around it.

Some CLI surfaces are intentionally still design surface: dataset build, report, compare, run show, and result show are not the same thing as the working execution runtime. Attached workers are shared-filesystem workers, not a queue service or RPC coordinator.

That boundary is not a weakness of the experiment. It is the point. The work proves a compact operating model: typed benchmark definitions, provider-aware runners, persisted execution plans, leased local workers, resumable state, live status, and per-case evidence. That is enough to make a small benchmark accountable.