Eval harness¶

The tako.eval package runs (orchestrator, dataset) pairs and emits a JSON report with pass-rate, total attempts, and p50/p95 latency.

Built-in synthetic dataset¶

A 10-task synthetic dataset (math + factual + code mix) ships in-tree to satisfy Phase 3's DoD ("Eval harness runs a 10-task synthetic benchmark and emits a JSON report"):

import asyncio
import tako
from tako.eval import Eval, load_synthetic

async def main():
    fake = tako.providers.Fake(canned_text="ok hello 42 paris earth 1969 def fn")
    orch = tako.SingleAgent(provider=fake)
    report = await Eval(orch=orch, dataset=load_synthetic(), k=1).run()
    print(report.model_dump_json(indent=2))

asyncio.run(main())

CLI¶

python -m tako.eval \
    --orch myproject.fixtures:my_orch \
    --dataset synthetic \
    --k 3 \
    --out report.json

--orch resolves a module:attr spec to a Python object that exposes .run(prompt) -> awaitable. --dataset accepts "synthetic" or a path to a JSONL file with {id, prompt, expected_substring|expected_regex, max_tokens} rows.

Custom datasets¶

from tako.eval import Eval, load_jsonl

dataset = load_jsonl("path/to/eval.jsonl")
report = await Eval(orch=my_orch, dataset=dataset, k=3, concurrency=8).run()

Task requires either expected_substring or expected_regex. Both may be set; both must match.

External datasets¶

load_dataset("swe_bench_lite") and load_dataset("gpqa_diamond") raise NotImplementedError — Phase 4 work. No model weights or proprietary datasets are committed in-tree.

Report shape¶

class EvalReport(BaseModel):
    dataset: str
    orchestrator: str
    k: int
    tasks_run: int
    pass_rate: float
    p50_latency_ms: float
    p95_latency_ms: float
    total_attempts: int
    task_results: list[TaskResult]

Each TaskResult has task_id, attempts, passes, the per-attempt latencies, and an optional error field if the orchestrator raised.