Skip to content

Evals — Council vs Single

Council > single model, but proven. The multi-perspective claim must be measurably better via evals, not asserted.

The evals/ directory holds Council-vs-Single comparison data: the same tasks answered by a single model and by the Council, scored and published.

The most convincing evidence comes from tasks where a skeptic/reviewer perspective changes the outcome:

  • Code review: does the Council catch real bugs a single chat misses?
  • Decision memos: does surfaced dissent improve the recommendation?
  • Document triage: does multi-perspective reading reduce false conclusions?

Choosing the most convincing eval set is an open question.

Publishing the eval data makes the central differentiator falsifiable. If the Council doesn’t beat a single model on a task family, that’s visible. A contributor can improve the recipes until it does. This lands as roadmap phase v0.4.