10. Cross-team ownership¶

Privacy is one cross-cutting concern. Ownership is another — who decides what enters the set, who reviews changes, how multiple teams contribute when they share a system. Cross-team governance is the discipline that makes the set sustainable across people and time.

A platform engineer at a Pune SaaS company runs the eval set for the customer-support agent. The agent is used by three product teams — sales, support, billing. Each team has different priorities. Sales wants cases for their new lead-routing feature; support wants regression-prevention for the failure modes they triage; billing wants explicit coverage for the regulatory cases their domain has. The engineer owns the set technically; she does not own the priorities. After a year of trying to satisfy everyone, the set has bloated, the failure modes are uneven, and each team feels under-served. The fix is a cross-team governance: a quarterly sync where each team's owner contributes their priorities; the set owner allocates the case budget across teams; the changelog notes which team contributed which cases. The set's coherence returns; each team has a voice; conflicts surface as conversations rather than as silent neglect.

This chapter is the discipline that makes that conversation regular.

Who owns the set¶

A single accountable owner with cross-team contributors. The owner:

Decides what enters the set
Maintains the version and the changelog
Runs the refresh cadence
Mediates conflicts between contributing teams
Signs off on each version

The owner is typically the team that operates the system — the platform team for a platform-wide set, the product team for a feature-specific set. The owner is rarely an engineer alone; the owner is a PM or a senior engineer with domain context.

Contributors:

Product teams using the system propose cases and priorities
Domain experts provide labels (chapter 03)
Engineering proposes technical-correctness cases (tool calls, schema adherence)
Compliance proposes regulatory-coverage cases

The owner is one; the contributors are many. The discipline is that contribution is welcomed but the integration of contributions into the set is the owner's call.

The governance forum¶

A quarterly cross-team sync. Agenda:

Review of recent eval scores — overall and per-team-relevant strata
New failure modes from each team's production — candidates for set additions
Set's coverage gaps — strata under-represented relative to importance
Priorities for the next quarter's refresh — each team's case wishlist
Allocation of refresh budget — how many cases go to each team's priorities
Calibration on rubric or labelling questions — alignment across team domains

The sync is 60–90 minutes. The output is a refresh plan for the quarter: how many cases per team, what categories, what calibration is needed.

Between syncs, the owner manages day-to-day decisions; cases of cross-team conflict surface to the next sync.

The contribution flow¶

A team proposing a case follows the flow:

Open a contribution request. A PR or ticket with the proposed case (input, suggested label, stratum tags, source provenance).
Owner review. The owner validates that the case fits the set's role (regression-prevention or distribution sample), the labels are valid (or kicks to labelling per chapter 03), the strata are appropriate.
Calibration if needed. If the case touches a domain not previously covered (a new feature, a new segment), calibration with domain experts before labels are finalised.
Add to set. Case enters with provenance noting the contributing team.
Score baseline. The case is run; the baseline is captured for future regression detection.

Most contributions go through this flow without friction. Calibration is the slow step for novel domains.

Conflicts and how to resolve them¶

Three common conflict patterns.

Conflict on labels. Team A's domain expert says output X is correct; team B's says it is wrong. Resolution: a calibration session with both teams; usually the rubric is sharpened to distinguish the two cases; sometimes the set excludes the genuinely-ambiguous case.

Conflict on priorities. Team A wants 20 new cases for their feature; team B wants 20 for theirs; the budget is 25. Resolution: the owner allocates based on production impact (which feature affects more users, which has the higher-risk profile). The decision is explicit; both teams know the allocation.

Conflict on retention. Team A wants to keep a case; team B says it is stale and inflates the score. Resolution: examine the case's production relevance; if it still represents real behaviour, keep; if not, retire. The owner's call, informed by data not just opinion.

The forum is where these conflicts are surfaced. Between forums, the owner handles them with consultation as needed.

When one team's domain dominates¶

Some platforms have one obvious primary team (the product team that mostly drives the system); others have multiple teams with comparable claims. The set's design must reflect the reality:

Primary-team-dominant: the set is heavily weighted toward that team's priorities; other teams have minority cases with explicit allocation.
Multi-team balanced: per-team allocation is roughly equal; each team's lead is a contributor; the owner mediates.
Federation: each team owns a sub-set with its own labelling discipline; the platform sums across them for the overall view.

Federation is more complex but appropriate when the teams have genuinely different domains (e.g., different regulated industries on a horizontal platform).

What if there is no owner¶

The pathology of an unowned set. Contributions arrive ad-hoc; no one decides the coherence; conflicts simmer silently; the set's quality degrades. Six months in, the set is a accumulation of cases nobody reviewed together.

The fix is to assign an owner. The owner does not need to be senior; they need to be empowered and have time. Even a junior engineer with explicit ownership and a quarterly forum produces better outcomes than a senior team without an owner.

The set as a contract between teams¶

The set is, in effect, a contract: "these cases are what the system must continue to handle." Each team's contribution to the set is the team's claim about what they care about; the owner's job is to integrate the claims into a coherent whole.

A team that finds its cases not in the set has a signal: either advocate for inclusion (legitimate priority not yet recognised) or accept that the set does not regression-protect their domain (they own the risk in production).

The clarity matters. A team that thinks "the eval covers us" without checking the set's contents may discover late that their cases were never included.

Common mistakes¶

No accountable owner. Decisions drift; cases accumulate; quality degrades.

Owner without authority. The owner's decisions are overridden by louder teams; coherence still degrades.

No forum. Contributing teams have no structured way to surface priorities; conflicts simmer.

Federation without coordination. Multiple sub-sets with no shared discipline; cross-platform metrics are incomparable.

One team's priorities dominating without explicit allocation. Other teams' coverage atrophies silently.

Interview Q&A¶

Q1. The set has multiple contributing teams and no clear owner. What goes wrong? Contributions arrive ad-hoc; no one decides what fits the set's role; near-duplicates accumulate; conflicts on labels go unresolved; the set's coherence decays. Six months in, the set is a pile of cases nobody reviewed as a whole. The fix is to assign an accountable owner with the authority to integrate contributions and resolve conflicts. The owner does not need to write every case; they need to be the integration point. Wrong-answer notes: "everyone shares ownership" without a single accountable role produces the chapter-opening drift.

Q2. Walk through how a contributing team adds a case to the set. Open a contribution request (PR or ticket) with the case: input, suggested label, stratum tags, source provenance. The set's owner reviews: does the case fit the set's role, is the label valid (kicks to labelling if needed), are the strata appropriate. If the case touches a new domain, calibrate with domain experts (chapter 03's discipline). Once approved, the case enters the set with provenance noting the contributing team. The baseline is captured. The case is part of the next version. Wrong-answer notes: "anyone can add a case" without the owner's review produces drift.

Q3. Team A and Team B disagree about the right label for a case. How do you resolve? A calibration session with both teams (chapter 03's discipline applied cross-team). Each side explains their reasoning; the group examines whether the rubric distinguishes the cases or is ambiguous. Outcomes: one team's interpretation prevails after discussion; the rubric is sharpened so future cases are not ambiguous; the case is excluded if genuinely subjective. The set owner facilitates; the conversation is documented for future reference. Wrong-answer notes: "owner decides" without dialogue misses the calibration value.

Q4. The set has been heavily weighted toward Team A's priorities for two years; Team B has been silently under-served. What is the response? Surface explicitly. The next governance forum: present the per-team allocation history; show Team B's production cases that the set does not cover; propose a rebalancing with explicit case allocation for the next quarter. The rebalancing happens at refresh; the set owner allocates the budget per team based on production impact and risk. Team B's voice is now in the process; the silent under-service ends. Wrong-answer notes: "we'll get to it eventually" produces the next year of the same drift.

What to do differently after reading this¶

Assign an accountable owner for the set. Empower them.
Run a quarterly governance forum with contributing teams.
Treat the set as a cross-team contract; allocate the refresh budget explicitly.
Surface conflicts as calibration topics, not silent disagreement.
For multi-domain platforms, consider federation with shared discipline.

Bridge. Cross-team governance is the operating cadence. The last operational concern is what happens when the set itself is wrong — false positives blocking ships, missed cases producing regressions. The next chapter is eval-set incident response. → 11-eval-set-incident-response.md