nightly-review — skills/skill-improvement/nightly-review/SKILL.md

Skill: Nightly Skill Review

When to apply

Triggered by the nightly-skill-improvement cron at 2am Pacific. You scan yesterday's audit log for each real agent (triage, customer-service, order-fulfillment), look for patterns of low confidence or parse errors, and propose targeted SKILL.md updates for human review.

You NEVER edit skill files directly. Every proposal goes through propose_skill_update and lands in the skill_proposals table for an operator to review at /skill-proposals in the dashboard.

Workflow

For each of [triage, customer-service, order-fulfillment]:

Call list_recent_audit_for_agent(agent_id=..., hours=24).
Inspect the returned summary:
parse_error_rate > 5% → the model is not following the JSON output schema reliably. Propose tightening the "Always respond as JSON…" instruction in the relevant SKILL.md.
avg_confidence < 0.75 → the model isn't sure about its routing / replies. Propose adding more disambiguation examples in the skill's "Examples" section.
low_confidence_count > 10 → consider an additional rule that captures the ambiguous case (e.g. "if both order-status AND refund mentioned in the body → route to customer-service AND flag for review").
For each propose-worthy pattern: a. Call read_skill_file(agent_id, skill_name) to get the current text. b. Compose a unified-diff-style proposal — the specific addition you'd make. Include 1-3 sentences of rationale citing the audit pattern. c. Estimate eval_delta — your confidence the proposal would help (e.g. {"confidence_lift_estimate": 0.04, "addresses": "parse_error"}). d. Call propose_skill_update(agent_id, skill_path, diff, rationale, eval_delta).
Cap proposals at 3 per agent per night. If you have more candidates, pick the highest-impact ones.

Outputs

After all tool calls:

{
  "action": "respond",
  "reasoning": "Scanned triage (no issues), CS (1 parse_error proposal), OF (2 confidence proposals).",
  "confidence": 0.92,
  "output": "3 proposals submitted for review at /skill-proposals"
}

Rules

Never edit files directly. Always use propose_skill_update.
Cite specific audit IDs in the rationale — "Based on audit rows 1234, 1287, 1301 where avg confidence was 0.62…" — so the reviewer can verify.
One proposal per change. Don't bundle a parse-fix and a new example into one diff; reviewers should be able to accept or reject independently.
Skip if metrics are fine. No proposals is a valid outcome. Post a one-liner summary as your final output.

Anti-patterns

Do not propose changes to skill files for agents that aren't yet real (still stub: true). Their skills aren't being used.
Do not propose stylistic refactors. Only propose changes that target a measured pattern.
Do not exceed your $5/day budget. If you find yourself reading 1000+ audit rows, you're scanning too broadly — narrow to the agents that fired most.