nightly-review — skills/skill-improvement/nightly-review/SKILL.md
← back to all skills · test triggers
Skill: Nightly Skill Review
When to apply
Triggered by the nightly-skill-improvement cron at 2am Pacific. You scan
yesterday's audit log for each real agent (triage, customer-service,
order-fulfillment), look for patterns of low confidence or parse errors,
and propose targeted SKILL.md updates for human review.
You NEVER edit skill files directly. Every proposal goes through
propose_skill_update and lands in the skill_proposals table for an
operator to review at /skill-proposals in the dashboard.
Workflow
For each of [triage, customer-service, order-fulfillment]:
- Call
list_recent_audit_for_agent(agent_id=..., hours=24). - Inspect the returned summary:
parse_error_rate> 5% → the model is not following the JSON output schema reliably. Propose tightening the "Always respond as JSON…" instruction in the relevant SKILL.md.avg_confidence< 0.75 → the model isn't sure about its routing / replies. Propose adding more disambiguation examples in the skill's "Examples" section.low_confidence_count> 10 → consider an additional rule that captures the ambiguous case (e.g. "if both order-status AND refund mentioned in the body → route to customer-service AND flag for review").-
For each propose-worthy pattern: a. Call
read_skill_file(agent_id, skill_name)to get the current text. b. Compose a unified-diff-style proposal — the specific addition you'd make. Include 1-3 sentences of rationale citing the audit pattern. c. Estimateeval_delta— your confidence the proposal would help (e.g.{"confidence_lift_estimate": 0.04, "addresses": "parse_error"}). d. Callpropose_skill_update(agent_id, skill_path, diff, rationale, eval_delta). -
Cap proposals at 3 per agent per night. If you have more candidates, pick the highest-impact ones.
Outputs
After all tool calls:
{
"action": "respond",
"reasoning": "Scanned triage (no issues), CS (1 parse_error proposal), OF (2 confidence proposals).",
"confidence": 0.92,
"output": "3 proposals submitted for review at /skill-proposals"
}
Rules
- Never edit files directly. Always use
propose_skill_update. - Cite specific audit IDs in the rationale — "Based on audit rows 1234, 1287, 1301 where avg confidence was 0.62…" — so the reviewer can verify.
- One proposal per change. Don't bundle a parse-fix and a new example into one diff; reviewers should be able to accept or reject independently.
- Skip if metrics are fine. No proposals is a valid outcome. Post a one-liner summary as your final output.
Anti-patterns
- Do not propose changes to skill files for agents that aren't yet real (still
stub: true). Their skills aren't being used. - Do not propose stylistic refactors. Only propose changes that target a measured pattern.
- Do not exceed your $5/day budget. If you find yourself reading 1000+ audit rows, you're scanning too broadly — narrow to the agents that fired most.