feat: apply MOSS-grounded self-evolution improvements to ADAM

Implements 7 improvements grounded in MOSS paper (arXiv 2605.22794): 1. Transcript capture (§3.4): context_ring buffer in adam-observe.mjs captures last 8 events around struggle signals as context_window. 2. Evidence batching (§3.1): new adam-batch.mjs pre-clusters windowed journal entries into coherent failure batches by (signal_type, cluster_key). 3. Multi-stage analysis (§3.3): SKILL.md dispatches adam agent in two stages (diagnose+plan → implement) with inter-stage validation gate. 4. Pre-apply verification (§3.4): 4-check deterministic gate before auto-apply (source entries exist, diagnosis grounded, type-evidence match, no conflicting recent proposals). 5. Auto-rollback (§3.5): new adam-rollback.mjs reverts regressed proposals detected by A/B measurement, creates regression nudges. 6. Harness self-modification (§1 Table 1): new harness_edit proposal type targeting adam's own scripts with stricter gates (confidence≥5, never auto-apply, test-suite-gated). 7. Keypoint matrix evaluation (§4.2): 5 capability dimensions (tool_selection, scope_discipline, error_recovery, first_attempt, build_reliability) scored per batch for structured evaluation. Test suite: 94 → 114 tests (20 new), all passing.
2026-06-30 02:54:34 +00:00 · 2026-05-24 11:15:32 +01:00
parent a48c705c0a
commit 440fb52eb1
7 changed files with 1038 additions and 20 deletions
@@ -8,6 +8,71 @@ tools: Read, Write, Edit, Grep, Glob, Bash

 You analyse Claude Code's own behaviour to propose targeted, surgical improvements. You operate offline (no LLM round-trips outside this run) and produce **files**, not actions. Main-thread Claude reviews and applies changes with the user.

+## Stage mode
+
+The skill dispatches you in one of two stages (MOSS-inspired multi-stage pipeline — §3.3: "a single prompt asked to diagnose, plan, implement, verify, and decide overloads context and produces lower-quality output than a sequenced flow"):
+
+- **`stage=diagnose`**: Read batched journal entries, cluster, diagnose root causes, plan fix types. Output diagnoses JSON to `/tmp/adam-diagnoses.json`. Do NOT draft proposals.
+- **`stage=implement`**: Read approved diagnoses from `/tmp/adam-diagnoses.json`. Draft full proposal files to `proposals_dir/`. Emit the clustering trace and punch list.
+
+If no `stage` is specified in the dispatch prompt, run **both stages sequentially** within a single pass (backward-compatible with pre-MOSS flow).
+
+### Diagnose-stage output format
+
+When `stage=diagnose`, write `/tmp/adam-diagnoses.json` containing:
+
+```json
+{
+  "diagnoses": [
+    {
+      "cluster_id": "c1",
+      "signal_type": "correction",
+      "cluster_key": "wrong|approach",
+      "count": 5,
+      "sessions": 3,
+      "diagnosis": {
+        "trigger": "...",
+        "action": "...",
+        "mismatch": "...",
+        "outcome": "... `verbatim quote` ..."
+      },
+      "plan": {
+        "type": "memory",
+        "target": "~/.claude/projects/-Users-nvm/memory/go-test-cache.md",
+        "scope": "add feedback memory about go test -count=1"
+      },
+      "keypoints": {
+        "tool_selection": 1,
+        "scope_discipline": 2,
+        "error_recovery": 0,
+        "first_attempt": 0,
+        "build_reliability": 1
+      },
+      "gates": {
+        "threshold": "pass",
+        "cross_session": "pass",
+        "window": "in:5/out:0",
+        "contradiction": "none"
+      },
+      "source_entries": ["2026-05-20T10:00:00Z", "2026-05-21T11:00:00Z"],
+      "context_evidence": ["... excerpts from context_window ..."]
+    }
+  ],
+  "skipped": [
+    {"cluster_id": "c3", "signal_type": "retry_loop", "reason": "threshold", "count": 2}
+  ],
+  "summary": "considered=4 diagnosed=2 skipped=2"
+}
+```
+
+The skill validates diagnoses between stages (see SKILL.md §2 "Inter-stage validation").
+
+## Context window evidence
+
+Journal entries for struggle signals now carry a `context_window` field — an array of the last 8 events (user prompts, tool calls, responses) surrounding the friction point. This is the ADAM equivalent of MOSS's "original transcript captured by auto-scan at evidence time" (§3.4).
+
+When drafting diagnoses, **prefer `context_window` evidence over transcript file lookups** when it is present. The `context_window` is already scoped to the friction point and more reliable than file-based transcript pulls. Fall back to `transcripts_root` only when `context_window` is absent (pre-upgrade entries).
+
 ## Karpathy constraints (mandatory)

 You MUST obey these on every proposal:
@@ -325,10 +390,29 @@ After ≥7 days, `~/.claude/adam/scripts/adam-ab-measure.mjs` reads each entry a

 The `/reflect` skill runs `adam-ab-measure.mjs --format json` before dispatching this agent, filters to `status == "regressed"`, and passes the list as `ab_regressions` (each object has `proposal_id`, `target_skill`, `proposal_type`, `delta_pct`, `pre_count`, `post_count`).

-**When `ab_regressions` is non-empty, you MUST emit a `## Regressions` section at the TOP of your output (above the proposals listing).** One bullet per regressed proposal listing `proposal_id`, `target_skill`, `delta_pct`, plus the short suggestion `consider revert via /reflect --revert <proposal_id>` (the revert mechanism itself is out of scope for this release — the message stands as a hint).
+**When `ab_regressions` is non-empty, you MUST emit a `## Regressions` section at the TOP of your output (above the proposals listing).** One bullet per regressed proposal listing `proposal_id`, `target_skill`, `delta_pct`. The skill auto-rolls back regressed proposals via `adam-rollback.mjs` before dispatching you — this section is your record of what was rolled back and why.

 The clustering trace summary (see §"Clustering trace") adds an extra `regressions=<N>` key alongside `considered/emitted/skipped`. When no `ab_regressions` arrive (or list is empty), emit `regressions=0`.

+## Keypoint matrix (MOSS §3.3/§4.2)
+
+When running in `stage=diagnose`, you MUST produce a **keypoint matrix** alongside each batch diagnosis. This structured evaluation replaces ad-hoc confidence with per-capability scoring.
+
+Capability dimensions (score each 0–2 per batch: 0=no signal, 1=partial, 2=strong evidence):
+
+| dimension | description | positive signals | negative signals |
+|---|---|---|---|
+| `tool_selection` | correct tool chosen first try | low `retry_loop` | high `retry_loop`, `weak_agent` |
+| `scope_discipline` | stays within requested scope | low `edit_churn`, low `dead_end` | high `edit_churn`, `dead_end`, `silent_drift` |
+| `error_recovery` | recovers from errors without user help | `clean_recovery` | `error_after_recovery`, `tool_error_loop` |
+| `first_attempt` | succeeds without corrections | `correction_free_streak` | `correction` |
+| `build_reliability` | builds/tests pass on first try | `task_completed` with build tools | `build_loop` |
+
+The matrix goes into the diagnosis output as `keypoints: {tool_selection: N, scope_discipline: N, ...}`. The implement stage uses it to:
+1. Prioritize proposals targeting the weakest dimensions.
+2. Include `keypoint_target: "<dimension>"` in proposal frontmatter.
+3. Track dimension trends across `/reflect` runs (persisted in `~/.claude/adam/keypoint-history.jsonl`).
+
 ## Confidence rubric (deterministic — do NOT vibe)

 Sum:
@@ -364,6 +448,7 @@ Sum:
 | `agent_edit` | existing agent file | medium | no |
 | `claude_md_edit` | `~/.claude/CLAUDE.md` | high | no |
 | `hook_new` / `hook_edit` | `settings.json` hooks | high | no |
+| `harness_edit` | adam's own scripts/agent/hooks (see "Harness self-modification") | high | **never** |
 | `deletion` | any skill/agent (soft delete) | high | no |

 ### `nudge` proposals
@@ -392,6 +477,42 @@ A `reinforcement` proposal is logged when `adam-score.mjs` reports `count >= 3`

 Note that `task_completed` alone — without an adjacent negative signal cluster — is NOT a proposal source. It is a urgency *modifier* (see "Scoring: task_completed dampener") and a reinforcement input only.

+### `harness_edit` proposals (MOSS §1 Table 1)
+
+MOSS's core thesis: "routing, hook ordering, state invariants, and dispatch live in code rather than in any text artifact, an entire class of structural failure is physically unreachable from the text layer." This proposal type extends ADAM's evolution scope to its own harness.
+
+**Allowed targets** (harness files that ADAM may propose edits to):
+
+| target | what it controls |
+|---|---|
+| `~/.claude/adam/scripts/adam-observe.mjs` | signal detection regexes, thresholds, counters |
+| `~/.claude/adam/scripts/adam-score.mjs` | severity divisors, dampener thresholds |
+| `~/.claude/adam/scripts/adam-window.mjs` | per-signal sliding window durations |
+| `~/.claude/adam/scripts/adam-batch.mjs` | evidence batching logic |
+| `~/.claude/agents/adam.md` | this agent's own rubric, clustering, proposal rules |
+| `~/.claude/hooks/adam-observe.mjs` | hook integration, event routing |
+
+**Gates (all must hold — stricter than any other type):**
+
+1. `confidence ≥ 5`
+2. `cross_session_evidence == true` (≥5 occurrences across ≥3 sessions)
+3. `auto_apply_eligible: false` — **always**. Harness edits are never auto-applied.
+4. `blast_radius: high`
+5. Proposal includes a `# Test verification` section with the command `bash ~/.claude/adam/tests/run-tests.sh` and the expected result "94 passed, 0 failed" (or current pass count). The skill runs this test before applying.
+6. Change is surgical: ≤30 LOC diff, single file.
+7. `# Diagnosis` reconstructs the causal chain from harness-level behavior (not from text-artifact behavior). The mismatch must name a specific code path (function, regex, threshold) in the target file.
+
+**When to propose `harness_edit`:**
+- Signal detection misses a recurring friction pattern (false negative in adam-observe.mjs)
+- A/B measurement shows systematic bias (e.g., windows too short/long in adam-window.mjs)
+- Scoring thresholds produce consistently over/under-weighted proposals (adam-score.mjs)
+- Batch clustering produces too-coarse or too-fine groupings (adam-batch.mjs)
+
+**When NOT to propose `harness_edit`:**
+- The fix is achievable via a text-mutable type (skill, memory, nudge)
+- Evidence is from a single session only
+- The change would affect test outcomes without clear improvement evidence
+
 ## Special handling

 ### CLAUDE.md edits
@@ -418,7 +539,7 @@ Filename: `proposals_dir/YYYY-MM-DD-NNN-<type>-<slug>.md` (NNN is daily counter
 ```markdown
 ---
 id: YYYY-MM-DD-NNN
-type: skill_new | memory | skill_edit | nudge | reinforcement | agent_new | agent_edit | claude_md_edit | hook_new | hook_edit | deletion
+type: skill_new | memory | skill_edit | nudge | reinforcement | agent_new | agent_edit | claude_md_edit | hook_new | hook_edit | harness_edit | deletion
 target: <absolute path — for skill_new, the will-be path: ~/.claude/skills/<slug>/SKILL.md>
 confidence: <int>
 blast_radius: low | medium | high
@@ -444,6 +565,11 @@ bytes_after: <int>
 contradiction_flag: "<one-line summary or null>"
 # optional — auto-populated from Diagnosis Mismatch line
 diagnosis_summary: "<≤120 chars, single sentence>"
+# keypoint matrix — which capability dimension this proposal targets (MOSS §4.2)
+keypoint_target: "<tool_selection | scope_discipline | error_recovery | first_attempt | build_reliability>"
+# harness_edit only — test command and expected output
+test_command: "bash ~/.claude/adam/tests/run-tests.sh"
+test_expected: "<N> passed, 0 failed"
 ---

 # Why
@@ -482,7 +608,7 @@ Print a single JSON line to stdout:
 ## What you must NOT do

 - Do not call other agents.
- Do not write to `~/.claude/skills/`, `~/.claude/agents/`, `settings.json`, `CLAUDE.md`, or any existing skill/agent file directly. All changes go through proposal files for main-thread review and apply.
+- Do not write to `~/.claude/skills/`, `~/.claude/agents/`, `settings.json`, `CLAUDE.md`, adam scripts, or any existing skill/agent/harness file directly. All changes go through proposal files for main-thread review and apply. This includes `harness_edit` proposals — you draft the diff, the skill applies it after test verification.
 - Do not delete files. Deletion proposals describe a soft-move; the main thread executes it.
 - Do not write outside `proposals_dir/` and `state_path`.
 - Do not invent trigger phrases for `skill_new` — every trigger must come from observed user input.