chore(v0.3.3): analyst observability, A/B measurement, journal hygiene

Storage/window/exclusion split (#7): ISO-week journal rotation with safety fuse replaces size-based rotation (fixes silent under-counting when clusters straddle boundaries). Per-signal sliding windows via adam-window.mjs guard against stale signal accumulation. Legacy YYYY-MM-DD-<ts>.jsonl files remain readable. Error fingerprint normalization (#3): adam-observe.mjs extracts canonical error codes (ENOENT, ECONNREFUSED, etc.) and normalizes paths/timestamps/hex before hashing. 'Connection refused' and 'ECONNREFUSED' now cluster identically. Correction corpus expansion (#1): strong tokens (stop, wrong, undo, try again, different approach, etc.) fire on any occurrence. Weak tokens (no, actually, wait) require negation/contrast co-occurrence within 8 tokens. Kills the 'actually, I think...' false positive. Analyst observability (#6): mandatory clustering trace block; adam-explain.mjs parses to summary/full/json. Cluster decisions now surface rejection reasons (threshold, contradiction, window). Persisted to ~/.claude/adam/last-trace.txt. Dead_end nudge proposal type (#2): single-session auto-apply gate (>=3 dead_end events). Action appends to active-nudges.json, surfaced via adam-nudge.mjs at next SessionStart. Lower blast than skill_edit. Per-(skill, fingerprint) cooldown (#4): adam-cooldown.mjs replaces coarse per-skill check. proposal_fingerprint = djb2(skill_slug + cluster_id + normalized_diff_body). Legacy applied/rejected records gate via 'legacy' fingerprint fallback through resolveSkill helper (handles target_skill, skill, or target: <path>). task_completed scoring integration (#8): adam-score.mjs computes per-session urgency dampener (3 task_completed -> 0.5) and reinforcement candidates (skills cited in >=3 clean completions). New 'reinforcement' proposal type appends to reinforcements.jsonl on apply (no code/memory mutation). A/B effectiveness measurement (#5): every auto-applied edit appends to ab-tracking.jsonl. adam-ab-measure.mjs computes 7d pre/post signal-count delta per entry (improved / neutral / regressed / no_baseline / pending). Analyst surfaces regressions at top of /reflect output. Upgrade UX overhaul (#9): adam-upgrade.mjs implements --list/--diff/--accept /--accept-all. SessionStart nudge prints pending-merge warning when .adam-new files exist (latency ~20ms via fixed shortlist). install.sh emits unmissable final-message hint after creating any .adam-new file. Simplify pass: adam-utils.mjs deduplicates readJsonlSafe / listJsonlFiles / parseFrontmatter across 8 scripts. Net -46 LOC. Test coverage: 30 -> 87 tests. Every new feature has feature-validating assertions (false-case coverage included). T77 statically verifies install.sh references every adam-*.mjs source script (would have caught the missing adam-utils inclusion that review #2 surfaced).
2026-06-28 02:47:45 +00:00 · 2026-05-13 01:02:33 +01:00
parent 7ddda26bb4
commit 012c40b9ab
18 changed files with 3064 additions and 81 deletions
@@ -20,7 +20,30 @@ You MUST obey these on every proposal:

 ## Inputs

-Paths arrive via the dispatch prompt — see `~/.claude/skills/adam-self-improvement/SKILL.md` §1.
+Paths arrive via the dispatch prompt — see `~/.claude/skills/adam-self-improvement/SKILL.md` §2.
+
+## Analysis window
+
+The journal you receive is **pre-filtered** by `~/.claude/adam/scripts/adam-window.mjs` before this agent runs. You do NOT apply window math yourself — every entry in the input stream is already within its signal type's freshness window. The same script also drops entries whose `ts` already appears in `applied/*.md` or `rejected/*.md` frontmatter `source_entries`, so the manual excluded-timestamps computation in the Process section below becomes a no-op when the pre-filter is healthy (still keep the logic — it's the fallback if the pre-filter is bypassed).
+
+Per-signal windows (single source of truth: `SIGNAL_WINDOWS_DAYS` in `~/.claude/adam/scripts/adam-window.mjs`):
+
+| signal | window | rationale |
+|---|---|---|
+| `dead_end` | 7 d | autonomy friction — fix-or-forget fast |
+| `correction` | 30 d | user phrasing patterns drift slowly |
+| `tool_error_loop` | 30 d | error fingerprints stable across days |
+| `edit_churn` | 14 d | per-file churn is task-local |
+| `retry_loop` | 14 d | tool-arg retries are task-local |
+| `build_loop` | 30 d | build/test failure patterns |
+| `weak_agent` | 30 d | subagent quality signal |
+| `subagent_dispatch_pattern` | 30 d | dispatch routing pattern |
+| `correction_free_streak` | 60 d | wins accumulate slowly |
+| `clean_recovery` | 60 d | wins accumulate slowly |
+| `task_completed` | 60 d | recipe wins accumulate slowly |
+| (unknown / new types) | 30 d | `DEFAULT_WINDOW_DAYS` fallback |
+
+Cross-session evidence gate: "≥3 occurrences across ≥2 sessions" is now scoped — it means **≥3 occurrences across ≥2 sessions WITHIN the signal's analysis window**. Entries that fall outside the window are not visible to clustering or scoring at all.

 ## Signal types

@@ -231,6 +254,55 @@ A `skill_edit` proposal sets `auto_apply_eligible: true` ONLY when ALL hold:

 If any of (3)–(9) fails: still emit the proposal, but `auto_apply_eligible: false` — main thread queues for review.

+## Per-(skill, fingerprint) cooldown
+
+The cooldown gate is keyed on **(target_skill, proposal_fingerprint)** — not on target_skill alone. A rejected/applied proposal for skill `X` with fingerprint `A` does NOT block future proposals for skill `X` with fingerprint `B`.
+
+`proposal_fingerprint` is computed deterministically as `djb2(skill_slug + "\n" + signal_cluster_id + "\n" + normalized_diff_body)` returned as base36, where:
+
+- `skill_slug` — target skill basename (or proposed slug for `skill_new`)
+- `signal_cluster_id` — the cluster id you assigned in the clustering trace (e.g. `c1`, `tool_error_loop-ECONNREFUSED:5432`)
+- `normalized_diff_body` — proposal's `# Proposed change` section with all whitespace collapsed to single spaces and trailing newlines stripped
+
+Both apply-time and analyst-time checks invoke `adam-cooldown.mjs --skill <slug> --fingerprint <hash>`. The script returns one of `{"status":"cool"}`, `{"status":"cooldown",...}`, or `{"status":"blacklisted",...}`. Auto-apply requires `cool`.
+
+Backward compat: proposals from before this rubric version (no `proposal_fingerprint` field) are treated as `fingerprint = "legacy"`. The cooldown script matches legacy applied/rejected records against any query fingerprint for the same skill — i.e. coarse-grained gating until those records age out of their windows (7d / 30d).
+
+## Scoring: task_completed dampener
+
+Before scoring each cluster's confidence, multiply the cluster's urgency score by the `dampener` value reported by `adam-score.mjs` for the session the cluster originated from:
+
+- `task_completed_count >= 3` in that session → dampener `0.5`
+- `task_completed_count >= 1` in that session → dampener `0.75`
+- otherwise → dampener `1.0`
+
+Rationale: sessions that successfully closed several multi-tool tasks alongside the friction signal are noisier proposal sources than sessions that produced only friction. The dampener does not zero out signals; it down-weights urgency so cross-session friction beats single-session friction-with-recoveries.
+
+The skill (`adam-self-improvement/SKILL.md` §1) runs `adam-score.mjs` immediately after `adam-window.mjs` and passes both outputs into the analyst's dispatch prompt.
+
+## A/B effectiveness
+
+Every auto-applied edit (`skill_edit`, `skill_new`, `memory`, `nudge`, `reinforcement`) gets a one-line tracking entry written to `~/.claude/adam/ab-tracking.jsonl` by `adam-self-improvement/SKILL.md` immediately after the proposal is moved to `applied/`. Schema:
+
+```json
+{"applied_at":<ms>,"proposal_id":"<id>","proposal_type":"...","target_skill":"<slug>","proposal_fingerprint":"<hash>","originating_signals":[{"type":"<signal>","count":<N>,"session_ids":[...]}],"pre_window_days":7}
+```
+
+After ≥7 days, `~/.claude/adam/scripts/adam-ab-measure.mjs` reads each entry and compares signal counts in the 7-day window BEFORE `applied_at` against the 7-day window AFTER (raw journal counts — does NOT use `adam-window.mjs` filtering). Status assignment:
+
+- `delta_pct = (post - pre) / pre * 100`
+- `pre == 0` → `no_baseline` (cold start, no measurement possible)
+- `delta_pct <= -25` → `improved`
+- `-25 < delta_pct < 25` → `neutral`
+- `delta_pct >= 25` → `regressed`
+- entry younger than 7 days → `pending`
+
+The `/reflect` skill runs `adam-ab-measure.mjs --format json` before dispatching this agent, filters to `status == "regressed"`, and passes the list as `ab_regressions` (each object has `proposal_id`, `target_skill`, `proposal_type`, `delta_pct`, `pre_count`, `post_count`).
+
+**When `ab_regressions` is non-empty, you MUST emit a `## Regressions` section at the TOP of your output (above the proposals listing).** One bullet per regressed proposal listing `proposal_id`, `target_skill`, `delta_pct`, plus the short suggestion `consider revert via /reflect --revert <proposal_id>` (the revert mechanism itself is out of scope for this release — the message stands as a hint).
+
+The clustering trace summary (see §"Clustering trace") adds an extra `regressions=<N>` key alongside `considered/emitted/skipped`. When no `ab_regressions` arrive (or list is empty), emit `regressions=0`.
+
 ## Confidence rubric (deterministic — do NOT vibe)

 Sum:
@@ -257,12 +329,40 @@ Sum:
 | `memory` | `~/.claude/projects/-Users-nvm/memory/*.md` | low | yes if conf≥4 AND cross_session |
 | `skill_new` | new dir under `~/.claude/skills/` | low | yes if conf≥4 AND cross_session |
 | `skill_edit` | existing skill file | medium | yes if win-evidence + LOC + cooldown gates all pass (see "Win-driven skill_edit eligibility") |
+| `nudge` | append to `~/.claude/adam/active-nudges.json` | low | yes when `dead_end_count ≥ 3` in a single session (single-session evidence sufficient; skips cross-session gate). Does NOT modify skills/memories/CLAUDE.md — only seeds a SessionStart reminder for a future session. |
+| `reinforcement` | append entry to `~/.claude/adam/reinforcements.jsonl` | low | yes if conf≥4 AND blast_radius=low (same gate as memory). Applies via `adam-apply-reinforcement.mjs`; appends one JSONL entry, no code/memory/skill changes. |
 | `agent_new` | new file under `~/.claude/agents/` | medium | no |
 | `agent_edit` | existing agent file | medium | no |
 | `claude_md_edit` | `~/.claude/CLAUDE.md` | high | no |
 | `hook_new` / `hook_edit` | `settings.json` hooks | high | no |
 | `deletion` | any skill/agent (soft delete) | high | no |

+### `nudge` proposals
+
+A `nudge` proposal does NOT modify any persistent rubric/skill/memory artifact. Its sole side-effect is to append an entry to `~/.claude/adam/active-nudges.json` so the next SessionStart hook surfaces a one-line reminder to the user in a *different* session.
+
+Trigger: `adam-nudge-eligibility.mjs --session <id>` returns `eligible: true` (i.e. ≥3 `dead_end` entries inside a single session). Distinguished from `skill_edit` precisely because there is no learning artifact to mutate — the action surfaces a checkpoint reminder, not a behavior change.
+
+`active-nudges.json` entry shape (created by the skill at apply time):
+
+```json
+{
+  "kind": "dead_end_reminder",
+  "message": "adam: previous session hit 3 dead_ends — consider a checkpoint before continuing.",
+  "created_at": <ms>,
+  "expires_at_ts": <ms now + 7 days>,
+  "max_displays": 3,
+  "displays_used": 0,
+  "source_session": "<originating session_id>"
+}
+```
+
+### `reinforcement` proposals
+
+A `reinforcement` proposal is logged when `adam-score.mjs` reports `count >= 3` clean `task_completed` events citing the same `active_skills[0]` slug. Frontmatter MUST include `skill_slug`, `count`, `source_session`, `confidence`, `blast_radius: low`. Apply gate (`confidence >= 4 AND blast_radius == low`) is identical to the `memory` gate — when both hold, the skill invokes `~/.claude/adam/scripts/adam-apply-reinforcement.mjs <proposal-path>` which appends one JSON line to `~/.claude/adam/reinforcements.jsonl` of shape `{ts, skill_slug, count, source_session}`. No code/memory/skill modifications either side of the gate — reinforcements are a positive-only ledger, separate from `ab-tracking.jsonl` (A/B intentionally does NOT measure positive signals to avoid skewing regression detection).
+
+Note that `task_completed` alone — without an adjacent negative signal cluster — is NOT a proposal source. It is a urgency *modifier* (see "Scoring: task_completed dampener") and a reinforcement input only.
+
 ## Special handling

 ### CLAUDE.md edits
@@ -289,7 +389,7 @@ Filename: `proposals_dir/YYYY-MM-DD-NNN-<type>-<slug>.md` (NNN is daily counter
 ```markdown
 ---
 id: YYYY-MM-DD-NNN
-type: skill_new | memory | skill_edit | agent_new | agent_edit | claude_md_edit | hook_new | hook_edit | deletion
+type: skill_new | memory | skill_edit | nudge | reinforcement | agent_new | agent_edit | claude_md_edit | hook_new | hook_edit | deletion
 target: <absolute path — for skill_new, the will-be path: ~/.claude/skills/<slug>/SKILL.md>
 confidence: <int>
 blast_radius: low | medium | high
@@ -301,6 +401,12 @@ source_entries:
  - "<journal entry ts that fed this cluster>"
  - "<another ts>"
  - "..."
+# skill_edit / skill_new — required for cooldown gate (see "Per-(skill, fingerprint) cooldown" below)
+proposal_fingerprint: "<djb2_base36 hash — computed via computeProposalFingerprint() in adam-cooldown.mjs>"
+target_skill: "<slug — populated for skill_edit (basename of target dir) and skill_new (proposed slug)>"
+# A/B effectiveness — required on every proposal; consumed at apply time to seed ab-tracking.jsonl
+originating_signals:
+  - {type: "<signal_type>", count: <N>, session_ids: ["<sid>", "..."]}
 # skill_edit only — required when auto_apply_eligible: true
 win_evidence: "<ts of triggering clean_recovery or correction_free_streak entry>"
 bytes_before: <int>
@@ -351,3 +457,48 @@ Print a single JSON line to stdout:
 - Do not delete files. Deletion proposals describe a soft-move; the main thread executes it.
 - Do not write outside `proposals_dir/` and `state_path`.
 - Do not invent trigger phrases for `skill_new` — every trigger must come from observed user input.
+
+## Clustering trace (always emit)
+
+After your proposals are written and BEFORE the final punch-list JSON line, you MUST emit a fenced code block tagged ` ```trace ` containing one line per cluster considered during this pass. This is mandatory regardless of whether any proposals were emitted, and regardless of any flags. The skill controls whether to SHOW this block to the user; you always produce it.
+
+Line format (one cluster per line, all four pipe-separated chunks required):
+
+```
+<cluster_id> | signal=<type> count=<N> sessions=<M> | gates: threshold=<pass|fail:<reason>>, cross_session=<pass|fail>, window=<in:<N>/out:<M>>, contradiction=<none|vetoed:[[memory-name]]> | decision: <proposal_emitted:<type>|skipped:<reason>>
+```
+
+Field semantics:
+
+- `cluster_id` — short stable identifier you assign per cluster this pass (e.g. `c1`, `c2`, …, or `<signal>-<short-key>`). Used by humans + adam-explain.mjs.
+- `signal=<type>` — the journal signal type (e.g. `correction`, `dead_end`).
+- `count=<N>` — number of journal entries that fell into this cluster.
+- `sessions=<M>` — distinct session ids contributing.
+- `gates:` — four sub-fields, all required:
+  - `threshold=pass` if the cluster met the "≥3 across ≥2 sessions" (or single-session struggle) rubric gate, else `fail:<short reason>` (e.g. `fail:only_1_session`, `fail:count_below_3`).
+  - `cross_session=pass|fail` — boolean restatement matching the `cross_session_evidence` rubric field.
+  - `window=in:<N>/out:<M>` — entries that survived per-signal sliding window vs entries dropped as stale. Pre-filter from `adam-window.mjs` makes `out` usually 0; record what you observed.
+  - `contradiction=none` for non-skill_edit clusters; for `skill_edit` set `vetoed:[[<memory-or-skill-name>]]` when the contradiction heuristic flagged a conflict, else `none`.
+- `decision:` — one of:
+  - `proposal_emitted:<type>` (e.g. `proposal_emitted:memory`, `proposal_emitted:skill_new`).
+  - `skipped:<reason>` where reason is a single token from `{threshold, contradiction, window, rejected-similar, type-bias, deletion-criteria, claude-md-scope, overlap, other}`.
+
+After the cluster lines, emit exactly one summary line (this trailing line is REQUIRED — adam-explain.mjs falls back to synthesising it from the cluster lines if you omit it, but you should always write it):
+
+```
+SUMMARY: considered=<N> emitted=<M> skipped=<N-M> regressions=<R> reasons={threshold:X, contradiction:Y, window:Z, other:W}
+```
+
+`reasons` keys: the same skip-reason tokens used in `decision:`; values are counts; include all four canonical keys (`threshold`, `contradiction`, `window`, `other`) even when zero — `other` is the catch-all for any reason not in the first three. `regressions=<R>` is the count of entries with `status == "regressed"` in the `ab_regressions` input (0 when empty/absent — see §"A/B effectiveness").
+
+Worked example (4 clusters, 2 emitted, 2 skipped):
+
+```trace
+c1 | signal=correction count=5 sessions=3 | gates: threshold=pass, cross_session=pass, window=in:5/out:0, contradiction=none | decision: proposal_emitted:memory
+c2 | signal=dead_end count=1 sessions=1 | gates: threshold=pass, cross_session=fail, window=in:1/out:0, contradiction=none | decision: proposal_emitted:skill_new
+c3 | signal=retry_loop count=2 sessions=1 | gates: threshold=fail:count_below_3, cross_session=fail, window=in:2/out:0, contradiction=none | decision: skipped:threshold
+c4 | signal=tool_error_loop count=4 sessions=2 | gates: threshold=pass, cross_session=pass, window=in:4/out:6, contradiction=none | decision: skipped:window
+SUMMARY: considered=4 emitted=2 skipped=2 regressions=0 reasons={threshold:1, contradiction:0, window:1, other:0}
+```
+
+Clusters that were filtered out entirely BEFORE clustering (e.g. excluded by `applied/*.md` `source_entries`) do not appear here — only clusters that the agent actually considered as candidates. Note: the trace lives entirely in your final assistant message, alongside the punch-list JSON; nothing else writes to disk on the agent side.