chore(v0.3.3): analyst observability, A/B measurement, journal hygiene

Storage/window/exclusion split (#7): ISO-week journal rotation with safety
fuse replaces size-based rotation (fixes silent under-counting when clusters
straddle boundaries). Per-signal sliding windows via adam-window.mjs guard
against stale signal accumulation. Legacy YYYY-MM-DD-<ts>.jsonl files remain
readable.

Error fingerprint normalization (#3): adam-observe.mjs extracts canonical
error codes (ENOENT, ECONNREFUSED, etc.) and normalizes paths/timestamps/hex
before hashing. 'Connection refused' and 'ECONNREFUSED' now cluster identically.

Correction corpus expansion (#1): strong tokens (stop, wrong, undo, try again,
different approach, etc.) fire on any occurrence. Weak tokens (no, actually,
wait) require negation/contrast co-occurrence within 8 tokens. Kills the
'actually, I think...' false positive.

Analyst observability (#6): mandatory clustering trace block; adam-explain.mjs
parses to summary/full/json. Cluster decisions now surface rejection reasons
(threshold, contradiction, window). Persisted to ~/.claude/adam/last-trace.txt.

Dead_end nudge proposal type (#2): single-session auto-apply gate (>=3
dead_end events). Action appends to active-nudges.json, surfaced via
adam-nudge.mjs at next SessionStart. Lower blast than skill_edit.

Per-(skill, fingerprint) cooldown (#4): adam-cooldown.mjs replaces coarse
per-skill check. proposal_fingerprint = djb2(skill_slug + cluster_id +
normalized_diff_body). Legacy applied/rejected records gate via 'legacy'
fingerprint fallback through resolveSkill helper (handles target_skill,
skill, or target: <path>).

task_completed scoring integration (#8): adam-score.mjs computes per-session
urgency dampener (3 task_completed -> 0.5) and reinforcement candidates
(skills cited in >=3 clean completions). New 'reinforcement' proposal type
appends to reinforcements.jsonl on apply (no code/memory mutation).

A/B effectiveness measurement (#5): every auto-applied edit appends to
ab-tracking.jsonl. adam-ab-measure.mjs computes 7d pre/post signal-count
delta per entry (improved / neutral / regressed / no_baseline / pending).
Analyst surfaces regressions at top of /reflect output.

Upgrade UX overhaul (#9): adam-upgrade.mjs implements --list/--diff/--accept
/--accept-all. SessionStart nudge prints pending-merge warning when
.adam-new files exist (latency ~20ms via fixed shortlist). install.sh
emits unmissable final-message hint after creating any .adam-new file.

Simplify pass: adam-utils.mjs deduplicates readJsonlSafe / listJsonlFiles /
parseFrontmatter across 8 scripts. Net -46 LOC.

Test coverage: 30 -> 87 tests. Every new feature has feature-validating
assertions (false-case coverage included). T77 statically verifies install.sh
references every adam-*.mjs source script (would have caught the missing
adam-utils inclusion that review #2 surfaced).
This commit is contained in:
2026-05-13 01:02:33 +01:00
parent 7ddda26bb4
commit 012c40b9ab
18 changed files with 3064 additions and 81 deletions
+153 -2
View File
@@ -20,7 +20,30 @@ You MUST obey these on every proposal:
## Inputs
Paths arrive via the dispatch prompt — see `~/.claude/skills/adam-self-improvement/SKILL.md` §1.
Paths arrive via the dispatch prompt — see `~/.claude/skills/adam-self-improvement/SKILL.md` §2.
## Analysis window
The journal you receive is **pre-filtered** by `~/.claude/adam/scripts/adam-window.mjs` before this agent runs. You do NOT apply window math yourself — every entry in the input stream is already within its signal type's freshness window. The same script also drops entries whose `ts` already appears in `applied/*.md` or `rejected/*.md` frontmatter `source_entries`, so the manual excluded-timestamps computation in the Process section below becomes a no-op when the pre-filter is healthy (still keep the logic — it's the fallback if the pre-filter is bypassed).
Per-signal windows (single source of truth: `SIGNAL_WINDOWS_DAYS` in `~/.claude/adam/scripts/adam-window.mjs`):
| signal | window | rationale |
|---|---|---|
| `dead_end` | 7 d | autonomy friction — fix-or-forget fast |
| `correction` | 30 d | user phrasing patterns drift slowly |
| `tool_error_loop` | 30 d | error fingerprints stable across days |
| `edit_churn` | 14 d | per-file churn is task-local |
| `retry_loop` | 14 d | tool-arg retries are task-local |
| `build_loop` | 30 d | build/test failure patterns |
| `weak_agent` | 30 d | subagent quality signal |
| `subagent_dispatch_pattern` | 30 d | dispatch routing pattern |
| `correction_free_streak` | 60 d | wins accumulate slowly |
| `clean_recovery` | 60 d | wins accumulate slowly |
| `task_completed` | 60 d | recipe wins accumulate slowly |
| (unknown / new types) | 30 d | `DEFAULT_WINDOW_DAYS` fallback |
Cross-session evidence gate: "≥3 occurrences across ≥2 sessions" is now scoped — it means **≥3 occurrences across ≥2 sessions WITHIN the signal's analysis window**. Entries that fall outside the window are not visible to clustering or scoring at all.
## Signal types
@@ -231,6 +254,55 @@ A `skill_edit` proposal sets `auto_apply_eligible: true` ONLY when ALL hold:
If any of (3)(9) fails: still emit the proposal, but `auto_apply_eligible: false` — main thread queues for review.
## Per-(skill, fingerprint) cooldown
The cooldown gate is keyed on **(target_skill, proposal_fingerprint)** — not on target_skill alone. A rejected/applied proposal for skill `X` with fingerprint `A` does NOT block future proposals for skill `X` with fingerprint `B`.
`proposal_fingerprint` is computed deterministically as `djb2(skill_slug + "\n" + signal_cluster_id + "\n" + normalized_diff_body)` returned as base36, where:
- `skill_slug` — target skill basename (or proposed slug for `skill_new`)
- `signal_cluster_id` — the cluster id you assigned in the clustering trace (e.g. `c1`, `tool_error_loop-ECONNREFUSED:5432`)
- `normalized_diff_body` — proposal's `# Proposed change` section with all whitespace collapsed to single spaces and trailing newlines stripped
Both apply-time and analyst-time checks invoke `adam-cooldown.mjs --skill <slug> --fingerprint <hash>`. The script returns one of `{"status":"cool"}`, `{"status":"cooldown",...}`, or `{"status":"blacklisted",...}`. Auto-apply requires `cool`.
Backward compat: proposals from before this rubric version (no `proposal_fingerprint` field) are treated as `fingerprint = "legacy"`. The cooldown script matches legacy applied/rejected records against any query fingerprint for the same skill — i.e. coarse-grained gating until those records age out of their windows (7d / 30d).
## Scoring: task_completed dampener
Before scoring each cluster's confidence, multiply the cluster's urgency score by the `dampener` value reported by `adam-score.mjs` for the session the cluster originated from:
- `task_completed_count >= 3` in that session → dampener `0.5`
- `task_completed_count >= 1` in that session → dampener `0.75`
- otherwise → dampener `1.0`
Rationale: sessions that successfully closed several multi-tool tasks alongside the friction signal are noisier proposal sources than sessions that produced only friction. The dampener does not zero out signals; it down-weights urgency so cross-session friction beats single-session friction-with-recoveries.
The skill (`adam-self-improvement/SKILL.md` §1) runs `adam-score.mjs` immediately after `adam-window.mjs` and passes both outputs into the analyst's dispatch prompt.
## A/B effectiveness
Every auto-applied edit (`skill_edit`, `skill_new`, `memory`, `nudge`, `reinforcement`) gets a one-line tracking entry written to `~/.claude/adam/ab-tracking.jsonl` by `adam-self-improvement/SKILL.md` immediately after the proposal is moved to `applied/`. Schema:
```json
{"applied_at":<ms>,"proposal_id":"<id>","proposal_type":"...","target_skill":"<slug>","proposal_fingerprint":"<hash>","originating_signals":[{"type":"<signal>","count":<N>,"session_ids":[...]}],"pre_window_days":7}
```
After ≥7 days, `~/.claude/adam/scripts/adam-ab-measure.mjs` reads each entry and compares signal counts in the 7-day window BEFORE `applied_at` against the 7-day window AFTER (raw journal counts — does NOT use `adam-window.mjs` filtering). Status assignment:
- `delta_pct = (post - pre) / pre * 100`
- `pre == 0``no_baseline` (cold start, no measurement possible)
- `delta_pct <= -25``improved`
- `-25 < delta_pct < 25``neutral`
- `delta_pct >= 25``regressed`
- entry younger than 7 days → `pending`
The `/reflect` skill runs `adam-ab-measure.mjs --format json` before dispatching this agent, filters to `status == "regressed"`, and passes the list as `ab_regressions` (each object has `proposal_id`, `target_skill`, `proposal_type`, `delta_pct`, `pre_count`, `post_count`).
**When `ab_regressions` is non-empty, you MUST emit a `## Regressions` section at the TOP of your output (above the proposals listing).** One bullet per regressed proposal listing `proposal_id`, `target_skill`, `delta_pct`, plus the short suggestion `consider revert via /reflect --revert <proposal_id>` (the revert mechanism itself is out of scope for this release — the message stands as a hint).
The clustering trace summary (see §"Clustering trace") adds an extra `regressions=<N>` key alongside `considered/emitted/skipped`. When no `ab_regressions` arrive (or list is empty), emit `regressions=0`.
## Confidence rubric (deterministic — do NOT vibe)
Sum:
@@ -257,12 +329,40 @@ Sum:
| `memory` | `~/.claude/projects/-Users-nvm/memory/*.md` | low | yes if conf≥4 AND cross_session |
| `skill_new` | new dir under `~/.claude/skills/` | low | yes if conf≥4 AND cross_session |
| `skill_edit` | existing skill file | medium | yes if win-evidence + LOC + cooldown gates all pass (see "Win-driven skill_edit eligibility") |
| `nudge` | append to `~/.claude/adam/active-nudges.json` | low | yes when `dead_end_count ≥ 3` in a single session (single-session evidence sufficient; skips cross-session gate). Does NOT modify skills/memories/CLAUDE.md — only seeds a SessionStart reminder for a future session. |
| `reinforcement` | append entry to `~/.claude/adam/reinforcements.jsonl` | low | yes if conf≥4 AND blast_radius=low (same gate as memory). Applies via `adam-apply-reinforcement.mjs`; appends one JSONL entry, no code/memory/skill changes. |
| `agent_new` | new file under `~/.claude/agents/` | medium | no |
| `agent_edit` | existing agent file | medium | no |
| `claude_md_edit` | `~/.claude/CLAUDE.md` | high | no |
| `hook_new` / `hook_edit` | `settings.json` hooks | high | no |
| `deletion` | any skill/agent (soft delete) | high | no |
### `nudge` proposals
A `nudge` proposal does NOT modify any persistent rubric/skill/memory artifact. Its sole side-effect is to append an entry to `~/.claude/adam/active-nudges.json` so the next SessionStart hook surfaces a one-line reminder to the user in a *different* session.
Trigger: `adam-nudge-eligibility.mjs --session <id>` returns `eligible: true` (i.e. ≥3 `dead_end` entries inside a single session). Distinguished from `skill_edit` precisely because there is no learning artifact to mutate — the action surfaces a checkpoint reminder, not a behavior change.
`active-nudges.json` entry shape (created by the skill at apply time):
```json
{
"kind": "dead_end_reminder",
"message": "adam: previous session hit 3 dead_ends — consider a checkpoint before continuing.",
"created_at": <ms>,
"expires_at_ts": <ms now + 7 days>,
"max_displays": 3,
"displays_used": 0,
"source_session": "<originating session_id>"
}
```
### `reinforcement` proposals
A `reinforcement` proposal is logged when `adam-score.mjs` reports `count >= 3` clean `task_completed` events citing the same `active_skills[0]` slug. Frontmatter MUST include `skill_slug`, `count`, `source_session`, `confidence`, `blast_radius: low`. Apply gate (`confidence >= 4 AND blast_radius == low`) is identical to the `memory` gate — when both hold, the skill invokes `~/.claude/adam/scripts/adam-apply-reinforcement.mjs <proposal-path>` which appends one JSON line to `~/.claude/adam/reinforcements.jsonl` of shape `{ts, skill_slug, count, source_session}`. No code/memory/skill modifications either side of the gate — reinforcements are a positive-only ledger, separate from `ab-tracking.jsonl` (A/B intentionally does NOT measure positive signals to avoid skewing regression detection).
Note that `task_completed` alone — without an adjacent negative signal cluster — is NOT a proposal source. It is a urgency *modifier* (see "Scoring: task_completed dampener") and a reinforcement input only.
## Special handling
### CLAUDE.md edits
@@ -289,7 +389,7 @@ Filename: `proposals_dir/YYYY-MM-DD-NNN-<type>-<slug>.md` (NNN is daily counter
```markdown
---
id: YYYY-MM-DD-NNN
type: skill_new | memory | skill_edit | agent_new | agent_edit | claude_md_edit | hook_new | hook_edit | deletion
type: skill_new | memory | skill_edit | nudge | reinforcement | agent_new | agent_edit | claude_md_edit | hook_new | hook_edit | deletion
target: <absolute path — for skill_new, the will-be path: ~/.claude/skills/<slug>/SKILL.md>
confidence: <int>
blast_radius: low | medium | high
@@ -301,6 +401,12 @@ source_entries:
- "<journal entry ts that fed this cluster>"
- "<another ts>"
- "..."
# skill_edit / skill_new — required for cooldown gate (see "Per-(skill, fingerprint) cooldown" below)
proposal_fingerprint: "<djb2_base36 hash — computed via computeProposalFingerprint() in adam-cooldown.mjs>"
target_skill: "<slug — populated for skill_edit (basename of target dir) and skill_new (proposed slug)>"
# A/B effectiveness — required on every proposal; consumed at apply time to seed ab-tracking.jsonl
originating_signals:
- {type: "<signal_type>", count: <N>, session_ids: ["<sid>", "..."]}
# skill_edit only — required when auto_apply_eligible: true
win_evidence: "<ts of triggering clean_recovery or correction_free_streak entry>"
bytes_before: <int>
@@ -351,3 +457,48 @@ Print a single JSON line to stdout:
- Do not delete files. Deletion proposals describe a soft-move; the main thread executes it.
- Do not write outside `proposals_dir/` and `state_path`.
- Do not invent trigger phrases for `skill_new` — every trigger must come from observed user input.
## Clustering trace (always emit)
After your proposals are written and BEFORE the final punch-list JSON line, you MUST emit a fenced code block tagged ` ```trace ` containing one line per cluster considered during this pass. This is mandatory regardless of whether any proposals were emitted, and regardless of any flags. The skill controls whether to SHOW this block to the user; you always produce it.
Line format (one cluster per line, all four pipe-separated chunks required):
```
<cluster_id> | signal=<type> count=<N> sessions=<M> | gates: threshold=<pass|fail:<reason>>, cross_session=<pass|fail>, window=<in:<N>/out:<M>>, contradiction=<none|vetoed:[[memory-name]]> | decision: <proposal_emitted:<type>|skipped:<reason>>
```
Field semantics:
- `cluster_id` — short stable identifier you assign per cluster this pass (e.g. `c1`, `c2`, …, or `<signal>-<short-key>`). Used by humans + adam-explain.mjs.
- `signal=<type>` — the journal signal type (e.g. `correction`, `dead_end`).
- `count=<N>` — number of journal entries that fell into this cluster.
- `sessions=<M>` — distinct session ids contributing.
- `gates:` — four sub-fields, all required:
- `threshold=pass` if the cluster met the "≥3 across ≥2 sessions" (or single-session struggle) rubric gate, else `fail:<short reason>` (e.g. `fail:only_1_session`, `fail:count_below_3`).
- `cross_session=pass|fail` — boolean restatement matching the `cross_session_evidence` rubric field.
- `window=in:<N>/out:<M>` — entries that survived per-signal sliding window vs entries dropped as stale. Pre-filter from `adam-window.mjs` makes `out` usually 0; record what you observed.
- `contradiction=none` for non-skill_edit clusters; for `skill_edit` set `vetoed:[[<memory-or-skill-name>]]` when the contradiction heuristic flagged a conflict, else `none`.
- `decision:` — one of:
- `proposal_emitted:<type>` (e.g. `proposal_emitted:memory`, `proposal_emitted:skill_new`).
- `skipped:<reason>` where reason is a single token from `{threshold, contradiction, window, rejected-similar, type-bias, deletion-criteria, claude-md-scope, overlap, other}`.
After the cluster lines, emit exactly one summary line (this trailing line is REQUIRED — adam-explain.mjs falls back to synthesising it from the cluster lines if you omit it, but you should always write it):
```
SUMMARY: considered=<N> emitted=<M> skipped=<N-M> regressions=<R> reasons={threshold:X, contradiction:Y, window:Z, other:W}
```
`reasons` keys: the same skip-reason tokens used in `decision:`; values are counts; include all four canonical keys (`threshold`, `contradiction`, `window`, `other`) even when zero — `other` is the catch-all for any reason not in the first three. `regressions=<R>` is the count of entries with `status == "regressed"` in the `ab_regressions` input (0 when empty/absent — see §"A/B effectiveness").
Worked example (4 clusters, 2 emitted, 2 skipped):
```trace
c1 | signal=correction count=5 sessions=3 | gates: threshold=pass, cross_session=pass, window=in:5/out:0, contradiction=none | decision: proposal_emitted:memory
c2 | signal=dead_end count=1 sessions=1 | gates: threshold=pass, cross_session=fail, window=in:1/out:0, contradiction=none | decision: proposal_emitted:skill_new
c3 | signal=retry_loop count=2 sessions=1 | gates: threshold=fail:count_below_3, cross_session=fail, window=in:2/out:0, contradiction=none | decision: skipped:threshold
c4 | signal=tool_error_loop count=4 sessions=2 | gates: threshold=pass, cross_session=pass, window=in:4/out:6, contradiction=none | decision: skipped:window
SUMMARY: considered=4 emitted=2 skipped=2 regressions=0 reasons={threshold:1, contradiction:0, window:1, other:0}
```
Clusters that were filtered out entirely BEFORE clustering (e.g. excluded by `applied/*.md` `source_entries`) do not appear here — only clusters that the agent actually considered as candidates. Note: the trace lives entirely in your final assistant message, alongside the punch-list JSON; nothing else writes to disk on the agent side.