feat(v0.6.0): review hardening — live active_skills clustering, computable fingerprints

Full codebase review (multi-agent, adversarially verified) surfaced several documented-but-dead mechanisms and doc/code drift. Fixes: - adam-observe: struggle signals now emit `active_skills`, so silent_drift's primary cluster key AND §5b skill-attribution sub-clustering (+1 rubric bonus) actually fire — both were silently dead (no struggle signal carried the field). - adam-cooldown: new `--compute` CLI deterministically derives proposal_fingerprint. The exported computeProposalFingerprint() was never called and the analyst was told to hand-compute a djb2 hash it cannot reproduce. Spec now mandates a *stable* cluster id so fingerprints reproduce across /reflect runs. Removed one dead normalization line. - spec: reinforcement proposals excluded from A/B tracking — agents/adam.md contradicted itself (:376 included, :476 excluded); SKILL.md aligned. - adam-nudge: PENDING_CHECK_PATHS now mirrors the full install set (adam-utils / adam-batch / adam-rollback were missing). - adam-explain: synthesized clustering summary carries `regressions: 0` (structural consistency with parsed summaries). - docs: test-count drift (87/94 -> 126) and "350-line hook" (-> ~600) fixed; adam-score header documents severity_sum/severity_by_type; adam-batch §4 reference corrected. Tests: +12 assertions (114 -> 126), all green. New regression tests cover the active_skills fix and --compute, plus boundary gaps the review flagged: retry_loop/weak_agent thresholds, A/B exact +/-25% deltas, cooldown 30d blacklist edge.
2026-07-05 03:25:38 +00:00 · 2026-05-29 01:57:44 +01:00
parent 2d9257922f
commit 4b36d6c09e
10 changed files with 219 additions and 20 deletions
@@ -13,7 +13,7 @@ Watches the friction in your coding sessions, clusters the signals via an LLM an

 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
 [![Version](https://img.shields.io/github/v/release/lukaszraczylo/claude-adam?label=version&color=blue)](https://github.com/lukaszraczylo/claude-adam/releases)
-[![Tests](https://img.shields.io/badge/tests-114%20passing-brightgreen.svg)](./adam/tests/run-tests.sh)
+[![Tests](https://img.shields.io/badge/tests-126%20passing-brightgreen.svg)](./adam/tests/run-tests.sh)
 [![Node](https://img.shields.io/badge/node-22%2B-339933.svg)](https://nodejs.org)
 [![Platform](https://img.shields.io/badge/platform-macOS%20%7C%20Linux-lightgrey.svg)]()

@@ -54,7 +54,7 @@ The installer copies files into `~/.claude/`, offers to merge ADAM's hook entrie
 Then:

 ```sh
-bash ~/.claude/adam/tests/run-tests.sh   # expect: 87 passed, 0 failed
+bash ~/.claude/adam/tests/run-tests.sh   # expect: 126 passed, 0 failed
 # … start a fresh Claude Code session …
 /reflect                                  # walks the proposal queue
 /reflect --explain                        # also shows the analyst's clustering trace
@@ -63,8 +63,8 @@ bash ~/.claude/adam/tests/run-tests.sh   # expect: 87 passed, 0 failed
 Pin a release for reproducibility:

 ```sh
-curl -fsSL https://raw.githubusercontent.com/lukaszraczylo/claude-adam/v0.5.0/install.sh \
-  | VERSION=v0.5.0 bash
+curl -fsSL https://raw.githubusercontent.com/lukaszraczylo/claude-adam/v0.6.0/install.sh \
+  | VERSION=v0.6.0 bash
 ```

 ## How it works
@@ -114,7 +114,7 @@ flowchart TB
    class TRACE trace
 ```

-The observation layer is a 350-line Node hook. Pure regex, counters, ring buffers — no LLM in the hot path. Signals append one JSONL line per detection to `~/.claude/adam/journal.jsonl`.
+The observation layer is a ~600-line Node hook. Pure regex, counters, ring buffers — no LLM in the hot path. Signals append one JSONL line per detection to `~/.claude/adam/journal.jsonl`.

 The analysis layer is an LLM subagent invoked by `/reflect`. Before the analyst runs, three deterministic pre-processors filter and enrich the journal: `adam-window.mjs` drops stale entries per per-signal age, `adam-score.mjs` computes per-session urgency dampeners + reinforcement candidates, and `adam-ab-measure.mjs` checks whether previously auto-applied edits actually reduced their originating signal.

@@ -247,11 +247,12 @@ Or pass `--explain` to `/reflect` to render the full trace inline.
    │   ├── adam-apply-reinforcement.mjs    # reinforcement proposal apply
    │   ├── adam-upgrade.mjs                # .adam-new file UX (list/diff/accept)
    │   └── adam-archive.mjs                # post-apply journal cleanup
-    └── tests/run-tests.sh            # 87 isolated tests; never touches live state
+    └── tests/run-tests.sh            # 126 isolated tests; never touches live state
 ```

 ## What's new

+- **v0.6.0** — review hardening. Struggle signals now emit `active_skills`, so `silent_drift`'s primary cluster key and the §5b skill-attribution sub-clustering (+1 rubric bonus) actually fire (both were silently dead). `proposal_fingerprint` is now deterministically computable via `adam-cooldown.mjs --compute` instead of asking the LLM analyst to hand-compute a djb2 hash; spec now mandates a *stable* cluster id so fingerprints reproduce across runs. `reinforcement` proposals are correctly excluded from A/B tracking (the spec previously contradicted itself). `adam-nudge.mjs` pending-upgrade check now mirrors the full install set (`adam-utils`/`adam-batch`/`adam-rollback` were missing). Doc/test-count drift corrected. 126 tests (up from 114).
 - **v0.5.0** — MOSS-grounded self-evolution (arXiv 2605.22794). Transcript capture: `context_window` field on struggle signals captures 8 surrounding events for evidence-based diagnosis. Two-stage analysis pipeline: diagnose+plan → inter-stage validation → implement (§3.3). Evidence batching via `adam-batch.mjs`: pre-clusters journal into coherent failure batches (§3.1). Pre-apply verification: 4-check deterministic gate before auto-apply (§3.4). Auto-rollback via `adam-rollback.mjs`: reverts regressed proposals detected by A/B measurement, creates regression nudges (§3.5). Harness self-modification: new `harness_edit` proposal type lets ADAM propose edits to its own scripts with test-suite-gated apply (§1 Table 1). Keypoint matrix: 5 capability dimensions scored per batch for structured evaluation (§4.2). 114 tests (up from 94).
 - **v0.4.0** — expanded struggle detection: `silent_drift` (5 consecutive read-only tools), `error_after_recovery` (same error fingerprint returns after clean recovery); severity-sum scoring with per-type divisors; extended `STRUGGLE_TYPES` set. 94 tests (up from 87).
 - **v0.3.3** — analyst observability, A/B measurement, journal hygiene. ISO-week journal rotation replaces 5MB size-based (fixes silent cluster-straddling under-count); per-signal sliding windows via `adam-window.mjs`; error fingerprint normalisation; correction corpus expanded + weak-token co-occurrence requirement (kills the `"actually, I think..."` false positive); mandatory clustering trace + `adam-explain.mjs`; new `nudge` and `reinforcement` proposal types; per-(skill, fingerprint) cooldown via `adam-cooldown.mjs`; `task_completed` scoring (dampener + reinforcement); A/B effectiveness measurement; upgrade UX overhaul (`adam-upgrade.mjs --list/--diff/--accept`); shared `adam-utils.mjs`. 87 tests (up from 30).