diff --git a/README.md b/README.md index 3f43dc6..995d536 100644 --- a/README.md +++ b/README.md @@ -13,7 +13,7 @@ Watches the friction in your coding sessions, clusters the signals via an LLM an [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE) [![Version](https://img.shields.io/github/v/release/lukaszraczylo/claude-adam?label=version&color=blue)](https://github.com/lukaszraczylo/claude-adam/releases) -[![Tests](https://img.shields.io/badge/tests-132%20passing-brightgreen.svg)](./adam/tests/run-tests.sh) +[![Tests](https://img.shields.io/badge/tests-134%20passing-brightgreen.svg)](./adam/tests/run-tests.sh) [![Node](https://img.shields.io/badge/node-22%2B-339933.svg)](https://nodejs.org) [![Platform](https://img.shields.io/badge/platform-macOS%20%7C%20Linux-lightgrey.svg)]() @@ -54,7 +54,7 @@ The installer copies files into `~/.claude/`, offers to merge ADAM's hook entrie Then: ```sh -bash ~/.claude/adam/tests/run-tests.sh # expect: 132 passed, 0 failed +bash ~/.claude/adam/tests/run-tests.sh # expect: 134 passed, 0 failed # … start a fresh Claude Code session … /reflect # walks the proposal queue /reflect --explain # also shows the analyst's clustering trace @@ -248,11 +248,12 @@ Or pass `--explain` to `/reflect` to render the full trace inline. │ ├── adam-apply-reinforcement.mjs # reinforcement proposal apply │ ├── adam-upgrade.mjs # .adam-new file UX (list/diff/accept) │ └── adam-archive.mjs # post-apply journal cleanup - └── tests/run-tests.sh # 132 isolated tests; never touches live state + └── tests/run-tests.sh # 134 isolated tests; never touches live state ``` ## What's new +- **v0.6.2** — two fixes surfaced by running ADAM's loop on a large real journal. **(1) A/B volume normalization** (`adam-ab-measure.mjs`): regressions are now measured on the signal's *share* of total activity (rate = count / window-total), not raw count — so a generally busier journal after an apply no longer masquerades as a regression. Falls back to raw delta when the signal is the only activity in the window (preserves prior behavior + tests); output adds `raw_delta_pct`, `pre_total`, `post_total`, `normalized` for transparency. **(2) Memory frontmatter schema** (`agents/adam.md`, `SKILL.md`): the drafting protocol now emits the live auto-memory shape — `name` = slug + a `metadata: {node_type, type, originSessionId}` block — instead of flat `type:`/`originSessionId:`, so auto-applied memories load and categorize correctly. 134 tests (up from 132). - **v0.6.1** — new `file_reread` signal (MOSS §1 harness self-modification, proposed and approved through ADAM's own `/reflect` loop). Consecutive Reads of the same file at different `offset`/`limit` escaped `retry_loop`'s arg-hash dedup and leaked into `tool_error_loop`; `file_reread` now catches them (same file ≥3× in the 10-event window, offset-agnostic, guarded against double-counting byte-identical reads). Fully wired: detection (`adam-observe.mjs`), 14-day window (`adam-window.mjs`), severity divisor 3 (`adam-score.mjs`), file-basename clustering (`adam-batch.mjs`), and the analyst rubric/spec. 132 tests (up from 126). - **v0.6.0** — review hardening. Struggle signals now emit `active_skills`, so `silent_drift`'s primary cluster key and the §5b skill-attribution sub-clustering (+1 rubric bonus) actually fire (both were silently dead). `proposal_fingerprint` is now deterministically computable via `adam-cooldown.mjs --compute` instead of asking the LLM analyst to hand-compute a djb2 hash; spec now mandates a *stable* cluster id so fingerprints reproduce across runs. `reinforcement` proposals are correctly excluded from A/B tracking (the spec previously contradicted itself). `adam-nudge.mjs` pending-upgrade check now mirrors the full install set (`adam-utils`/`adam-batch`/`adam-rollback` were missing). Doc/test-count drift corrected. 126 tests (up from 114). - **v0.5.0** — MOSS-grounded self-evolution (arXiv 2605.22794). Transcript capture: `context_window` field on struggle signals captures 8 surrounding events for evidence-based diagnosis. Two-stage analysis pipeline: diagnose+plan → inter-stage validation → implement (§3.3). Evidence batching via `adam-batch.mjs`: pre-clusters journal into coherent failure batches (§3.1). Pre-apply verification: 4-check deterministic gate before auto-apply (§3.4). Auto-rollback via `adam-rollback.mjs`: reverts regressed proposals detected by A/B measurement, creates regression nudges (§3.5). Harness self-modification: new `harness_edit` proposal type lets ADAM propose edits to its own scripts with test-suite-gated apply (§1 Table 1). Keypoint matrix: 5 capability dimensions scored per batch for structured evaluation (§4.2). 114 tests (up from 94). diff --git a/adam/scripts/adam-ab-measure.mjs b/adam/scripts/adam-ab-measure.mjs index ab9b7d7..4bb4596 100755 --- a/adam/scripts/adam-ab-measure.mjs +++ b/adam/scripts/adam-ab-measure.mjs @@ -3,11 +3,19 @@ // // Reads ~/.claude/adam/ab-tracking.jsonl (one line per auto-apply event, // written by adam-self-improvement/SKILL.md), then for each entry old enough -// (>= --min-age-days; default 7) compares signal counts in the 7-day window -// BEFORE applied_at against the 7-day window AFTER applied_at across the +// (>= --min-age-days; default 7) compares the originating signal in the 7-day +// window BEFORE applied_at against the 7-day window AFTER applied_at across the // full journal corpus (active + rotated). Surfaces regressions so /reflect // can flag proposals that made things worse. // +// Volume normalization: when the windows contain other (non-originating) +// activity, the delta is computed on the signal's SHARE of total activity +// (rate = count / total), not its raw count — so a generally busier journal +// after apply does not masquerade as a regression. When the signal is the only +// activity in the windows, it falls back to the raw-count delta. Output carries +// both `delta_pct` (drives status) and `raw_delta_pct` + `normalized` for +// transparency. +// // CLI: // adam-ab-measure.mjs [--home ] [--format json|table] [--min-age-days N] // @@ -92,31 +100,60 @@ export function computeDeltas(entries, journal, opts = {}) { const preStart = appliedAt - windowDays * DAY_MS; const postEnd = appliedAt + windowDays * DAY_MS; + // preCount/postCount = originating-signal occurrences; preTotal/postTotal = + // ALL journal entries in the window (the activity denominator). let preCount = 0; let postCount = 0; + let preTotal = 0; + let postTotal = 0; for (const je of journal || []) { if (!je || typeof je !== "object") continue; - if (!sigSet.has(je.type)) continue; const t = tsMs(je); if (Number.isNaN(t)) continue; - if (t >= preStart && t < appliedAt) preCount++; - else if (t >= appliedAt && t < postEnd) postCount++; + const inPre = t >= preStart && t < appliedAt; + const inPost = t >= appliedAt && t < postEnd; + if (!inPre && !inPost) continue; + if (inPre) preTotal++; else postTotal++; + if (!sigSet.has(je.type)) continue; + if (inPre) preCount++; else postCount++; } let status; let deltaPct; + let rawDeltaPct = null; + let normalized = false; if (preCount === 0) { status = "no_baseline"; deltaPct = null; } else { - deltaPct = ((postCount - preCount) / preCount) * 100; + rawDeltaPct = Math.round(((postCount - preCount) / preCount) * 10000) / 100; + // Volume normalization: when the windows contain non-originating activity, + // compare the signal's SHARE of activity (rate), not its absolute count — + // otherwise a generally busier post-window masquerades as a regression. + // No background (signal IS the only activity) → fall back to raw delta, + // preserving prior behavior. + const hasBackground = (preTotal - preCount) + (postTotal - postCount) > 0; + if (hasBackground && postTotal > 0) { + const preRate = preCount / preTotal; // preTotal >= preCount > 0 + const postRate = postCount / postTotal; + deltaPct = ((postRate - preRate) / preRate) * 100; + normalized = true; + } else { + deltaPct = ((postCount - preCount) / preCount) * 100; + } // Round to 2 dp for stable comparison + presentation. deltaPct = Math.round(deltaPct * 100) / 100; if (deltaPct <= IMPROVED_PCT) status = "improved"; else if (deltaPct >= REGRESSED_PCT) status = "regressed"; else status = "neutral"; } - out.push({ ...base, pre_count: preCount, post_count: postCount, delta_pct: deltaPct, status }); + out.push({ + ...base, + pre_count: preCount, post_count: postCount, + pre_total: preTotal, post_total: postTotal, + raw_delta_pct: rawDeltaPct, normalized, + delta_pct: deltaPct, status, + }); } return out; } diff --git a/adam/tests/run-tests.sh b/adam/tests/run-tests.sh index 3534346..788bfac 100755 --- a/adam/tests/run-tests.sh +++ b/adam/tests/run-tests.sh @@ -2014,6 +2014,50 @@ done assert_grep "$ROOT/journal.jsonl" '"type":"retry_loop"' "3x byte-identical reads emit retry_loop" assert_no_grep "$ROOT/journal.jsonl" '"type":"file_reread"' "byte-identical reads NOT double-counted as file_reread (sameToolArgs>=RETRY guard)" +# --- Test 112: A/B volume normalization — busier journal does NOT fake a regression --- +echo "Test 112: A/B volume-normalized (raw +200% but flat share → neutral)" +reset_state +applied_at_ms=$(node -e 'console.log(Date.now() - 14*86400000)') +cat > "$ROOT/ab-tracking.jsonl" < "$ROOT/journal.jsonl" +# pre window: 2 correction + 8 dead_end (rate 0.2) +for i in 1 2; do ts=$(node -e "console.log(new Date(Date.now()-(15+$i*0.2)*86400000).toISOString())"); echo "{\"ts\":\"$ts\",\"session\":\"sV\",\"type\":\"correction\"}" >> "$ROOT/journal.jsonl"; done +for i in 1 2 3 4 5 6 7 8; do ts=$(node -e "console.log(new Date(Date.now()-(15+$i*0.1)*86400000).toISOString())"); echo "{\"ts\":\"$ts\",\"session\":\"sV\",\"type\":\"dead_end\"}" >> "$ROOT/journal.jsonl"; done +# post window: 6 correction + 24 dead_end (rate 0.2 — share unchanged, raw count +200%) +for i in $(seq 1 6); do ts=$(node -e "console.log(new Date(Date.now()-(8+$i*0.1)*86400000).toISOString())"); echo "{\"ts\":\"$ts\",\"session\":\"sV\",\"type\":\"correction\"}" >> "$ROOT/journal.jsonl"; done +for i in $(seq 1 24); do ts=$(node -e "console.log(new Date(Date.now()-(8+$i*0.05)*86400000).toISOString())"); echo "{\"ts\":\"$ts\",\"session\":\"sV\",\"type\":\"dead_end\"}" >> "$ROOT/journal.jsonl"; done +out=$(ABMEASURE_RUN --format json 2>/dev/null) +if echo "$out" | node -e 'let b="";process.stdin.on("data",d=>b+=d).on("end",()=>{const a=JSON.parse(b);const e=a.find(x=>x.proposal_id==="ab-vol-001");process.exit(e&&e.normalized===true&&e.raw_delta_pct===200&&e.status==="neutral"?0:1)})'; then + echo " PASS: volume growth normalized → neutral (raw +200%)"; PASS=$((PASS+1)) +else + echo " FAIL: volume normalization wrong (got: $out)"; FAIL=$((FAIL+1)) +fi +rm -f "$ROOT/ab-tracking.jsonl" + +# --- Test 113: A/B genuine rate regression still flagged --- +echo "Test 113: A/B genuine share increase → regressed" +reset_state +applied_at_ms=$(node -e 'console.log(Date.now() - 14*86400000)') +cat > "$ROOT/ab-tracking.jsonl" < "$ROOT/journal.jsonl" +# pre: 2 correction + 8 dead_end (rate 0.2) +for i in 1 2; do ts=$(node -e "console.log(new Date(Date.now()-(15+$i*0.2)*86400000).toISOString())"); echo "{\"ts\":\"$ts\",\"session\":\"sV2\",\"type\":\"correction\"}" >> "$ROOT/journal.jsonl"; done +for i in 1 2 3 4 5 6 7 8; do ts=$(node -e "console.log(new Date(Date.now()-(15+$i*0.1)*86400000).toISOString())"); echo "{\"ts\":\"$ts\",\"session\":\"sV2\",\"type\":\"dead_end\"}" >> "$ROOT/journal.jsonl"; done +# post: 6 correction + 6 dead_end (rate 0.5 — share up → genuine regression) +for i in $(seq 1 6); do ts=$(node -e "console.log(new Date(Date.now()-(8+$i*0.1)*86400000).toISOString())"); echo "{\"ts\":\"$ts\",\"session\":\"sV2\",\"type\":\"correction\"}" >> "$ROOT/journal.jsonl"; done +for i in 1 2 3 4 5 6; do ts=$(node -e "console.log(new Date(Date.now()-(8+$i*0.07)*86400000).toISOString())"); echo "{\"ts\":\"$ts\",\"session\":\"sV2\",\"type\":\"dead_end\"}" >> "$ROOT/journal.jsonl"; done +out=$(ABMEASURE_RUN --format json 2>/dev/null) +if echo "$out" | node -e 'let b="";process.stdin.on("data",d=>b+=d).on("end",()=>{const a=JSON.parse(b);const e=a.find(x=>x.proposal_id==="ab-vol-002");process.exit(e&&e.normalized===true&&e.status==="regressed"?0:1)})'; then + echo " PASS: genuine share increase → regressed"; PASS=$((PASS+1)) +else + echo " FAIL: genuine regression missed (got: $out)"; FAIL=$((FAIL+1)) +fi +rm -f "$ROOT/ab-tracking.jsonl" + echo echo "Results: $PASS passed, $FAIL failed" [ "$FAIL" = "0" ] diff --git a/agents/adam.md b/agents/adam.md index 2ff7421..0fd911e 100644 --- a/agents/adam.md +++ b/agents/adam.md @@ -250,10 +250,12 @@ Required structure: ```markdown --- -name: -description: -type: user | feedback | project | reference -originSessionId: +name: +description: "" +metadata: + node_type: memory + type: user | feedback | project | reference + originSessionId: --- - reference: pointer to external system + what's there.> ``` +The frontmatter MUST match the live auto-memory schema exactly: `name` is the +slug (NOT a prose title), and `node_type`, `type`, `originSessionId` live under +a `metadata:` block (verify against an existing file in the target memory dir +before drafting — match its shape). + Constraints: -- Frontmatter fields `name`, `description`, `type` are **required**. Skill enforces this at apply time. -- `originSessionId` is required — must be a `session` value from one of the cluster's journal entries. +- Top-level `name` + `description` and nested `metadata.node_type` (always `memory`) + `metadata.type` are **required**. Skill enforces this at apply time. +- `metadata.originSessionId` is required — must be a `session` value from one of the cluster's journal entries. - ≤50 LOC of body content. Surgical. -- Slug (used in `target` path filename) must not collide with any existing memory file. -- For `type=feedback` and `type=project`, body MUST contain `**Why:**` and `**How to apply:**` lines (CLAUDE.md memory schema). +- `name`/slug (also the `target` path filename) must not collide with any existing memory file. +- For `type: feedback` and `type: project`, body MUST contain `**Why:**` and `**How to apply:**` lines (CLAUDE.md memory schema). ## Diagnosis drafting protocol (required for every proposal) @@ -509,7 +516,7 @@ MOSS's core thesis: "routing, hook ordering, state invariants, and dispatch live 2. `cross_session_evidence == true` (≥5 occurrences across ≥3 sessions) 3. `auto_apply_eligible: false` — **always**. Harness edits are never auto-applied. 4. `blast_radius: high` -5. Proposal includes a `# Test verification` section with the command `bash ~/.claude/adam/tests/run-tests.sh` and the expected result "132 passed, 0 failed" (or current pass count). The skill runs this test before applying. +5. Proposal includes a `# Test verification` section with the command `bash ~/.claude/adam/tests/run-tests.sh` and the expected result "134 passed, 0 failed" (or current pass count). The skill runs this test before applying. 6. Change is surgical: ≤30 LOC diff, single file. 7. `# Diagnosis` reconstructs the causal chain from harness-level behavior (not from text-artifact behavior). The mismatch must name a specific code path (function, regex, threshold) in the target file. diff --git a/skills/adam-self-improvement/SKILL.md b/skills/adam-self-improvement/SKILL.md index 810471d..b1e20dc 100644 --- a/skills/adam-self-improvement/SKILL.md +++ b/skills/adam-self-improvement/SKILL.md @@ -300,7 +300,7 @@ Before writing any proposal: - For `skill_new`: confirm the slug doesn't collide with any existing skill in `~/.claude/skills/`. If it does, refuse and ask user to rename. - For `skill_edit`: confirm the diff is append-only (no `-` lines that remove existing content) and that target SKILL.md exists. When auto-applying, ALSO re-verify the eligibility gate steps in §3 (cooldown, blacklist, byte cap) before any `Edit` call — never trust frontmatter alone. - For `skill_edit` with `auto_apply_eligible: true`: confirm `contradiction_flag` is absent or null in frontmatter. Refuse auto-apply if `contradiction_flag` is set with any non-empty value (treat the agent's flag as a hard veto on auto-apply; user can still manually approve in walk-the-queue if they disagree with the heuristic). -- For `memory`: confirm `# Proposed change` body starts with `---` frontmatter containing required fields `name`, `description`, `type`, `originSessionId`. Refuse if frontmatter missing — agent must redraft per the Memory drafting protocol. +- For `memory`: confirm `# Proposed change` body starts with `---` frontmatter matching the live auto-memory schema — top-level `name` (the slug) + `description`, plus a `metadata:` block with `node_type: memory`, `type`, and `originSessionId`. Cross-check the shape against an existing file in the target memory dir. Refuse if frontmatter is flat (`type:`/`originSessionId:` at top level) or missing the `metadata:` block — agent must redraft per the Memory drafting protocol. - For `harness_edit`: confirm `auto_apply_eligible: false` (never auto-apply). Confirm `confidence ≥ 5`. Confirm `# Test verification` section names the test command. Confirm diff is ≤30 LOC and targets a single allowed harness file (see `agents/adam.md` §"Harness self-modification"). Run test suite before AND after applying — revert on any regression. - Confirm `source_entries` is present in proposal frontmatter as a non-empty list (used for archive). Warn (do not refuse) if missing — legacy proposals from before v0.2.0 won't have it.