fix(v0.6.2): A/B volume normalization + memory frontmatter schema

Two issues surfaced by running ADAM's /reflect loop on a large real journal (4015 entries, 119 sessions) — both caused false/broken auto-apply behavior. 1. A/B over-reported regressions (adam-ab-measure.mjs). Regressions were measured on RAW originating-signal counts pre vs post. On a busy, growing journal almost every signal count rises post-apply regardless of whether the proposal helped — so the loop flagged 9 false "regressions" (and would auto-roll-back good proposals). Now the delta is computed on the signal's SHARE of total activity (rate = count / window-total). Falls back to the raw-count delta when the signal is the only activity in the window (preserves prior behavior + all existing A/B tests). Output adds raw_delta_pct, pre_total, post_total, normalized for transparency. 2. Memory frontmatter drift (agents/adam.md, SKILL.md). The drafting protocol emitted flat `type:`/`originSessionId:` with a prose `name`, but the live auto-memory store uses `name` = slug plus a `metadata: {node_type, type, originSessionId}` block. Auto-applied memories could fail to load/categorize. Protocol + apply-time validation now require the live metadata.* schema and cross-checking against an existing file. Tests: 132 -> 134. New: volume growth (raw +200%) with flat activity-share classifies neutral, not regressed; a genuine share increase still classifies regressed.
2026-06-10 23:29:03 +00:00 · 2026-05-29 12:37:10 +01:00
parent 3a54d7d3e1
commit d929101af4
5 changed files with 109 additions and 20 deletions
@@ -13,7 +13,7 @@ Watches the friction in your coding sessions, clusters the signals via an LLM an

 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
 [![Version](https://img.shields.io/github/v/release/lukaszraczylo/claude-adam?label=version&color=blue)](https://github.com/lukaszraczylo/claude-adam/releases)
-[![Tests](https://img.shields.io/badge/tests-132%20passing-brightgreen.svg)](./adam/tests/run-tests.sh)
+[![Tests](https://img.shields.io/badge/tests-134%20passing-brightgreen.svg)](./adam/tests/run-tests.sh)
 [![Node](https://img.shields.io/badge/node-22%2B-339933.svg)](https://nodejs.org)
 [![Platform](https://img.shields.io/badge/platform-macOS%20%7C%20Linux-lightgrey.svg)]()

@@ -54,7 +54,7 @@ The installer copies files into `~/.claude/`, offers to merge ADAM's hook entrie
 Then:

 ```sh
-bash ~/.claude/adam/tests/run-tests.sh   # expect: 132 passed, 0 failed
+bash ~/.claude/adam/tests/run-tests.sh   # expect: 134 passed, 0 failed
 # … start a fresh Claude Code session …
 /reflect                                  # walks the proposal queue
 /reflect --explain                        # also shows the analyst's clustering trace
@@ -248,11 +248,12 @@ Or pass `--explain` to `/reflect` to render the full trace inline.
    │   ├── adam-apply-reinforcement.mjs    # reinforcement proposal apply
    │   ├── adam-upgrade.mjs                # .adam-new file UX (list/diff/accept)
    │   └── adam-archive.mjs                # post-apply journal cleanup
-    └── tests/run-tests.sh            # 132 isolated tests; never touches live state
+    └── tests/run-tests.sh            # 134 isolated tests; never touches live state
 ```

 ## What's new

+- **v0.6.2** — two fixes surfaced by running ADAM's loop on a large real journal. **(1) A/B volume normalization** (`adam-ab-measure.mjs`): regressions are now measured on the signal's *share* of total activity (rate = count / window-total), not raw count — so a generally busier journal after an apply no longer masquerades as a regression. Falls back to raw delta when the signal is the only activity in the window (preserves prior behavior + tests); output adds `raw_delta_pct`, `pre_total`, `post_total`, `normalized` for transparency. **(2) Memory frontmatter schema** (`agents/adam.md`, `SKILL.md`): the drafting protocol now emits the live auto-memory shape — `name` = slug + a `metadata: {node_type, type, originSessionId}` block — instead of flat `type:`/`originSessionId:`, so auto-applied memories load and categorize correctly. 134 tests (up from 132).
 - **v0.6.1** — new `file_reread` signal (MOSS §1 harness self-modification, proposed and approved through ADAM's own `/reflect` loop). Consecutive Reads of the same file at different `offset`/`limit` escaped `retry_loop`'s arg-hash dedup and leaked into `tool_error_loop`; `file_reread` now catches them (same file ≥3× in the 10-event window, offset-agnostic, guarded against double-counting byte-identical reads). Fully wired: detection (`adam-observe.mjs`), 14-day window (`adam-window.mjs`), severity divisor 3 (`adam-score.mjs`), file-basename clustering (`adam-batch.mjs`), and the analyst rubric/spec. 132 tests (up from 126).
 - **v0.6.0** — review hardening. Struggle signals now emit `active_skills`, so `silent_drift`'s primary cluster key and the §5b skill-attribution sub-clustering (+1 rubric bonus) actually fire (both were silently dead). `proposal_fingerprint` is now deterministically computable via `adam-cooldown.mjs --compute` instead of asking the LLM analyst to hand-compute a djb2 hash; spec now mandates a *stable* cluster id so fingerprints reproduce across runs. `reinforcement` proposals are correctly excluded from A/B tracking (the spec previously contradicted itself). `adam-nudge.mjs` pending-upgrade check now mirrors the full install set (`adam-utils`/`adam-batch`/`adam-rollback` were missing). Doc/test-count drift corrected. 126 tests (up from 114).
 - **v0.5.0** — MOSS-grounded self-evolution (arXiv 2605.22794). Transcript capture: `context_window` field on struggle signals captures 8 surrounding events for evidence-based diagnosis. Two-stage analysis pipeline: diagnose+plan → inter-stage validation → implement (§3.3). Evidence batching via `adam-batch.mjs`: pre-clusters journal into coherent failure batches (§3.1). Pre-apply verification: 4-check deterministic gate before auto-apply (§3.4). Auto-rollback via `adam-rollback.mjs`: reverts regressed proposals detected by A/B measurement, creates regression nudges (§3.5). Harness self-modification: new `harness_edit` proposal type lets ADAM propose edits to its own scripts with test-suite-gated apply (§1 Table 1). Keypoint matrix: 5 capability dimensions scored per batch for structured evaluation (§4.2). 114 tests (up from 94).
@@ -3,11 +3,19 @@
 //
 // Reads ~/.claude/adam/ab-tracking.jsonl (one line per auto-apply event,
 // written by adam-self-improvement/SKILL.md), then for each entry old enough
-// (>= --min-age-days; default 7) compares signal counts in the 7-day window
-// BEFORE applied_at against the 7-day window AFTER applied_at across the
+// (>= --min-age-days; default 7) compares the originating signal in the 7-day
+// window BEFORE applied_at against the 7-day window AFTER applied_at across the
 // full journal corpus (active + rotated). Surfaces regressions so /reflect
 // can flag proposals that made things worse.
 //
+// Volume normalization: when the windows contain other (non-originating)
+// activity, the delta is computed on the signal's SHARE of total activity
+// (rate = count / total), not its raw count — so a generally busier journal
+// after apply does not masquerade as a regression. When the signal is the only
+// activity in the windows, it falls back to the raw-count delta. Output carries
+// both `delta_pct` (drives status) and `raw_delta_pct` + `normalized` for
+// transparency.
+//
 // CLI:
 //   adam-ab-measure.mjs [--home <path>] [--format json|table] [--min-age-days N]
 //
@@ -92,31 +100,60 @@ export function computeDeltas(entries, journal, opts = {}) {

    const preStart = appliedAt - windowDays * DAY_MS;
    const postEnd = appliedAt + windowDays * DAY_MS;
+    // preCount/postCount = originating-signal occurrences; preTotal/postTotal =
+    // ALL journal entries in the window (the activity denominator).
    let preCount = 0;
    let postCount = 0;
+    let preTotal = 0;
+    let postTotal = 0;
    for (const je of journal || []) {
      if (!je || typeof je !== "object") continue;
-      if (!sigSet.has(je.type)) continue;
      const t = tsMs(je);
      if (Number.isNaN(t)) continue;
-      if (t >= preStart && t < appliedAt) preCount++;
-      else if (t >= appliedAt && t < postEnd) postCount++;
+      const inPre = t >= preStart && t < appliedAt;
+      const inPost = t >= appliedAt && t < postEnd;
+      if (!inPre && !inPost) continue;
+      if (inPre) preTotal++; else postTotal++;
+      if (!sigSet.has(je.type)) continue;
+      if (inPre) preCount++; else postCount++;
    }

    let status;
    let deltaPct;
+    let rawDeltaPct = null;
+    let normalized = false;
    if (preCount === 0) {
      status = "no_baseline";
      deltaPct = null;
    } else {
-      deltaPct = ((postCount - preCount) / preCount) * 100;
+      rawDeltaPct = Math.round(((postCount - preCount) / preCount) * 10000) / 100;
+      // Volume normalization: when the windows contain non-originating activity,
+      // compare the signal's SHARE of activity (rate), not its absolute count —
+      // otherwise a generally busier post-window masquerades as a regression.
+      // No background (signal IS the only activity) → fall back to raw delta,
+      // preserving prior behavior.
+      const hasBackground = (preTotal - preCount) + (postTotal - postCount) > 0;
+      if (hasBackground && postTotal > 0) {
+        const preRate = preCount / preTotal;   // preTotal >= preCount > 0
+        const postRate = postCount / postTotal;
+        deltaPct = ((postRate - preRate) / preRate) * 100;
+        normalized = true;
+      } else {
+        deltaPct = ((postCount - preCount) / preCount) * 100;
+      }
      // Round to 2 dp for stable comparison + presentation.
      deltaPct = Math.round(deltaPct * 100) / 100;
      if (deltaPct <= IMPROVED_PCT) status = "improved";
      else if (deltaPct >= REGRESSED_PCT) status = "regressed";
      else status = "neutral";
    }
-    out.push({ ...base, pre_count: preCount, post_count: postCount, delta_pct: deltaPct, status });
+    out.push({
+      ...base,
+      pre_count: preCount, post_count: postCount,
+      pre_total: preTotal, post_total: postTotal,
+      raw_delta_pct: rawDeltaPct, normalized,
+      delta_pct: deltaPct, status,
+    });
  }
  return out;
 }
@@ -2014,6 +2014,50 @@ done
 assert_grep    "$ROOT/journal.jsonl" '"type":"retry_loop"' "3x byte-identical reads emit retry_loop"
 assert_no_grep "$ROOT/journal.jsonl" '"type":"file_reread"' "byte-identical reads NOT double-counted as file_reread (sameToolArgs>=RETRY guard)"

+# --- Test 112: A/B volume normalization — busier journal does NOT fake a regression ---
+echo "Test 112: A/B volume-normalized (raw +200% but flat share → neutral)"
+reset_state
+applied_at_ms=$(node -e 'console.log(Date.now() - 14*86400000)')
+cat > "$ROOT/ab-tracking.jsonl" <<EOF
+{"applied_at":$applied_at_ms,"proposal_id":"ab-vol-001","proposal_type":"memory","target_skill":"vol","proposal_fingerprint":"fpV","originating_signals":[{"type":"correction","count":2,"session_ids":["sV"]}],"pre_window_days":7}
+EOF
+> "$ROOT/journal.jsonl"
+# pre window: 2 correction + 8 dead_end (rate 0.2)
+for i in 1 2; do ts=$(node -e "console.log(new Date(Date.now()-(15+$i*0.2)*86400000).toISOString())"); echo "{\"ts\":\"$ts\",\"session\":\"sV\",\"type\":\"correction\"}" >> "$ROOT/journal.jsonl"; done
+for i in 1 2 3 4 5 6 7 8; do ts=$(node -e "console.log(new Date(Date.now()-(15+$i*0.1)*86400000).toISOString())"); echo "{\"ts\":\"$ts\",\"session\":\"sV\",\"type\":\"dead_end\"}" >> "$ROOT/journal.jsonl"; done
+# post window: 6 correction + 24 dead_end (rate 0.2 — share unchanged, raw count +200%)
+for i in $(seq 1 6); do ts=$(node -e "console.log(new Date(Date.now()-(8+$i*0.1)*86400000).toISOString())"); echo "{\"ts\":\"$ts\",\"session\":\"sV\",\"type\":\"correction\"}" >> "$ROOT/journal.jsonl"; done
+for i in $(seq 1 24); do ts=$(node -e "console.log(new Date(Date.now()-(8+$i*0.05)*86400000).toISOString())"); echo "{\"ts\":\"$ts\",\"session\":\"sV\",\"type\":\"dead_end\"}" >> "$ROOT/journal.jsonl"; done
+out=$(ABMEASURE_RUN --format json 2>/dev/null)
+if echo "$out" | node -e 'let b="";process.stdin.on("data",d=>b+=d).on("end",()=>{const a=JSON.parse(b);const e=a.find(x=>x.proposal_id==="ab-vol-001");process.exit(e&&e.normalized===true&&e.raw_delta_pct===200&&e.status==="neutral"?0:1)})'; then
+  echo "  PASS: volume growth normalized → neutral (raw +200%)"; PASS=$((PASS+1))
+else
+  echo "  FAIL: volume normalization wrong (got: $out)"; FAIL=$((FAIL+1))
+fi
+rm -f "$ROOT/ab-tracking.jsonl"
+
+# --- Test 113: A/B genuine rate regression still flagged ---
+echo "Test 113: A/B genuine share increase → regressed"
+reset_state
+applied_at_ms=$(node -e 'console.log(Date.now() - 14*86400000)')
+cat > "$ROOT/ab-tracking.jsonl" <<EOF
+{"applied_at":$applied_at_ms,"proposal_id":"ab-vol-002","proposal_type":"memory","target_skill":"vol2","proposal_fingerprint":"fpV2","originating_signals":[{"type":"correction","count":2,"session_ids":["sV2"]}],"pre_window_days":7}
+EOF
+> "$ROOT/journal.jsonl"
+# pre: 2 correction + 8 dead_end (rate 0.2)
+for i in 1 2; do ts=$(node -e "console.log(new Date(Date.now()-(15+$i*0.2)*86400000).toISOString())"); echo "{\"ts\":\"$ts\",\"session\":\"sV2\",\"type\":\"correction\"}" >> "$ROOT/journal.jsonl"; done
+for i in 1 2 3 4 5 6 7 8; do ts=$(node -e "console.log(new Date(Date.now()-(15+$i*0.1)*86400000).toISOString())"); echo "{\"ts\":\"$ts\",\"session\":\"sV2\",\"type\":\"dead_end\"}" >> "$ROOT/journal.jsonl"; done
+# post: 6 correction + 6 dead_end (rate 0.5 — share up → genuine regression)
+for i in $(seq 1 6); do ts=$(node -e "console.log(new Date(Date.now()-(8+$i*0.1)*86400000).toISOString())"); echo "{\"ts\":\"$ts\",\"session\":\"sV2\",\"type\":\"correction\"}" >> "$ROOT/journal.jsonl"; done
+for i in 1 2 3 4 5 6; do ts=$(node -e "console.log(new Date(Date.now()-(8+$i*0.07)*86400000).toISOString())"); echo "{\"ts\":\"$ts\",\"session\":\"sV2\",\"type\":\"dead_end\"}" >> "$ROOT/journal.jsonl"; done
+out=$(ABMEASURE_RUN --format json 2>/dev/null)
+if echo "$out" | node -e 'let b="";process.stdin.on("data",d=>b+=d).on("end",()=>{const a=JSON.parse(b);const e=a.find(x=>x.proposal_id==="ab-vol-002");process.exit(e&&e.normalized===true&&e.status==="regressed"?0:1)})'; then
+  echo "  PASS: genuine share increase → regressed"; PASS=$((PASS+1))
+else
+  echo "  FAIL: genuine regression missed (got: $out)"; FAIL=$((FAIL+1))
+fi
+rm -f "$ROOT/ab-tracking.jsonl"
+
 echo
 echo "Results: $PASS passed, $FAIL failed"
 [ "$FAIL" = "0" ]
@@ -250,10 +250,12 @@ Required structure:

 ```markdown
 ---
-name: <human-readable name, ≤80 chars>
-description: <one-line description used to decide future relevance — be specific, ≤200 chars>
-type: user | feedback | project | reference
-originSessionId: <session_id from journal entries that fed this cluster>
+name: <slug — snake_case, MUST equal the target filename without `.md`, e.g. feedback_go_test_cache>
+description: "<one-line used to decide future relevance — be specific, ≤200 chars>"
+metadata:
+  node_type: memory
+  type: user | feedback | project | reference
+  originSessionId: <session_id from journal entries that fed this cluster>
 ---

 <Body content per type, see CLAUDE.md memory schema:
@@ -263,12 +265,17 @@ originSessionId: <session_id from journal entries that fed this cluster>
  - reference: pointer to external system + what's there.>
 ```

+The frontmatter MUST match the live auto-memory schema exactly: `name` is the
+slug (NOT a prose title), and `node_type`, `type`, `originSessionId` live under
+a `metadata:` block (verify against an existing file in the target memory dir
+before drafting — match its shape).
+
 Constraints:
- Frontmatter fields `name`, `description`, `type` are **required**. Skill enforces this at apply time.
- `originSessionId` is required — must be a `session` value from one of the cluster's journal entries.
+- Top-level `name` + `description` and nested `metadata.node_type` (always `memory`) + `metadata.type` are **required**. Skill enforces this at apply time.
+- `metadata.originSessionId` is required — must be a `session` value from one of the cluster's journal entries.
 - ≤50 LOC of body content. Surgical.
- Slug (used in `target` path filename) must not collide with any existing memory file.
- For `type=feedback` and `type=project`, body MUST contain `**Why:**` and `**How to apply:**` lines (CLAUDE.md memory schema).
+- `name`/slug (also the `target` path filename) must not collide with any existing memory file.
+- For `type: feedback` and `type: project`, body MUST contain `**Why:**` and `**How to apply:**` lines (CLAUDE.md memory schema).

 ## Diagnosis drafting protocol (required for every proposal)

@@ -509,7 +516,7 @@ MOSS's core thesis: "routing, hook ordering, state invariants, and dispatch live
 2. `cross_session_evidence == true` (≥5 occurrences across ≥3 sessions)
 3. `auto_apply_eligible: false` — **always**. Harness edits are never auto-applied.
 4. `blast_radius: high`
-5. Proposal includes a `# Test verification` section with the command `bash ~/.claude/adam/tests/run-tests.sh` and the expected result "132 passed, 0 failed" (or current pass count). The skill runs this test before applying.
+5. Proposal includes a `# Test verification` section with the command `bash ~/.claude/adam/tests/run-tests.sh` and the expected result "134 passed, 0 failed" (or current pass count). The skill runs this test before applying.
 6. Change is surgical: ≤30 LOC diff, single file.
 7. `# Diagnosis` reconstructs the causal chain from harness-level behavior (not from text-artifact behavior). The mismatch must name a specific code path (function, regex, threshold) in the target file.

@@ -300,7 +300,7 @@ Before writing any proposal:
 - For `skill_new`: confirm the slug doesn't collide with any existing skill in `~/.claude/skills/`. If it does, refuse and ask user to rename.
 - For `skill_edit`: confirm the diff is append-only (no `-` lines that remove existing content) and that target SKILL.md exists. When auto-applying, ALSO re-verify the eligibility gate steps in §3 (cooldown, blacklist, byte cap) before any `Edit` call — never trust frontmatter alone.
 - For `skill_edit` with `auto_apply_eligible: true`: confirm `contradiction_flag` is absent or null in frontmatter. Refuse auto-apply if `contradiction_flag` is set with any non-empty value (treat the agent's flag as a hard veto on auto-apply; user can still manually approve in walk-the-queue if they disagree with the heuristic).
- For `memory`: confirm `# Proposed change` body starts with `---` frontmatter containing required fields `name`, `description`, `type`, `originSessionId`. Refuse if frontmatter missing — agent must redraft per the Memory drafting protocol.
+- For `memory`: confirm `# Proposed change` body starts with `---` frontmatter matching the live auto-memory schema — top-level `name` (the slug) + `description`, plus a `metadata:` block with `node_type: memory`, `type`, and `originSessionId`. Cross-check the shape against an existing file in the target memory dir. Refuse if frontmatter is flat (`type:`/`originSessionId:` at top level) or missing the `metadata:` block — agent must redraft per the Memory drafting protocol.
 - For `harness_edit`: confirm `auto_apply_eligible: false` (never auto-apply). Confirm `confidence ≥ 5`. Confirm `# Test verification` section names the test command. Confirm diff is ≤30 LOC and targets a single allowed harness file (see `agents/adam.md` §"Harness self-modification"). Run test suite before AND after applying — revert on any regression.
 - Confirm `source_entries` is present in proposal frontmatter as a non-empty list (used for archive). Warn (do not refuse) if missing — legacy proposals from before v0.2.0 won't have it.