10 Commits

Author SHA1 Message Date
lukaszraczylo 4d1276a73f feat(v0.6.5): execution-grounded skill-utility report (adam-skill-utility)
Ranks skills by good:bad outcome co-occurrence (Wilson LB + lift vs
baseline) over the journal's active_skills payloads — the SkillsInjector
(arXiv 2605.29794) execution-grounded utility signal Δ(s), computed from
data already collected, no training.

- reuses adam-score NEGATIVE_SIGNAL_TYPES + entrySeverity (single source of truth)
- registered in install.sh helper-script copy loop
- /reflect pre-step surfaces worst below-baseline skills to the USER as
  advisory (co-occurrence != causation; not fed to the analyst's proposal machinery)
- Test 119 added; full suite 141/141 green
2026-06-02 01:47:40 +01:00
lukaszraczylo c23b09cc09 fix(v0.6.4): rollback removes the proposal's ab-tracking entry
adam-rollback.mjs's docstring always claimed it "removes the ab-tracking entry
(so it doesn't re-trigger)", but executeRollback() never did. Consequence: a
rolled-back proposal kept being re-detected as `regressed` on every subsequent
/reflect, which triggered endless `not_found` rollback attempts (the applied
file is already gone) and noisy ## Regressions sections.

executeRollback now deletes the matching ab-tracking.jsonl row by proposal_id
after the move, preserving all unrelated rows. Surfaced by running ADAM's own
/reflect loop a second time (two zombie regressions: 2026-05-16-002 and
2026-05-22-001).

Tests: 138 -> 140 (rollback purges the entry by id; an unrelated entry is
preserved).
2026-05-29 13:50:38 +01:00
lukaszraczylo d929101af4 fix(v0.6.2): A/B volume normalization + memory frontmatter schema
Two issues surfaced by running ADAM's /reflect loop on a large real journal
(4015 entries, 119 sessions) — both caused false/broken auto-apply behavior.

1. A/B over-reported regressions (adam-ab-measure.mjs).
   Regressions were measured on RAW originating-signal counts pre vs post. On a
   busy, growing journal almost every signal count rises post-apply regardless
   of whether the proposal helped — so the loop flagged 9 false "regressions"
   (and would auto-roll-back good proposals). Now the delta is computed on the
   signal's SHARE of total activity (rate = count / window-total). Falls back to
   the raw-count delta when the signal is the only activity in the window
   (preserves prior behavior + all existing A/B tests). Output adds
   raw_delta_pct, pre_total, post_total, normalized for transparency.

2. Memory frontmatter drift (agents/adam.md, SKILL.md).
   The drafting protocol emitted flat `type:`/`originSessionId:` with a prose
   `name`, but the live auto-memory store uses `name` = slug plus a
   `metadata: {node_type, type, originSessionId}` block. Auto-applied memories
   could fail to load/categorize. Protocol + apply-time validation now require
   the live metadata.* schema and cross-checking against an existing file.

Tests: 132 -> 134. New: volume growth (raw +200%) with flat activity-share
classifies neutral, not regressed; a genuine share increase still classifies
regressed.
2026-05-29 12:37:10 +01:00
lukaszraczylo 3a54d7d3e1 feat(v0.6.1): file_reread signal — catch offset-shifted same-file re-reads
Proposed and approved through ADAM's own /reflect harness_edit loop (MOSS §1):
the analyst surfaced 23 tool_error_loop entries across 4 sessions whose context
windows were really redundant re-reads of one file.

retry_loop keys on argsHash of the full tool_input (including offset/limit), so
consecutive Reads of the SAME file at different offsets escaped dedup and leaked
into tool_error_loop fingerprints. The new file_reread signal catches them:
same file Read >=3x in the 10-event window, offset-agnostic (keyed on file
path), guarded by `sameToolArgs < RETRY_THRESHOLD` so byte-identical reads stay
with retry_loop (no double-count).

Fully wired end-to-end (not a half-dead signal):
- adam-observe.mjs: detection + STRUGGLE_TYPES membership (so it carries
  context_window + active_skills like other struggle signals).
- adam-window.mjs: 14-day sliding window (task-local, like retry_loop).
- adam-score.mjs: severity divisor 3.
- adam-batch.mjs: file-basename clustering.
- agents/adam.md + README: signal tables, clustering rules, rubric, windows.

Tests: 126 -> 132 (file_reread fires on 3x offset-shifted reads, not on 2x;
byte-identical reads route to retry_loop not file_reread; carries context_window).
2026-05-29 11:31:50 +01:00
lukaszraczylo 4b36d6c09e feat(v0.6.0): review hardening — live active_skills clustering, computable fingerprints
Full codebase review (multi-agent, adversarially verified) surfaced several
documented-but-dead mechanisms and doc/code drift. Fixes:

- adam-observe: struggle signals now emit `active_skills`, so silent_drift's
  primary cluster key AND §5b skill-attribution sub-clustering (+1 rubric
  bonus) actually fire — both were silently dead (no struggle signal carried
  the field).
- adam-cooldown: new `--compute` CLI deterministically derives
  proposal_fingerprint. The exported computeProposalFingerprint() was never
  called and the analyst was told to hand-compute a djb2 hash it cannot
  reproduce. Spec now mandates a *stable* cluster id so fingerprints reproduce
  across /reflect runs. Removed one dead normalization line.
- spec: reinforcement proposals excluded from A/B tracking — agents/adam.md
  contradicted itself (:376 included, :476 excluded); SKILL.md aligned.
- adam-nudge: PENDING_CHECK_PATHS now mirrors the full install set
  (adam-utils / adam-batch / adam-rollback were missing).
- adam-explain: synthesized clustering summary carries `regressions: 0`
  (structural consistency with parsed summaries).
- docs: test-count drift (87/94 -> 126) and "350-line hook" (-> ~600) fixed;
  adam-score header documents severity_sum/severity_by_type; adam-batch §4
  reference corrected.

Tests: +12 assertions (114 -> 126), all green. New regression tests cover the
active_skills fix and --compute, plus boundary gaps the review flagged:
retry_loop/weak_agent thresholds, A/B exact +/-25% deltas, cooldown 30d
blacklist edge.
2026-05-29 01:57:44 +01:00
lukaszraczylo 440fb52eb1 feat: apply MOSS-grounded self-evolution improvements to ADAM
Implements 7 improvements grounded in MOSS paper (arXiv 2605.22794):

1. Transcript capture (§3.4): context_ring buffer in adam-observe.mjs
   captures last 8 events around struggle signals as context_window.

2. Evidence batching (§3.1): new adam-batch.mjs pre-clusters windowed
   journal entries into coherent failure batches by (signal_type, cluster_key).

3. Multi-stage analysis (§3.3): SKILL.md dispatches adam agent in two
   stages (diagnose+plan → implement) with inter-stage validation gate.

4. Pre-apply verification (§3.4): 4-check deterministic gate before
   auto-apply (source entries exist, diagnosis grounded, type-evidence
   match, no conflicting recent proposals).

5. Auto-rollback (§3.5): new adam-rollback.mjs reverts regressed proposals
   detected by A/B measurement, creates regression nudges.

6. Harness self-modification (§1 Table 1): new harness_edit proposal type
   targeting adam's own scripts with stricter gates (confidence≥5, never
   auto-apply, test-suite-gated).

7. Keypoint matrix evaluation (§4.2): 5 capability dimensions
   (tool_selection, scope_discipline, error_recovery, first_attempt,
   build_reliability) scored per batch for structured evaluation.

Test suite: 94 → 114 tests (20 new), all passing.
2026-05-24 11:15:32 +01:00
lukaszraczylo a48c705c0a feat(adam): smarter signals & clustering
- New signal types in hooks/adam-observe.mjs:
  - silent_drift: 5 consecutive read-only PostToolUse without an action tool
  - error_after_recovery: same error fingerprint returns within 5 events of clean_recovery
- Severity-weighted scoring in adam/scripts/adam-score.mjs:
  - SEVERITY_DIVISORS exported per struggle signal type
  - Per-session severity_sum + severity_by_type added to JSON output
- Skill-attribution clustering in agents/adam.md:
  - Sub-cluster struggle signals on active_skills[0]
  - New struggle-driven skill_edit variant (always queues, never auto-applies)
- Rubric updates:
  - +1 for cluster severity-sum >= 10, additional +1 for >= 32
  - +1 for skill-attributed sub-cluster naming an existing skill
  - silent_drift + error_after_recovery added to struggle signal list
- Window: silent_drift 14d, error_after_recovery 30d
- Tests: 94 passing (78-82 new)

Backward compat: entries without count default to severity 1. Existing
win-driven skill_edit gate untouched. No journal migration.
2026-05-13 19:21:59 +01:00
lukaszraczylo 012c40b9ab chore(v0.3.3): analyst observability, A/B measurement, journal hygiene
Storage/window/exclusion split (#7): ISO-week journal rotation with safety
fuse replaces size-based rotation (fixes silent under-counting when clusters
straddle boundaries). Per-signal sliding windows via adam-window.mjs guard
against stale signal accumulation. Legacy YYYY-MM-DD-<ts>.jsonl files remain
readable.

Error fingerprint normalization (#3): adam-observe.mjs extracts canonical
error codes (ENOENT, ECONNREFUSED, etc.) and normalizes paths/timestamps/hex
before hashing. 'Connection refused' and 'ECONNREFUSED' now cluster identically.

Correction corpus expansion (#1): strong tokens (stop, wrong, undo, try again,
different approach, etc.) fire on any occurrence. Weak tokens (no, actually,
wait) require negation/contrast co-occurrence within 8 tokens. Kills the
'actually, I think...' false positive.

Analyst observability (#6): mandatory clustering trace block; adam-explain.mjs
parses to summary/full/json. Cluster decisions now surface rejection reasons
(threshold, contradiction, window). Persisted to ~/.claude/adam/last-trace.txt.

Dead_end nudge proposal type (#2): single-session auto-apply gate (>=3
dead_end events). Action appends to active-nudges.json, surfaced via
adam-nudge.mjs at next SessionStart. Lower blast than skill_edit.

Per-(skill, fingerprint) cooldown (#4): adam-cooldown.mjs replaces coarse
per-skill check. proposal_fingerprint = djb2(skill_slug + cluster_id +
normalized_diff_body). Legacy applied/rejected records gate via 'legacy'
fingerprint fallback through resolveSkill helper (handles target_skill,
skill, or target: <path>).

task_completed scoring integration (#8): adam-score.mjs computes per-session
urgency dampener (3 task_completed -> 0.5) and reinforcement candidates
(skills cited in >=3 clean completions). New 'reinforcement' proposal type
appends to reinforcements.jsonl on apply (no code/memory mutation).

A/B effectiveness measurement (#5): every auto-applied edit appends to
ab-tracking.jsonl. adam-ab-measure.mjs computes 7d pre/post signal-count
delta per entry (improved / neutral / regressed / no_baseline / pending).
Analyst surfaces regressions at top of /reflect output.

Upgrade UX overhaul (#9): adam-upgrade.mjs implements --list/--diff/--accept
/--accept-all. SessionStart nudge prints pending-merge warning when
.adam-new files exist (latency ~20ms via fixed shortlist). install.sh
emits unmissable final-message hint after creating any .adam-new file.

Simplify pass: adam-utils.mjs deduplicates readJsonlSafe / listJsonlFiles /
parseFrontmatter across 8 scripts. Net -46 LOC.

Test coverage: 30 -> 87 tests. Every new feature has feature-validating
assertions (false-case coverage included). T77 statically verifies install.sh
references every adam-*.mjs source script (would have caught the missing
adam-utils inclusion that review #2 surfaced).
2026-05-13 01:02:33 +01:00
lukaszraczylo 6d8ff37cb2 v0.3.1: code review pass + DX overhaul
Bug fixes (HIGH):
- adam-observe.mjs: errorFingerprint no longer false-positives when
  toolResponse.is_error === false; ERROR_RE only used as fallback when
  is_error is undefined.
- adam-observe.mjs: resetSessionLocal now clears tool_window so retry_loop
  cannot fire on the first tool of a new session by matching prior session.
- adam-archive.mjs: ts dedup uses Map<ts, count> instead of Set<ts>; two
  journal entries sharing a millisecond are no longer both archived when
  only one is referenced in source_entries.
- adam-nudge.mjs: only counts proposal filenames matching
  /^\d{4}-\d{2}-\d{3}-/ pattern; README/notes in proposals/ no longer bump.
- skills/adam-self-improvement/SKILL.md: contradiction_flag veto now applied
  at apply time (carry-over from earlier review).

Test isolation:
- adam/tests/run-tests.sh: ALWAYS runs against an isolated $HOME under
  mktemp -d. Previously truncated live ~/.claude/adam/journal.jsonl on
  every run — destructive on production state.

Conciseness:
- agents/adam.md: -19 LOC (cuts: vestigial cursor sentence, duplicate
  not-do bullets, blast-radius bullet collapse, Inputs paths delegate to
  SKILL.md, win-cluster-vs-struggle-cluster commentary already enforced
  by cluster-key separation, # Overlap section spec compressed).
- skills/adam-self-improvement/SKILL.md: -4 LOC (framing paragraph, dead
  catch-all bullet for non-eligible types).

Auto-prune script DELETED:
- The cumulative-count primitive cannot distinguish "never used" from
  "used before tracking began"; mtime gate is meaningless for installed
  files. Auto-prune deferred to v0.4 with a per-key lastSeen schema.

Cross-platform:
- macOS (BSD coreutils) and Linux (Alpine, glibc + musl) verified.
- All scripts use portable forms (stat -f || stat -c, mktemp -d -t).
- README documents platform support explicitly.

DX overhaul:
- install.sh: hardened — supports `curl | bash` via auto-clone,
  --version=vX.Y.Z pinning, --yes / --dry-run flags, jq-based
  settings.json merge with diff prompt and backup, conservative file
  copy that detects local mtime drift and writes <file>.adam-new
  instead of clobbering, idempotent across re-runs.
- adam-uninstall.sh: NEW. Soft-archives ~/.claude/adam/ to .bak.<ts>/
  by default; --purge to delete; --yes for non-interactive; jq-based
  settings.json cleanup with diff prompt.
- README.md: curl one-liner install + version-pinned variant at top,
  What's New section through v0.3.1, upgrade-safe data files callout,
  uninstaller documentation, platform support note, expanded rubric
  showing skill_edit gate.

Test count: 27 passed, 0 failed (was 27 — no regression).
2026-05-10 21:33:17 +01:00
lukaszraczylo 7962e85578 v0.2.0: drop cursor, add source_entries lifecycle, mandate memory frontmatter
Lifecycle redesign:
- Each proposal records source_entries: [<ts>...] in frontmatter listing
  the journal timestamps that fed its cluster.
- After apply/reject, skill calls adam/scripts/adam-archive.mjs which moves
  matching entries from journal.jsonl to journal/actioned-<id>.jsonl.
- Agent reads applied/ + rejected/ frontmatter on each /reflect, builds an
  excluded-timestamps set, skips any leftover already-actioned entries.
- cursor field in state.json is vestigial; agent ignores it.

Effect: journal stays bounded by active observations. Rule changes
re-evaluate the remainder without manual rewind. Race-safer for parallel
sessions on shared state.json (no cursor write contention).

Memory drafting:
- agents/adam.md adds 'Memory drafting protocol' parallel to Skill drafting.
- Memory proposals MUST contain auto-memory frontmatter (name, description,
  type, originSessionId) in '# Proposed change' body.
- Skill enforces frontmatter check at apply time; refuses if missing.

Tests: 18 -> 21. Two new tests for adam-archive happy path + no-op.

Migration: existing applied proposals lack source_entries. Their backing
journal entries archived as a one-time bulk migration; legacy proposals
annotated with migration note.
2026-05-10 04:29:49 +01:00