## Issue Triage — 2026-05-02

### 1) Self-hosted auth: invalid bearer token silently retries until “backend timeout” (pairing UI never shown)
- **Issue Title & ID:** Invalid token → endless 401 retry loop → backend timeout (no pairing UI) — **(To file) app-core auth regression from PR #7212 review**
- **Current Status:** **Untracked (needs GitHub issue)**; risk introduced/observed in review notes for **PR elizaos/eliza#7212 (merged)**
- **Impact Assessment:**
  - **User Impact:** **High** (any self-hosted browser/mobile/desktop user with stale token)
  - **Functional Impact:** **Partial** (blocks login/pairing recovery; user can’t reach chat reliably)
  - **Brand Impact:** **High** (appears as instability/timeouts instead of clear auth UX)
- **Technical Classification:**
  - **Issue Category:** Bug / UX
  - **Component Affected:** **API + app-core client startup** (`startup-phase-poll.ts`, pairing/auth flow)
  - **Complexity:** **Moderate effort**
- **Resource Requirements:**
  - **Required Expertise:** TypeScript, auth flows, client startup state machine, self-hosted runtime
  - **Dependencies:** None, but should be validated across **Capacitor + Electrobun + web**
  - **Estimated Effort (1–5):** **3**
- **Recommended Priority:** **P0**
- **Specific Actionable Next Steps:**
  1. File issue with reproduction: set invalid token in localStorage/boot config, load remote runtime, observe 401 retry until timeout.
  2. Update logic: when `auth.required === true` and `auth.authenticated === false`, **always** transition to pairing/auth-required state (even if `client.hasToken()` is true).
  3. Add tests covering: invalid token → pairing UI shown; valid token → no pairing; transient 401 → retry then recover.
  4. Add telemetry/logging: explicit “token rejected by server” vs “backend unreachable”.
- **Potential Assignees:** **NubsCarson** (auth/self-hosted work), **lalalune** (app-core tests), **odilitime** (core dev review)

---

### 2) Secrets management rollout: verify vault integration doesn’t leak plaintext secrets + ensure safe migration posture
- **Issue Title & ID:** Vault integration security + migration hardening — **PR elizaos/eliza#7197 (merged)**
- **Current Status:** **Open (post-merge hardening needed)**; feature is live but needs security/ops follow-through
- **Impact Assessment:**
  - **User Impact:** **Critical** (any user saving API keys via Settings UI)
  - **Functional Impact:** **Partial** (feature works, but risk is confidentiality + operational safety)
  - **Brand Impact:** **High** (credential leakage would be severe)
- **Technical Classification:**
  - **Issue Category:** Security
  - **Component Affected:** **Core Framework / Settings UI / @elizaos/vault**
  - **Complexity:** **Complex solution** (threat model + cross-platform validation)
- **Resource Requirements:**
  - **Required Expertise:** App security, OS keychain/credential manager, secure storage, CI hardening
  - **Dependencies:** Coordination with plugin credential saving/reveal endpoints
  - **Estimated Effort (1–5):** **4**
- **Recommended Priority:** **P0**
- **Specific Actionable Next Steps:**
  1. Define and publish a **vault threat model**: assets, trust boundaries, attacker model (local malware, disk exfil, CI logs).
  2. Confirm **no secrets are written to**: logs, crash reports, `config.env`, UI state serialization, analytics.
  3. Add automated checks:
     - Grep-based CI guard preventing accidental logging of known secret keys.
     - Test: “save secret” → ensure `config.env.*` does not contain plaintext (or clearly document if mirror is still present).
  4. Decide migration policy: continue “write-through mirror” vs flip to “vault-only” (with explicit opt-out for headless).
  5. Add operator docs: headless servers (`MILADY_VAULT_PASSPHRASE` / equivalent), backups, rotation.
- **Potential Assignees:** **Dexploarer** (vault author), **odilitime** (core dev), **trace.g** (prod reliability/security), support from **lalalune** (tests)

---

### 3) Schema drift guardrail: prevent abstract schema vs drizzle schema divergence (missing tables break fresh installs)
- **Issue Title & ID:** Add schema parity checks for plugin-sql migrator to prevent missing tables — **Follow-up to elizaos/eliza#7222 (closed)**
- **Current Status:** **Closed incident, open prevention work**
- **Impact Assessment:**
  - **User Impact:** **High** (fresh clones/self-hosted installs are common)
  - **Functional Impact:** **Yes** (missing tables caused chat/memory composition failures)
  - **Brand Impact:** **High** (fresh install “broken” is reputationally costly)
- **Technical Classification:**
  - **Issue Category:** Bug / Reliability
  - **Component Affected:** **Plugin System (plugin-sql) + Runtime migrator**
  - **Complexity:** **Moderate effort**
- **Resource Requirements:**
  - **Required Expertise:** Drizzle, migrations, test harnesses, TypeScript schemas
  - **Dependencies:** None (internal validation tooling)
  - **Estimated Effort (1–5):** **3**
- **Recommended Priority:** **P1**
- **Specific Actionable Next Steps:**
  1. Implement a startup-time assertion in plugin-sql: every exported “abstract schema table name” must exist in drizzle pgTables used by runtime-migrator.
  2. Add CI test: spin up fresh PGLite, run migrator, validate required tables exist (at least: `entity_identities`, `entity_merge_candidates`, `fact_candidates` and other critical providers).
  3. Add developer ergonomics: fail loud with actionable message (“Add pgTable definition at …/plugin-sql/typescript/schema/”).
- **Potential Assignees:** **Sw4pIO** (reported root cause), **odilitime** (core dev), **lalalune** (tests)

---

### 4) Long-lived agent “memory rot” after ~3 months: integrate reconciliation/freshness gates into core memory architecture
- **Issue Title & ID:** Memory rot mitigation for long-lived agents (freshness gates + periodic reconciliation) — **(To file) from Discord: sentient_dawn field report**
- **Current Status:** **Partially implemented externally (reported working in production), not upstreamed**
- **Impact Assessment:**
  - **User Impact:** **Medium → High** (affects serious/always-on deployments)
  - **Functional Impact:** **Partial** (core capability degrades over time; correctness failure)
  - **Brand Impact:** **High** (agents “drift” and contradict reality; trust erosion)
- **Technical Classification:**
  - **Issue Category:** Bug / Reliability (emergent behavior), possibly Feature (maintenance pipeline)
  - **Component Affected:** **Core Framework (memory providers, RAG/vector retrieval, evaluators)**
  - **Complexity:** **Architectural change** (introduces maintenance lifecycle + claim validation)
- **Resource Requirements:**
  - **Required Expertise:** Retrieval systems, embeddings/versioning, prompt/state composition, eval pipelines
  - **Dependencies:** Definition of “freshness” signals; storage support for timestamps/provenance/ontology versions
  - **Estimated Effort (1–5):** **5**
- **Recommended Priority:** **P1**
- **Specific Actionable Next Steps:**
  1. File upstream design issue summarizing failure mode + proposed approach (reconciliation pass, cross-source diffs, re-embedding under current ontology).
  2. Define minimal viable implementation:
     - Add **freshness metadata** to stored facts/claims.
     - Add **outgoing-claim freshness gate** (block or qualify stale assertions).
     - Add scheduled/triggered **reconciliation job** (diff + re-embed).
  3. Add evaluation harness: inject “stale facts” and verify agent self-identifies uncertainty or refreshes sources.
  4. Document operational guidance for always-on agents (cadence, costs, storage growth).
- **Potential Assignees:** **sentient_dawn** (solution author), **trace.g** (prod stability), **odilitime** (core), **shawmakesmagic** (architecture direction)

---

### 5) Anthropic subscription OAuth path: ensure stealth preload failures are loud + covered by tests (post-fix hardening)
- **Issue Title & ID:** dev-ui preload stealth file mismatch breaks Anthropic subscription OAuth — **elizaos/eliza#7210 (CLOSED)**
- **Current Status:** **Closed**, but needs **regression-proofing + UX improvement**
- **Impact Assessment:**
  - **User Impact:** **Medium** (subset: Anthropic subscription users)
  - **Functional Impact:** **Yes (for that path)** (401 on every request)
  - **Brand Impact:** **Medium** (appears like “subscriptions don’t work”)
- **Technical Classification:**
  - **Issue Category:** Bug / UX
  - **Component Affected:** **Model Integration + app-core dev tooling**
  - **Complexity:** **Simple fix** (guardrails), assuming functional fix already landed
- **Resource Requirements:**
  - **Required Expertise:** Build tooling (Bun preload), Anthropic auth modes, dev UX
  - **Dependencies:** None
  - **Estimated Effort (1–5):** **2**
- **Recommended Priority:** **P2**
- **Specific Actionable Next Steps:**
  1. Add explicit warning/error: if `stealth.claude` enabled but preload artifact missing, print actionable instructions.
  2. Add a smoke test: subscription oauth env set → asserts interceptor installed (or asserts warning emitted).
  3. Update docs: “Anthropic subscription OAuth requirements” and troubleshooting.
- **Potential Assignees:** **Sw4pIO** (original report), **NubsCarson** (app-core), **lalalune** (tests/docs)

---

### 6) Self-hosted CORS/auth hardening: tighten allowed origins and address token carry-over edge cases
- **Issue Title & ID:** Self-hosted connectivity edge cases (CORS scope + stale token carry-over) — **(To file) follow-ups from PR elizaos/eliza#7212 review**
- **Current Status:** **Untracked**, observed in review notes
- **Impact Assessment:**
  - **User Impact:** **Medium**
  - **Functional Impact:** **Partial** (edge-case connectivity failures; potential security looseness)
  - **Brand Impact:** **Medium**
- **Technical Classification:**
  - **Issue Category:** Security / UX
  - **Component Affected:** **API (CORS) + client token persistence**
  - **Complexity:** **Moderate effort**
- **Resource Requirements:**
  - **Required Expertise:** Web security (CORS), token storage, multi-runtime clients (Capacitor/Electrobun/web)
  - **Dependencies:** Should be addressed alongside Issue #1 to avoid conflicting behavior
  - **Estimated Effort (1–5):** **3**
- **Recommended Priority:** **P2**
- **Specific Actionable Next Steps:**
  1. Revisit `https://localhost` unconditional allowance: restrict to explicit env allowlist or dev-mode only.
  2. Validate token lifecycle when switching Remote ↔ Local runtime: ensure stale remote bearer isn’t forwarded to local by default.
  3. Add integration tests: origin allowlist parsing + token clear/reset semantics across runtime gate transitions.
- **Potential Assignees:** **NubsCarson**, **odilitime**, **lalalune**

---

### 7) Agent key security + local LLM data storage: define baseline security posture and testing methodology
- **Issue Title & ID:** Local data storage + agent key security + red-team swarm testing — **(To file) from Discord 2026-04-30 action items**
- **Current Status:** **Untracked (discussion-stage)**
- **Impact Assessment:**
  - **User Impact:** **High** (applies to most serious deployments)
  - **Functional Impact:** **Partial** (not a single blocker, but foundational to safe usage)
  - **Brand Impact:** **High** (security posture is a major adoption factor)
- **Technical Classification:**
  - **Issue Category:** Security / Architecture
  - **Component Affected:** **Core Framework, Runtime Ops, Storage**
  - **Complexity:** **Architectural change**
- **Resource Requirements:**
  - **Required Expertise:** Security engineering, secrets handling, sandboxing, storage encryption, threat modeling
  - **Dependencies:** Align with vault rollout (#2); define what “local LLM data” includes (prompts, memory DB, embeddings)
  - **Estimated Effort (1–5):** **5**
- **Recommended Priority:** **P1**
- **Specific Actionable Next Steps:**
  1. Establish security baseline: key storage, rotation, least privilege for plugins, audit logging defaults.
  2. Define “local data at rest” policy: encryption requirements, redaction, retention.
  3. Create red-team swarm test plan: scripted malicious plugins, prompt injection suites, exfil tests, SSRF checks for tool calls.
  4. Add a security checklist to PR template for auth/storage/tooling changes.
- **Potential Assignees:** **shawmakesmagic** (direction), **odilitime** (core), **Dexploarer** (vault), **trace.g** (prod/security), plus new contributors with security background

---

### 8) Documentation deliverable: publish the full “memory rot” field report + recommended operational runbook
- **Issue Title & ID:** Publish memory rot field report + runbook — **(To file) requested by mayoe76**
- **Current Status:** **Open (documentation request)**
- **Impact Assessment:**
  - **User Impact:** **Medium** (helps operators of long-running agents avoid silent failures)
  - **Functional Impact:** **No** (documentation), but improves operational outcomes
  - **Brand Impact:** **Medium** (signals engineering maturity)
- **Technical Classification:**
  - **Issue Category:** Documentation
  - **Component Affected:** **Docs / Knowledge base**
  - **Complexity:** **Simple fix**
- **Resource Requirements:**
  - **Required Expertise:** Clear technical writing, memory system knowledge
  - **Dependencies:** None; can land independently of code changes
  - **Estimated Effort (1–5):** **2**
- **Recommended Priority:** **P3**
- **Specific Actionable Next Steps:**
  1. Convert field report into: symptoms, root cause, detection signals, mitigation architecture, cadence guidance.
  2. Add a “Long-lived agents” docs page (or RFC) and link from memory/RAG docs.
- **Potential Assignees:** **sentient_dawn**, **trace.g** (editor/reviewer), **odilitime** (merge)

---

## Top Highest-Priority Focus (next 24–72 hours)
1. **P0:** Invalid token handling regression (pairing UI not shown; backend timeout) — *(Issue to file; from PR #7212 review)*
2. **P0:** Vault rollout security hardening (prevent plaintext leakage; define migration posture) — *(post-merge work for PR #7197)*
3. **P1:** Schema parity guardrails for plugin-sql migrator (prevent missing-table fresh install failures) — *(follow-up to #7222)*
4. **P1:** Memory rot mitigation upstream plan (freshness gates + reconciliation lifecycle) — *(Discord field report)*
5. **P1:** Baseline agent key security + local storage policy + red-team methodology — *(Discord action items)*
6. **P2:** Self-hosted CORS tightening + token carry-over edge cases — *(PR #7212 follow-ups)*
7. **P2:** Anthropic subscription OAuth path regression-proofing + loud failures — *(follow-up to #7210)*
8. **P3:** Publish memory rot field report + operator runbook — *(docs request)*

---

## Patterns / Themes Indicating Deeper Architectural Risk
- **State-machine and auth complexity across multiple runtimes (web/Capacitor/Electrobun):** Small logic changes can create “silent failure” UX (timeouts) instead of explicit recovery paths.
- **Multiple sources of truth (schemas, config/secrets):** Drift between abstract schema definitions and migrator inputs caused runtime breakage; similarly, secrets now exist across vault + legacy paths, increasing the chance of accidental plaintext persistence.
- **Long-lived reliability gaps:** The “memory rot” report highlights a class of failures that don’t show up in short-lived testing, implying the need for lifecycle maintenance primitives in core.

---

## Process Recommendations (to prevent repeats)
1. **Add “fresh install” CI smoke tests**: boot with empty PGLite, run migrations, execute one chat turn, assert no provider/evaluator errors.
2. **Codify auth startup invariants**: tests for invalid/expired tokens must always yield deterministic UI states (pairing/auth-required), not timeouts.
3. **Schema parity tooling**: enforce that any new abstract schema must have a migrator table definition (CI check + runtime assertion).
4. **Security gates for secrets changes**: require a mini threat model + “no plaintext in logs/files” validation for any PR touching config, auth, or secrets.
5. **Longevity test harness**: introduce “time-skew / staleness simulation” tests for memory and retrieval pipelines, plus an operator runbook requirement for long-running deployments.