## Issue Triage — 2026-05-11

### 1) Cloud app-scoped chat returns 500 for auth failures (API-key callers)  
- **Issue Title & ID:** `cloud: /api/v1/apps/:appId/chat masks AuthenticationError/ForbiddenError as 500 (should be 401/403)` — **TBD (create issue; surfaced in PR #7376 review)**  
- **Current Status:** **Untracked bug** (code shipped/merged; needs verification on `develop` and in production)  
- **Impact Assessment:**
  - **User Impact:** **High** (all API-key integrators + generated apps using API keys)
  - **Functional Impact:** **Partial** (endpoint works, but error semantics break clients/retries and hides root cause)
  - **Brand Impact:** **High** (looks like unstable Cloud API; confusing to paying customers)
- **Technical Classification:**
  - **Category:** Bug / UX (API correctness)
  - **Component:** Cloud API (auth middleware + app chat route)
  - **Complexity:** **Moderate effort** (route-level error handling + tests)
- **Resource Requirements:**
  - **Required Expertise:** Cloud API auth patterns (Hono/Workers), error taxonomy, integration tests
  - **Dependencies:** None, but should align with existing global middleware behavior for API-key requests
  - **Estimated Effort (1-5):** **3**
- **Recommended Priority:** **P1**
- **Specific Actionable Next Steps:**
  1. Refactor `/api/v1/apps/:id/chat` to avoid throwing auth errors inside `Promise.all`, or explicitly catch `AuthenticationError/ForbiddenError` and map to 401/403.
  2. Add unit/integration test cases for: invalid API key, expired API key, cross-org API key, missing cookie auth.
  3. Confirm clients (SDK / generated apps) rely on 401/403 for retry vs fail-fast; document response codes.
- **Potential Assignees:** **standujar** (Cloud auth), **NubsCarson** (Cloud apps/domains author), **0xSolace** (cloud stabilization)

---

### 2) Cloud managed domain sync never sets `verified=true` after Cloudflare provisioning  
- **Issue Title & ID:** `cloud: domain /sync can set status=active but leaves verified=false (breaks CORS origin allowlist)` — **TBD (create issue; surfaced in PR #7376 review)**  
- **Current Status:** **Untracked bug** (likely present post-merge; requires validation)  
- **Impact Assessment:**
  - **User Impact:** **High** (any monetized/custom-domain app attempting to go live)
  - **Functional Impact:** **Yes** (CORS origin list remains empty → app chat/web flows fail cross-origin)
  - **Brand Impact:** **High** (custom domain feature appears “broken” after purchase)
- **Technical Classification:**
  - **Category:** Bug
  - **Component:** Cloud API (managed domains, CORS/origin verification)
  - **Complexity:** **Simple fix** (set verified flag on successful zone/provisioning confirmation) + tests
- **Resource Requirements:**
  - **Required Expertise:** Cloudflare domain lifecycle, DB state transitions, CORS origin list computation
  - **Dependencies:** Requires clear definition of “verified” vs “active” (what external signals qualify)
  - **Estimated Effort (1-5):** **2**
- **Recommended Priority:** **P0** (blocks core monetized-domain functionality and can strand paying users)
- **Specific Actionable Next Steps:**
  1. Patch sync route to set `verified: true` when Cloudflare status indicates provisioned/active and DNS/zone checks pass.
  2. Add regression tests: purchase pending → sync → verified becomes true; verified origins list includes domain.
  3. Run a staging smoke test for a real domain lifecycle (buy/check/status/sync/verify) and confirm CORS works.
- **Potential Assignees:** **NubsCarson**, **standujar**

---

### 3) Cloud chat credit reconciliation failure modes (free inference / overcharge / lost response)  
- **Issue Title & ID:** `cloud: app chat streaming/non-streaming reconciliation can refund after content delivered or charge without returning response` — **TBD (create issue; surfaced in PR #7376 review)**  
- **Current Status:** **Untracked bug / suspected billing vulnerability** (needs confirmation + containment)  
- **Impact Assessment:**
  - **User Impact:** **Critical** (affects billing correctness for all monetized app chat)
  - **Functional Impact:** **Yes** (can return 500 after successful provider response; or mis-bill)
  - **Brand Impact:** **Critical** (billing trust + potential abuse)
- **Technical Classification:**
  - **Category:** Bug / Security (economic abuse) / Reliability
  - **Component:** Cloud API billing + provider proxy + streaming pipeline
  - **Complexity:** **Complex solution** (needs robust transaction semantics and idempotency)
- **Resource Requirements:**
  - **Required Expertise:** Streaming semantics, billing ledgers, idempotent reconciliation design, observability
  - **Dependencies:** Clarify desired invariants:
    - If content delivered → never full-refund silently
    - If provider succeeded → user should receive response even if reconciliation fails (or receive deterministic refund + audit)
  - **Estimated Effort (1-5):** **5**
- **Recommended Priority:** **P0**
- **Specific Actionable Next Steps:**
  1. Implement a reconciliation state machine with explicit phases (reserve → deliver → finalize) and immutable audit logging.
  2. Ensure streaming path: once bytes are sent, reconciliation failure should not trigger “actual cost 0” refund; instead mark “finalize_failed” for async retry and/or partial safe handling.
  3. Ensure non-streaming path: if provider response obtained, return it even if reconcile fails; attempt refund (or enqueue finalize job) and emit alerts.
  4. Add integration tests with forced DB failures at each phase (reserve ok/finalize fail; provider fail; finalize retry).
  5. Add metrics + alerts: reconcile failure rate, refund-after-delivery count, charge-with-error count.
- **Potential Assignees:** **NubsCarson** (route author), **standujar** (Cloud reliability/auth), plus someone familiar with billing internals (suggest: **0xSolace**)

---

### 4) Container control-plane internal auth can become a no-op if token env var missing  
- **Issue Title & ID:** `cloud: requireInternalToken bypass when CONTAINER_CONTROL_PLANE_TOKEN is unset` — **TBD (create issue; surfaced in PR #7376 review)**  
- **Current Status:** **Untracked security risk** (configuration-dependent)  
- **Impact Assessment:**
  - **User Impact:** **Medium → High** (depends on exposure; could be critical if endpoint is reachable)
  - **Functional Impact:** **Partial** (system may “work” but without intended security boundary)
  - **Brand Impact:** **High** (infra security posture)
- **Technical Classification:**
  - **Category:** Security
  - **Component:** Cloud services (container-control-plane service + API forwarder)
  - **Complexity:** **Moderate effort** (fail-closed auth + deployment guardrails)
- **Resource Requirements:**
  - **Required Expertise:** Service-to-service auth, deployment config, threat modeling
  - **Dependencies:** Deployment environment expectations (is the control-plane network-private always?)
  - **Estimated Effort (1-5):** **3**
- **Recommended Priority:** **P0** if reachable externally; otherwise **P1** with immediate hardening
- **Specific Actionable Next Steps:**
  1. Change `requireInternalToken` to **fail closed** (error if token is unset) in non-dev environments.
  2. Add startup validation: refuse to boot control-plane without required secrets in staging/prod.
  3. Add tests for missing token behavior; update deployment docs and `.env.example`.
  4. Verify network exposure (ingress rules) and rotate any shared secrets if uncertainty exists.
- **Potential Assignees:** **standujar**, **NubsCarson**, (security-minded reviewer) **Dexploarer**

---

### 5) Slack connector can silently drop inbound messages if `users.info` fails  
- **Issue Title & ID:** `plugin-slack: missing try/catch around getUser() causes message loss on Slack API errors` — **TBD (create issue; surfaced in PR #7375 review)**  
- **Current Status:** **Untracked bug** (plugin migrated + merged; needs patch ASAP)  
- **Impact Assessment:**
  - **User Impact:** **High** (Slack is a major connector; rate limits and transient failures are common)
  - **Functional Impact:** **Yes** (drops messages; no memory written; no agent reply)
  - **Brand Impact:** **High** (connector appears unreliable; “agent ignores messages”)
- **Technical Classification:**
  - **Category:** Bug / Reliability
  - **Component:** Plugin System → `@elizaos/plugin-slack` service event handlers
  - **Complexity:** **Simple fix** (guard call + fallback behavior) + regression test
- **Resource Requirements:**
  - **Required Expertise:** Slack Bolt event handling, error handling patterns, connector integration tests
  - **Dependencies:** Decide fallback identity behavior when user lookup fails (use userId only, mark unknown handle, retry async, etc.)
  - **Estimated Effort (1-5):** **2**
- **Recommended Priority:** **P1**
- **Specific Actionable Next Steps:**
  1. Wrap `getUser()` calls in `handleMessage` and `handleAppMention` with try/catch.
  2. On failure: continue processing message with minimal identity (userId) and log structured warning (include error code, retry-after if present).
  3. Add test that simulates Slack API failure and asserts message is still stored + emitted to runtime.
  4. Audit other Slack API calls on critical ingress path for similar unguarded throws.
- **Potential Assignees:** **2-A-M** (connectors/monorepo integration), **0xSolace** (stability), **standujar** (reliability review)

---

### 6) Discord report of possible compromise / scam warnings (incident triage + comms)  
- **Issue Title & ID:** `Security: Discord users flagged possible compromise + scam message (needs incident triage)` — **TBD (create security tracking issue)**  
- **Current Status:** **Untriaged security signal** (no details captured; no follow-up in public thread)  
- **Impact Assessment:**
  - **User Impact:** **Medium → Critical** (unknown scope; could be social engineering only or real token compromise)
  - **Functional Impact:** **No** (unless infra or releases are affected)
  - **Brand Impact:** **High** (trust and safety)
- **Technical Classification:**
  - **Category:** Security / Community Ops
  - **Component:** Project operations (Discord, GitHub org, package publishing)
  - **Complexity:** **Moderate effort** (investigation + possible rotations)
- **Resource Requirements:**
  - **Required Expertise:** Incident response, access audit, Discord moderation, GitHub org security
  - **Dependencies:** Need specific indicators: what was “compromised”, where, by whom, and evidence
  - **Estimated Effort (1-5):** **3**
- **Recommended Priority:** **P1** (immediate triage; escalate to P0 if any credible indicator appears)
- **Specific Actionable Next Steps:**
  1. Contact reporting users (**gokumaster64**, **dieantwoord1337**) for specifics: links, screenshots, accounts, timestamps.
  2. Audit: recent GitHub org security logs, npm publish events, token rotations, unusual maintainer access changes.
  3. Post a short security advisory in Discord: how to verify official links, where to report, and remind never to share keys.
  4. If any credible evidence: rotate relevant secrets (CI, publish tokens), invalidate sessions, and document incident timeline.
- **Potential Assignees:** **odilitime** (Community Ops/Moderator), **dankvr** (Core Dev/Moderator), **0xSolace** (infra hygiene)

---

### 7) Documentation gap: real-world operational costs for “Eliza as Twitter bot”  
- **Issue Title & ID:** `Docs: clarify cost drivers and example configs for X/Twitter bot operation` — **TBD (create docs issue)**  
- **Current Status:** **Not tracked** (user asked; partial answer in Discord)  
- **Impact Assessment:**
  - **User Impact:** **Medium** (onboarding friction for a common “bot” use case)
  - **Functional Impact:** **No**
  - **Brand Impact:** **Medium** (reduces “unknown cost” anxiety; improves adoption)
- **Technical Classification:**
  - **Category:** Documentation
  - **Component:** Deployment guides / X connector docs / cost estimation
  - **Complexity:** **Simple fix**
- **Resource Requirements:**
  - **Required Expertise:** Knowledge of provider pricing, rate limits, configuration knobs (reply volume)
  - **Dependencies:** None
  - **Estimated Effort (1-5):** **1**
- **Recommended Priority:** **P3**
- **Specific Actionable Next Steps:**
  1. Add a short doc page: example monthly budgets ($10/mo low-volume vs higher-volume), what settings drive cost (reply count, model choice, embeddings, logging).
  2. Provide a “cost safe mode” config preset (rate limiting, max replies/day).
- **Potential Assignees:** **odilitime** (historical cost context), **2-A-M** (docs in monorepo), **NubsCarson** (Cloud cost framing)

---

## Top priority summary (address immediately: next 24–72 hours)
1. **P0:** Cloud domain sync not setting `verified=true` → breaks CORS / custom domains (TBD; from PR #7376 review)  
2. **P0:** Cloud chat credit reconciliation failure modes → free inference / overcharge / lost response (TBD; from PR #7376 review)  
3. **P0/P1:** Container control-plane internal auth can be bypassed if token env var missing (TBD; from PR #7376 review)  
4. **P1:** Cloud app chat returns 500 instead of 401/403 for auth failures (TBD; from PR #7376 review)  
5. **P1:** Slack plugin inbound messages can be silently dropped when `users.info` fails (TBD; from PR #7375 review)  
6. **P1:** Security signal from Discord (“compromised?” + scam warning) needs incident triage (TBD)

(Secondary)  
7. **P3:** Document Twitter/X bot operational cost drivers and safe presets (TBD)

---

## Patterns / themes indicating deeper architectural risk
- **“Merged PR, but critical path defects remain as review findings”**: Multiple P1/P0-class issues surfaced in automated review notes for large Cloud/connector merges. This suggests insufficient pre-merge gating for critical routes (billing/auth/ingress handlers).
- **State-machine gaps in monetized/paid flows**: Credit reservation + streaming response delivery needs explicit transactional semantics and failure handling; “best effort” reconciliation is not adequate.
- **Connector ingress robustness**: Missing try/catch on third-party API calls (Slack `users.info`) creates silent message drops—classic reliability failure mode for event-driven integrations.
- **Security controls dependent on environment presence**: Auth becoming a no-op when env vars are missing is a systemic hardening issue; production services should fail closed.

---

## Process improvement recommendations
1. **Introduce “Critical Path Gate” checks before merge** for any PR touching:
   - Billing/credits
   - Auth middleware / response code mapping
   - Connector message ingress handlers  
   Require explicit checklist + at least one maintainer review sign-off.
2. **Add fault-injection tests** for Cloud chat monetization:
   - DB failure during finalize
   - Provider success but reconciliation failure
   - Streaming writer already closed then error occurs  
   Treat these as release blockers.
3. **Fail-closed configuration validation** in Cloud services:
   - On boot, assert required env vars exist in staging/prod (internal tokens, secrets, provider keys).
4. **Standardize connector ingress error handling**:
   - A shared helper pattern: `safeThirdPartyCall()` with structured logging and fallback identity behavior.
5. **Security reporting playbook for Discord**:
   - Dedicated channel/form for security reports, minimum info template, and a documented escalation path to core maintainers.