# Issue Triage — 2026-05-13 (elizaOS)

## 1) Plugin registry v2 infrastructure returns 404 (repo + web)
- **Issue Title & ID:** V2 plugin registry unavailable (404 on `elizaos-plugins/registry` + `plugins.elizacloud.ai`) — **(Discord / no GH issue yet)**
- **Current Status:** **Open / Investigating** (acknowledged by core dev; policy fallback discussed)
- **Impact Assessment:**
  - **User Impact:** **High** (all third-party plugin authors attempting v2 submissions; blocks new ecosystem growth)
  - **Functional Impact:** **Partial** (core framework runs, but plugin distribution/submission pipeline is blocked)
  - **Brand Impact:** **High** (404s on “official” endpoints suggests broken infrastructure)
- **Technical Classification:**
  - **Category:** Bug / Infrastructure
  - **Component:** Plugin System / Registry Infrastructure / Web
  - **Complexity:** **Moderate effort** (could be access control, deployment routing, repo visibility, DNS, or artifact publishing)
- **Resource Requirements:**
  - **Required Expertise:** GitHub org/repo administration, CI/CD/deployments, DNS/hosting for `plugins.elizacloud.ai`, registry schema/publishing workflow
  - **Dependencies:** Decision on submission policy (registry vs PR-to-monorepo), BUSL-1.1 handling path
  - **Estimated Effort (1–5):** **3**
- **Recommended Priority:** **P0**
- **Specific Actionable Next Steps:**
  1. **Create a tracking GitHub issue** in `elizaos/eliza` (or the registry repo if it exists) capturing: endpoints returning 404, expected access model, and required SLA.
  2. Verify whether `elizaos-plugins/registry` is **deleted, renamed, private, or access-restricted**; check GitHub org audit log if available.
  3. Check `plugins.elizacloud.ai` deployment: origin health, routing, DNS, TLS, and whether it points to a removed environment.
  4. Publish an interim **“Submission Status”** note: “registry temporarily unavailable; use workaround X” with ETA or explicit “unknown”.
  5. Decide and document policy for **BUSL-1.1 plugins** (cannot PR into monorepo): define a sanctioned submission route.
- **Potential Assignees:** **odilitime** (platform/community ops), **0xSolace** (stability/infra bugfix focus), **NubsCarson** (cloud infra familiarity)

---

## 2) Discord security: scammer activity and reports of “compromised admin accounts”
- **Issue Title & ID:** Discord server security incidents (scammers / potential account compromise) — **(Discord / no GH issue yet)**
- **Current Status:** **Open** (reported by members; core dev notes ongoing scammers)
- **Impact Assessment:**
  - **User Impact:** **Critical** (community members at risk; phishing/scams scale quickly)
  - **Functional Impact:** **No** (doesn’t break runtime, but breaks community support channel integrity)
  - **Brand Impact:** **High** (trust/safety reputational risk)
- **Technical Classification:**
  - **Category:** Security / Community Ops
  - **Component:** Discord / Community Infrastructure
  - **Complexity:** **Moderate effort** (requires policy + moderation + bot tooling)
- **Resource Requirements:**
  - **Required Expertise:** Discord moderation/admin, incident response, bot configuration (anti-spam/anti-phish), access control hygiene
  - **Dependencies:** Clear role model (“no admins” claim vs observed roles), moderator coverage/timezones
  - **Estimated Effort (1–5):** **3**
- **Recommended Priority:** **P0**
- **Specific Actionable Next Steps:**
  1. Establish an **incident channel + runbook** (what happened, affected users, mitigation, comms).
  2. Audit roles/permissions: confirm who has elevated permissions; rotate credentials where applicable; enforce **2FA** for privileged roles.
  3. Enable/strengthen anti-scam controls: verification gates, link filtering, DM restrictions, auto-timeouts, suspicious account heuristics.
  4. Post a pinned security notice: no staff will DM first; how to verify links; how to report scams.
  5. Add structured reporting: dedicated “report-scam” form/thread; ensure fast moderator response.
- **Potential Assignees:** **odilitime** (moderation), **shrektwo** (reporting/triage), **scottnuttall_** (helper/mod support), plus any designated Discord moderators

---

## 3) Cloud monetized app chat endpoint: auth errors misclassified as 500 + credit reconciliation exploit/overcharge risk
- **Issue Title & ID:** Cloud `/api/v1/apps/:appId/chat` error handling & billing reconciliation hazards — **PR follow-up: elizaos/eliza#7376**
- **Current Status:** **Open (post-merge risk)** — merged with automated review flagging multiple P1s; needs immediate validation in production/staging
- **Impact Assessment:**
  - **User Impact:** **Critical** (affects all monetized app chat users; auth failures confuse clients; billing inaccuracies directly harm users/project)
  - **Functional Impact:** **Yes** (chat endpoint is core for monetized apps; incorrect status codes and reconciliation failures break reliability)
  - **Brand Impact:** **High** (billing/auth bugs are trust-destroying)
- **Technical Classification:**
  - **Category:** Bug / Security (abuse vector) / Reliability
  - **Component:** Cloud API / Billing / Auth middleware / App chat gateway
  - **Complexity:** **Complex solution** (streaming + billing correctness + error classification)
- **Resource Requirements:**
  - **Required Expertise:** Cloud API (Hono/Next route handlers), auth middleware, billing/credits accounting, streaming response lifecycle
  - **Dependencies:** Monitoring/telemetry for reconcile failures; clear accounting rules for reserve vs actual cost
  - **Estimated Effort (1–5):** **5**
- **Recommended Priority:** **P0**
- **Specific Actionable Next Steps:**
  1. Add a **hotfix issue** and regression tests for: invalid API key returns **401/403**, never **500**.
  2. Refactor auth validation out of `Promise.all` so thrown `AuthenticationError/ForbiddenError` maps deterministically to correct HTTP codes.
  3. Streaming path: ensure **no full refund after content delivered** if reconciliation fails; design idempotent reconcile with “delivered=true” marker.
  4. Non-streaming path: if provider response succeeded but reconcile fails, return the response and queue reconciliation retry; ensure **no overcharge/no lost response**.
  5. Add dashboards/alerts: reconcile failure rate, negative balances, refund spikes, and “charged-with-500” scenarios.
- **Potential Assignees:** **NubsCarson** (author/Cloud domain work), **standujar** (cloud auth stabilization), **0xSolace** (bugfix/stability)

---

## 4) Cloud managed domains sync: “verified” never flips true after provisioning (breaks CORS/origin verification)
- **Issue Title & ID:** Domain sync doesn’t mark Cloudflare domains verified — **PR follow-up: elizaos/eliza#7376**
- **Current Status:** **Open (post-merge risk)**
- **Impact Assessment:**
  - **User Impact:** **High** (any app relying on managed domain origins may fail CORS / origin allowlists)
  - **Functional Impact:** **Partial** (apps run but custom domains/origins may remain unusable)
  - **Brand Impact:** **High** (domains “stuck” looks like broken cloud product)
- **Technical Classification:**
  - **Category:** Bug
  - **Component:** Cloud Domains / CORS origin verification
  - **Complexity:** **Moderate effort**
- **Resource Requirements:**
  - **Required Expertise:** Cloud domain lifecycle, Cloudflare status mapping, CORS/origin allowlist generation
  - **Dependencies:** Correct interpretation of “active” vs “verified” in domain provisioning pipeline
  - **Estimated Effort (1–5):** **3**
- **Recommended Priority:** **P1**
- **Specific Actionable Next Steps:**
  1. Patch `/apps/:id/domains/sync` to set `verified: true` when zone + registration are active and ownership checks pass.
  2. Add a unit/integration test: domain purchased while pending → later sync → becomes verified → appears in verified origins list.
  3. Run a one-time backfill migration/job to fix existing rows stuck as `verified=false` but active.
- **Potential Assignees:** **NubsCarson**, **standujar**

---

## 5) Container control-plane internal auth can be a no-op if token env var missing
- **Issue Title & ID:** Container control-plane `requireInternalToken` ineffective when env var absent — **PR follow-up: elizaos/eliza#7376**
- **Current Status:** **Open (security hardening needed)**
- **Impact Assessment:**
  - **User Impact:** **Medium → High** (depends on network exposure; could become critical if reachable)
  - **Functional Impact:** **Partial** (service works, but may be improperly protected)
  - **Brand Impact:** **High** (internal control-plane endpoints must be strongly authenticated)
- **Technical Classification:**
  - **Category:** Security
  - **Component:** Cloud container-control-plane service
  - **Complexity:** **Moderate effort**
- **Resource Requirements:**
  - **Required Expertise:** Service-to-service auth, deployment env hygiene, gateway policies, secret management
  - **Dependencies:** Deployment must guarantee token provision; may need infrastructure-level network policy as defense-in-depth
  - **Estimated Effort (1–5):** **3**
- **Recommended Priority:** **P1**
- **Specific Actionable Next Steps:**
  1. Make missing token **fail closed** (service refuses to start or all internal routes return 503 with explicit error).
  2. Enforce network policy (private network only) + gateway allowlist as secondary control.
  3. Add CI/deploy checks: token presence required for production/staging.
- **Potential Assignees:** **NubsCarson**, **0xSolace**

---

## 6) Slack connector: unguarded `users.info` can drop incoming messages on Slack API errors
- **Issue Title & ID:** Slack message handling can silently drop messages when user lookup fails — **PR follow-up: elizaos/eliza#7375**
- **Current Status:** **Open (post-merge functional reliability risk)**
- **Impact Assessment:**
  - **User Impact:** **High** (Slack is a primary connector; rate limits/network errors are common)
  - **Functional Impact:** **Yes** (drops inbound messages; no memory stored; no reply)
  - **Brand Impact:** **High** (connector appears flaky/unreliable)
- **Technical Classification:**
  - **Category:** Bug / Reliability
  - **Component:** Plugin System / `plugin-slack` / message ingestion path
  - **Complexity:** **Simple fix → Moderate** (wrap calls, define fallback identity behavior)
- **Resource Requirements:**
  - **Required Expertise:** Slack Bolt/SDK behavior, connector event lifecycle, error handling patterns
  - **Dependencies:** Decide fallback when `users.info` fails (use minimal identity, defer enrichment, retry)
  - **Estimated Effort (1–5):** **2**
- **Recommended Priority:** **P1**
- **Specific Actionable Next Steps:**
  1. Add try/catch around `getUser()` in all event handlers; ensure handler continues with a safe fallback user object.
  2. Log structured warning with Slack error code (`ratelimited`, `user_not_found`, etc.), but **do not throw**.
  3. Add tests simulating Slack API failure: message still processed + memory created + reply sent (where allowed).
- **Potential Assignees:** **2-A-M** (plugin migration work), **0xSolace** (stability), plugin maintainers for connectors

---

## 7) Plugin submission policy ambiguity (v2 registry vs PR-to-monorepo), especially for BUSL-1.1 plugins
- **Issue Title & ID:** Unclear/blocked submission path for BUSL-1.1 plugins during registry outage — **(Discord / no GH issue yet)**
- **Current Status:** **Open** (policy fallback discussed but not finalized)
- **Impact Assessment:**
  - **User Impact:** **High** (plugin authors blocked even if registry restored, due to licensing constraints)
  - **Functional Impact:** **Partial** (ecosystem contribution pipeline impaired)
  - **Brand Impact:** **Medium → High** (contributors perceive process as unstable)
- **Technical Classification:**
  - **Category:** Documentation / Process
  - **Component:** Plugin System / Governance
  - **Complexity:** **Moderate effort**
- **Resource Requirements:**
  - **Required Expertise:** Maintainer policy, licensing constraints, repo workflows, documentation
  - **Dependencies:** Resolution of registry availability + governance decision
  - **Estimated Effort (1–5):** **2**
- **Recommended Priority:** **P2**
- **Specific Actionable Next Steps:**
  1. Write an interim policy: where BUSL plugins go, what metadata is required, and review criteria.
  2. Publish “current recommended path” in docs and pin in Discord (#coders).
  3. Create templates: registry entry JSON schema, validation tooling, and CI checks.
- **Potential Assignees:** **odilitime**, **pmairca**, documentation contributors

---

## 8) Untriaged security concern mention (“was something compromised?”) without follow-up
- **Issue Title & ID:** Unresolved community report of possible compromise (no details captured) — **(Discord / no GH issue yet)**
- **Current Status:** **Open / Needs clarification**
- **Impact Assessment:**
  - **User Impact:** **Unknown** (insufficient details)
  - **Functional Impact:** **Unknown**
  - **Brand Impact:** **Medium** (unanswered compromise questions increase anxiety)
- **Technical Classification:**
  - **Category:** Security / Triage
  - **Component:** Community / Potential infra
  - **Complexity:** **Simple fix** (collect details) → could become complex if real incident
- **Resource Requirements:**
  - **Required Expertise:** Incident triage, log review, moderation
  - **Dependencies:** Reporter follow-up
  - **Estimated Effort (1–5):** **1**
- **Recommended Priority:** **P2**
- **Specific Actionable Next Steps:**
  1. Ask reporter for specifics (what was seen, where, timestamps, screenshots).
  2. Correlate with Discord scam incidents; check if any GitHub/org access anomalies occurred.
  3. Close as “no evidence” or escalate to incident response with findings.
- **Potential Assignees:** **odilitime**, security-minded maintainers

---

# Conclusion

## 1) Top 5–10 highest priority items to address immediately
1. **(P0)** V2 plugin registry 404 outage (repo + `plugins.elizacloud.ai`) — unblock ecosystem submissions.
2. **(P0)** Discord security incident response (scammers / possible compromised elevated accounts).
3. **(P0)** Cloud monetized app chat endpoint: auth status misclassification + credit reconciliation failure modes (free-inference/refund exploit + overcharge/response loss).
4. **(P1)** Cloud domain sync not setting `verified=true` (breaks CORS/origin verification).
5. **(P1)** Container control-plane internal token enforcement fail-open risk.
6. **(P1)** Slack connector inbound message drops on Slack API errors (`users.info` unguarded).
7. **(P2)** Plugin submission policy clarification (registry vs monorepo PRs; BUSL-1.1 path).
8. **(P2)** Follow up on vague “compromised?” report to ensure no hidden incident.

## 2) Patterns/themes suggesting deeper architectural problems
- **Critical operational surfaces lack “fail-loud” behavior:** 404 registry endpoints, silent Slack message drops, and ambiguous Discord security posture all point to insufficient hard failure signals and missing runbooks.
- **Post-merge risk in high-stakes Cloud paths:** monetization + auth + credits reconciliation needs stronger pre-merge gating and production-safety checks, especially around streaming lifecycle correctness.
- **Ecosystem pipeline fragility:** the plugin submission process depends on infrastructure that can disappear (404) without a documented fallback or clear licensing-aware pathway.

## 3) Process improvement recommendations
- **Introduce “Production Surface Gate” checks** for Cloud PRs touching billing/auth/streaming:
  - mandatory integration tests for status code mapping (401/403/500),
  - reconciliation invariants (no free inference on failure; no charge-without-response),
  - and explicit rollback/feature-flag strategy.
- **Add “fail-loud” observability requirements** for connectors and infra:
  - connectors must not drop messages silently; require structured error logs + counters,
  - registry availability should be monitored with uptime checks and a public status note.
- **Formalize plugin submission governance**:
  - single source of truth docs,
  - explicit licensing compatibility matrix (monorepo vs registry),
  - and a documented fallback when registry is down.
- **Security hygiene for community ops**:
  - enforce 2FA for privileged Discord roles,
  - publish an incident response playbook,
  - and add standardized reporting/triage workflow.