{
  "interval": {
    "intervalStart": "2026-02-12T00:00:00.000Z",
    "intervalEnd": "2026-02-13T00:00:00.000Z",
    "intervalType": "day"
  },
  "repository": "elizaos/eliza",
  "overview": "From 2026-02-12 to 2026-02-13, elizaos/eliza had 1 new PRs (1 merged), 0 new issues, and 4 active contributors.",
  "topIssues": [
    {
      "id": "I_kwDOMT5cIs7FLaT0",
      "title": "Implement Runtime Method Mocking for Deterministic Agent Testing",
      "author": "monilpat",
      "number": 5749,
      "repository": "elizaos/eliza",
      "body": "### Problem Statement\nCurrently, ElizaOS scenario testing lacks the ability to mock internal agent runtime calls (particularly LLM interactions) when testing via the API client. This makes it difficult to create deterministic tests for agent behavior and workflows.\n\n### Current Limitations\n- Global function mocks only work for code executed in the same process\n- No way to intercept `runtime.useModel` calls from the bootstrap plugin\n- Agent behavior is non-deterministic due to real LLM calls\n- Difficult to test specific agent decision paths and responses\n\n### Why Different Approach Was Needed\n\nThe existing mocking system was designed for **code execution environments** (E2B sandbox and local `node -e` processes), where mocks are applied to the execution context. However, **agent testing** works differently:\n\n- **E2B/Local Execution**: Code runs in isolated processes/sandboxes where global mocks can be applied\n- **Agent Testing**: Agent runs as a service with API client interaction, requiring direct runtime method interception\n- **Architecture Gap**: The bootstrap plugin calls `runtime.useModel` internally, which can't be mocked through the execution environment approach\n\nThis architectural difference necessitated a new approach that directly intercepts runtime methods rather than relying on execution context mocking.\n\n### Proposed Solution: Runtime Method Mocking\n\nWe've implemented a new mocking approach that directly intercepts `IAgentRuntime` method calls:\n\n#### Key Components\n\n1. **MockEngine Enhancement**\n   - Added `runtimeMethodMocks` to support mocking runtime methods\n   - Priority system: `runtimeMethod` > `service` > `globalFunction`\n   - Direct method replacement on the runtime instance\n\n2. **Schema Updates**\n   - Extended `MockSchema` to include `runtimeMethod: z.string().optional()`\n   - Supports mocking any `IAgentRuntime` method (e.g., `useModel`, `getService`)\n\n3. **Conditional Matching**\n   - Matcher functions can inspect method arguments\n   - Example: `\"return args[1] && args[1].prompt && args[1].prompt.includes('should respond')\"`\n   - Supports complex conditional logic for when to apply mocks\n\n#### Usage Example\n\n```yaml\nsetup:\n  mocks:\n    - runtimeMethod: \"useModel\"\n      method: \"useModel\"\n      when:\n        matcher: \"return args[1] && args[1].prompt && args[1].prompt.includes('should respond')\"\n      response: \"<response><name>agent</name><reasoning>Mocked response</reasoning><action>RESPOND</action></response>\"\n```\n\n#### Benefits\n\n1. **Deterministic Testing**: Replace LLM calls with static responses\n2. **Full Workflow Testing**: Test complete agent decision flows\n3. **Conditional Mocking**: Apply mocks based on specific conditions\n4. **Runtime Integration**: Works seamlessly with existing agent architecture\n5. **Backward Compatibility**: Existing mocks continue to work\n\n#### Implementation Steps\n\n- MockEngine runtime method support\n- Schema updates for runtimeMethod mocks\n- Conditional matching with matcher functions\n- Integration with scenario runner\n- Basic useModel mocking working\n- Add support for mocking other runtime methods (getService, etc.)\n- Create comprehensive test scenarios\n- Add documentation and examples\n- Consider adding mock recording/replay capabilities\n\n### Technical Details\n\nThe implementation works by:\n1. Creating the `IAgentRuntime` instance in the scenario runner\n2. Applying mocks directly to the runtime object before agent initialization\n3. The bootstrap plugin receives the mocked runtime and uses the mocked methods\n4. Mocks are reverted after scenario completion\n\nThis approach provides the foundation for robust, deterministic agent testing while maintaining the existing architecture.",
      "createdAt": "2025-08-11T00:08:49Z",
      "closedAt": "2026-02-12T22:43:03Z",
      "state": "CLOSED",
      "commentCount": 0
    },
    {
      "id": "I_kwDOMT5cIs7FcGEm",
      "title": "feat(scenarios): Add Step Count Evaluator",
      "author": "monilpat",
      "number": 5761,
      "repository": "elizaos/eliza",
      "body": "# feat(scenarios): Add Step Count Evaluator\n\nLinks: [Issue #5726](https://github.com/elizaOS/eliza/issues/5726)\n\n## Summary\nAdd an evaluator that asserts on the number of agent/tool/action steps taken to complete a scenario step. This encourages concise trajectories and prevents thrashing.\n\n## Goals\n- Count actions/tools invoked during response generation for a step\n- Assert against `max_steps`, `min_steps`, optional `target_steps`\n\n## Acceptance Criteria\n1. New evaluator type `step_count`\n2. Step count derived from trajectory/action memories queried via `AgentRuntime.getMemories`\n3. Evaluator passes if within bounds and returns a clear message\n\n## Schema Changes\nEdit `packages/cli/src/commands/scenario/src/schema.ts`:\n\n```ts\nconst StepCountEvaluationSchema = BaseEvaluationSchema.extend({\n  type: z.literal('step_count'),\n  max_steps: z.number().optional(),\n  min_steps: z.number().optional(),\n  target_steps: z.number().optional(),\n  // optional filter by action name prefix to scope counting\n  action_prefix: z.string().optional(),\n});\n```\n\n## Evaluation Implementation\nLeverage the existing `TrajectoryContainsActionEvaluator` patterns.\n\n```ts\nclass StepCountEvaluator implements Evaluator {\n  async evaluate(params: EvaluationSchema, _runResult: ExecutionResult, runtime: AgentRuntime): Promise<EvaluationResult> {\n    if (params.type !== 'step_count') throw new Error('Mismatched evaluator');\n    try {\n      const memories = await runtime.getMemories({\n        tableName: 'messages', agentId: runtime.agentId, count: 200, unique: false,\n      });\n      const actionResults = memories.filter(m => m?.type === 'messages' && m.content?.type === 'action_result');\n      const filtered = params.action_prefix\n        ? actionResults.filter(m => (m.content?.actionName ?? '').startsWith(params.action_prefix!))\n        : actionResults;\n      const steps = filtered.length;\n\n      const tooMany = params.max_steps != null && steps > params.max_steps;\n      const tooFew = params.min_steps != null && steps < params.min_steps;\n      const success = !(tooMany || tooFew);\n      return {\n        success,\n        message: `Steps taken: ${steps} (min=${params.min_steps ?? '-'}, target=${params.target_steps ?? '-'}, max=${params.max_steps ?? '-'})`,\n      };\n    } catch (error: any) {\n      return { success: false, message: `Step count evaluation failed: ${error.message ?? String(error)}` };\n    }\n  }\n}\n```\n\nRegister in `EvaluationEngine` constructor:\n\n```ts\nthis.register('step_count', new StepCountEvaluator());\n```\n\n## Example Usage\n\n```yaml\nevaluations:\n  - type: step_count\n    max_steps: 3\n    action_prefix: \"github-service.\"\n```\n\n## Tests\n- Unit: evaluator result on synthetic memories\n- Integration: scenario that triggers 1-2 simple actions and validates bound\n\n## Notes\n- Future enhancement: count only steps within the current scenario step window by tagging messages with a scenario/step id context field.\n\n\n",
      "createdAt": "2025-08-12T04:27:34Z",
      "closedAt": "2026-02-12T22:43:05Z",
      "state": "CLOSED",
      "commentCount": 0
    },
    {
      "id": "I_kwDOMT5cIs7FcGCu",
      "title": "feat(scenarios): Add Consistency Evaluator",
      "author": "monilpat",
      "number": 5760,
      "repository": "elizaos/eliza",
      "body": "# feat(scenarios): Add Consistency Evaluator\n\nLinks: [Issue #5726](https://github.com/elizaOS/eliza/issues/5726)\n\n## Summary\nAdd an evaluator that runs the same step multiple times and asserts consistency over a chosen metric (response content, length, execution time, token counts). Useful for detecting non-determinism and regression in prompt/rules.\n\n## Goals\n- Re-run a step N times\n- Compare result metrics across runs\n- Allow tolerance configuration\n\n## Acceptance Criteria\n1. New evaluator type `consistency_check`\n2. Supports metrics: `response_length` | `execution_time` | `token_count`\n3. Configurable `runs` (>=2) and `tolerance` (absolute or percentage)\n4. Consolidated pass/fail with clear variance message\n\n## Schema Changes\nEdit `packages/cli/src/commands/scenario/src/schema.ts`:\n\n```ts\nconst ConsistencyEvaluationSchema = BaseEvaluationSchema.extend({\n  type: z.literal('consistency_check'),\n  runs: z.number().min(2),\n  metric: z.enum(['response_length', 'execution_time', 'token_count']),\n  tolerance: z.number().min(0), // if < 1 treat as proportion, else absolute\n});\n```\n\n## Orchestration Changes\nAdd a helper to re-execute the current step within the selected environment provider without re-running the entire scenario:\n\n- Expose a `runSingleStep(step, scenarioContext)` on providers that mirrors the logic in `run`, returning an `ExecutionResult`.\n- In `EvaluationEngine`, when encountering `consistency_check`, call this helper to collect multiple `ExecutionResult`s for the same step.\n- Fallback: If helper is not implemented, re-run `run([step])` via a shim.\n\n## Evaluation Implementation\n\n```ts\nclass ConsistencyEvaluator implements Evaluator {\n  constructor(private rerunStep: (times: number) => Promise<ExecutionResult[]>) {}\n\n  async evaluate(params: EvaluationSchema, runResult: ExecutionResult): Promise<EvaluationResult> {\n    if (params.type !== 'consistency_check') throw new Error('Mismatched evaluator');\n\n    const runs = await this.rerunStep(params.runs);\n    const values = runs.map(r => {\n      switch (params.metric) {\n        case 'response_length': return r.stdout.length;\n        case 'execution_time': return r.durationMs ?? 0;\n        case 'token_count': {\n          const llm = r.metrics?.llm ?? [];\n          return llm.reduce((sum, m) => sum + (m.totalTokens ?? (m.promptTokens ?? 0) + (m.completionTokens ?? 0)), 0);\n        }\n      }\n    });\n\n    const min = Math.min(...values);\n    const max = Math.max(...values);\n    const spread = max - min;\n    const baseline = values[0] || 1;\n    const proportion = baseline ? spread / baseline : spread;\n    const threshold = params.tolerance < 1 ? params.tolerance : params.tolerance / baseline;\n    const success = proportion <= threshold;\n\n    return {\n      success,\n      message: `Consistency ${success ? 'ok' : 'failed'} for ${params.metric}: min=${min}, max=${max}, spread=${spread}, tolerance=${params.tolerance}`,\n    };\n  }\n}\n```\n\nRegister in `EvaluationEngine` with a provider-specific `rerunStep` function wired from the scenario runner.\n\n## Example Usage\n\n```yaml\nevaluations:\n  - type: consistency_check\n    runs: 3\n    metric: execution_time\n    tolerance: 0.2   # 20% spread allowed\n```\n\n## Tests\n- Unit: evaluator math on synthetic values\n- Integration: deterministic small prompt with low tolerance; non-deterministic with higher tolerance\n\n## Notes\n- This evaluator requires access to the step context. Implementation will pass a `rerunStep(times)` closure into the evaluator at construction time.\n\n\n",
      "createdAt": "2025-08-12T04:27:29Z",
      "closedAt": "2026-02-12T22:43:05Z",
      "state": "CLOSED",
      "commentCount": 0
    },
    {
      "id": "I_kwDOMT5cIs7FcGAc",
      "title": "feat(scenarios): Add Cost Evaluator",
      "author": "monilpat",
      "number": 5759,
      "repository": "elizaos/eliza",
      "body": "# feat(scenarios): Add Cost Evaluator\n\nLinks: [Issue #5726](https://github.com/elizaOS/eliza/issues/5726)\n\n## Summary\nIntroduce an evaluator that asserts the estimated dollar cost of LLM usage per step. Cost is derived from token counts and a model price table.\n\n## Goals\n- Estimate cost (USD) for each step from recorded token metrics\n- Allow thresholds (`max_cost_usd`) to fail expensive runs\n- Support multiple models within a single step\n\n## Acceptance Criteria\n1. New evaluator type `llm_cost`\n2. Price table configurable via env or default map\n3. Evaluator passes if total step cost <= `max_cost_usd`\n4. Detailed message with model breakdown and total\n\n## Schema Changes\nEdit `packages/cli/src/commands/scenario/src/schema.ts`:\n\n```ts\nconst LlmCostEvaluationSchema = BaseEvaluationSchema.extend({\n  type: z.literal('llm_cost'),\n  max_cost_usd: z.number(),\n});\n```\n\n## Pricing Source\nAdd a small utility `packages/cli/src/commands/scenario/src/pricing.ts`:\n\n```ts\nexport type ModelPricing = {\n  inputPer1K: number;    // USD per 1000 input tokens\n  outputPer1K: number;   // USD per 1000 output tokens\n};\n\nexport const DEFAULT_MODEL_PRICING: Record<string, ModelPricing> = {\n  TEXT_SMALL: { inputPer1K: 0.15, outputPer1K: 0.60 },\n  TEXT_LARGE: { inputPer1K: 0.50, outputPer1K: 1.50 },\n  OBJECT_SMALL: { inputPer1K: 0.50, outputPer1K: 1.50 },\n};\n\nexport function getPricing(modelType: string, overrides?: Record<string, ModelPricing>): ModelPricing | null {\n  const table = overrides ?? DEFAULT_MODEL_PRICING;\n  return table[modelType] ?? null;\n}\n```\n\nAllow overrides via `SCENARIO_MODEL_PRICING` env (JSON string) in a follow-up.\n\n## Evaluation Implementation\nAdd to `EvaluationEngine`:\n\n```ts\nclass LlmCostEvaluator implements Evaluator {\n  async evaluate(params: EvaluationSchema, runResult: ExecutionResult): Promise<EvaluationResult> {\n    if (params.type !== 'llm_cost') throw new Error('Mismatched evaluator');\n    const llm = runResult.metrics?.llm ?? [];\n    if (!llm.length) return { success: false, message: 'No LLM metrics found for cost calculation' };\n\n    const pricingOverrides = process.env.SCENARIO_MODEL_PRICING ? JSON.parse(process.env.SCENARIO_MODEL_PRICING) : undefined;\n    let total = 0;\n    for (const m of llm) {\n      const pricing = getPricing(m.modelType, pricingOverrides);\n      if (!pricing) continue;\n      const inTok = m.promptTokens ?? 0;\n      const outTok = m.completionTokens ?? 0;\n      total += (inTok / 1000) * pricing.inputPer1K + (outTok / 1000) * pricing.outputPer1K;\n    }\n\n    const success = total <= params.max_cost_usd;\n    return { success, message: `Estimated cost: $${total.toFixed(4)} (limit $${params.max_cost_usd.toFixed(4)})` };\n  }\n}\n```\n\nRegister:\n\n```ts\nthis.register('llm_cost', new LlmCostEvaluator());\n```\n\n## Example Usage\n\n```yaml\nevaluations:\n  - type: llm_cost\n    max_cost_usd: 0.05\n```\n\n## Tests\n- Unit: price math with multiple model records\n- Integration: with token_count metrics present and absent\n\n## Notes\nThis builds on the Token Count evaluator and shared metrics capture. It complements mocking enhancements described in [Issue #5726](https://github.com/elizaOS/eliza/issues/5726).\n\n\n",
      "createdAt": "2025-08-12T04:27:23Z",
      "closedAt": "2026-02-12T22:43:04Z",
      "state": "CLOSED",
      "commentCount": 0
    },
    {
      "id": "I_kwDOMT5cIs7FcF-L",
      "title": "feat(scenarios): Add Token Count Evaluator",
      "author": "monilpat",
      "number": 5758,
      "repository": "elizaos/eliza",
      "body": "# feat(scenarios): Add Token Count Evaluator\n\nLinks: [Issue #5726](https://github.com/elizaOS/eliza/issues/5726)\n\n## Summary\nAdd an evaluator to assert on input/output/total token counts for LLM calls used during a scenario step. This establishes cost-awareness and guardrails on prompt size and verbosity.\n\n## Goals\n- Track token counts for LLM interactions triggered by a step\n- Assert against `max_input_tokens`, `max_output_tokens`, optional `max_total_tokens`\n- Work across providers through `AgentRuntime.useModel` hooks\n\n## Acceptance Criteria\n1. New evaluator type `token_count`\n2. Instrumentation captures prompt/completion token counts per step\n3. Evaluator passes if counts are within configured maxima\n4. Messages include breakdown and model type(s)\n\n## Schema Changes\nEdit `packages/cli/src/commands/scenario/src/schema.ts`:\n\n```ts\nconst TokenCountEvaluationSchema = BaseEvaluationSchema.extend({\n  type: z.literal('token_count'),\n  max_input_tokens: z.number().optional(),\n  max_output_tokens: z.number().optional(),\n  max_total_tokens: z.number().optional(),\n});\n\n// Add to EvaluationSchema union\n```\n\n## Runtime Integration\nWe will attach a per-step metrics accumulator that the evaluators can read via `ExecutionResult` metadata. Minimal, provider-agnostic approach:\n\n### ExecutionResult extension\nEdit `packages/cli/src/commands/scenario/src/providers.ts`:\n\n```ts\nexport interface ExecutionResult {\n  // ...existing\n  metrics?: {\n    llm?: Array<{\n      modelType: string;\n      promptTokens?: number;\n      completionTokens?: number;\n      totalTokens?: number;\n      provider?: string;\n    }>;\n  };\n}\n```\n\n### Capturing tokens\nAdd a light hook layer around `AgentRuntime.useModel` during scenario runs:\n\n- In `runtime-factory.ts`, wrap the runtime instance for scenarios with a proxy that:\n  - Intercepts `useModel`\n  - If params include a prompt or known text fields, call tokenizer via `ModelType.TEXT_TOKENIZER_ENCODE` to estimate prompt tokens\n  - After result, estimate output tokens similarly\n  - Push a record into `currentStepMetrics.llm[]`\n\nNote: If a tokenizer is not available, fall back to heuristic length/4 estimate. Keep it optional so normal runs are unaffected.\n\n## Evaluation Implementation\nAdd to `EvaluationEngine`:\n\n```ts\nclass TokenCountEvaluator implements Evaluator {\n  async evaluate(params: EvaluationSchema, runResult: ExecutionResult): Promise<EvaluationResult> {\n    if (params.type !== 'token_count') throw new Error('Mismatched evaluator');\n    const llm = runResult.metrics?.llm ?? [];\n    if (!llm.length) {\n      return { success: false, message: 'No LLM token metrics found for this step' };\n    }\n    const totals = llm.reduce((acc, m) => ({\n      prompt: acc.prompt + (m.promptTokens ?? 0),\n      completion: acc.completion + (m.completionTokens ?? 0),\n      total: acc.total + (m.totalTokens ?? ((m.promptTokens ?? 0) + (m.completionTokens ?? 0))),\n    }), { prompt: 0, completion: 0, total: 0 });\n\n    const tooManyInput = params.max_input_tokens != null && totals.prompt > params.max_input_tokens;\n    const tooManyOutput = params.max_output_tokens != null && totals.completion > params.max_output_tokens;\n    const tooManyTotal = params.max_total_tokens != null && totals.total > params.max_total_tokens;\n\n    const success = !(tooManyInput || tooManyOutput || tooManyTotal);\n    return {\n      success,\n      message: `Tokens — input:${totals.prompt}, output:${totals.completion}, total:${totals.total}`,\n    };\n  }\n}\n```\n\nRegister in constructor:\n\n```ts\nthis.register('token_count', new TokenCountEvaluator());\n```\n\n## Example Usage\n\n```yaml\nevaluations:\n  - type: token_count\n    max_input_tokens: 2000\n    max_output_tokens: 1000\n    max_total_tokens: 2500\n```\n\n## Tests\n- Unit: evaluator aggregation logic\n- Integration: scenario with a small prompt + enforced limits; tokenizer present and absent\n\n## Notes\n- This is complementary to a Cost evaluator which can derive cost from token metrics and model price tables.\n\n\n",
      "createdAt": "2025-08-12T04:27:18Z",
      "closedAt": "2026-02-12T22:43:04Z",
      "state": "CLOSED",
      "commentCount": 0
    }
  ],
  "topPRs": [
    {
      "id": "PR_kwDOMT5cIs7DaiYU",
      "title": "chore(changelog): remove references",
      "author": "mcp97",
      "number": 6495,
      "body": "## Summary\r\n- remove all references from `CHANGELOG.md`\r\n\r\n## Testing\r\n- not run (content-only change)\r\n",
      "repository": "elizaos/eliza",
      "createdAt": "2026-02-12T22:49:11Z",
      "mergedAt": "2026-02-12T23:13:17Z",
      "additions": 0,
      "deletions": 23
    }
  ],
  "codeChanges": {
    "additions": 0,
    "deletions": 23,
    "files": 1,
    "commitCount": 7
  },
  "completedItems": [
    {
      "title": "chore(changelog): remove references",
      "prNumber": 6495,
      "type": "other",
      "body": "## Summary\r\n- remove all references from `CHANGELOG.md`\r\n\r\n## Testing\r\n- not run (content-only change)\r\n",
      "files": [
        "CHANGELOG.md"
      ]
    }
  ],
  "topContributors": [
    {
      "username": "standujar",
      "avatarUrl": "https://avatars.githubusercontent.com/u/16385918?u=718bdcd1585be8447bdfffb8c11ce249baa7532d&v=4",
      "totalScore": 81.35548968218932,
      "prScore": 81.35548968218932,
      "issueScore": 0,
      "reviewScore": 0,
      "commentScore": 0,
      "summary": null
    },
    {
      "username": "2-A-M",
      "avatarUrl": "https://avatars.githubusercontent.com/u/96268540?u=b7d92c0e2a91af580d09eeae862eef576955ab8a&v=4",
      "totalScore": 36.63501911726088,
      "prScore": 36.63501911726088,
      "issueScore": 0,
      "reviewScore": 0,
      "commentScore": 0,
      "summary": null
    },
    {
      "username": "mcp97",
      "avatarUrl": "https://avatars.githubusercontent.com/u/15067321?v=4",
      "totalScore": 21.901026915173976,
      "prScore": 21.901026915173976,
      "issueScore": 0,
      "reviewScore": 0,
      "commentScore": 0,
      "summary": null
    },
    {
      "username": "greptile-apps",
      "avatarUrl": "https://avatars.githubusercontent.com/in/867647?v=4",
      "totalScore": 4.5,
      "prScore": 0,
      "issueScore": 0,
      "reviewScore": 4.5,
      "commentScore": 0,
      "summary": null
    }
  ],
  "newPRs": 1,
  "mergedPRs": 1,
  "newIssues": 0,
  "closedIssues": 5,
  "activeContributors": 4
}