{
  "interval": {
    "intervalStart": "2025-08-12T00:00:00.000Z",
    "intervalEnd": "2025-08-13T00:00:00.000Z",
    "intervalType": "day"
  },
  "repository": "elizaos/eliza",
  "overview": "From 2025-08-12 to 2025-08-13, elizaos/eliza had 1 new PRs (0 merged), 5 new issues, and 2 active contributors.",
  "topIssues": [
    {
      "id": "I_kwDOMT5cIs7Eng6F",
      "title": "feat(scenarios): Implement natural language agent interaction and response validation",
      "author": "monilpat",
      "number": 5727,
      "repository": "elizaos/eliza",
      "body": "# feat(scenarios): Implement natural language agent interaction and response validation\n\n## Description\n\nThis ticket enables scenarios to test agent behavior through natural language interactions rather than direct code execution. This allows testing of agent reasoning, decision-making, and response generation in realistic conversation contexts with proper evaluation of agent responses.\n\n## Acceptance Criteria\n\n1. Scenario `run` blocks support `input` field for natural language prompts to agents\n2. Agent responses are captured and available for evaluation (text, thoughts, actions)\n3. Evaluators can access both agent response text and execution context\n4. Support for multi-turn conversations in scenarios\n5. Agent responses include thought process and action decisions\n6. Integration with existing evaluation engine for response validation\n7. Support for conversation context across multiple steps\n8. Agent response timing and performance metrics\n\n## Technical Approach\n\n### 1. Enhanced Run Step Schema\n```typescript\n// packages/cli/src/scenarios/schema.ts\nconst RunStepSchema = z.object({\n  name: z.string().optional(),\n  // Natural language input to agent\n  input: z.string().optional(),\n  // Direct code execution (existing)\n  lang: z.string().optional(),\n  code: z.string().optional(),\n  // Agent interaction specific\n  agent_context: z.object({\n    conversation_id: z.string().optional(),\n    user_id: z.string().optional(),\n    room_id: z.string().optional(),\n  }).optional(),\n  evaluations: z.array(EvaluationSchema),\n});\n```\n\n### 2. Agent Interaction Engine\n```typescript\n// packages/cli/src/scenarios/agent-interaction.ts\nexport class AgentInteractionEngine {\n  constructor(private runtime: IAgentRuntime) {}\n\n  async interactWithAgent(input: string, context?: AgentContext): Promise<AgentResponse> {\n    // Create message for agent\n    const message: Memory = {\n      entityId: context?.user_id || 'scenario-user',\n      roomId: context?.room_id || 'scenario-room',\n      content: {\n        type: 'text',\n        text: input,\n      },\n      metadata: {\n        type: 'message',\n        conversationId: context?.conversation_id,\n      },\n    };\n\n    // Send to agent and capture response\n    const startTime = Date.now();\n    const response = await this.runtime.processMessage(message);\n    const endTime = Date.now();\n\n    return {\n      text: response.text,\n      thoughts: response.thoughts,\n      actions: response.actions,\n      timing: {\n        startTime,\n        endTime,\n        duration: endTime - startTime,\n      },\n      context: {\n        conversationId: context?.conversation_id,\n        messageId: message.id,\n      },\n    };\n  }\n}\n```\n\n### 3. Enhanced Execution Result\n```typescript\n// packages/cli/src/scenarios/providers.ts\nexport interface ExecutionResult {\n  exitCode: number;\n  stdout: string;\n  stderr: string;\n  files: Record<string, string>;\n  // New: Agent interaction results\n  agentResponse?: AgentResponse;\n  conversationHistory?: AgentResponse[];\n}\n```\n\n## Test Scenario\n\nCreate `agent-interaction-test.scenario.yaml`:\n```yaml\nname: \"Agent Interaction Test\"\ndescription: \"Tests natural language interaction with agents\"\n\nplugins:\n  - \"@elizaos/plugin-github\"\n  - \"@elizaos/plugin-evm\"\n\nenvironment:\n  type: e2b\n\nsetup:\n  mocks:\n    - service: \"github-service\"\n      method: \"searchIssues\"\n      response:\n        - title: \"Implement Dark Mode\"\n          number: 123\n          state: \"open\"\n          labels: [\"feature\", \"ui\"]\n    - service: \"evm-service\"\n      method: \"getBalancesForAddress\"\n      response:\n        - chain: \"ethereum\"\n          balances:\n            - symbol: \"ETH\"\n              amount: \"2.5\"\n\nrun:\n  - name: \"Ask agent about roadmap\"\n    input: \"What new features are you planning to add?\"\n    agent_context:\n      conversation_id: \"roadmap-conversation\"\n      user_id: \"test-user\"\n    evaluations:\n      - type: \"trajectory_contains_action\"\n        action: \"github-service.searchIssues\"\n        description: \"Verify agent searched for issues\"\n      \n      - type: \"string_contains\"\n        value: \"Dark Mode\"\n        description: \"Verify agent mentioned the mocked issue\"\n      \n      - type: \"llm_judge\"\n        prompt: \"Did the agent provide a helpful and coherent response about new features?\"\n        expected: \"yes\"\n        description: \"Verify agent response quality\"\n\n  - name: \"Ask agent about wallet\"\n    input: \"What's my current wallet balance?\"\n    agent_context:\n      conversation_id: \"wallet-conversation\"\n      user_id: \"test-user\"\n    evaluations:\n      - type: \"trajectory_contains_action\"\n        action: \"evm-service.getBalancesForAddress\"\n        description: \"Verify agent checked wallet balance\"\n      \n      - type: \"string_contains\"\n        value: \"2.5 ETH\"\n        description: \"Verify agent reported the correct balance\"\n      \n      - type: \"llm_judge\"\n        prompt: \"Did the agent clearly explain the wallet balance information?\"\n        expected: \"yes\"\n\n  - name: \"Multi-turn conversation\"\n    input: \"Can you help me with both my wallet and roadmap?\"\n    agent_context:\n      conversation_id: \"multi-turn-conversation\"\n      user_id: \"test-user\"\n    evaluations:\n      - type: \"trajectory_contains_action\"\n        action: \"evm-service.getBalancesForAddress\"\n      - type: \"trajectory_contains_action\"\n        action: \"github-service.searchIssues\"\n      - type: \"string_contains\"\n        value: \"ETH\"\n      - type: \"string_contains\"\n        value: \"Dark Mode\"\n      - type: \"llm_judge\"\n        prompt: \"Did the agent address both wallet and roadmap questions comprehensively?\"\n        expected: \"yes\"\n\njudgment:\n  strategy: all_pass\n```\n\n## Testing Strategy\n\n1. **Single Turn**: Test basic agent interaction and response\n2. **Multi-turn**: Test conversation context across steps\n3. **Action Tracking**: Verify agent uses appropriate actions\n4. **Response Quality**: Test LLM judge evaluation of responses\n5. **Performance**: Test response timing and metrics\n6. **Error Handling**: Test agent behavior with invalid inputs\n\n## Dependencies\n\n- Requires plugin system integration (Ticket 1)\n- Builds on advanced mocking system (Ticket 2)\n- Integrates with existing evaluation engine\n- Depends on agent runtime message processing",
      "createdAt": "2025-08-07T02:49:34Z",
      "closedAt": "2025-08-12T04:21:31Z",
      "state": "CLOSED",
      "commentCount": 1
    },
    {
      "id": "I_kwDOMT5cIs7Engk3",
      "title": "feat(scenarios): Implement conditional mocking and complex response structures",
      "author": "monilpat",
      "number": 5726,
      "repository": "elizaos/eliza",
      "body": "# feat(scenarios): Implement conditional mocking and complex response structures\n\n## Description\n\nThis ticket enhances the mocking system to support conditional responses based on input parameters and complex response structures with metadata. This enables realistic testing of service interactions like GitHub API calls or EVM transactions with proper request/response matching.\n\n## Acceptance Criteria\n\n1. Mock definitions support `when` clauses for conditional responses\n2. `when` clauses can match on method arguments, input parameters, or request context\n3. Mock responses support complex nested structures with metadata (timestamps, IDs, etc.)\n4. Multiple mock responses can be defined for the same service/method with different conditions\n5. Mock system provides clear logging of which mock was triggered and why\n6. Mock responses can include realistic error conditions and edge cases\n7. Support for dynamic response generation based on input parameters\n8. Mock validation ensures `when` clauses are syntactically correct\n\n## Technical Approach\n\n### 1. Enhanced Mock Schema\n```typescript\n// packages/cli/src/scenarios/schema.ts\nconst MockSchema = z.object({\n  service: z.string(),\n  method: z.string(),\n  when: z.object({\n    // Match on method arguments\n    args: z.array(z.any()).optional(),\n    // Match on specific argument values\n    input: z.record(z.any()).optional(),\n    // Match on request context\n    context: z.record(z.any()).optional(),\n    // Custom matching function\n    matcher: z.string().optional(), // JavaScript expression\n  }).optional(),\n  response: z.any(), // Can be function or static value\n  // For dynamic responses\n  responseFn: z.string().optional(), // JavaScript function\n  // Error simulation\n  error: z.object({\n    code: z.string(),\n    message: z.string(),\n  }).optional(),\n});\n```\n\n### 2. Mock Engine Implementation\n```typescript\n// packages/cli/src/scenarios/mock-engine.ts\nexport class MockEngine {\n  private mocks: MockDefinition[] = [];\n\n  addMock(mock: MockDefinition) {\n    this.mocks.push(mock);\n  }\n\n  async findMock(service: string, method: string, args: any[]): Promise<any> {\n    const candidates = this.mocks.filter(m => \n      m.service === service && m.method === method\n    );\n\n    for (const mock of candidates) {\n      if (await this.matchesCondition(mock, args)) {\n        this.logger.info(`Mock triggered: ${service}.${method} with condition: ${JSON.stringify(mock.when)}`);\n        return this.generateResponse(mock, args);\n      }\n    }\n\n    return null; // No mock found\n  }\n\n  private async matchesCondition(mock: MockDefinition, args: any[]): Promise<boolean> {\n    if (!mock.when) return true; // Default mock\n\n    // Match on arguments\n    if (mock.when.args) {\n      if (!this.deepEqual(args, mock.when.args)) return false;\n    }\n\n    // Match on input parameters\n    if (mock.when.input) {\n      const input = this.extractInputFromArgs(args);\n      if (!this.deepEqual(input, mock.when.input)) return false;\n    }\n\n    // Custom matcher function\n    if (mock.when.matcher) {\n      const matcherFn = new Function('args', 'input', mock.when.matcher);\n      return matcherFn(args, this.extractInputFromArgs(args));\n    }\n\n    return true;\n  }\n\n  private generateResponse(mock: MockDefinition, args: any[]): any {\n    if (mock.error) {\n      throw new Error(`${mock.error.code}: ${mock.error.message}`);\n    }\n\n    if (mock.responseFn) {\n      const responseFn = new Function('args', 'input', mock.responseFn);\n      return responseFn(args, this.extractInputFromArgs(args));\n    }\n\n    return mock.response;\n  }\n}\n```\n\n## Test Scenario\n\nCreate `advanced-mocking-test.scenario.yaml`:\n```yaml\nname: \"Advanced Mocking Test\"\ndescription: \"Tests conditional mocking and complex response structures\"\n\nplugins:\n  - \"@elizaos/plugin-github\"\n  - \"@elizaos/plugin-evm\"\n\nenvironment:\n  type: e2b\n\nsetup:\n  mocks:\n    # Conditional GitHub issue search\n    - service: \"github-service\"\n      method: \"searchIssues\"\n      when:\n        input:\n          labels: \"bug\"\n        matcher: \"input.labels.includes('bug')\"\n      response:\n        - title: \"Critical Bug Found\"\n          number: 456\n          state: \"open\"\n          labels: [\"bug\", \"critical\"]\n          created_at: \"2024-07-15T10:00:00Z\"\n\n    # Conditional GitHub issue search - different response\n    - service: \"github-service\"\n      method: \"searchIssues\"\n      when:\n        input:\n          labels: \"feature\"\n        matcher: \"input.labels.includes('feature')\"\n      response:\n        - title: \"New Feature Request\"\n          number: 789\n          state: \"open\"\n          labels: [\"feature\", \"enhancement\"]\n          created_at: \"2024-07-15T11:00:00Z\"\n\n    # Dynamic EVM balance response\n    - service: \"evm-service\"\n      method: \"getBalancesForAddress\"\n      when:\n        args: [\"0x1234567890abcdef\"]\n      responseFn: |\n        return {\n          chain: \"ethereum\",\n          address: args[0],\n          balances: [\n            { symbol: \"ETH\", amount: \"1.23\" },\n            { symbol: \"USDC\", amount: \"1000.00\" }\n          ],\n          last_updated: new Date().toISOString()\n        }\n\n    # Error simulation\n    - service: \"github-service\"\n      method: \"readFile\"\n      when:\n        input:\n          path: \"/docs/nonexistent.md\"\n      error:\n        code: \"FILE_NOT_FOUND\"\n        message: \"File does not exist\"\n\nrun:\n  - name: \"Test conditional GitHub search\"\n    input: \"Search for issues with bug label\"\n    evaluations:\n      - type: \"trajectory_contains_action\"\n        action: \"github-service.searchIssues\"\n      - type: \"string_contains\"\n        value: \"Critical Bug Found\"\n      - type: \"llm_judge\"\n        prompt: \"Did the agent correctly search for bug issues?\"\n        expected: \"yes\"\n\n  - name: \"Test dynamic EVM response\"\n    input: \"What's the balance for address 0x1234567890abcdef?\"\n    evaluations:\n      - type: \"trajectory_contains_action\"\n        action: \"evm-service.getBalancesForAddress\"\n      - type: \"string_contains\"\n        value: \"1.23 ETH\"\n      - type: \"string_contains\"\n        value: \"1000.00 USDC\"\n\n  - name: \"Test error handling\"\n    input: \"Read the file /docs/nonexistent.md\"\n    evaluations:\n      - type: \"trajectory_contains_action\"\n        action: \"github-service.readFile\"\n      - type: \"string_contains\"\n        value: \"File does not exist\"\n\njudgment:\n  strategy: all_pass\n```\n\n## Testing Strategy\n\n1. **Conditional Matching**: Test different responses based on input parameters\n2. **Dynamic Responses**: Test response generation based on arguments\n3. **Error Simulation**: Test error handling and reporting\n4. **Complex Structures**: Test nested response objects with metadata\n5. **Multiple Mocks**: Test multiple mocks for same service/method\n6. **Logging**: Verify mock selection is logged clearly\n\n## Dependencies\n\n- Builds on existing mock system in scenarios\n- Requires plugin system integration (Ticket 1)\n- Integrates with agent interaction testing (Ticket 3) ",
      "createdAt": "2025-08-07T02:49:00Z",
      "closedAt": "2025-08-12T04:21:45Z",
      "state": "CLOSED",
      "commentCount": 1
    },
    {
      "id": "I_kwDOMT5cIs7EngKo",
      "title": "feat(scenarios): Implement plugin specification and dynamic loading",
      "author": "monilpat",
      "number": 5725,
      "repository": "elizaos/eliza",
      "body": "# feat(scenarios): Implement plugin specification and dynamic loading\n\n## Description\n\nThis ticket implements plugin specification in scenario YAML files, allowing scenarios to declare which plugins are required for testing. This enables testing of agent behaviors that depend on specific plugins like `@elizaos/plugin-github` or `@elizaos/plugin-evm`. The system will dynamically load specified plugins during scenario execution and make their actions, providers, and services available to the agent.\n\n## Acceptance Criteria\n\n1. Scenario YAML supports a `plugins` array at the root level with string plugin names\n2. The `initializeAgent()` function respects scenario plugin specifications and loads them via `startAgent()`\n3. Plugin loading follows the same dependency resolution and error handling as the main CLI\n4. Scenarios can specify both string plugin names (`@elizaos/plugin-github`) and direct plugin objects\n5. Plugin loading errors are clearly reported with actionable guidance\n6. Default plugins (bootstrap, sql) are automatically included unless explicitly excluded via `exclude_defaults: true`\n7. Plugin conflicts are detected and reported during scenario validation\n8. Plugin initialization errors don't crash the scenario but are reported in results\n\n## Technical Approach\n\n### 1. Update Scenario Schema\n```typescript\n// packages/cli/src/scenarios/schema.ts\nconst ScenarioSchema = z.object({\n  name: z.string(),\n  description: z.string(),\n  plugins: z.array(z.string()).optional(), // e.g., [\"@elizaos/plugin-github\"]\n  exclude_defaults: z.boolean().optional(), // exclude bootstrap/sql\n  environment: EnvironmentSchema,\n  setup: SetupSchema.optional(),\n  run: z.array(RunStepSchema),\n  judgment: JudgmentSchema,\n});\n```\n\n### 2. Enhance Runtime Factory\n```typescript\n// packages/cli/src/scenarios/runtime-factory.ts\nexport async function initializeAgent(scenario: Scenario): Promise<IAgentRuntime> {\n  const character: Character = {\n    name: 'scenario-runner',\n    id: stringToUuid('scenario-runner'),\n    bio: 'A minimal character for running scenarios',\n    plugins: scenario.plugins || []\n  };\n\n  // Load default plugins unless excluded\n  if (!scenario.exclude_defaults) {\n    character.plugins.push('@elizaos/plugin-bootstrap', '@elizaos/plugin-sql');\n  }\n\n  const runtime = await startAgent(\n    encryptedCharacter(character),\n    server,\n    undefined,\n    character.plugins,\n    { isTestMode: true }\n  );\n\n  return runtime;\n}\n```\n\n### 3. Plugin Validation\n```typescript\n// packages/cli/src/scenarios/plugin-validator.ts\nexport async function validateScenarioPlugins(scenario: Scenario): Promise<ValidationResult[]> {\n  const results: ValidationResult[] = [];\n  \n  for (const pluginName of scenario.plugins || []) {\n    try {\n      const plugin = await loadAndPreparePlugin(pluginName);\n      if (!plugin) {\n        results.push({\n          type: 'error',\n          message: `Plugin '${pluginName}' could not be loaded`,\n          suggestion: 'Check if plugin is installed or built correctly'\n        });\n      }\n    } catch (error) {\n      results.push({\n        type: 'error', \n        message: `Failed to validate plugin '${pluginName}': ${error.message}`,\n        suggestion: 'Verify plugin dependencies and configuration'\n      });\n    }\n  }\n  \n  return results;\n}\n```\n\n## Test Scenario\n\nCreate `plugin-integration-test.scenario.yaml`:\n```yaml\nname: \"Plugin Integration Test\"\ndescription: \"Tests loading and using plugins specified in scenario YAML\"\n\nplugins:\n  - \"@elizaos/plugin-github\"\n  - \"@elizaos/plugin-evm\"\n\nenvironment:\n  type: e2b\n\nsetup:\n  mocks:\n    - service: \"github-service\"\n      method: \"searchIssues\"\n      response:\n        - title: \"Test Issue\"\n          number: 123\n          state: \"open\"\n    - service: \"evm-service\"\n      method: \"getBalancesForAddress\"\n      response:\n        - chain: \"ethereum\"\n          balances:\n            - symbol: \"ETH\"\n              amount: \"1.23\"\n\nrun:\n  - name: \"Test GitHub plugin actions\"\n    input: \"Search for issues with label 'bug'\"\n    evaluations:\n      - type: \"trajectory_contains_action\"\n        action: \"github-service.searchIssues\"\n        description: \"Verify GitHub plugin action was executed\"\n      \n      - type: \"string_contains\"\n        value: \"Test Issue\"\n        description: \"Verify agent found the mocked issue\"\n\n  - name: \"Test EVM plugin actions\"\n    input: \"What's my wallet balance?\"\n    evaluations:\n      - type: \"trajectory_contains_action\"\n        action: \"evm-service.getBalancesForAddress\"\n        description: \"Verify EVM plugin action was executed\"\n      \n      - type: \"string_contains\"\n        value: \"1.23 ETH\"\n        description: \"Verify agent reported the mocked balance\"\n\njudgment:\n  strategy: all_pass\n```\n\n## Testing Strategy\n\n1. **Plugin Loading Test**: Verify plugins load without errors\n2. **Action Availability Test**: Confirm agent can use plugin actions\n3. **Error Handling Test**: Test with non-existent plugin\n4. **Default Plugin Test**: Verify bootstrap/sql are included by default\n5. **Exclusion Test**: Test `exclude_defaults: true` behavior\n\n## Dependencies\n\n- Fixes the `startAgent` hanging issue (#5719) to enable plugin testing\n- Builds on existing `loadAndPreparePlugin` functionality\n- Integrates with current scenario execution flow ",
      "createdAt": "2025-08-07T02:48:08Z",
      "closedAt": "2025-08-12T04:21:13Z",
      "state": "CLOSED",
      "commentCount": 1
    },
    {
      "id": "I_kwDOMT5cIs7FcGEm",
      "title": "feat(scenarios): Add Step Count Evaluator",
      "author": "monilpat",
      "number": 5761,
      "repository": "elizaos/eliza",
      "body": "# feat(scenarios): Add Step Count Evaluator\n\nLinks: [Issue #5726](https://github.com/elizaOS/eliza/issues/5726)\n\n## Summary\nAdd an evaluator that asserts on the number of agent/tool/action steps taken to complete a scenario step. This encourages concise trajectories and prevents thrashing.\n\n## Goals\n- Count actions/tools invoked during response generation for a step\n- Assert against `max_steps`, `min_steps`, optional `target_steps`\n\n## Acceptance Criteria\n1. New evaluator type `step_count`\n2. Step count derived from trajectory/action memories queried via `AgentRuntime.getMemories`\n3. Evaluator passes if within bounds and returns a clear message\n\n## Schema Changes\nEdit `packages/cli/src/commands/scenario/src/schema.ts`:\n\n```ts\nconst StepCountEvaluationSchema = BaseEvaluationSchema.extend({\n  type: z.literal('step_count'),\n  max_steps: z.number().optional(),\n  min_steps: z.number().optional(),\n  target_steps: z.number().optional(),\n  // optional filter by action name prefix to scope counting\n  action_prefix: z.string().optional(),\n});\n```\n\n## Evaluation Implementation\nLeverage the existing `TrajectoryContainsActionEvaluator` patterns.\n\n```ts\nclass StepCountEvaluator implements Evaluator {\n  async evaluate(params: EvaluationSchema, _runResult: ExecutionResult, runtime: AgentRuntime): Promise<EvaluationResult> {\n    if (params.type !== 'step_count') throw new Error('Mismatched evaluator');\n    try {\n      const memories = await runtime.getMemories({\n        tableName: 'messages', agentId: runtime.agentId, count: 200, unique: false,\n      });\n      const actionResults = memories.filter(m => m?.type === 'messages' && m.content?.type === 'action_result');\n      const filtered = params.action_prefix\n        ? actionResults.filter(m => (m.content?.actionName ?? '').startsWith(params.action_prefix!))\n        : actionResults;\n      const steps = filtered.length;\n\n      const tooMany = params.max_steps != null && steps > params.max_steps;\n      const tooFew = params.min_steps != null && steps < params.min_steps;\n      const success = !(tooMany || tooFew);\n      return {\n        success,\n        message: `Steps taken: ${steps} (min=${params.min_steps ?? '-'}, target=${params.target_steps ?? '-'}, max=${params.max_steps ?? '-'})`,\n      };\n    } catch (error: any) {\n      return { success: false, message: `Step count evaluation failed: ${error.message ?? String(error)}` };\n    }\n  }\n}\n```\n\nRegister in `EvaluationEngine` constructor:\n\n```ts\nthis.register('step_count', new StepCountEvaluator());\n```\n\n## Example Usage\n\n```yaml\nevaluations:\n  - type: step_count\n    max_steps: 3\n    action_prefix: \"github-service.\"\n```\n\n## Tests\n- Unit: evaluator result on synthetic memories\n- Integration: scenario that triggers 1-2 simple actions and validates bound\n\n## Notes\n- Future enhancement: count only steps within the current scenario step window by tagging messages with a scenario/step id context field.\n\n\n",
      "createdAt": "2025-08-12T04:27:34Z",
      "closedAt": null,
      "state": "OPEN",
      "commentCount": 0
    },
    {
      "id": "I_kwDOMT5cIs7FcGCu",
      "title": "feat(scenarios): Add Consistency Evaluator",
      "author": "monilpat",
      "number": 5760,
      "repository": "elizaos/eliza",
      "body": "# feat(scenarios): Add Consistency Evaluator\n\nLinks: [Issue #5726](https://github.com/elizaOS/eliza/issues/5726)\n\n## Summary\nAdd an evaluator that runs the same step multiple times and asserts consistency over a chosen metric (response content, length, execution time, token counts). Useful for detecting non-determinism and regression in prompt/rules.\n\n## Goals\n- Re-run a step N times\n- Compare result metrics across runs\n- Allow tolerance configuration\n\n## Acceptance Criteria\n1. New evaluator type `consistency_check`\n2. Supports metrics: `response_length` | `execution_time` | `token_count`\n3. Configurable `runs` (>=2) and `tolerance` (absolute or percentage)\n4. Consolidated pass/fail with clear variance message\n\n## Schema Changes\nEdit `packages/cli/src/commands/scenario/src/schema.ts`:\n\n```ts\nconst ConsistencyEvaluationSchema = BaseEvaluationSchema.extend({\n  type: z.literal('consistency_check'),\n  runs: z.number().min(2),\n  metric: z.enum(['response_length', 'execution_time', 'token_count']),\n  tolerance: z.number().min(0), // if < 1 treat as proportion, else absolute\n});\n```\n\n## Orchestration Changes\nAdd a helper to re-execute the current step within the selected environment provider without re-running the entire scenario:\n\n- Expose a `runSingleStep(step, scenarioContext)` on providers that mirrors the logic in `run`, returning an `ExecutionResult`.\n- In `EvaluationEngine`, when encountering `consistency_check`, call this helper to collect multiple `ExecutionResult`s for the same step.\n- Fallback: If helper is not implemented, re-run `run([step])` via a shim.\n\n## Evaluation Implementation\n\n```ts\nclass ConsistencyEvaluator implements Evaluator {\n  constructor(private rerunStep: (times: number) => Promise<ExecutionResult[]>) {}\n\n  async evaluate(params: EvaluationSchema, runResult: ExecutionResult): Promise<EvaluationResult> {\n    if (params.type !== 'consistency_check') throw new Error('Mismatched evaluator');\n\n    const runs = await this.rerunStep(params.runs);\n    const values = runs.map(r => {\n      switch (params.metric) {\n        case 'response_length': return r.stdout.length;\n        case 'execution_time': return r.durationMs ?? 0;\n        case 'token_count': {\n          const llm = r.metrics?.llm ?? [];\n          return llm.reduce((sum, m) => sum + (m.totalTokens ?? (m.promptTokens ?? 0) + (m.completionTokens ?? 0)), 0);\n        }\n      }\n    });\n\n    const min = Math.min(...values);\n    const max = Math.max(...values);\n    const spread = max - min;\n    const baseline = values[0] || 1;\n    const proportion = baseline ? spread / baseline : spread;\n    const threshold = params.tolerance < 1 ? params.tolerance : params.tolerance / baseline;\n    const success = proportion <= threshold;\n\n    return {\n      success,\n      message: `Consistency ${success ? 'ok' : 'failed'} for ${params.metric}: min=${min}, max=${max}, spread=${spread}, tolerance=${params.tolerance}`,\n    };\n  }\n}\n```\n\nRegister in `EvaluationEngine` with a provider-specific `rerunStep` function wired from the scenario runner.\n\n## Example Usage\n\n```yaml\nevaluations:\n  - type: consistency_check\n    runs: 3\n    metric: execution_time\n    tolerance: 0.2   # 20% spread allowed\n```\n\n## Tests\n- Unit: evaluator math on synthetic values\n- Integration: deterministic small prompt with low tolerance; non-deterministic with higher tolerance\n\n## Notes\n- This evaluator requires access to the step context. Implementation will pass a `rerunStep(times)` closure into the evaluator at construction time.\n\n\n",
      "createdAt": "2025-08-12T04:27:29Z",
      "closedAt": null,
      "state": "OPEN",
      "commentCount": 0
    }
  ],
  "topPRs": [
    {
      "id": "PR_kwDOMT5cIs6jLjXe",
      "title": "build: update checkout action to v5",
      "author": "rejected-l",
      "number": 5762,
      "body": "Bumps checkout to v5 for future-proofing against Node 24 runner updates. Requires runner v2.327.1+. Workflows compile the same.\n\nMore info: https://github.com/actions/checkout/releases/tag/v5.0.0",
      "repository": "elizaos/eliza",
      "createdAt": "2025-08-12T05:29:45Z",
      "mergedAt": null,
      "additions": 41,
      "deletions": 41
    }
  ],
  "codeChanges": {
    "additions": 0,
    "deletions": 0,
    "files": 0,
    "commitCount": 1
  },
  "completedItems": [],
  "topContributors": [
    {
      "username": "wtfsayo",
      "avatarUrl": "https://avatars.githubusercontent.com/u/82053242?u=98209a1f10456f42d4d2fa71db4d5bf4a672cbc3&v=4",
      "totalScore": 80.91504956158522,
      "prScore": 80.91504956158522,
      "issueScore": 0,
      "reviewScore": 0,
      "commentScore": 0,
      "summary": null
    },
    {
      "username": "rejected-l",
      "avatarUrl": "https://avatars.githubusercontent.com/u/99460023?u=977f49541583c40f4fc5f6a9f11ca6c6a78b362a&v=4",
      "totalScore": 26.67920303898299,
      "prScore": 26.67920303898299,
      "issueScore": 0,
      "reviewScore": 0,
      "commentScore": 0,
      "summary": null
    },
    {
      "username": "monilpat",
      "avatarUrl": "https://avatars.githubusercontent.com/u/15067321?v=4",
      "totalScore": 10.438,
      "prScore": 0,
      "issueScore": 10,
      "reviewScore": 0,
      "commentScore": 0.43799999999999994,
      "summary": null
    }
  ],
  "newPRs": 1,
  "mergedPRs": 0,
  "newIssues": 5,
  "closedIssues": 3,
  "activeContributors": 2
}