Prerequisites
- An OpenReward account and API key
- An API key for your model provider (Anthropic, OpenAI, Google, or OpenRouter)
- Python 3.10+
Installation
| Agent | Additional requirement |
|---|---|
claude-code | Claude Code v2.1.88+ |
codex | Codex CLI v0.121.0+ |
gemini | Gemini CLI |
resum, react | None (API-only, no CLI needed) |
Your First Evaluation
Set your API keys:./results.
Available Agents
Firehorse ships five agent harnesses, each with a different architecture:| Agent | Approach | Description |
|---|---|---|
resum | API + context compaction | Default agent. Implements a ReAct loop with 3-layer context compaction for long trials. Multi-provider support. |
claude-code | Subprocess + MCP | Launches Claude Code CLI per trial. Disables filesystem builtins and replaces them with sandboxed MCP tools. Supports extended thinking via --effort. |
codex | Subprocess + MCP | Launches OpenAI Codex CLI. Uses a single bash tool surface. Read-only filesystem sandbox mode. |
gemini | Subprocess + MCP | Launches Google Gemini CLI. Pre-builds tool specs to avoid CLI discovery timeout. |
react | API-direct | Direct LLM API integration with a straightforward reason-act loop. Supports Anthropic, OpenAI, Google, and OpenRouter. No subprocess overhead. |
Key Options
| Option | Description | Default |
|---|---|---|
--env | Environment name (e.g. GeneralReasoning/CTF) | Required |
--agent | Agent type (resum, claude-code, codex, react, gemini) | resum |
--model | Model identifier with provider prefix (e.g. anthropic/claude-sonnet-4-6, openrouter/deepseek/deepseek-v3.2) | Required |
--effort | Reasoning depth: none, low, medium, high, max, xhigh | Provider default |
--n-concurrent | Number of parallel trials | 1 |
--max-tasks | Limit number of tasks to evaluate | All tasks |
--max-turns | Maximum tool calls per trial | 100 |
--split | Data split to evaluate (train, test, validation) | train |
--variant | Environment variant | Default variant |
--output-dir | Directory for results and trajectory logs | ./output |
--provider-url | Custom model provider endpoint | — |
--secrets | Extra secrets as KEY=VALUE pairs | — |
Model Identifiers
Models are specified with a provider prefix:Effort Levels
The--effort flag controls reasoning depth and maps to each provider’s native mechanism:
| Effort | Anthropic | OpenAI | |
|---|---|---|---|
low | Adaptive thinking (low) | reasoning_effort: low | thinking_level: low |
medium | Adaptive thinking (medium) | reasoning_effort: medium | thinking_level: medium |
high | Adaptive thinking (high) | reasoning_effort: high | thinking_level: high |
max | Max thinking tokens | — | — |
xhigh | — | reasoning_effort: xhigh | — |
Understanding the Output
Firehorse produces three types of output per trial:- Result JSON — final metrics: total reward, tool call count, token usage, API cost, duration
- Trajectory JSONL — full event log capturing reasoning steps, tool calls, and tool results
- Aggregate summary — statistics across all trials
Examples
Evaluate with OpenRouter
High-effort Claude Code evaluation
Codex on a custom environment
Next Steps
Harness Toolsets
Configure agent-native tool surfaces for your environments
Your First Evaluation
Build a custom evaluation environment
Building Agentic Environments
Create sandbox-based environments for agent tasks
Using Toolsets
Compose reusable tool collections into environments

