Firehorse is a library for running agent harnesses against any OpenReward environment. It works by composing the appropriate harness toolset with the environment, connecting the harness agent, and orchestrating the agent loop.Documentation Index
Fetch the complete documentation index at: https://docs.openreward.ai/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites
- An OpenReward account and API key
- An API key for your model provider (Anthropic, OpenAI, Google, or OpenRouter)
- Python 3.10+
Installation
| Agent | Additional requirement |
|---|---|
claude-code | Claude Code v2.1.88+ |
codex | Codex CLI v0.121.0+ |
gemini | Gemini CLI |
resum, react | None (API-only, no CLI needed) |
Your First Evaluation
Set your API keys:./results.
Available Agents
Firehorse ships five agent harnesses, each with a different architecture:| Agent | Approach | Description |
|---|---|---|
resum | API + context compaction | Default agent. Implements a ReAct loop with 3-layer context compaction for long trials. Multi-provider support. |
claude-code | Subprocess + MCP | Launches Claude Code CLI per trial. Disables filesystem builtins and replaces them with sandboxed MCP tools. Supports extended thinking via --effort. |
codex | Subprocess + MCP | Launches OpenAI Codex CLI. Uses a single bash tool surface. Read-only filesystem sandbox mode. |
gemini | Subprocess + MCP | Launches Google Gemini CLI. Pre-builds tool specs to avoid CLI discovery timeout. |
react | API-direct | Direct LLM API integration with a straightforward reason-act loop. Supports Anthropic, OpenAI, Google, and OpenRouter. No subprocess overhead. |
Key Options
| Option | Description | Default |
|---|---|---|
--env | Environment name (e.g. GeneralReasoning/CTF) | Required |
--agent | Agent type (resum, claude-code, codex, react, gemini) | resum |
--model | Model identifier with provider prefix (e.g. anthropic/claude-sonnet-4-6, openrouter/deepseek/deepseek-v3.2) | Required |
--effort | Reasoning depth: none, low, medium, high, max, xhigh | Provider default |
--n-concurrent | Number of parallel trials | 1 |
--max-tasks | Limit number of tasks to evaluate | All tasks |
--max-turns | Maximum tool calls per trial | 100 |
--split | Data split to evaluate (train, test, validation) | train |
--variant | Environment variant | Default variant |
--output-dir | Directory for results and trajectory logs | ./output |
--provider-url | Custom model provider endpoint | — |
--secrets | Extra secrets as KEY=VALUE pairs | — |
Model Identifiers
Models are specified with a provider prefix:Effort Levels
The--effort flag controls reasoning depth and maps to each provider’s native mechanism:
| Effort | Anthropic | OpenAI | |
|---|---|---|---|
low | Adaptive thinking (low) | reasoning_effort: low | thinking_level: low |
medium | Adaptive thinking (medium) | reasoning_effort: medium | thinking_level: medium |
high | Adaptive thinking (high) | reasoning_effort: high | thinking_level: high |
max | Max thinking tokens | — | — |
xhigh | — | reasoning_effort: xhigh | — |
Understanding the Output
Firehorse produces three types of output per trial:- Result JSON — final metrics: total reward, tool call count, token usage, API cost, duration
- Trajectory JSONL — full event log capturing reasoning steps, tool calls, and tool results
- Aggregate summary — statistics across all trials
Examples
Evaluate with OpenRouter
High-effort Claude Code evaluation
Codex on a custom environment
Next Steps
Harness Toolsets
Configure agent-native tool surfaces for your environments
Your First Evaluation
Build a custom evaluation environment
Building Agentic Environments
Create sandbox-based environments for agent tasks
Using Toolsets
Compose reusable tool collections into environments

