Harness Quickstart - OpenReward

Firehorse is a library for running agent harnesses against any OpenReward environment. It works by composing the appropriate harness toolset with the environment, connecting the harness agent, and orchestrating the agent loop.

Prerequisites

An OpenReward account and API key
An API key for your model provider (Anthropic, OpenAI, Google, or OpenRouter)
Python 3.10+

Installation

pip install firehorse-cli

For specific agent types, you may also need the corresponding CLI tool installed:

Agent	Additional requirement
`claude-code`	Claude Code v2.1.88+
`codex`	Codex CLI v0.121.0+
`gemini`	Gemini CLI
`resum`, `react`	None (API-only, no CLI needed)

Your First Evaluation

Set your API keys:

export OPENREWARD_API_KEY='your-openreward-api-key'
export ANTHROPIC_API_KEY='your-anthropic-api-key'

Run an evaluation:

firehorse \
  --env GeneralReasoning/terminal-bench-2-verified \
  --agent claude-code \
  --model anthropic/claude-sonnet-4-6 \
  --split test \
  --max-tasks 5 \
  --output-dir ./results

This launches Claude Code as a subprocess against the terminal-bench-2-verified environment, running 5 tasks from the test split. Results are written to ./results.

Available Agents

Firehorse ships five agent harnesses, each with a different architecture:

Agent	Approach	Description
`resum`	API + context compaction	Default agent. Implements a ReAct loop with 3-layer context compaction for long trials. Multi-provider support.
`claude-code`	Subprocess + MCP	Launches Claude Code CLI per trial. Disables filesystem builtins and replaces them with sandboxed MCP tools. Supports extended thinking via `--effort`.
`codex`	Subprocess + MCP	Launches OpenAI Codex CLI. Uses a single `bash` tool surface. Read-only filesystem sandbox mode.
`gemini`	Subprocess + MCP	Launches Google Gemini CLI. Pre-builds tool specs to avoid CLI discovery timeout.
`react`	API-direct	Direct LLM API integration with a straightforward reason-act loop. Supports Anthropic, OpenAI, Google, and OpenRouter. No subprocess overhead.

Subprocess + MCP agents (claude-code, codex, gemini) launch the respective CLI as a child process and proxy environment tools via MCP. The agent’s built-in filesystem tools are disabled and replaced with sandbox-backed equivalents. API-direct agents (resum, react) call LLM APIs directly and execute tool calls via the OpenReward session. No local CLI is required.

Key Options

Option	Description	Default
`--env`	Environment name (e.g. `GeneralReasoning/CTF`)	Required
`--agent`	Agent type (`resum`, `claude-code`, `codex`, `react`, `gemini`)	`resum`
`--model`	Model identifier with provider prefix (e.g. `anthropic/claude-sonnet-4-6`, `openrouter/deepseek/deepseek-v3.2`)	Required
`--effort`	Reasoning depth: `none`, `low`, `medium`, `high`, `max`, `xhigh`	Provider default
`--n-concurrent`	Number of parallel trials	`1`
`--max-tasks`	Limit number of tasks to evaluate	All tasks
`--max-turns`	Maximum tool calls per trial	`100`
`--split`	Data split to evaluate (`train`, `test`, `validation`)	`train`
`--variant`	Environment variant	Default variant
`--output-dir`	Directory for results and trajectory logs	`./output`
`--provider-url`	Custom model provider endpoint	—
`--secrets`	Extra secrets as `KEY=VALUE` pairs	—

Model Identifiers

Models are specified with a provider prefix:

# Anthropic
--model anthropic/claude-sonnet-4-6

# OpenAI
--model openai/gpt-5.4

# Google
--model google/gemini-2.5-flash

# OpenRouter (any model)
--model openrouter/deepseek/deepseek-v3.2

Effort Levels

The --effort flag controls reasoning depth and maps to each provider’s native mechanism:

Effort	Anthropic	OpenAI	Google
`low`	Adaptive thinking (low)	`reasoning_effort: low`	`thinking_level: low`
`medium`	Adaptive thinking (medium)	`reasoning_effort: medium`	`thinking_level: medium`
`high`	Adaptive thinking (high)	`reasoning_effort: high`	`thinking_level: high`
`max`	Max thinking tokens	—	—
`xhigh`	—	`reasoning_effort: xhigh`	—

Understanding the Output

Firehorse produces three types of output per trial:

Result JSON — final metrics: total reward, tool call count, token usage, API cost, duration
Trajectory JSONL — full event log capturing reasoning steps, tool calls, and tool results
Aggregate summary — statistics across all trials

Example output structure:

results/
├── trial_0/
│   ├── result.json          # Final metrics
│   ├── trajectory.jsonl     # Full event log
│   └── rewards.jsonl        # Timestamped reward signals
├── trial_1/
│   └── ...
└── summary.json             # Aggregate statistics

Examples

Evaluate with OpenRouter

export OPENROUTER_API_KEY='your-key'

firehorse \
  --env GeneralReasoning/CTF \
  --agent react \
  --model openrouter/moonshotai/kimi-k2.6 \
  --n-concurrent 4 \
  --max-tasks 20

High-effort Claude Code evaluation

export ANTHROPIC_API_KEY='your-key'

firehorse \
  --env Eigent/SETA \
  --agent claude-code \
  --model anthropic/claude-sonnet-4-6 \
  --effort high \
  --n-concurrent 8 \
  --split train

Codex on a custom environment

export OPENAI_API_KEY='your-key'

firehorse \
  --env MyOrg/MyEnv \
  --agent codex \
  --model openai/gpt-5.4 \
  --variant hard \
  --max-turns 50

Next Steps

Harness Toolsets

Configure agent-native tool surfaces for your environments

Your First Evaluation

Build a custom evaluation environment

Building Agentic Environments

Create sandbox-based environments for agent tasks

Using Toolsets

Compose reusable tool collections into environments

​Prerequisites

​Installation

​Your First Evaluation

​Available Agents

​Key Options

​Model Identifiers

​Effort Levels

​Understanding the Output

​Examples

​Evaluate with OpenRouter

​High-effort Claude Code evaluation

​Codex on a custom environment

​Next Steps

Harness Toolsets

Your First Evaluation

Building Agentic Environments

Using Toolsets

Prerequisites

Installation

Your First Evaluation

Available Agents

Key Options

Model Identifiers

Effort Levels

Understanding the Output

Examples

Evaluate with OpenRouter

High-effort Claude Code evaluation

Codex on a custom environment

Next Steps