Skip to main content
Firehorse is a library for running agent harnesses against any OpenReward environment. It works by composing the appropriate harness toolset with the environment, connecting the harness agent, and orchestrating the agent loop.

Prerequisites

  • An OpenReward account and API key
  • An API key for your model provider (Anthropic, OpenAI, Google, or OpenRouter)
  • Python 3.10+

Installation

pip install firehorse-cli
For specific agent types, you may also need the corresponding CLI tool installed:
AgentAdditional requirement
claude-codeClaude Code v2.1.88+
codexCodex CLI v0.121.0+
geminiGemini CLI
resum, reactNone (API-only, no CLI needed)

Your First Evaluation

Set your API keys:
export OPENREWARD_API_KEY='your-openreward-api-key'
export ANTHROPIC_API_KEY='your-anthropic-api-key'
Run an evaluation:
firehorse \
  --env GeneralReasoning/terminal-bench-2-verified \
  --agent claude-code \
  --model anthropic/claude-sonnet-4-6 \
  --split test \
  --max-tasks 5 \
  --output-dir ./results
This launches Claude Code as a subprocess against the terminal-bench-2-verified environment, running 5 tasks from the test split. Results are written to ./results.

Available Agents

Firehorse ships five agent harnesses, each with a different architecture:
AgentApproachDescription
resumAPI + context compactionDefault agent. Implements a ReAct loop with 3-layer context compaction for long trials. Multi-provider support.
claude-codeSubprocess + MCPLaunches Claude Code CLI per trial. Disables filesystem builtins and replaces them with sandboxed MCP tools. Supports extended thinking via --effort.
codexSubprocess + MCPLaunches OpenAI Codex CLI. Uses a single bash tool surface. Read-only filesystem sandbox mode.
geminiSubprocess + MCPLaunches Google Gemini CLI. Pre-builds tool specs to avoid CLI discovery timeout.
reactAPI-directDirect LLM API integration with a straightforward reason-act loop. Supports Anthropic, OpenAI, Google, and OpenRouter. No subprocess overhead.
Subprocess + MCP agents (claude-code, codex, gemini) launch the respective CLI as a child process and proxy environment tools via MCP. The agent’s built-in filesystem tools are disabled and replaced with sandbox-backed equivalents. API-direct agents (resum, react) call LLM APIs directly and execute tool calls via the OpenReward session. No local CLI is required.

Key Options

OptionDescriptionDefault
--envEnvironment name (e.g. GeneralReasoning/CTF)Required
--agentAgent type (resum, claude-code, codex, react, gemini)resum
--modelModel identifier with provider prefix (e.g. anthropic/claude-sonnet-4-6, openrouter/deepseek/deepseek-v3.2)Required
--effortReasoning depth: none, low, medium, high, max, xhighProvider default
--n-concurrentNumber of parallel trials1
--max-tasksLimit number of tasks to evaluateAll tasks
--max-turnsMaximum tool calls per trial100
--splitData split to evaluate (train, test, validation)train
--variantEnvironment variantDefault variant
--output-dirDirectory for results and trajectory logs./output
--provider-urlCustom model provider endpoint
--secretsExtra secrets as KEY=VALUE pairs

Model Identifiers

Models are specified with a provider prefix:
# Anthropic
--model anthropic/claude-sonnet-4-6

# OpenAI
--model openai/gpt-5.4

# Google
--model google/gemini-2.5-flash

# OpenRouter (any model)
--model openrouter/deepseek/deepseek-v3.2

Effort Levels

The --effort flag controls reasoning depth and maps to each provider’s native mechanism:
EffortAnthropicOpenAIGoogle
lowAdaptive thinking (low)reasoning_effort: lowthinking_level: low
mediumAdaptive thinking (medium)reasoning_effort: mediumthinking_level: medium
highAdaptive thinking (high)reasoning_effort: highthinking_level: high
maxMax thinking tokens
xhighreasoning_effort: xhigh

Understanding the Output

Firehorse produces three types of output per trial:
  • Result JSON — final metrics: total reward, tool call count, token usage, API cost, duration
  • Trajectory JSONL — full event log capturing reasoning steps, tool calls, and tool results
  • Aggregate summary — statistics across all trials
Example output structure:
results/
├── trial_0/
│   ├── result.json          # Final metrics
│   ├── trajectory.jsonl     # Full event log
│   └── rewards.jsonl        # Timestamped reward signals
├── trial_1/
│   └── ...
└── summary.json             # Aggregate statistics

Examples

Evaluate with OpenRouter

export OPENROUTER_API_KEY='your-key'

firehorse \
  --env GeneralReasoning/CTF \
  --agent react \
  --model openrouter/moonshotai/kimi-k2.6 \
  --n-concurrent 4 \
  --max-tasks 20

High-effort Claude Code evaluation

export ANTHROPIC_API_KEY='your-key'

firehorse \
  --env Eigent/SETA \
  --agent claude-code \
  --model anthropic/claude-sonnet-4-6 \
  --effort high \
  --n-concurrent 8 \
  --split train

Codex on a custom environment

export OPENAI_API_KEY='your-key'

firehorse \
  --env MyOrg/MyEnv \
  --agent codex \
  --model openai/gpt-5.4 \
  --variant hard \
  --max-turns 50

Next Steps

Harness Toolsets

Configure agent-native tool surfaces for your environments

Your First Evaluation

Build a custom evaluation environment

Building Agentic Environments

Create sandbox-based environments for agent tasks

Using Toolsets

Compose reusable tool collections into environments