Skip to main content
OpenReward provides infrastructure for hosting and running AI agent environments. This infrastructure has two main components:
  • Environments specify tasks, tools, rewards and basic state.
  • Sandboxes (optional) give the agent access to a computer in the environment.
Note that not all environments require sandboxes, but this is becoming increasingly common as we train agents on computer-based tasks. This is often called agentic reinforcement learning.

What are environments?

An environment is a simulated space where an agent can perform a task using tools and resources, and receive rewards for good or bad actions. In the language of reinforcement learning, the underlying abstraction is a POMDP, which specifies a system that an agent interacts with, the actions that allow the agents to interact with it, and rewards for the agent. On OpenReward, we treat environments as passive FastAPI-style servers that an agent can interact with. This maintains strict separation of concerns between the agent and the environment, allowing for easier development and more robustness to changes in agent harnesses. The standard for how agents talk to environments is called ORS, which you can read about in full detail here. Briefly speaking, an ORS server provides:
  • Tasks - tasks are the core problems to be solved, including the initial prompts
  • Tools - tools are the actions an agent can take in the environment
  • Splits - splits organise tasks into groups, e.g. for training and evaluation
  • Statefulness - agent actions in a session can affect state
  • Tool Results - including tool feedback, rewards and termination signals
But at its heart, an ORS service is a web service that you can call. You can run these servers locally, or in the cloud. OpenReward is a managed service for hosting ORS environments on the internet.

What are sandboxes?

A sandbox is an isolated container for running code. They are often used in conjunction with environments when an agent needs access to a computer. For example, a software engineering task would require interacting with a filesystem, writing and executing files, and more. We use sandboxes, rather than the same compute as the environment, to isolate the agent’s actions - and prevent them from interfering with the running of the environment! A sandbox provides:
  • Isolated execution environment
  • Configurable resources (CPU, memory)
  • Network isolation options
  • Automatic cleanup after use
In the context of ORS, a sandbox is usually initialised when a new session is created, and torn down when the session ends. But the exact use can vary; for example, some environments may only use a sandbox for a particular tool execution instead of the entire session. ORS environments can be run with any sandbox provider - it is fully interoperable. OpenReward provides our own sandbox solution, but we encourage you to use whatever works best for you - and these docs contain detailed guides on how to use environments with popular providers such as Daytona, E2B and Modal.

How environments and sandboxes work together

We have briefly touched upon how these two pillars of infrastructure interact, but we consider a few popular patterns below. Pattern 1: Environment-Only For environments that don’t need code execution:
Agent → ORS Environment
Examples include mathematics or multiple-choice question answering benchmarks where the agent is not given access to a computer. Pattern 2: Environment + Sandboxes For environments requiring code execution:
Agent → ORS Environment → Sandbox
Examples include software engineering and knowledge-work benchmarks where the agent reads and writes to a filesystem. The agent connects to an ORS environment, which then initialises a sandbox and utilises it for certain tools in the environment (e.g. bash, write, read, grep).

Where is data stored?

Data on OpenReward can come from external sources, for example HuggingFace datasets or public buckets, or you can upload assets to the environment and use those. There are two main uses for data in ORS environments:
  • Environment data. This is data that powers the underlying environment; for example, tasks that are contained in a parquet or .jsonl, or it could be information the environment but not the agent has access to (e.g. ground truth labels).
  • Sandbox data. This is data that the agent has access to for solving the task. For example, in a Kaggle competition environment, the agent might have access to a train and validation dataset on their machine to build models with.
We have a full set of docs on where data lives here. The key things to note about the use of OpenReward storage is that:
  • The hosted environment mounts storage at the location: /orwd_data/
  • Sandboxes can choose the path to mount the storage at

Deployment Flow

OpenReward deploys ORS servers, and we do this via the following flow:
1. The user writes environment code along with a Dockerimage

2. The user pushes their code to a GitHub repository

3. The user connects an OpenReward environment to their GitHub

4. OpenReward builds an image and deploys an environment server

5. The server automatically scales based on connections
OpenReward does not host code; that stays on GitHub. In essence it does one job which is to take the code (along with any data) and host an API endpoint for it. That endpoint can then be used for training, distillation or evaluation.

Getting Started

Environments

Learn about environment servers

Sandboxes

Explore ephemeral execution containers

Storage

Configure cloud storage access

Build Your First Environment

Deploy your first environment