Your first evaluation

Goals

In this documentation, you will:

Make an AIME 2024 evaluation environment.
Deploy the environment to OpenReward (optional).
Write an evaluation script.
Evaluate from a model of your choice on this evaluation.

Prerequisites

An OpenReward account
An OpenReward API key
An API key and SDK for your model provider of choice (e.g. OpenAI, Anthropic, Google, OpenRouter)

Setup

Environments in OpenReward are written using the OpenReward Python library. You can install this library using pip or uv:

pip install openreward

Our evaluation: AIME 2024

The American Invitational Mathematics Examination (AIME) is a selective 15-question, 3-hour test given since 1983 to those who rank in the top 5% on the AMC 12 high school mathematics examination. Historically it has been used to evaluate the reasoning performance of language models. An example question and answer from the AIME 2024 examination:

Let $x,y$ and $z$ be positive real numbers that satisfy the following system of equations:

$\log_2\left({x \over yz}\right) = {1 \over 2}$

$\log_2\left({y \over xz}\right) = {1 \over 3}$

$\log_2\left({z \over xy}\right) = {1 \over 4}$

Then the value of $\left|\log_2(x^4y^3z^2)\right|$ is $\tfrac{m}{n}$ where $m$ and $n$ are relatively prime positive integers. Find $m+n$ .

We’ll implement this as an evaluation using OpenReward.

Building the evaluation

Initialise a project using the OpenReward cli:

orwd init aime2024 --template basic
cd aime2024 && ls aime2024

We’ll replace the server.py file with the following:

from math_verify import parse, verify
import pandas as pd
from pydantic import BaseModel

from openreward.environments import Environment, JSONObject, Server, TextBlock, ToolOutput, tool

class AIME2024TaskSpec(BaseModel):
    id: str
    problem: str
    answer: str

class AnswerParams(BaseModel):
    answer: str

test_tasks = pd.read_parquet("aime_2024_problems.parquet").to_dict(orient="records")

for i, task in enumerate(test_tasks):
    task.pop('ID')
    task.pop('Solution')
    task['id'] = str(i)
    task['Answer'] = str(task['Answer'])

    keys_to_change = [key for key in task.keys() if key != "id"]
    for key in keys_to_change:
        if key.lower() != key:
            task[key.lower()] = task.pop(key)


class AIME2024(Environment):
    """
    An environment for the AIME 2024 dataset
    """
    def __init__(self, task_spec: JSONObject = {}):
        super().__init__(task_spec)
        self.config = AIME2024TaskSpec.model_validate(task_spec)

    @classmethod
    def list_tasks(cls, split: str) -> list[JSONObject]:
        if split == "train":
            return []
        elif split == "test":
            return test_tasks
        raise ValueError(f"Unknown split: {split}")

    @classmethod
    def list_splits(cls) -> list[str]:
        return ["train", "test"]

    def get_prompt(self) -> str:
        return [TextBlock(type="text", text=self.config.problem)]

    @tool
    async def answer(self, params: AnswerParams) -> ToolOutput:
        """
        The answer tool can be used to submit your final answer. Note that this finishes the episode.
        """
        gold = parse(self.config.answer)
        answer = parse(params.answer)
        is_correct = verify(gold, answer)

        if is_correct:
            agent_message = "Correct!"
            reward = 1.0
        else:
            agent_message = "Wrong!"
            reward = 0.0

        return ToolOutput(
            blocks=[TextBlock(type="text", text=agent_message)],
            reward=reward,
            finished=True
        )

if __name__ == "__main__":
    Server([AIME2024]).run()

We’ll update the requirements.txt:

fastapi>=0.115.12
openreward
pandas
pyarrow
uvicorn>=0.34.3
math-verify[antlr4_13_2]

And update the Dockerfile:

FROM python:3.11-slim

RUN apt update && apt upgrade -y && apt install -y \
    curl

WORKDIR /app

# Copy requirements and install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY server.py .

# Expose port
EXPOSE 8000

RUN curl -L -o aime_2024_problems.parquet \
https://huggingface.co/datasets/Maxwell-Jia/AIME_2024/resolve/main/aime_2024_problems.parquet

# Start the server
CMD ["python", "server.py"]

Now create an environment on OpenReward, upload your code to GitHub and connect your repository. Alternatively, you can do your evaluations locally by running the server locally, i.e. python server.py.

Get started

Using Environments

Making Environments

Your first evaluation

Goals

Prerequisites

Setup

Our evaluation: AIME 2024

Building the evaluation

Get started

Using Environments

Making Environments

​Goals

​Prerequisites

​Setup

​Our evaluation: AIME 2024

​Building the evaluation

Goals

Prerequisites

Setup

Our evaluation: AIME 2024

Building the evaluation