Skip to main content

Goals

In this documentation, you will:
  • Make an AIME 2024 evaluation environment.
  • Deploy the environment to OpenReward (optional).
  • Write an evaluation script.
  • Evaluate from a model of your choice on this evaluation.

Prerequisites

  • An OpenReward account
  • An OpenReward API key
  • An API key and SDK for your model provider of choice (e.g. OpenAI, Anthropic, Google, OpenRouter)

Setup

Environments in OpenReward are written using the OpenReward Python library. You can install this library using pip or uv:
pip install openreward

Our evaluation: AIME 2024

The American Invitational Mathematics Examination (AIME) is a selective 15-question, 3-hour test given since 1983 to those who rank in the top 5% on the AMC 12 high school mathematics examination. Historically it has been used to evaluate the reasoning performance of language models. An example question and answer from the AIME 2024 examination:
Let x,yx,y and zz be positive real numbers that satisfy the following system of equations:
log2(xyz)=12\log_2\left({x \over yz}\right) = {1 \over 2}
log2(yxz)=13\log_2\left({y \over xz}\right) = {1 \over 3}
log2(zxy)=14\log_2\left({z \over xy}\right) = {1 \over 4}
Then the value of log2(x4y3z2)\left|\log_2(x^4y^3z^2)\right| is mn\tfrac{m}{n} where mm and nn are relatively prime positive integers. Find m+nm+n.
We’ll implement this as an evaluation using OpenReward.

Building the evaluation

Initialise a project using the OpenReward cli:
orwd init aime2024 --template basic
cd aime2024 && ls aime2024
We’ll replace the server.py file with the following:
from math_verify import parse, verify
import pandas as pd
from pydantic import BaseModel

from openreward.environments import Environment, JSONObject, Server, TextBlock, ToolOutput, tool

class AIME2024TaskSpec(BaseModel):
    id: str
    problem: str
    answer: str

class AnswerParams(BaseModel):
    answer: str

test_tasks = pd.read_parquet("aime_2024_problems.parquet").to_dict(orient="records")

for i, task in enumerate(test_tasks):
    task.pop('ID')
    task.pop('Solution')
    task['id'] = str(i)
    task['Answer'] = str(task['Answer'])

    keys_to_change = [key for key in task.keys() if key != "id"]
    for key in keys_to_change:
        if key.lower() != key:
            task[key.lower()] = task.pop(key)


class AIME2024(Environment):
    """
    An environment for the AIME 2024 dataset
    """
    def __init__(self, task_spec: JSONObject = {}, secrets: dict[str, str] = {}):
        super().__init__(task_spec)
        self.config = AIME2024TaskSpec.model_validate(task_spec)

    @classmethod
    def list_tasks(cls, split: str) -> list[JSONObject]:
        if split == "train":
            return []
        elif split == "test":
            return test_tasks
        raise ValueError(f"Unknown split: {split}")

    @classmethod
    def list_splits(cls) -> list[str]:
        return ["train", "test"]

    def get_prompt(self) -> str:
        return [TextBlock(type="text", text=self.config.problem)]

    @tool
    async def answer(self, params: AnswerParams) -> ToolOutput:
        """
        The answer tool can be used to submit your final answer. Note that this finishes the episode.
        """
        gold = parse(self.config.answer)
        answer = parse(params.answer)
        is_correct = verify(gold, answer)

        if is_correct:
            agent_message = "Correct!"
            reward = 1.0
        else:
            agent_message = "Wrong!"
            reward = 0.0

        return ToolOutput(
            blocks=[TextBlock(type="text", text=agent_message)],
            reward=reward,
            finished=True
        )

if __name__ == "__main__":
    Server([AIME2024]).run()
We’ll update the requirements.txt:
fastapi>=0.115.12
openreward
pandas
pyarrow
uvicorn>=0.34.3
math-verify[antlr4_13_2]
And update the Dockerfile:
FROM python:3.11-slim

RUN apt update && apt upgrade -y && apt install -y \
    curl

WORKDIR /app

# Copy requirements and install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY server.py .

# Expose port
EXPOSE 8000

RUN curl -L -o aime_2024_problems.parquet \
https://huggingface.co/datasets/Maxwell-Jia/AIME_2024/resolve/main/aime_2024_problems.parquet

# Start the server
CMD ["python", "server.py"]
Now create an environment on OpenReward, upload your code to GitHub and connect your repository. Alternatively, you can do your evaluations locally by running the server locally, i.e. python server.py.

Evaluating a model

Now let’s evaluate our model.
OpenAI
1

Set your API keys

Make sure you have an API key for OpenAI and OpenReward, and set the environment variables:
export OPENAI_API_KEY='your-openai-api-key-here'
export OPENREWARD_API_KEY='your-openreward-api-key-here'
2

Write the code

Save the following to evaluate.py:
import asyncio
import json
import copy
from typing import cast

from openai import AsyncOpenAI
from openai.types.responses import ResponseInputItemParam
from openai.types.responses.response_input_item_param import FunctionCallOutput
from openai.types.responses.tool_param import ToolParam

from openreward import OpenReward
from openreward.api.environments.client import Environment, Task, ToolSpec


CLIENT = AsyncOpenAI()
NUM_SEEDS = 2
OAI_MODEL = "gpt-4o"

async def process_aime_task(environment: Environment, task: Task, tools: list[ToolSpec] | list[dict]) -> float | None:
    async with environment.session(task=task) as session:
        prompt = await session.get_prompt()
        prompt = "".join([str(block) for block in prompt])

        previous_response_id = None
        latest_input: list[ResponseInputItemParam] = [{"role": "user", "content": prompt}]
        while True:
            response = await CLIENT.responses.create(
                model=OAI_MODEL,
                tools=cast(list[ToolParam], tools),
                input=latest_input,
                previous_response_id=previous_response_id,
                tool_choice="required",
            )
            previous_response_id = response.id

            # parse and execute tool calls
            latest_input = []
            for out in response.output:
                if out.type != "function_call":
                    continue

                tool_result = await session.call_tool(out.name, json.loads(str(out.arguments)))
                if tool_result.finished:
                    print(tool_result.reward)
                    return tool_result.reward
                item: FunctionCallOutput = {
                    "type": "function_call_output", 
                    "call_id": out.call_id,
                    "output": "".join([str(block) for block in tool_result.blocks])
                }
                latest_input.append(item)

async def main():
    or_client = OpenReward()

    environment = or_client.environments.get(name="RJT1990/AIME2024")
    tasks = await environment.list_tasks(split="test")
    tools = await environment.list_tools(format="openai")

    original_tasks: list[Task] = list(tasks)
    tasks_with_seeds: list[Task] = [
        copy.deepcopy(task) for task in original_tasks for _ in range(NUM_SEEDS)
    ]
    
    semaphore = asyncio.Semaphore(10)
    async def run_task(task: Task) -> float | None:
        async with semaphore:
            return await process_aime_task(environment, task, tools)

    results = await asyncio.gather(*[run_task(task) for task in tasks_with_seeds])

    num_samples = len(results)
    num_correct = sum(1 for r in results if r is not None and r == 1)
    pass_at_1 = num_correct / num_samples if num_samples > 0 else 0.0
    print(pass_at_1)

if __name__ == "__main__":
    asyncio.run(main())
3

Run the code

Now, run the code to evaluate:
python evaluate.py
0.05
So the pass@1 is calculated a 5.00% for gpt-4o for these seeds.