Goals
In this documentation, you will:- Make an AIME 2024 evaluation environment.
- Deploy the environment to OpenReward (optional).
- Write an evaluation script.
- Evaluate from a model of your choice on this evaluation.
Prerequisites
- An OpenReward account
- An OpenReward API key
- An API key and SDK for your model provider of choice (e.g. OpenAI, Anthropic, Google, OpenRouter)
Setup
Environments in OpenReward are written using the OpenReward Python library. You can install this library using pip or uv:Our evaluation: AIME 2024
The American Invitational Mathematics Examination (AIME) is a selective 15-question, 3-hour test given since 1983 to those who rank in the top 5% on the AMC 12 high school mathematics examination. Historically it has been used to evaluate the reasoning performance of language models. An example question and answer from the AIME 2024 examination:Let and be positive real numbers that satisfy the following system of equations:
Then the value of is where and are relatively prime positive integers. Find .We’ll implement this as an evaluation using OpenReward.
Building the evaluation
Initialise a project using the OpenReward cli:server.py file with the following:
requirements.txt:
Dockerfile:
python server.py.
Evaluating a model
Now let’s evaluate our model. OpenAI
Set your API keys
Make sure you have an API key for OpenAI and OpenReward, and set the environment variables:

