Goals
- Learn why we often need language model graders
- Understand how to integrate them into an environment
Prerequisites
- An OpenReward account
- An OpenReward API key
- An API key and SDK for your model provider of choice (e.g. OpenAI, Anthropic, Google, OpenRouter)
Introduction
Many tasks can be easily verified with rules-based parsers. For example, in a task where the answer is an integer (e.g.4) and there is a submit_answer tool, we can simply compare the model answer to the ground truth, e.g. model_answer == ground_truth_answer and assign a reward accordingly.
But other tasks are harder to grade with simple rules. Consider a medical question where the model answer is acetaminophen but the ground truth answer is paracetamol. If we did string matching, we wouldn’t find a match - but it turns out these are two different names for the same drug.
For this reason, we often use LLM graders to assign reward. We pass the question, the model answer and the ground truth - along with any other instructions - into a language model and ask it to identify whether the answer is correct. This is more expensive than a rules-based parser, but it is more general in that it uses the knowledge of a language model to help assign a reward.
Graders can give a binary response, such as yes or no, but can also give partial credit. Often partial credit is helpful in reinforcement learning as it helps the model improve without getting the full solution.
Example
In this tutorial we’ll see how to set up a binary grader with the drug name example mentioned in the introduction. First, make sure you have theopenreward library installed:
basic template:
train_tasks and test_tasks with the following:
AsyncOpenAI client and rewrite our __init__.py:
model_response and a solution, ask a model to grade it, and then return a JSON response with a score key.
Next we can rewrite the answer tool:
server.py is:
- OpenAI
Set your API keys
Make sure you have API keys for OpenReward and OpenAI, and set these as environment variables:
Advanced Graders
In the previous example, we asked the grader to output a score alone. In some cases we may want the agent to do some limited thinking before outputting this score, or at least offer transparent reasoning that leads to its score. To see how we can do this, let’s first rewrite the grader:answer tool:
quickstart.py:

