Using LLM graders

Goals

Learn why we often need language model graders
Understand how to integrate them into an environment

Prerequisites

An OpenReward account
An OpenReward API key
An API key and SDK for your model provider of choice (e.g. OpenAI, Anthropic, Google, OpenRouter)

Introduction

Many tasks can be easily verified with rules-based parsers. For example, in a task where the answer is an integer (e.g. 4) and there is a submit_answer tool, we can simply compare the model answer to the ground truth, e.g. model_answer == ground_truth_answer and assign a reward accordingly. But other tasks are harder to grade with simple rules. Consider a medical question where the model answer is acetaminophen but the ground truth answer is paracetamol. If we did string matching, we wouldn’t find a match - but it turns out these are two different names for the same drug. For this reason, we often use LLM graders to assign reward. We pass the question, the model answer and the ground truth - along with any other instructions - into a language model and ask it to identify whether the answer is correct. This is more expensive than a rules-based parser, but it is more general in that it uses the knowledge of a language model to help assign a reward. Graders can give a binary response, such as yes or no, but can also give partial credit. Often partial credit is helpful in reinforcement learning as it helps the model improve without getting the full solution.

Example

In this tutorial we’ll see how to set up a binary grader with the drug name example mentioned in the introduction. First, make sure you have the openreward library installed:

pip install openreward

To begin we’ll initialise our project with a basic template:

orwd init medicalenv --template basic
cd medicalenv && ls medicalenv

We’ll replace the train_tasks and test_tasks with the following:

train_tasks = []
test_tasks = [{"id": "train-0", "problem": "What is the drug sold under the brand name Tylenol?", "solution": "acetaminophen"}]

Now we’ll need to define an LLM grader. For this tutorial we’ll use an OpenAI grader, but you can choose any grader of your choice. We’ll import the AsyncOpenAI client and rewrite our __init__.py:

def __init__(self, task_spec: JSONObject = {}, secrets: dict[str, str] = {}):
    super().__init__(task_spec)
    self.config = BasicTaskSpec.model_validate(task_spec)
    self.grader_client = AsyncOpenAI(api_key=secrets["OPENAI_API_KEY"])

Now let’s make a function outside our environment client:

async def grade_model_response(client: AsyncOpenAI, model_response: str, solution: str) -> float:
    """
    Grades the model response against the solution. Return a reward.
    """

    prompt = [
        {
            "role": "system",
            "content": "You are an expert grader. If the reference and model answer match or are equivalent, output a score of 1. Otherwise, give a score of 0. Only respond with a JSON object with a single key 'score' and a value between 0 and 1."
        },
        {
            "role": "user",
            "content": f"Reference: {solution}. Model answer: {model_response}"
        }
    ]

    response = await client.chat.completions.create(
        model="gpt-5-mini",
        messages=prompt
    )

    return float(json.loads(response.choices[0].message.content)["score"])

Here we pass in a model_response and a solution, ask a model to grade it, and then return a JSON response with a score key. Next we can rewrite the answer tool:

@tool
async def answer(self, params: AnswerParams) -> ToolOutput:
    """
    The answer tool can be used to submit your final answer. Note that this finishes the episode.
    """
    reward = await grade_model_response(self.grader_client, params.answer, self.config.solution)

    return ToolOutput(
        blocks=[TextBlock(type="text", text=f"Your answer was graded with a score of {reward}")],
        reward=reward,
        finished=True
    )

And now we can test. Our full server.py is:

from pydantic import BaseModel
from openreward.environments import Environment, JSONObject, Server, TextBlock, ToolOutput, tool
from openai import AsyncOpenAI
import json

class BasicTaskSpec(BaseModel):
    """
    Each environment has a list of tasks. A task is a dict which contains information about a particular problem in an environment.

    Examples:
    - A math environment might have a problem and a solution
    """
    id: str
    problem: str
    solution: str

class AnswerParams(BaseModel):
    """
    Each tool takes in arguments, and these are specified using types.

    Examples:
    - An answer tool might have an answer argument
    - A bash tool might have a command argument 
    """
    answer: str


async def grade_model_response(client: AsyncOpenAI, model_response: str, solution: str) -> float:
    """
    Grades the model response against the solution. Return a reward.
    """

    prompt = [
        {
            "role": "system",
            "content": "You are an expert grader. If the reference and model answer match or are equivalent, output a score of 1. Otherwise, give a score of 0. Only respond with a JSON object with a single key 'score' and a value between 0 and 1."
        },
        {
            "role": "user",
            "content": f"Reference: {solution}. Model answer: {model_response}"
        }
    ]

    response = await client.chat.completions.create(
        model="gpt-5-mini",
        messages=prompt
    )

    return float(json.loads(response.choices[0].message.content)["score"])


train_tasks = []
test_tasks = [{"id": "train-0", "problem": "What is the drug sold under the brand name Tylenol?", "solution": "acetaminophen"}]

class BasicEnvironment(Environment):
    """
    A BasicEnvironment showing the main methods needed to define a working environment.
    """
    def __init__(self, task_spec: JSONObject = {}, secrets: dict[str, str] = {}):
        super().__init__(task_spec)
        self.config = BasicTaskSpec.model_validate(task_spec)
        self.grader_client = AsyncOpenAI(api_key=secrets["OPENAI_API_KEY"])

    @classmethod
    def list_tasks(cls, split: str) -> list[JSONObject]:
        """
        This method is used to find the available environment tasks for a particular split in the dataset.
        """
        if split == "train":
            return train_tasks
        elif split == "test":
            return test_tasks
        raise ValueError(f"Unknown split: {split}")

    @classmethod
    def list_splits(cls) -> list[str]:
        """
        This method is used to list all the splits in the dataset.
        """
        return ["train", "test"]

    def get_prompt(self) -> str:
        """
        This method is used to obtain the prompt that should be used when the agent is starting an episode in the environment.
        """
        return [TextBlock(type="text", text=self.config.problem)]

    @tool
    async def answer(self, params: AnswerParams) -> ToolOutput:
        """
        The answer tool can be used to submit your final answer. Note that this finishes the episode.
        """
        reward = await grade_model_response(self.grader_client, params.answer, self.config.solution)

        return ToolOutput(
            blocks=[TextBlock(type="text", text=f"Your answer was graded with a score of {reward}")],
            reward=reward,
            finished=True
        )

if __name__ == "__main__":
    Server([BasicEnvironment]).run()

To sample, run the following code:

OpenAI

Set your API keys

Make sure you have API keys for OpenReward and OpenAI, and set these as environment variables:

export OPENAI_API_KEY='your-openai-api-key-here'
export OPENREWARD_API_KEY='your-openreward-api-key-here'

Create your code

Save this as quickstart.py:

  from openai import OpenAI
  from openreward import OpenReward
  import json
  import os

  or_client = OpenReward()
  oai_client = OpenAI()
  MODEL_NAME = "gpt-5.2"

  environment = or_client.environments.get(name="medicalenv", base_url="http://localhost:8080")
  tasks = environment.list_tasks(split="test")
  tools = environment.list_tools(format="openai")

  example_task = tasks[0]

  with environment.session(task=example_task, secrets={"OPENAI_API_KEY": os.getenv("OPENAI_API_KEY")}) as session:
      prompt = session.get_prompt()
      input_list = [{"role": "user", "content": prompt[0].text}]
      finished = False
      print(input_list)

      while not finished:
          response = oai_client.responses.create(
              model=MODEL_NAME,
              tools=tools,
              input=input_list
          )
          print(response.output)

          input_list += response.output

          for item in response.output:
              if item.type == "function_call":
                  tool_result = session.call_tool(item.name, json.loads(str(item.arguments)))

                  reward = tool_result.reward
                  finished = tool_result.finished

                  input_list.append({
                      "type": "function_call_output",
                      "call_id": item.call_id,
                      "output": json.dumps({
                          "result": tool_result.blocks[0].text
                      })
                  })

                  print(input_list[-1])

                  if tool_result.finished:
                      finished = True
                      break

Run your code

  python quickstart.py

Example output:

[{'role': 'user', 'content': 'What is the drug sold under the brand name Tylenol?'}]
[ResponseOutputMessage(id='msg_0c452008c87c52ac006976148ab2fc81a2b15b31b79ab6e53d', content=[ResponseOutputText(annotations=[], text='Tylenol is the brand name for **acetaminophen** (also called **paracetamol** in many countries).', type='output_text', logprobs=[])], role='assistant', status='completed', type='message')]
[ResponseFunctionToolCall(arguments='{"answer":"Tylenol is the brand name for **acetaminophen** (also called **paracetamol** in many countries)."}', call_id='call_txGlADVzH3YNuBzGZ4tnrjoJ', name='answer', type='function_call', id='fc_0c452008c87c52ac006976148bbba481a2830e9cd56d63799d', status='completed')]
{'type': 'function_call_output', 'call_id': 'call_txGlADVzH3YNuBzGZ4tnrjoJ', 'output': '{"result": "Your answer was graded with a score of 1.0"}'}

You should note that because the grader requires an OpenAI API key, we pass in the key as follows:

with environment.session(task=example_task, secrets={"OPENAI_API_KEY": os.getenv("OPENAI_API_KEY")}) as session:

If you are using a grader from another provider, your code will follow the same pattern. You will need to capture this secret server-side as well so the key from the client can be passed into the grader on the server backend:

def __init__(self, task_spec: JSONObject = {}, secrets: dict[str, str] = {}):
    super().__init__(task_spec)
    self.config = BasicTaskSpec.model_validate(task_spec)
    self.grader_client = YourGraderClient(api_key=secrets["YOUR_GRADER_KEY"])

Advanced Graders

In the previous example, we asked the grader to output a score alone. In some cases we may want the agent to do some limited thinking before outputting this score, or at least offer transparent reasoning that leads to its score. To see how we can do this, let’s first rewrite the grader:

import re

async def grade_model_response(client: AsyncOpenAI, model_response: str, solution: str) -> float:
    """
    Grades the model response against the solution. Return a reward.
    """

    prompt = [
        {
            "role": "system",
            "content": "You are an expert grader. If the reference and model answer match or are equivalent, output a score of 1. Otherwise, give a score of 0. You should first provide some reasoning in <reasoning></reasoning> and then afterwrds output your score inside <answer></answer> which contains only a JSON object with a single key 'score' and a value between 0 and 1."
        },
        {
            "role": "user",
            "content": f"Reference: {solution}. Model answer: {model_response}"
        }
    ]

    response = await client.chat.completions.create(
        model="gpt-5-mini",
        messages=prompt
    )

    # The model response should contain <reasoning>...</reasoning> (optional) and <answer>...</answer>
    # Extract the content between <reasoning> and </reasoning> and between <answer> and </answer>
    content = response.choices[0].message.content

    reasoning = None
    reasoning_match = re.search(r"<reasoning>(.*?)</reasoning>", content, re.DOTALL | re.IGNORECASE)
    if reasoning_match:
        reasoning = reasoning_match.group(1).strip()

    answer_json = None
    answer_match = re.search(r"<answer>(.*?)</answer>", content, re.DOTALL | re.IGNORECASE)
    if answer_match:
        answer_text = answer_match.group(1).strip()
        try:
            answer_json = json.loads(answer_text)
        except Exception:
            json_match = re.search(r"\{.*\}", answer_text, re.DOTALL)
            if json_match:
                answer_json = json.loads(json_match.group(0))
            else:
                raise ValueError("Could not extract score JSON from answer tag.")
    else:
        # Fallback to old behavior if <answer> not found
        answer_json = json.loads(content)

    return float(answer_json["score"]), reasoning

And let us also rewrite the answer tool:

@tool
async def answer(self, params: AnswerParams) -> ToolOutput:
    """
    The answer tool can be used to submit your final answer. Note that this finishes the episode.
    """
    reward, reasoning = await grade_model_response(self.grader_client, params.answer, self.config.solution)

    return ToolOutput(
        blocks=[TextBlock(type="text", text=f"Your answer was graded with a score of {reward}. Reasoning: {reasoning}")],
        reward=reward,
        finished=True
    )

Now rerun quickstart.py:

python quickstart.py

[{'role': 'user', 'content': 'What is the drug sold under the brand name Tylenol?'}]
[ResponseFunctionToolCall(arguments='{"answer":"Tylenol is the brand name for **acetaminophen** (also called **paracetamol**)."}', call_id='call_2L1hRzCMF4id67XzrGBDEDju', name='answer', type='function_call', id='fc_0d546a6135c94ead006976166700088193b7223a1e2b0021d9', status='completed')]
{'type': 'function_call_output', 'call_id': 'call_2L1hRzCMF4id67XzrGBDEDju', 'output': '{"result": "Your answer was graded with a score of 1.0. Reasoning: The model answer names Tylenol as the brand name for acetaminophen and notes the synonym paracetamol, which matches the reference term \\"acetaminophen.\\" The content is equivalent."}'}

As we can see we have the same result but now the agent is incentivised to spend tokens doing some reasoning before answering. In general, you should test different graders and observe their outputs and decide which one better matches human judgement. The direct grader is cheaper in that it spends fewer tokens, but if you need higher grader accuracy you may want to opt for graders like this. A more expensive grader entirely involves using a reasoning model to judge the output. This is essentially a more extreme version of the grader we have just made, where a model will spend many more tokens thinking before giving a reward. These types of graders are also called “generative verifiers”. We can use one of these graders by changing our original grader to the following:

async def grade_model_response(client: AsyncOpenAI, model_response: str, solution: str) -> float:
    """
    Grades the model response against the solution. Return a reward.
    """

    prompt = [
        {
            "role": "system",
            "content": "You are an expert grader. If the reference and model answer match or are equivalent, output a score of 1. Otherwise, give a score of 0. Only respond with a JSON object with a single key 'score' and a value between 0 and 1."
        },
        {
            "role": "user",
            "content": f"Reference: {solution}. Model answer: {model_response}"
        }
    ]

    response = await client.responses.create(
        model="gpt-5.2",
        reasoning={"effort": "medium"},
        input=prompt
    )

    return float(json.loads(response.output_text)["score"]), ""

python quickstart.py

[{'role': 'user', 'content': 'What is the drug sold under the brand name Tylenol?'}]
[ResponseOutputMessage(id='msg_0b1df4bd5a513b1400697617e0d75481978e84348c273060dc', content=[ResponseOutputText(annotations=[], text='Tylenol is the brand name for **acetaminophen** (also called **paracetamol** in many countries).', type='output_text', logprobs=[])], role='assistant', status='completed', type='message')]
[ResponseFunctionToolCall(arguments='{"answer":"Tylenol is the brand name for **acetaminophen** (also called **paracetamol** in many countries)."}', call_id='call_YwKT2KrCOeqJJCRzB7dFjuXa', name='answer', type='function_call', id='fc_0b1df4bd5a513b1400697617e1d6c0819786ba7acf559fdee6', status='completed')]
{'type': 'function_call_output', 'call_id': 'call_YwKT2KrCOeqJJCRzB7dFjuXa', 'output': '{"result": "Your answer was graded with a score of 1.0. Reasoning: "}'}

Most model providers hide the reasoning, so it won’t be visible in this case. Note that a full reasoning model is not needed for verification given the simplicity of this task, but there may be harder verification problems where they may be useful. The Rubrics tutorial gives some examples of these type of problems.

Get started

Core Concepts

Using Environments

Making Environments

Deployment

Storage & Data

Goals

Prerequisites

Introduction

Example

Advanced Graders

Get started

Core Concepts

Using Environments

Making Environments

Deployment

Storage & Data

​Goals

​Prerequisites

​Introduction

​Example

​Advanced Graders

Goals

Prerequisites

Introduction

Example

Advanced Graders