Skip to main content

Goals

  • Learn why rubrics are useful for assigning reward
  • Learning how to define rubrics for an environment
  • Understand how to integrate them into an ORS environment

Prerequisites

Introduction

In the Using LLM Graders tutorial, we saw we could use language models to assign rewards where it was hard to match a model and ground truth answer using string matching. But this still relied on the notion of a “reference answer” to compare against. What about domains where a reference answer isn’t available? Consider a task where we task a model to write an essay on the causes of World War One. There isn’t a “correct” answer in this case, but a teacher would still need to grade an essay based on the quality of reasoning, the evidence presented, and more. A grading Rubric is a criterion for grading a model response. A rubric can be:
  • Binary: for example, did the answer mention a particular event or not? Did it pass a unit test or not?
  • Point-based: for example, it could allocate 0 points for not mentioning an important factor, 0.5 points for mentioning it but not specific details, and 1.0 points for mentioning the factor and the details.
  • Weighted: we might weight one rubric higher than another when considering the total score.
Typically, we calculate the final reward as a weighted sum of rubrics, where: r=iKwisiiwir = \frac{\sum_{i}^{K}w_{i}s_{i}}{\sum_{i}w_{i}} where wiw_i is the weight on a rubric, and sis_i is its score. Multiple rubrics, and/or point-based reward, is useful for reinforcement learning as it enables a more continuous reward signal to distinguish between different responses. Rubrics are particularly useful for long-form language model responses. For example, they power the training of DeepResearch type models. Now let’s see how rubrics work in action.

Worked Example

Let’s make an ORS environment locally. First, if you haven’t already:
pip install openreward
We’ll initialise a new project:
orwd init historyenv --template basic
cd historyenv && ls historyenv
First, we’ll add some additional imports to our server.py file:
import asyncio
import re
from typing import List

import openai
Next we’ll alter train_tasks and test_tasks. We’ll focus on the question we asked earlier in this tutorial:
train_tasks = [
    {
        "id": "wwi-causes",
        "problem": "What were the causes of World War One?",
        "rubrics": WWI_RUBRICS
    }
]

# Empty test split - only train split has tasks
test_tasks = []
We’ll need to define the rubrics above. We’ll use ten separate rubrics with a point scale:
# WWI Essay Rubrics (10 criteria, each worth 1.0 point)
WWI_RUBRICS = [
    {
        "criterion": "Direct answer + defensible thesis about 'cause'. 0: No clear answer; descriptive list. +0.5: Vague thesis that shifts. +1.0: Clear, arguable thesis defining causation that stays consistent.",
        "points": 1.0
    },
    {
        "criterion": "Chronological command of 1908–14 and July Crisis. 0: Major errors. +0.5: Broad chronology right but key steps missing. +1.0: Accurate tight chronology showing decision turning points.",
        "points": 1.0
    },
    {
        "criterion": "Explains why 1914 (timing problem). 0: Doesn't address why earlier crises didn't produce war. +0.5: Mentions earlier crises without analysis. +1.0: Uses pre-1914 crises analytically to explain escalation.",
        "points": 1.0
    },
    {
        "criterion": "Structural causes: alliances/diplomacy as mechanisms. 0: Alliances as automatic dominoes without explanation. +0.5: Describes alliances but under-specifies mechanism. +1.0: Shows how commitments shaped incentives and constraints.",
        "points": 1.0
    },
    {
        "criterion": "Militarism and war planning. 0: 'Countries liked war' with no institutional explanation. +0.5: Mentions arms race/plans without crisis connection. +1.0: Demonstrates how timetables/plans accelerated escalation.",
        "points": 1.0
    },
    {
        "criterion": "Nationalism and the Balkans. 0: Balkan nationalism as background noise. +0.5: Acknowledges volatility but agency stays with great powers. +1.0: Explains how Balkan politics interacted with great-power calculations.",
        "points": 1.0
    },
    {
        "criterion": "Great-power agency. 0: One-sided blame or vague 'everyone equally'. +0.5: Covers actors unevenly; responsibility asserted not demonstrated. +1.0: Weighs actors with specific motives/constraints; responsibility argued.",
        "points": 1.0
    },
    {
        "criterion": "Historiography. 0: Little engagement or name-dropping. +0.5: Uses interpretations but evaluation is thin. +1.0: Engages competing interpretations as arguments and shows thesis relation.",
        "points": 1.0
    },
    {
        "criterion": "Evidence and scholarship. 0: Sparse/unsuitable evidence. +0.5: Uses works but cherry-picks or doesn't distinguish evidence from interpretation. +1.0: Uses evidence discriminately with acknowledged limits.",
        "points": 1.0
    },
    {
        "criterion": "Argumentative craft. 0: Unclear structure, weak paragraphs. +0.5: Readable and structured but argument gets lost occasionally. +1.0: Coherent architecture, strong paragraphs, professional apparatus.",
        "points": 1.0
    }
]
Next we’ll need a template for the LLM grader:
GRADER_TEMPLATE = """You are an expert history evaluator specializing in World War One historiography. Evaluate the following essay response against a specific criterion.

QUESTION:
{question}

SUBMITTED ESSAY:
{response}

EVALUATION CRITERION:
{criterion}

Instructions:
1. Analyze how well the essay meets this specific criterion
2. The criterion specifies scores: 0, +0.5, or +1.0 - choose the appropriate score
3. Provide a brief explanation (2-3 sentences) justifying your score
4. Consider the quality of argument, evidence, and historical analysis

Output format:
Analysis: [Your 2-3 sentence explanation]
Score: [One of: 0, 0.5, or 1.0]
We’ll change the init to include the grader. We’ll use an OpenAI model as the grader:
def __init__(self, task_spec: JSONObject = {}, secrets: dict[str, str] = {}):
    super().__init__(task_spec)
    self.config = BasicTaskSpec.model_validate(task_spec)

    api_key = secrets.get("openai_api_key")
    if not api_key:
        raise ValueError("OpenAI API key required in secrets parameter")

    self.client = openai.AsyncClient(api_key=api_key)

    # Store rubrics for grading
    self.rubrics = self.config.rubrics
And we’ll define some methods for the new answer tool:
    def _parse_score(self, grading_response: str, max_points: float) -> float:
        """Parse score from grading response with fallback logic"""
        # Look for "Score: X" pattern
        match = re.search(r"Score:\s*([\d.]+)", grading_response, re.IGNORECASE)
        if match:
            score = float(match.group(1))
            return max(0.0, min(max_points, score))

        # Fallback: find any decimal number
        numbers = re.findall(r"\b(\d+(?:\.\d+)?)\b", grading_response)
        if numbers:
            score = float(numbers[-1])
            return max(0.0, min(max_points, score))

        return 0.0  # Default if parsing fails

    async def _grade_single_criterion(self, response: str, rubric: dict) -> dict:
        """Grade response against a single criterion"""
        grader_prompt = GRADER_TEMPLATE.format(
            question=self.config.problem,
            response=response,
            criterion=rubric["criterion"]
        )

        # CRITICAL: Use gpt-5-mini, NO temperature parameter
        res = await self.client.chat.completions.create(
            model="gpt-5-mini",
            messages=[{"role": "user", "content": grader_prompt}]
        )

        grading_response = res.choices[0].message.content or ""
        score = self._parse_score(grading_response, rubric["points"])

        return {
            "criterion": rubric["criterion"],
            "max_points": rubric["points"],
            "score": score,
            "feedback": grading_response
        }

    async def _grade_all_rubrics(self, response: str) -> List[dict]:
        """Grade response against all 10 rubrics in parallel"""
        grading_tasks = [
            self._grade_single_criterion(response, rubric)
            for rubric in self.rubrics
        ]
        return await asyncio.gather(*grading_tasks)

    @tool
    async def answer(self, params: AnswerParams) -> ToolOutput:
        """
        Submit your essay answer for grading. Your essay will be evaluated against 10 historical rubrics covering thesis, chronology, causation, historiography, evidence, and argumentative craft. Note that this finishes the episode.
        """
        essay = params.answer

        # Grade all rubrics in parallel
        rubric_results = await self._grade_all_rubrics(essay)

        # Calculate total score
        total_score = sum(r["score"] for r in rubric_results)
        total_possible = sum(r["max_points"] for r in rubric_results)

        # Normalize reward to 0-1 scale (sum / 10)
        reward = total_score / total_possible if total_possible > 0 else 0.0

        # Format detailed feedback
        feedback_lines = [
            f"",
            f"**Total Score: {total_score:.1f} / {total_possible:.1f}**",
            f"**Normalized Reward: {reward:.3f}**",
            f"",
            f"## Rubric-by-Rubric Feedback",
            f""
        ]

        for i, result in enumerate(rubric_results, 1):
            feedback_lines.append(f"### Rubric {i}: {result['score']:.1f} / {result['max_points']:.1f}")
            feedback_lines.append(f"**Criterion**: {result['criterion'][:100]}...")
            feedback_lines.append(f"**Feedback**: {result['feedback']}")
            feedback_lines.append("")

        feedback_text = "\n".join(feedback_lines)

        return ToolOutput(
            blocks=[TextBlock(type="text", text=feedback_text)],
            metadata={
                "task_id": self.config.id,
                "total_score": total_score,
                "total_possible": total_possible,
                "normalized_reward": reward,
                "rubric_results": rubric_results
            },
            reward=reward,
            finished=True
        )
Once we’ve made these changes, let’s run the environment. In your terminal run:
python server.py
To sample, run the following code:
1

Set your API keys

Make sure you have API keys for OpenReward and OpenAI, and set these as environment variables:
export OPENAI_API_KEY='your-openai-api-key-here'
export OPENREWARD_API_KEY='your-openreward-api-key-here'
2

Create your code

Save this as test_rubrics.py:
from openai import OpenAI
from openreward import OpenReward
import json
import os

or_client = OpenReward()
oai_client = OpenAI()
MODEL_NAME = "gpt-5.2"

environment = or_client.environments.get(name="historyenv", base_url="http://localhost:8080")
tasks = environment.list_tasks(split="train")
tools = environment.list_tools(format="openai")

example_task = tasks[0]

with environment.session(task=example_task, secrets={"openai_api_key": os.getenv("OPENAI_API_KEY")}) as session:
    prompt = session.get_prompt()
    input_list = [{"role": "user", "content": prompt[0].text}]
    finished = False
    print(input_list)

    while not finished:
        response = oai_client.responses.create(
            model=MODEL_NAME,
            tools=tools,
            input=input_list
        )
        print(response.output)

        input_list += response.output

        for item in response.output:
            if item.type == "function_call":
                tool_result = session.call_tool(item.name, json.loads(str(item.arguments)))

                reward = tool_result.reward
                finished = tool_result.finished

                input_list.append({
                    "type": "function_call_output",
                    "call_id": item.call_id,
                    "output": json.dumps({
                        "result": tool_result.blocks[0].text
                    })
                })

                print(input_list[-1])
                print(reward)

                if tool_result.finished:
                    finished = True
                    break
3

Run your code

  python test_rubrics.py
Example output:
[ResponseOutputMessage(id='msg_00b7c91bdcd6757f0069920e0cd92c81a3ad06e75fa90d7c09', content=[ResponseOutputText(annotations=[], text='World War I (1914–1918) was caused by a combination of long-term structural tensions in Europe and a short-term crisis that spiraled into a general war.\n\n## Long-term causes (underlying conditions)\n\n### 1) Alliance systems and bloc politics\nEurope was divided into two major, increasingly rigid camps:\n- **Triple Alliance**: Germany, Austria-Hungary, Italy (Italy later stayed neutral and then joined the other side)\n- **Triple Entente**: Britain, France, Russia\n\nThese alliances were partly defensive, but they encouraged **chain reactions**: a conflict involving one state could quickly pull in its partners.\n\n### 2) Militarism and arms races\nMajor powers expanded their armies and developed detailed war plans:\n- Germany’s **Schlieffen Plan** envisioned a rapid strike against France if war broke out with Russia.\n- Russia, France, and others had their own mobilization timetables.\n- Britain and Germany competed in a major **naval arms race**.\n\nBecause mobilization was seen as critical, crises became “use it or lose it,” making de-escalation harder.\n\n### 3) Nationalism\nNationalism increased both unity and conflict:\n- **French revanchism** after losing Alsace-Lorraine to Germany (1871).\n- **Pan-Slavism** and Serbian nationalism in the Balkans, challenging Austria-Hungary.\n- Nationalist movements inside multi-ethnic empires (especially Austria-Hungary) threatened internal stability.\n\n### 4) Imperial and economic rivalry\nCompetition for colonies, markets, and prestige heightened distrust:\n- Crises such as those over **Morocco** (1905, 1911) worsened relations, particularly between Germany and France/Britain.\n- Economic power shifts (notably Germany’s rapid growth) fed strategic anxiety.\n\n### 5) Instability in the Balkans (“powder keg”)\nThe decline of Ottoman influence and the ambitions of Austria-Hungary and Russia made the Balkans volatile:\n- **Balkan Wars (1912–1913)** intensified territorial disputes and regional militarization.\n- Serbia’s increased power alarmed Austria-Hungary.\n\n## Short-term cause (trigger and escalation)\n\n### Assassination of Archduke Franz Ferdinand (June 28, 1914)\nA Bosnian Serb nationalist linked to networks in Serbia assassinated the heir to Austria-Hungary in Sarajevo. Austria-Hungary used the event to confront Serbia.\n\n### The July Crisis: how a regional conflict became a world war\nKey steps:\n1. **Germany gave Austria-Hungary strong backing** (“blank check”), encouraging a tough stance.\n2. Austria-Hungary issued an **ultimatum** to Serbia; Serbia accepted most terms but not all.\n3. Austria-Hungary **declared war on Serbia** (July 28).\n4. Russia **mobilized** to support Serbia.\n5. Germany **declared war on Russia** (Aug 1) and **France** (Aug 3).\n6. Germany invaded **Belgium** to reach France, leading Britain to **declare war on Germany** (Aug 4) due to treaty commitments and strategic concerns.\n\n## Bottom line\nWorld War I was not caused by a single factor. It resulted from **entangled alliances, militarized planning, nationalism (especially in the Balkans), imperial rivalries, and repeated crises**, with the assassination in 1914 triggering a rapid escalation that leaders could not—or would not—stop.\n\nIf you want, I can compare the “systemic causes” vs. “blame/decision” arguments historians debate (e.g., Germany’s role, miscalculation, or shared responsibility).', type='output_text', logprobs=[])], role='assistant', status='completed', type='message')]
[ResponseFunctionToolCall(arguments='{"answer":"World War I (1914–1918) had deep long‑term causes and a short‑term trigger that escalated into a general European war.\\n\\nLong‑term causes\\n1) Alliance systems and bloc politics: By 1914 Europe was divided into two rival groupings—Germany and Austria‑Hungary (with Italy loosely tied in the Triple Alliance) versus France, Russia, and Britain (the Triple Entente). These commitments were meant to deter war but in a crisis they encouraged a chain reaction, turning local disputes into continent‑wide conflict.\\n\\n2) Militarism and arms races: States expanded armies, invested heavily in weapons, and treated mobilization as decisive. War plans (e.g., Germany’s Schlieffen Plan) and tight railway timetables created pressure to mobilize quickly, making leaders fear that delay meant defeat and reducing room for diplomacy once a crisis began.\\n\\n3) Nationalism: Aggressive nationalism sharpened rivalries. France resented the loss of Alsace‑Lorraine (1871). In the Balkans, Serbian nationalism and pan‑Slav sentiment challenged Austria‑Hungary, while nationalist movements within multi‑ethnic empires threatened internal cohesion and pushed leaders toward hardline policies.\\n\\n4) Imperial and economic rivalry: Competition for colonies, markets, and status—seen in crises like Morocco (1905, 1911)—worsened distrust, especially between Germany and the Entente powers.\\n\\n5) Balkan instability: The Ottoman retreat and the Balkan Wars (1912–13) left unresolved territorial and ethnic disputes. Serbia’s rise alarmed Austria‑Hungary; Russia viewed itself as Serbia’s patron, making the region a flashpoint.\\n\\nShort‑term trigger and escalation\\nThe immediate spark was the assassination of Archduke Franz Ferdinand in Sarajevo (28 June 1914) by a Bosnian Serb nationalist connected to Serbian networks. Austria‑Hungary, backed by Germany’s “blank check,” issued a harsh ultimatum to Serbia and then declared war (28 July). Russia mobilized to support Serbia; Germany declared war on Russia (1 Aug) and France (3 Aug). Germany’s invasion of Belgium brought Britain into the war (4 Aug). In weeks, alliance commitments, mobilization schedules, and mutual fears transformed a Balkan crisis into World War I."}', call_id='call_reMjWpi1OS8rfBg0j8pp93fX', name='answer', type='function_call', id='fc_00b7c91bdcd6757f0069920e1b0cfc81a3b5b103f065b9e5a2', status='completed')]
{'type': 'function_call_output', 'call_id': 'call_reMjWpi1OS8rfBg0j8pp93fX', 'output': '{"result": "\\n**Total Score: 6.5 / 10.0**\\n**Normalized Reward: 0.650**\\n\\n## Rubric-by-Rubric Feedback\\n\\n### Rubric 1: 1.0 / 1.0\\n**Criterion**: Direct answer + defensible thesis about \'cause\'. 0: No clear answer; descriptive list. +0.5: Vague t...\\n**Feedback**: Analysis: The essay presents a clear, defensible thesis that World War I resulted from long\\u2011term structural causes (alliances, militarism, nationalism, imperial rivalry, Balkan instability) combined with the short\\u2011term trigger of Franz Ferdinand\\u2019s assassination, and it consistently applies that framework in the body. It goes beyond a mere list by explaining how these factors interacted to produce mobilisation pressures and a chain reaction, so it meets the criterion for a sustained, arguable claim about causation.\\n\\nScore: 1.0\\n\\n### Rubric 2: 0.5 / 1.0\\n**Criterion**: Chronological command of 1908\\u201314 and July Crisis. 0: Major errors. +0.5: Broad chronology right but ...\\n**Feedback**: Analysis: The essay gives a correct broad sequence from the assassination to general war and mentions relevant 1912\\u201313 Balkan instability, but it lacks tighter chronological command of 1908\\u201314 (no mention of the 1908 Bosnian Crisis or the Agadir/1905\\u201311 tensions) and omits key July 1914 turning-point details and dates (e.g. Germany\\u2019s 5 July \\"blank cheque,\\" Austria\\u2019s ultimatum of 23 July, Serbia\\u2019s reply on 25 July, partial vs. general Russian mobilization). Because the overall order is right but important steps and precise timing in the July Crisis are missing, it merits a partial credit. \\nScore: 0.5\\n\\n### Rubric 3: 0.5 / 1.0\\n**Criterion**: Explains why 1914 (timing problem). 0: Doesn\'t address why earlier crises didn\'t produce war. +0.5: ...\\n**Feedback**: Analysis: The essay names pre\\u20111914 crises (the Moroccan incidents and the Balkan Wars) and notes their role in increasing distrust and leaving unresolved disputes, but it does not analyze why those earlier crises were contained or why 1914 produced general war when they had not. It therefore mentions earlier crises without using them analytically to explain the timing of escalation.\\n\\nScore: 0.5\\n\\n### Rubric 4: 0.5 / 1.0\\n**Criterion**: Structural causes: alliances/diplomacy as mechanisms. 0: Alliances as automatic dominoes without exp...\\n**Feedback**: Analysis: The essay clearly describes the opposing alliance blocs and notes that commitments produced a \\u201cchain reaction\\u201d and cites Germany\\u2019s \\u201cblank check,\\u201d but it stops short of explaining how those commitments concretely shaped state incentives and constrained diplomatic choices (e.g. obligation to mobilize, fear of isolation, bargaining leverage and rigidity). Because the mechanism is asserted rather than analysed in detail, the response fits the partial-credit category.  \\nScore: 0.5\\n\\n### Rubric 5: 1.0 / 1.0\\n**Criterion**: Militarism and war planning. 0: \'Countries liked war\' with no institutional explanation. +0.5: Menti...\\n**Feedback**: Analysis: The essay explicitly links militarism to specific institutions and practices\\u2014war plans (Schlieffen), mobilization timetables, and railway schedules\\u2014and explains how these created pressure to mobilize quickly and reduced diplomatic room, thereby accelerating escalation. This goes beyond a vague claim that \\"countries liked war\\" and provides the causal mechanism the criterion requires.\\n\\nScore: 1.0\\n\\n### Rubric 6: 1.0 / 1.0\\n**Criterion**: Nationalism and the Balkans. 0: Balkan nationalism as background noise. +0.5: Acknowledges volatilit...\\n**Feedback**: Analysis: The essay clearly treats Balkan nationalism as more than background, detailing Serbian nationalism, the Balkan Wars, and the Ottoman retreat, and it links these developments to great\\u2011power responses\\u2014Austria\\u2011Hungary\\u2019s alarm, Russia\\u2019s patronage of Serbia, and how the Sarajevo assassination produced a chain of mobilizations. This shows how Balkan politics interacted with and helped shape great\\u2011power calculations, so the treatment meets the highest level of the criterion.\\n\\nScore: 1.0\\n\\n### Rubric 7: 1.0 / 1.0\\n**Criterion**: Great-power agency. 0: One-sided blame or vague \'everyone equally\'. +0.5: Covers actors unevenly; re...\\n**Feedback**: Analysis: The essay attributes clear agency to the main great powers\\u2014Germany (Schlieffen Plan, \\u201cblank check\\u201d), Austria\\u2011Hungary (ultimatum), Russia (mobilization to defend Serbia), France and Britain (alliance commitments, response to Belgian invasion)\\u2014and links these actions to specific motives and constraints like alliance obligations, mobilization timetables, and imperial/nationalist pressures. While it could deepen causal weighting, it nevertheless argues responsibility with concrete examples rather than vague equal\\u2011blame assertions.\\n\\nScore: 1.0\\n\\n### Rubric 8: 0.0 / 1.0\\n**Criterion**: Historiography. 0: Little engagement or name-dropping. +0.5: Uses interpretations but evaluation is ...\\n**Feedback**: Analysis: The essay provides a clear, conventional summary of causes but contains no engagement with historiographical debate\\u2014no historians, schools, or competing interpretations are referenced or evaluated (e.g., Fischer, A.J.P. Taylor, Christopher Clark). It therefore qualifies as little engagement with historiography rather than using or weighing rival scholarly arguments.\\n\\nScore: 0\\n\\n### Rubric 9: 0.0 / 1.0\\n**Criterion**: Evidence and scholarship. 0: Sparse/unsuitable evidence. +0.5: Uses works but cherry-picks or doesn\'...\\n**Feedback**: Analysis: The essay gives accurate factual examples (Schlieffen Plan, Moroccan Crises, Balkan Wars, assassination) but offers no engagement with scholarship or citation of historians/primary sources, nor does it acknowledge limits or competing interpretations (e.g., Fischer, A.J.P. Taylor, Christopher Clark). Because it presents assertions without discriminating evidence or historiographical context, it fails the criterion for evidence and scholarship.\\n\\nScore: 0\\n\\n### Rubric 10: 1.0 / 1.0\\n**Criterion**: Argumentative craft. 0: Unclear structure, weak paragraphs. +0.5: Readable and structured but argume...\\n**Feedback**: Analysis: The essay is well organized with a clear architecture\\u2014separate, focused sections on long\\u2011term causes and the short\\u2011term trigger\\u2014and each paragraph/point concisely advances the argument. It deploys relevant evidence (Schlieffen Plan, Balkan Wars, assassination, \\u201cblank check,\\u201d Belgian invasion) and maintains a coherent causal chain from underlying tensions to escalation.  \\nScore: 1.0\\n"}'}
0.65
In the example above we got a reward of 0.65, or 6.5/10. Studying the grader response, we see the model answer lost points because it did not engage with the existing scholarship, misses key dates and other details. This shows an example of how rubrics work in practice, and how we can use them in ORS environments. But we should reflect: there are some issue with the rubrics we have defined for this environment, as well as the way we have just prompted a language model:
  • Hallucination. The rubrics, as defined, do not punish hallucination directly. For example, what if the model made up an event or got a date wrong? Our rubrics do not punish this directly.
  • Required output format. It may be acceptable to omit detail if it is a chatbot response; it is less acceptable if it is a university thesis or a undergraduate essay. We should make it clear to the language model what type of output is required so it knows how much detail is needed.
  • Tools and citations. We have not given the language model access to a web search tool so it can search the literature. We have also not explicitly mentioned a need for citations, nor defined a rubric for citation use. Likewise, we have not given the grader access to tools that could be used for verification (and to identify and punish hallucination).
This is not a problem with rubrics, per say, but it does show we need to be careful to elicit and reward the right behaviour. When using rubrics to train models, you should pay close attention to their responses to ensure they are not reward hacking, and are appropriately incentivised to exhibit the right behaviours.

Limitations of Rubrics

We have already touched upon some limitations of our rubric example, but we should note some general themes to look out for:
  • Reward Hacking - if we do not specify the desired behaviours correctly, or underspecify requirements, then the model may find a way to achieve reward without exhibiting the desired behaviours.
  • Subjectivity - if an objective metric is available, we should always prefer that. If we use a subjective human measure of quality, then we may limit the creativity of the model to discover better-than-human solutions.
  • Cost - LLM graders can become costly, especially in large-scale reinforcement learning. With rubrics, we often use multiple graders for each rubric which multiplies costs further. If we can find simple rules-based methods to identify and reward the same behaviour, then this is preferred to LLM graded rubrics.
  • Grader Quality - We are also bound by the quality of the grader. If a rubric is difficult to judge, or requires deep expertise, then it is recommended to use reasoning models as graders.