Skip to main content

Goals

  • Set up distributed RL training with Slime
  • Configure an OpenReward environment for training
  • Monitor training progress with WandB
  • Train a model on the WhoDunIt environment

Prerequisites

  • Slime installed locally (pip install -e /path/to/slime)
  • An OpenReward account and API key
  • A WandB account and API key
  • Python 3.11+
  • NVIDIA GPUs (tested on H100/H200)

Setup

Slime is an RL post-training framework from Tsinghua University. It uses SGLang for fast inference and supports FSDP or Megatron backends for distributed training. In this tutorial, we’ll use it to train a language model on an OpenReward environment using reinforcement learning with GRPO. First, clone the OpenReward cookbook repository and navigate to the Slime training example:
git clone https://github.com/OpenRewardAI/openreward-cookbook.git
cd openreward-cookbook/training/slime
Install the required packages:
pip install -r requirements.txt
Or using uv:
uv pip install -r requirements.txt
Next, set the required environment variables:
export OPENREWARD_API_KEY=your_openreward_key_here
export WANDB_API_KEY=your_wandb_key_here
export OPENAI_API_KEY=your_openai_key_here  # If environments use LLM-based graders

Understanding the Training Pipeline

The training pipeline combines three services:
  • Slime provides the distributed compute infrastructure for running training (FSDP or Megatron backend) and SGLang for fast inference during rollouts
  • OpenReward provides the environments and tasks for the agent to learn from
  • WandB tracks metrics, logs, and training progress
As training runs, Slime will sample multi-turn rollouts from your OpenReward environment, compute rewards using GRPO advantage estimation, and update the model using reinforcement learning. Per-token log probabilities are tracked for importance sampling, and trajectories are uploaded to OpenReward for visualization.

Selecting an Environment

Browse available environments at OpenReward: Environment selection Let’s use the GeneralReasoning/WhoDunIt environment for this tutorial. This environment challenges agents to solve mystery scenarios. Environment selection Click the copy button to copy the identifier GeneralReasoning/WhoDunIt for use in your config.

Configuration

Training is configured via two files:

train_config.yaml — Environment & agent settings

Open train_config.yaml and update the environment configuration to use GeneralReasoning/WhoDunIt:
environments:
  GeneralReasoning/WhoDunIt:
    splits:
      - train
    nonterminal_reward: 0.0
    reward_reduction: sum
    max_turns: 20
You can train on multiple environments simultaneously by adding entries:
environments:
  GeneralReasoning/WhoDunIt:
    splits: [train]
    reward_reduction: sum
    max_turns: 20

  MATH/GSM8K:
    splits: [train]
    reward_reduction: mean
    max_turns: 10

run.sh — Training hyperparameters

All training, optimizer, cluster, and rollout settings are passed via run.sh CLI flags:
FlagDefaultDescription
--modelQwen/Qwen3-30B-A3BHuggingFace checkpoint
--lr1e-5Learning rate
--n-samples16Rollouts per prompt (for GRPO)
--rollout-batch-size32Prompts per rollout batch
--max-response-len4096Max response tokens per generation call
--max-tokens-per-gpu8192Token cap per GPU in training (OOM prevention)
--temperature1.0Sampling temperature
--train-backendfsdpfsdp or megatron

Running Training

Training is a two-step process. First, fetch tasks from OpenReward and write a Slime-compatible JSONL dataset:
python prepare_tasks.py --config train_config.yaml --output tasks.jsonl
Then, from the Slime repo root, launch training:
cd /path/to/slime
bash /path/to/openreward-cookbook/training/slime/run.sh
Common overrides:
# Different model
bash run.sh --model Qwen/Qwen3-4B

# Adjust GPU allocation
bash run.sh --actor-gpus 4 --rollout-gpus 4 --tp 4

# Tune training
bash run.sh --lr 5e-6 --n-samples 8 --rollout-batch-size 16

# Enable gradient checkpointing (reduces memory, ~10% slower)
bash run.sh -- --gradient-checkpointing

# Pass arbitrary Slime args after --
bash run.sh -- --context-parallel-size 2 --use-kl-loss --kl-loss-coef 0.01
To resume from a checkpoint:
bash run.sh --load /path/to/checkpoints/
Training will begin and you’ll see output in your terminal: Environment selection The training process will:
  1. Load your model and prepare for distributed training
  2. Connect to SGLang for inference
  3. Sample multi-turn rollouts from the WhoDunIt environment
  4. Compute rewards and update the model using GRPO
  5. Log metrics to WandB
  6. Save checkpoints periodically

Monitoring Training

Your training metrics will appear in your WandB dashboard. You can track rewards, response lengths and other key metrics in real-time. WANDB To view your WandB dashboard, go to https://wandb.ai/ and navigate to your project. You’ll see charts showing:
  • Training loss over time
  • Average reward per episode
  • Success rate on tasks
  • Learning rate schedule
Detailed rollout data is uploaded to your OpenReward runs page: Rollout list One rollout

Additional tips

Some environments require additional secrets, for example environments that use LLM graders or environments that use external search APIs. You can configure these in the secrets section of train_config.yaml:
secrets:
  openai_api_key: null  # null = read from OPENAI_API_KEY env var

Memory considerations

Multi-turn agent rollouts produce long sequences (system prompt + tools + N turns of generation + tool responses). This can cause OOM during training. Key levers:
  • --max-tokens-per-gpu N + --use-dynamic-batch-size: Caps tokens packed per GPU per training step. Start at max_response_len and increase for throughput.
  • --gradient-checkpointing: Trades ~10% speed for significantly less activation memory. Recommended for models with large vocabularies (e.g. Qwen3’s 152k vocab).
  • --context-parallel-size N: Splits long sequences across N GPUs (requires N actor GPUs).
  • max_turns in train_config.yaml: Fewer turns = shorter sequences.

Next Steps