Goals
- Set up distributed RL training with Tinker
- Configure an OpenReward environment for training
- Monitor training progress with WandB
- Train a model on the WhoDunIt environment
Prerequisites
- A Tinker account and API key
- An OpenReward account and API key
- A WandB account and API key
- Python 3.11+
Setup
Tinker is a flexible API for efficiently fine-tuning open source models with LoRA. In this tutorial, we’ll use it to train a language model on an OpenReward environment using reinforcement learning. First, clone the OpenReward cookbook repository and navigate to the Tinker training example:.env file with your API credentials:
Understanding the Training Pipeline
The training pipeline combines three services:- Tinker provides the distributed compute infrastructure for running training
- OpenReward provides the environments and tasks for the agent to learn from
- WandB tracks metrics, logs, and training progress
Selecting an Environment
Browse available environments at OpenReward:
GeneralReasoning/WhoDunIt environment for this tutorial. This environment challenges agents to solve mystery scenarios.

GeneralReasoning/WhoDunIt for use in your config.
Configuration
Now we’ll update the configuration file for the training run. Opentinker-config.yaml and update the environment configuration to use GeneralReasoning/WhoDunIt:
wandb_project_name: The WandB project where metrics will be loggedwandb_run_name: A name for this specific training runmodel_name: The base model to fine-tune (using LoRA)lora_rank: The rank for LoRA adapters (lower = fewer parameters)batch_size: Number of samples per training batchlearning_rate: Learning rate for the optimizersave_every: Save a checkpoint every N batchesnum_rollouts: Number of rollouts to collect per environment per batchtemperature: Sampling temperature for the model (1.0 = default)
Running Training
Now we can start training. The cookbook includes amain.py file with the complete training loop.
Run training with:

- Load your model and initialize LoRA adapters
- Connect to Tinker’s infrastructure
- Sample rollouts from the WhoDunIt environment
- Compute rewards and update the model
- Log metrics to WandB
- Save checkpoints periodically
Monitoring Training
Your training metrics will appear in your WandB dashboard. You can track rewards, response lengths and other key metrics in real-time.
- Training loss over time
- Average reward per episode
- Success rate on tasks
- Learning rate schedule


/tmp/tinker-rl in our example). These contain detailed information about each training step.

