Qwen3-4B GRPO on Haiku with slime on Modal
Qwen3-4B GRPO on haiku poems — structure score + LLM judge
This tutorial teaches Qwen3-4B to write 5-7-5 haiku poems about Modal-flavored topics. The training algorithm is GRPO (Group Relative Policy Optimization, the method popularized by DeepSeek-R1): for each prompt, the model generates a group of candidate responses, each response is scored by a reward model, and the policy is updated to prefer responses that scored above the group’s mean.
The interesting piece is the reward model, which has two halves:
- Structure — deterministic, cheap, and always on. A 5-7-5 syllable check via CMUdict. No network, no GPU, runs inside the training loop.
- Style — an LLM-as-judge score. A separately-deployed vLLM
server rates the poem on relevance, poetic quality, and Modal
vocabulary usage. Optional: activated only when
LLM_JUDGE_URLis set in the training environment, so the tutorial is runnable end-to-end without standing up a second service.
Both pieces are wired into slime through its custom_rm_path hook.
In notebooks, the tutorial writes a local haiku.py file that slime
imports; from the CLI, the same config falls back to the packaged
modal_training_gym.common.haiku_reward module.
The workflow has four stages:
| Stage | What it does | Where |
|---|---|---|
download_model | Pulls Qwen/Qwen3-4B into the HF cache volume | 1×H100 |
prepare_dataset | Downloads statworx/haiku and writes train/test parquet | CPU |
train | GRPO training loop over the haiku prompts | 1×8×H100, colocated |
serve_app (separate) | Hosts the finished checkpoint via vLLM + Flash | 1×H100 |
Invoke any function on the returned training app via modal run
from the CLI, or interactively with app.run() + .remote() from a
notebook.
Imports
Section titled “Imports”Three import groups:
modal— needed only for the interactive cells below (modal.enable_output(),app.run()). The training app itself is built by the launcher, so the tutorial body never touches the Modal SDK directly.haiku/modal_training_gym.common.haiku_reward— the custom reward module. In notebooks, the next cell writes a localhaiku.pyfile so you can edit the rubric inline and ship it to the training image withlocal_python_sources=["haiku"]. In the plain.pytutorial, the same code falls back to the packaged reference module atmodal_training_gym/common/haiku_reward.py.modal_training_gym.*— shared containers (DatasetConfig,Qwen3_4B,WandbConfig), the vLLM serving helper, and slime’s framework-specific config + launcher classes.DATA_PATHis the canonical mount point (/data) for the shared data volume on every slime container.
import modal
try: import haiku as _local_haikuexcept ImportError: _local_haiku = None
if _local_haiku is not None and hasattr(_local_haiku, "haiku_rm"): haiku = _local_haiku CUSTOM_RM_PATH = "haiku.haiku_rm" LOCAL_PYTHON_SOURCES = ["haiku"]else: from modal_training_gym.common import haiku_reward as haiku
CUSTOM_RM_PATH = "modal_training_gym.common.haiku_reward.haiku_rm" LOCAL_PYTHON_SOURCES = []
from modal_training_gym.common.dataset import HuggingFaceDatasetfrom modal_training_gym.common.models import Qwen3_4Bfrom modal_training_gym.common.wandb import WandbConfigfrom modal_training_gym.common.serve_vllm import build_vllm_serve_appfrom modal_training_gym.frameworks.slime import ModalConfig, SlimeConfigfrom modal_training_gym.frameworks.slime.config import DATA_PATHDefine the dataset
Section titled “Define the dataset”slime reads training data from parquet files on a shared volume. A
DatasetConfig subclass specifies two things:
- Where the files live (
prompt_data,eval_prompt_data) — paths underDATA_PATH(/datainside every slime container). - How to produce them (
prepare()) — runs once inside theprepare_datasetModal function with the data volume mounted read-write, and should write the parquet files atomically.
The class attrs also map to slime CLI flags:
input_key="messages"— column name slime reads the chat conversation from.label_key="label"— column holding the ground-truth poem (used only by the optional LLM judge to grade relative quality; the structure-only reward ignores it).apply_chat_template=True— ask slime to run the tokenizer’s chat template overmessagesbefore rollout, matching what the model was pretrained with.rm_type="async_rm"— the reward model is an async Python function rather than a built-in (math,deepscaler, …) or a remote HTTP service.
Inside prepare() we:
- Download
statworx/haiku, a dataset of ~7k haiku poems tagged with akeywordstopic. - Turn each row into a chat conversation: a short system prompt
that asks for a haiku, plus a user prompt like
"Write me a haiku about cat.". - Also precompute the tokenized
promptstring via the model’s chat template — slime uses that as the exact rollout input. - Hold out the last 20% (capped at 1000 rows) as a test split so
eval_prompt_datahas something to score against during training.
class HaikuDataset(HuggingFaceDataset): hf_repo = "statworx/haiku" input_key = "messages" label_key = "label"
def __init__(self, data_path, hf_checkpoint): # `data_path` is DATA_PATH on whichever container is mounting # the shared volume — /data on slime containers. Stashing it # plus the HF model id lets prepare() run standalone later. self._data_path = str(data_path) self._hf_checkpoint = hf_checkpoint self.prompt_data = f"{self._data_path}/haiku/train.parquet" self.eval_prompt_data = ["haiku", f"{self._data_path}/haiku/test.parquet"]
def prepare(self): import os
from datasets import load_dataset from transformers import AutoTokenizer
# Use the base-model tokenizer to render the chat template; # slime will re-tokenize at rollout, but pre-rendering the # prompt string here lets us log exactly what the model sees. tokenizer = AutoTokenizer.from_pretrained(self._hf_checkpoint) ds = load_dataset("statworx/haiku")
system_prompt = ( "You are a haiku poet. You will be given a prompt and you " "will need to write a haiku about the prompt." )
def to_chat(example): # statworx/haiku rows look like # {"keywords": "Cat", "text": "<reference haiku>"}. # We turn `keywords` into a user question and keep the # reference poem under `label` for the judge to consume. keyword = example["keywords"].lower() question = f"Write me a haiku about {keyword}." messages = [ {"content": system_prompt, "role": "system"}, {"content": question, "role": "user"}, ] return { "question": question, "label": example["text"], "messages": messages, "prompt": tokenizer.apply_chat_template( messages, tokenize=False, enable_thinking=False ), }
# Hold out the last 20% (≤ 1000 rows) as the test split so # `eval_prompt_data` has fresh prompts the policy never sees # during rollout. train_ds = ds["train"] test_size = min(1000, int(len(train_ds) * 0.2)) test_ds = train_ds.select(range(len(train_ds) - test_size, len(train_ds))) train_ds = train_ds.select(range(len(train_ds) - test_size))
os.makedirs(f"{self._data_path}/haiku", exist_ok=True) train_ds.map(to_chat, remove_columns=["keywords"]).to_parquet( f"{self._data_path}/haiku/train.parquet" ) test_ds.map(to_chat, remove_columns=["keywords"]).to_parquet( f"{self._data_path}/haiku/test.parquet" )Reward model
Section titled “Reward model”GRPO needs a scalar reward for every rollout. The notebook writes a
local haiku.py module; the CLI version uses the packaged
modal_training_gym.common.haiku_reward module from
modal_training_gym/common/haiku_reward.py. Both expose the same
haiku_rm, an async reward function matching the signature
slime expects:
async def haiku_rm(args, sample, **kwargs) -> float: ...argsis the slime CLI namespace (rarely needed).sampleexposes.response(the model’s generation),.prompt/.question(the input), and.label(the reference answer, when available).- Return value is the reward. Higher is better; GRPO subtracts the group mean so absolute magnitudes are less important than consistency across samples.
haiku_rm composes two scoring components:
score_haiku_structure(response, cmudict)— pure-Python check that returns a value in[0, 1]: a quarter-point for having three lines, plus up to a quarter per line for hitting the 5-7-5 syllable target (half-credit for off-by-one). Uses NLTK’s CMUdict for pronunciation; falls back to a vowel-group heuristic for out-of-dict words.HaikuStyleJudge(LlmJudge)— subclass of the sharedmodal_training_gym.common.llm_judge.LlmJudge. The base class handles the boring part (POST to an OpenAI-compatible chat-completions endpoint, regex-parse a number out of the reply, normalize to[0, 1]). The subclass overridesbuild_prompt()with the haiku-specific rubric: relevance + poetic quality + Modal vocabulary usage.
The two components are combined with curriculum-style gating —
the style score is evaluated only when the structure score is
perfect (1.0). Early in training the model has to learn 5-7-5
first; once it does, the style score kicks in and rewards
thematically interesting poems. Total reward is therefore
structure + style ∈ [0, 2].
Gating also makes the style path opt-in: it only fires when
LLM_JUDGE_URL is set in the training environment. Without a judge
deployed, the reward degrades gracefully to structure-only and the
run still progresses — convenient for smoke-testing this tutorial.
Define the experiment
Section titled “Define the experiment”SlimeConfig(...) wires the model, dataset, reward, and cluster
shape into a single object. Fields that aren’t set here fall back to
slime’s defaults (optimizer, Megatron parallelism, GRPO clipping,
etc.) — see the
slime_gsm8k tutorial
for a fully explicit example.
Key choices for this run:
rm_type="async_rm"+custom_rm_path=CUSTOM_RM_PATH— when the notebook-createdhaiku.pyis present, slime importshaiku.haiku_rm; otherwise it falls back tomodal_training_gym.common.haiku_reward.haiku_rm.local_python_sources=LOCAL_PYTHON_SOURCESonModalConfig— ships the notebook-authoredhaiku.pyinto the training image when present. The packaged fallback already lives undermodal_training_gym, so it doesn’t need an extra local source.apply_chat_template_kwargs='{"enable_thinking": false}'— Qwen3’s chat template has aenable_thinkingtoggle; we disable it so rollouts produce poems, not chain-of-thought traces.save_interval=10,eval_interval=20— checkpoint every 10 steps, evaluate every 20.
base_model = Qwen3_4B()my_training_run = SlimeConfig( model=base_model, dataset=HaikuDataset(DATA_PATH, base_model.model_name), wandb=WandbConfig(project="slime-grpo", group="qwen3-4b-haiku"), ref_load=base_model.model_name, megatron_to_hf_mode="bridge",
rm_type="async_rm", custom_rm_path=CUSTOM_RM_PATH,
num_rollout=50, apply_chat_template_kwargs='{"enable_thinking": false}',
eval_interval=20, n_samples_per_eval_prompt=8,
save="/checkpoints/qwen3-4b-haiku", save_interval=10,
local_python_sources=LOCAL_PYTHON_SOURCES, image_run_commands=[ "uv pip install --system aiohttp nltk>=3.8.0", "python -c \"import nltk; nltk.download('cmudict', quiet=True)\"", ],)Build the Modal app
Section titled “Build the Modal app”build_app() returns a modal.App with four functions defined
against the right volumes, secrets, and GPU spec:
download_model— pulls the HF checkpoint into thehuggingface-cachevolume (1×H100, 2 hour timeout).prepare_dataset— runsHaikuDataset.prepare()against theslime-datavolume (CPU).convert_checkpoint— no-op in bridge mode; formegatron_to_hf_mode ="raw"experiments it’d do the HF → torch_dist conversion.train— the actual GRPO run on a Modal@clustered(n_nodes)cluster (1×8×H100 here), with Ray brought up in-container viaModalRayCluster.
Every function closes over the my_training_run config above, so the
app is fully determined by that single object.
app = my_training_run.build_app()Run training
Section titled “Run training”Serve the trained model
Section titled “Serve the trained model”build_vllm_serve_app(...) is the post-training counterpart to
build_app(): a thin wrapper around the canonical
Modal vLLM inference example
that returns a second Modal app, fully independent of the
training app. It registers one @modal.web_server-decorated
function that mounts the checkpoints volume, runs
vllm serve <model_path>, and exposes the standard OpenAI
chat-completions API at
https://<workspace>--<app_name>-serve.modal.run.
Because it’s a separate app, you modal deploy it (not
modal run) — that gives you a long-lived URL you can point a
playground UI or eval harness at. Modal scales the replica to zero
after scaledown_window of idleness.
Pointing at the right checkpoint
Section titled “Pointing at the right checkpoint”model_path must point at a directory containing an HF-format
checkpoint (config.json plus weight shards). slime bridge mode
writes Megatron torch_dist checkpoints under save — pick out an
HF-compatible subdirectory once training completes, or run a
torch_dist → HF conversion if your config didn’t emit one. Inspect
what the training left behind with:
modal volume ls slime-checkpoints qwen3-4b-haiku/and point model_path at whichever directory has a config.json.
The reference
qwen3-haiku convert_torch_dist_to_hf.py
is a worked example of the conversion flow if you need it.
serve_app = build_vllm_serve_app( # Modal app name — determines the deployed URL shape: # https://<workspace>--serve-haiku-model-serve.modal.run app_name="serve-haiku-model", # Absolute path inside the container to an HF-format checkpoint # directory (must contain config.json). Update after inspecting # the slime-checkpoints volume — the exact subdirectory depends # on slime's bridge-mode output layout. (To smoke-test the # serving pipeline before training, swap this for the HF repo # id "Qwen/Qwen3-4B" and drop the `checkpoints_volume` arg # below; vLLM will download the base model itself.) model_path="/checkpoints/qwen3-4b-haiku", # The `model` field callers pass in chat-completions requests. served_model_name="qwen3-4b-haiku", # Mount the training checkpoints volume; without this, # /checkpoints wouldn't exist in the serving container. checkpoints_volume="slime-checkpoints", gpu="H100", n_gpu=1, # Qwen3's reasoning-parser separates <think>...</think> traces # from the final response on the server side. --enforce-eager # skips CUDA-graph capture (faster cold starts, slightly slower # inference — ideal for scale-from-zero endpoints). extra_vllm_args=["--enforce-eager", "--reasoning-parser", "qwen3"],)Related API Reference
Section titled “Related API Reference”Source: tutorials/rl/003_slime_with_llm_as_judge/003_slime_with_llm_as_judge.py
| Open in Modal Notebook