Skip to content
GitHub

RL basics: verifiable rewards, haiku edition

Qwen3-4B haiku evaluation with verifiable rewards — serve, evaluate, train, compare

This tutorial uses Qwen3-4B and haiku poems to introduce the verifiable reward pattern that underpins RL post-training:

  1. Serve the base model.
  2. Define a scoring function with a verifiable reward (syllable structure).
  3. Evaluate the base model against that scorer.
  4. GRPO-train the model with slime using the reward function.
  5. Serve the trained checkpoint.
  6. Evaluate it with the same scorer and compare.

Why haikus? A haiku has two attributes you can score automatically — whether it follows the 5-7-5 syllable format (deterministic, cheap) and whether the poem is actually good. That split between verifiable and subjective rewards is exactly the landscape RL post-training operates in. This tutorial covers the verifiable half. In a later tutorial, we will cover the subjective half.

import re
from modal_training_gym import (
DeploymentConfig,
EvalConfig,
EvalRowResult,
HuggingFaceDataset,
Qwen3_4B,
SlimeRecipe,
TrainConfig,
list_checkpoints,
)

So, how does Qwen3-4B currently fare at writing haikus? We can serve the base model and find out.

The training gym has several config classes so you can define deploymnet, training, and evaluation configurations, and reuse them across different runs for parameter sweeps.

Let’s start by initializng a DeploymentConfig.

Calling DeploymentConfig.serve() builds and deploys a vLLM app, then returns a ModelDeployment that contains a the concrete endpoint URL.

base_model = Qwen3_4B()
base_model_deployment = DeploymentConfig(
model=base_model,
).serve()
print(f"Base model deployed to {base_model_deployment.url}")

Let’s now cover the evaluation part of the tutorial.

A good eval takes a particular outcome and assigns a score to it. It can be binary (pass/fail) or continuous (0-100), deterministic or subjective, and cheap or expensive to compute.

In our case, we want our model to be good at writing haiku poems, so how do we evaluate if an llm response was a good haiku or not?

Well, a haiku must follow the 5-7-5 syllable format, so we can count syllables using NLTK’s CMU Pronouncing Dictionary (with a regex fallback for words not in the dictionary) and score how close each line is to its target syllable count.

We can give it score 0 if it doesn’t follow the 5-7-5 syllable format, and 1 if it does. But that’s not very informative. Instead, we can score it based on how close it is to the target syllable count for each line.

_cmudict_cache = {}
def _get_cmudict() -> dict:
if not _cmudict_cache:
import nltk
from nltk.corpus import cmudict
nltk.download("cmudict", quiet=True)
_cmudict_cache.update(cmudict.dict())
return _cmudict_cache
def _count_syllables(text: str) -> int:
cmu = _get_cmudict()
total = 0
for word in re.findall(r"[a-zA-Z]+", text):
phones = cmu.get(word.lower())
if phones:
total += sum(p[-1].isdigit() for p in phones[0])
else:
count = len(re.findall(r"[aeiouy]+", word.lower()))
if word.lower().endswith("e") and count > 1:
count -= 1
total += max(count, 1)
return total
def score_haiku(response: str) -> float:
lines = [line.strip() for line in response.strip().split("\n") if line.strip()]
if len(lines) != 3:
return -10
total_diff = sum(
abs(_count_syllables(line) - target)
for line, target in zip(lines, [5, 7, 5])
)
return -float(total_diff)

Let’s also define a Haiku dataset. Here, we use the statworx/haiku dataset from HuggingFace. Each row has a keywords topic and a reference text haiku. We can use this dataset to train our model.

Datasets for training models can take many form factors, and huggingface dataset is just one of them. If you’re curious about other options, check out the DatasetConfig documentation.

class HaikuDataset(HuggingFaceDataset):
hf_repo = "statworx/haiku"
input_column = "keywords"
output_column = "text"
output_format = "jsonl"
apply_chat_template = True
system_prompt = (
"You are a haiku poet. Write a haiku about the given topic. "
"Use the 5-7-5 syllable format across three lines."
)
prompt_template = "Write a haiku about {input}."
always_prepare = True # For the purpose of this tutorial, we want to prepare the dataset every time we run it, in case there is stale data from a previous run.
train_dataset = HaikuDataset(n_rows=10)
eval_dataset = HaikuDataset(n_rows=5)

Seems straightforward enough, right? How do we run an eval on our base model with this dataset? We can transform our scoring function above into an Eval Configuration.

First, to explain, an Eval Configuration is a class that owns the model-calling loop. The task-specific part is a scoring function passed to .evaluate(...), which must return EvalRowResult.

The very simple form of an eval is given a dataset, and the corresponding model response, return its score. That can be configured using EvalConfig.eval_response_fn.

For more complex evals (e.g. multi-turn), you can also define a custom EvalConfig.eval_fn that takes a ModelDeployment and a dataset row and returns a score.

def eval_response_fn(_example: dict, response: str) -> EvalRowResult:
return EvalRowResult(score=score_haiku(response), response=response)
eval_config = EvalConfig(
dataset=eval_dataset,
eval_response_fn=eval_response_fn,
generate_kwargs={"chat_template_kwargs": {"enable_thinking": False}},
)
print("——— Running base model evaluation... ———")
base_eval = eval_config.evaluate(base_model_deployment, debug=True)
print(f"Average haiku score: {base_eval.mean:.1f}")
print("——— Base model evaluation complete ———")

Now, let’s actually train the model to write good haikus. Here, we use the slime framework (https://github.com/THUDM/slime) on Modal.

All flags that are native to slime can be passed to the TrainConfig object. You can also add patches to slime using the image_overlay argument.

async def haiku_rm(args, sample, **kwargs) -> float:
return score_haiku(sample.response)
training_run = TrainConfig(
model=base_model,
dataset=train_dataset,
recipe=SlimeRecipe(
custom_rm_function=haiku_rm,
gpu_type="H100",
colocate=True,
tensor_model_parallel_size=1,
sequence_parallel=False,
rollout_num_gpus_per_engine=1,
num_rollout=10,
rollout_batch_size=16,
rollout_max_response_len=4096,
rollout_temperature=1.0,
save_interval=5,
apply_chat_template_kwargs='{"enable_thinking": false}',
image_overlay=lambda image: image.run_commands(
"uv pip install --system aiohttp nltk>=3.8.0",
"python -c \"import nltk; nltk.download('cmudict', quiet=True)\"",
),
),
)

TrainConfig.train() builds the Modal app, runs training, and returns a TrainResult with the run ID and checkpoint path.

print("——— Running training... ———")
train_result = training_run.train()
print("——— Training complete ———")

The returned TrainResult has the checkpoint path and volume metadata attached. You can pass an explicit checkpoint= to DeploymentConfig to pin a specific checkpoint, or omit it to use the model’s default path.

checkpoint = list_checkpoints(train_result.training_run_id)[-1]
print(checkpoint.path)
trained_model_deployment = DeploymentConfig(
model=Qwen3_4B(),
checkpoint=checkpoint,
app_name="qwen3-4b-haiku-serve",
served_model_name="qwen3-4b-haiku",
).serve()
print(f"Trained model deployed to {trained_model_deployment.url}")

Now let’s run the same eval on the trained model and compare.

print("——— Running trained model evaluation... ———")
trained_eval = eval_config.evaluate(trained_model_deployment, debug=True)
print(f"Trained haiku score: {trained_eval.mean:.1f}")
print("——— Trained model evaluation complete ———")

Hmm, looks like the trained model is not doing very well. Maybe it’s because it only trained for 10 iterations.

What happens if we train it for more? We want to train it off of the latest checkpoint, not from scratch.

new_training_run = TrainConfig(
model=Qwen3_4B(),
dataset=train_dataset,
checkpoint=checkpoint,
recipe=SlimeRecipe(
custom_rm_function=haiku_rm,
gpu_type="H100",
colocate=True,
tensor_model_parallel_size=1,
sequence_parallel=False,
rollout_num_gpus_per_engine=1,
num_rollout=20,
rollout_batch_size=16,
rollout_max_response_len=4096,
rollout_temperature=1.0,
save_interval=10,
apply_chat_template_kwargs='{"enable_thinking": false}',
image_overlay=lambda image: image.run_commands(
"uv pip install --system aiohttp nltk>=3.8.0",
"python -c \"import nltk; nltk.download('cmudict', quiet=True)\"",
),
),
)
print("——— Running new training... ———")
new_train_result = new_training_run.train()
print("——— New training complete ———")

Now let’s run the same eval on the newly trained model and compare.

new_checkpoint = list_checkpoints(new_train_result.training_run_id)[-1]
print(new_checkpoint.path)
new_model_deployment = DeploymentConfig(
model=Qwen3_4B(),
checkpoint=new_checkpoint,
app_name="qwen3-4b-haiku-serve-new",
served_model_name="qwen3-4b-haiku",
).serve()
print(f"Newly trained model deployed to {new_model_deployment.url}")

Now let’s compare the results of the newly trained model and the base model.

print("——— Running trained model evaluation... ———")
new_eval = eval_config.evaluate(new_model_deployment, debug=True)
print(f"Trained model (new) haiku score: {new_eval.mean:.1f}")
print("——— Trained model (new) evaluation complete ———")

Now let’s compare the results across all three checkpoints.

print(f"Base model haiku score: {base_eval.mean:.1f}")
print(f"Trained model haiku score: {trained_eval.mean:.1f}")
print(f"Trained model (new) haiku score: {new_eval.mean:.1f}")

Source: tutorials/rl/000_rl_basics/000_rl_basics.py | Open in Modal Notebook