Skip to content
GitHub

Code RL with Harbor hello-world + Modal sandboxes

Code RL with Harbor hello-world and sandboxed verification

What if you have a task where you want to score model outputs by running them in an environment?

This tutorial trains a model on the hello-world task from Harbor Hub, scoring solutions by spawning and executing them in Modal sandboxes.

Workflow:

  1. Pull the hello-world task from Harbor Hub via HarborDataset.
  2. Score model outputs with HarborEval — it extracts code, runs it in a Modal sandbox, and compares stdout automatically.
  3. Reuse the same score_in_sandbox helper as a SLIME custom_rm_function.
  4. Train and compare base vs. trained behavior.
from modal_training_gym import (
DeploymentConfig,
HarborDataset,
HarborEval,
Qwen3_4B,
SlimeRecipe,
TrainConfig,
extract_code,
list_checkpoints,
score_in_sandbox,
)

HarborDataset accepts a dataset_name to pull tasks from Harbor Hub. Each task has:

  • instruction.md — the problem statement (prompt)
  • task.toml — metadata (difficulty, category)
  • tests/ — verification tests (format varies by task)

The hello-world task uses pytest-based verification rather than *.in/*.out file pairs, so we define stdin/stdout test cases inline and pass them to HarborEval via the test_cases field.

A single dataset instance handles both training and eval — prepare() writes train and eval splits to the volume, while load() returns all tasks for offline evaluation.

HELLO_WORLD_TESTS = [{"input": "", "expected_output": "Hello, world!\n"}]
dataset = HarborDataset(
dataset_name="harbor/hello-world",
label_metadata_path="task.toml",
train_repeats=20,
always_prepare=True, # For the purpose of this tutorial, we want to prepare the dataset every time we run it, in case there is stale data from a previous run.
system_prompt=(
"You are an expert Python programmer. "
"Solve the given problem by writing a complete Python program. "
"Your program must print the answer to stdout using print(). "
"Do not create or write any files. "
"Put your solution in a ```python code fence."
),
)

HarborEval automates the sandbox scoring loop. It:

  1. Sends each task’s prompt to the deployed model.
  2. Extracts Python code from the response (stripping thinking tags, chat-template artifacts, and code fences via extract_code).
  3. Runs the extracted code in a Modal sandbox against the test cases.
  4. Returns a score = fraction of test cases passed.

Since hello-world doesn’t ship *.in/*.out file pairs, we pass test_cases directly — HarborEval uses them as a fallback when the dataset label doesn’t contain test cases.

Passing model=Qwen3_4B() enables model-aware response parsing, which populates parsed_response on each result row for richer dashboard display.

base_model = Qwen3_4B()
base_deployment = DeploymentConfig(model=base_model).serve()
print(f"Base model URL: {base_deployment.url}")
eval_config = HarborEval(
dataset=dataset,
model=base_model,
test_cases=HELLO_WORLD_TESTS,
)
print("Running base eval...")
base_eval = eval_config.evaluate(base_deployment, debug=True)
print(f"Base mean reward: {base_eval.mean:.4f}")

For training, we reuse the same score_in_sandbox and extract_code helpers that HarborEval uses internally — wrapped in an async reward function for SLIME’s custom_rm_function.

score_in_sandbox enforces sandbox_cpu/sandbox_memory with a "limit" policy by default: rather than reserving that capacity up front, the values become burst ceilings, so Modal bills each sandbox by actual CPU-/RAM-second usage instead of the (usually idle) reservation. Pass cpu_policy="ignore" to let rollouts burst above the configured values, or "reserve" for the legacy fixed-reservation behavior.

async def sandbox_rm(args, sample, **kwargs) -> float:
import asyncio
code = extract_code(sample.response, model=base_model)
reward, meta = await asyncio.to_thread(
score_in_sandbox, code, test_cases=HELLO_WORLD_TESTS,
)
sample.metadata = {**(getattr(sample, "metadata", None) or {}), "sandbox": meta}
return float(reward)
training_run = TrainConfig(
model=Qwen3_4B(),
dataset=dataset,
recipe=SlimeRecipe(
custom_rm_function=sandbox_rm,
gpu_type="H100",
colocate=True,
tensor_model_parallel_size=1,
sequence_parallel=False,
rollout_num_gpus_per_engine=1,
num_rollout=10,
rollout_batch_size=8,
n_samples_per_prompt=8,
rollout_max_response_len=2048,
rollout_temperature=0.9,
global_batch_size=8,
eval_max_response_len=2048,
n_samples_per_eval_prompt=8,
max_tokens_per_gpu=4096,
save_interval=10,
image_overlay=lambda image: image.run_commands(
"uv pip install --system modal>=1.2.0",
),
),
)
print("Starting training...")
train_result = training_run.train()
print(f"Training run id: {train_result.training_run_id}")
checkpoint = list_checkpoints(train_result.training_run_id)[-1]
trained_deployment = DeploymentConfig(
model=Qwen3_4B(),
checkpoint=checkpoint,
app_name="qwen3-4b-hello-world-serve",
served_model_name="qwen3-4b-hello-world",
).serve()
print(f"Trained model URL: {trained_deployment.url}")
trained_eval = eval_config.evaluate(trained_deployment, debug=True)
print(f"Trained mean reward: {trained_eval.mean:.4f}")
print(f"Base mean reward: {base_eval.mean:.4f}")

Source: tutorials/rl/001_sandboxes/001_sandboxes.py | Open in Modal Notebook