Code RL with Harbor hello-world + Modal sandboxes
Code RL with Harbor hello-world and sandboxed verification
What if you have a task where you want to score model outputs by running them in an environment?
This tutorial trains a model on the hello-world task from Harbor Hub, scoring solutions by spawning and executing them in Modal sandboxes.
Workflow:
- Pull the hello-world task from Harbor Hub via
HarborDataset. - Score model outputs with
HarborEval— it extracts code, runs it in a Modal sandbox, and compares stdout automatically. - Reuse the same
score_in_sandboxhelper as a SLIMEcustom_rm_function. - Train and compare base vs. trained behavior.
from modal_training_gym import ( DeploymentConfig, HarborDataset, HarborEval, Qwen3_4B, SlimeRecipe, TrainConfig, extract_code, list_checkpoints, score_in_sandbox,)Load hello-world from Harbor Hub
Section titled “Load hello-world from Harbor Hub”HarborDataset accepts a dataset_name to pull tasks from
Harbor Hub. Each task has:
instruction.md— the problem statement (prompt)task.toml— metadata (difficulty, category)tests/— verification tests (format varies by task)
The hello-world task uses pytest-based verification rather than
*.in/*.out file pairs, so we define stdin/stdout test cases
inline and pass them to HarborEval via the test_cases field.
A single dataset instance handles both training and eval —
prepare() writes train and eval splits to the volume,
while load() returns all tasks for offline evaluation.
HELLO_WORLD_TESTS = [{"input": "", "expected_output": "Hello, world!\n"}]
dataset = HarborDataset( dataset_name="harbor/hello-world", label_metadata_path="task.toml", train_repeats=20, always_prepare=True, # For the purpose of this tutorial, we want to prepare the dataset every time we run it, in case there is stale data from a previous run. system_prompt=( "You are an expert Python programmer. " "Solve the given problem by writing a complete Python program. " "Your program must print the answer to stdout using print(). " "Do not create or write any files. " "Put your solution in a ```python code fence." ),)Evaluate with HarborEval
Section titled “Evaluate with HarborEval”HarborEval automates the sandbox scoring loop. It:
- Sends each task’s prompt to the deployed model.
- Extracts Python code from the response (stripping thinking tags,
chat-template artifacts, and code fences via
extract_code). - Runs the extracted code in a Modal sandbox against the test cases.
- Returns a score = fraction of test cases passed.
Since hello-world doesn’t ship *.in/*.out file pairs, we pass
test_cases directly — HarborEval uses them as a fallback when
the dataset label doesn’t contain test cases.
Passing model=Qwen3_4B() enables model-aware response parsing,
which populates parsed_response on each result row for richer
dashboard display.
base_model = Qwen3_4B()base_deployment = DeploymentConfig(model=base_model).serve()print(f"Base model URL: {base_deployment.url}")
eval_config = HarborEval( dataset=dataset, model=base_model, test_cases=HELLO_WORLD_TESTS,)print("Running base eval...")base_eval = eval_config.evaluate(base_deployment, debug=True)print(f"Base mean reward: {base_eval.mean:.4f}")Train with SLIME and sandbox reward
Section titled “Train with SLIME and sandbox reward”For training, we reuse the same score_in_sandbox and extract_code
helpers that HarborEval uses internally — wrapped in an async
reward function for SLIME’s custom_rm_function.
score_in_sandbox enforces sandbox_cpu/sandbox_memory with a
"limit" policy by default: rather than reserving that capacity up
front, the values become burst ceilings, so Modal bills each sandbox
by actual CPU-/RAM-second usage instead of the (usually idle)
reservation. Pass cpu_policy="ignore" to let rollouts burst above
the configured values, or "reserve" for the legacy fixed-reservation
behavior.
async def sandbox_rm(args, sample, **kwargs) -> float: import asyncio
code = extract_code(sample.response, model=base_model) reward, meta = await asyncio.to_thread( score_in_sandbox, code, test_cases=HELLO_WORLD_TESTS, ) sample.metadata = {**(getattr(sample, "metadata", None) or {}), "sandbox": meta} return float(reward)
training_run = TrainConfig( model=Qwen3_4B(), dataset=dataset, recipe=SlimeRecipe( custom_rm_function=sandbox_rm,
gpu_type="H100", colocate=True, tensor_model_parallel_size=1, sequence_parallel=False, rollout_num_gpus_per_engine=1,
num_rollout=10, rollout_batch_size=8, n_samples_per_prompt=8, rollout_max_response_len=2048, rollout_temperature=0.9,
global_batch_size=8, eval_max_response_len=2048, n_samples_per_eval_prompt=8, max_tokens_per_gpu=4096, save_interval=10, image_overlay=lambda image: image.run_commands( "uv pip install --system modal>=1.2.0", ), ),)print("Starting training...")train_result = training_run.train()print(f"Training run id: {train_result.training_run_id}")Evaluate the trained checkpoint
Section titled “Evaluate the trained checkpoint”checkpoint = list_checkpoints(train_result.training_run_id)[-1]trained_deployment = DeploymentConfig( model=Qwen3_4B(), checkpoint=checkpoint, app_name="qwen3-4b-hello-world-serve", served_model_name="qwen3-4b-hello-world",).serve()print(f"Trained model URL: {trained_deployment.url}")
trained_eval = eval_config.evaluate(trained_deployment, debug=True)print(f"Trained mean reward: {trained_eval.mean:.4f}")print(f"Base mean reward: {base_eval.mean:.4f}")Related API Reference
Section titled “Related API Reference”HarborDatasetDeploymentConfigHarborEvalQwen3_4BSlimeRecipeTrainConfigscore_in_sandboxextract_code
Source: tutorials/rl/001_sandboxes/001_sandboxes.py
| Open in Modal Notebook