Code RL with Harbor hello-world + Modal sandboxes

Code RL with Harbor hello-world and sandboxed verification

What if you have a task where you want to score model outputs by running them in an environment?

This tutorial trains a model on the hello-world task from Harbor Hub, scoring solutions by spawning and executing them in Modal sandboxes.

Workflow:

Pull the hello-world task from Harbor Hub via HarborDataset.
Score model outputs with HarborEval — it extracts code, runs it in a Modal sandbox, and compares stdout automatically.
Reuse the same score_in_sandbox helper as a SLIME custom_rm_function.
Train and compare base vs. trained behavior.

from modal_training_gym import (
    DeploymentConfig,
    HarborDataset,
    HarborEval,
    Qwen3_4B,
    SlimeRecipe,
    TrainConfig,
    extract_code,
    list_checkpoints,
    score_in_sandbox,
)

Load hello-world from Harbor Hub

HarborDataset accepts a dataset_name to pull tasks from Harbor Hub. Each task has:

instruction.md — the problem statement (prompt)
task.toml — metadata (difficulty, category)
tests/ — verification tests (format varies by task)

The hello-world task uses pytest-based verification rather than *.in/*.out file pairs, so we define stdin/stdout test cases inline and pass them to HarborEval via the test_cases field.

A single dataset instance handles both training and eval — prepare() writes train and eval splits to the volume, while load() returns all tasks for offline evaluation.

HELLO_WORLD_TESTS = [{"input": "", "expected_output": "Hello, world!\n"}]

dataset = HarborDataset(
    dataset_name="harbor/hello-world",
    label_metadata_path="task.toml",
    train_repeats=20,
    always_prepare=True, # For the purpose of this tutorial, we want to prepare the dataset every time we run it, in case there is stale data from a previous run.
    system_prompt=(
        "You are an expert Python programmer. "
        "Solve the given problem by writing a complete Python program. "
        "Your program must print the answer to stdout using print(). "
        "Do not create or write any files. "
        "Put your solution in a ```python code fence."
    ),
)

Evaluate with HarborEval

HarborEval automates the sandbox scoring loop. It:

Sends each task’s prompt to the deployed model.
Extracts Python code from the response (stripping thinking tags, chat-template artifacts, and code fences via extract_code).
Runs the extracted code in a Modal sandbox against the test cases.
Returns a score = fraction of test cases passed.

Since hello-world doesn’t ship *.in/*.out file pairs, we pass test_cases directly — HarborEval uses them as a fallback when the dataset label doesn’t contain test cases.

Passing model=Qwen3_4B() enables model-aware response parsing, which populates parsed_response on each result row for richer dashboard display.

base_model = Qwen3_4B()
base_deployment = DeploymentConfig(model=base_model).serve()
print(f"Base model URL: {base_deployment.url}")

eval_config = HarborEval(
    dataset=dataset,
    model=base_model,
    test_cases=HELLO_WORLD_TESTS,
)
print("Running base eval...")
base_eval = eval_config.evaluate(base_deployment, debug=True)
print(f"Base mean reward: {base_eval.mean:.4f}")

Train with SLIME and sandbox reward

For training, we reuse the same score_in_sandbox and extract_code helpers that HarborEval uses internally — wrapped in an async reward function for SLIME’s custom_rm_function.

score_in_sandbox enforces sandbox_cpu/sandbox_memory with a "limit" policy by default: rather than reserving that capacity up front, the values become burst ceilings, so Modal bills each sandbox by actual CPU-/RAM-second usage instead of the (usually idle) reservation. Pass cpu_policy="ignore" to let rollouts burst above the configured values, or "reserve" for the legacy fixed-reservation behavior.

async def sandbox_rm(args, sample, **kwargs) -> float:
    import asyncio

    code = extract_code(sample.response, model=base_model)
    reward, meta = await asyncio.to_thread(
        score_in_sandbox, code, test_cases=HELLO_WORLD_TESTS,
    )
    sample.metadata = {**(getattr(sample, "metadata", None) or {}), "sandbox": meta}
    return float(reward)

training_run = TrainConfig(
    model=Qwen3_4B(),
    dataset=dataset,
    recipe=SlimeRecipe(
        custom_rm_function=sandbox_rm,

        gpu_type="H100",
        colocate=True,
        tensor_model_parallel_size=1,
        sequence_parallel=False,
        rollout_num_gpus_per_engine=1,

        num_rollout=10,
        rollout_batch_size=8,
        n_samples_per_prompt=8,
        rollout_max_response_len=2048,
        rollout_temperature=0.9,

        global_batch_size=8,
        eval_max_response_len=2048,
        n_samples_per_eval_prompt=8,
        max_tokens_per_gpu=4096,
        save_interval=10,
        image_overlay=lambda image: image.run_commands(
            "uv pip install --system modal>=1.2.0",
        ),
    ),
)
print("Starting training...")
train_result = training_run.train()
print(f"Training run id: {train_result.training_run_id}")

Evaluate the trained checkpoint

checkpoint = list_checkpoints(train_result.training_run_id)[-1]
trained_deployment = DeploymentConfig(
    model=Qwen3_4B(),
    checkpoint=checkpoint,
    app_name="qwen3-4b-hello-world-serve",
    served_model_name="qwen3-4b-hello-world",
).serve()
print(f"Trained model URL: {trained_deployment.url}")

trained_eval = eval_config.evaluate(trained_deployment, debug=True)
print(f"Trained mean reward: {trained_eval.mean:.4f}")
print(f"Base mean reward:    {base_eval.mean:.4f}")

Source: tutorials/rl/001_sandboxes/001_sandboxes.py | Open in Modal Notebook