Customizing your slime run
Customizing your slime run — scaling nodes, parallelism, and throughput
The intro tutorial runs
Qwen3-4B GRPO on GSM8K with all defaults — one node, colocated,
the model’s built-in SlimePreset. This tutorial shows how to
override those defaults when you need more from your training run.
Why would you want this?
Section titled “Why would you want this?”The default SlimePreset for Qwen3-4B gives you a working run on
a single 8×H100 node. That’s great for prototyping — you can
validate your reward function, check that loss moves, and iterate
on your dataset in minutes.
But a single-node run has limits:
-
Speed. GRPO generates rollouts for every prompt, scores them, then trains. On one node, rollout and training take turns on the same GPUs. A 7k-prompt dataset with 8 samples per prompt means 56k generations per training step. On 8 GPUs that’s slow.
-
Batch diversity. Larger global batches give more diverse advantage estimates per step, which stabilizes GRPO training. But you can only fit so much in 8 GPUs’ worth of memory.
-
Rollout throughput. More rollout GPUs = more generations per second. Separating rollout onto dedicated GPUs (non-colocated) lets training and inference overlap.
Scaling to 4 nodes (32 GPUs) addresses all three: 4× the data
parallelism, 4× the rollout capacity, and room for much larger
batches. The only thing that changes in your code is a few numbers
on SlimeConfig.
from modal_training_gym.common.dataset import HuggingFaceDatasetfrom modal_training_gym.common.models import Qwen3_4Bfrom modal_training_gym.common.wandb import WandbConfigfrom modal_training_gym.frameworks.slime import SlimeConfigfrom modal_training_gym.frameworks.slime.config import DATA_PATHWhat the model provides
Section titled “What the model provides”Qwen3_4B ships a SlimePreset with these values:
| Field | Value | Why |
|---|---|---|
gpu_type | "H100" | Qwen3-4B fits comfortably on H100 |
actor_num_nodes | 1 | 4B params → no need for multi-node |
actor_num_gpus_per_node | 8 | Standard Modal GPU node |
colocate | True | Actor + rollout share GPUs |
tensor_model_parallel_size | 1 | Model fits on one GPU |
sequence_parallel | False | Not needed at TP=1 |
These are the minimum viable settings. For a quick experiment they’re fine. For a production run you’ll want to scale up.
Scaling up with more nodes
Section titled “Scaling up with more nodes”The simplest way to speed up training: add more nodes. This gives you more GPUs for data parallelism — each node processes a different batch slice in parallel, then gradients are all-reduced.
# 1 node (8 GPUs) — default from Qwen3_4B presetrun = SlimeConfig(model=Qwen3_4B(), ...)
# 4 nodes (32 GPUs) — 4x data parallelismrun = SlimeConfig(model=Qwen3_4B(), actor_num_nodes=4, ...)With more GPUs you’ll typically also increase global_batch_size
and num_rollout to keep each GPU busy:
| Nodes | GPUs | Suggested global_batch_size | Suggested num_rollout |
|---|---|---|---|
| 1 | 8 | 16 | 50 |
| 2 | 16 | 64 | 200 |
| 4 | 32 | 128 | 3000 |
Colocated vs non-colocated rollout
Section titled “Colocated vs non-colocated rollout”By default, actor (training) and rollout (SGLang inference) share
the same GPUs (colocate=True). This is memory-efficient but
means training and inference can’t overlap.
For larger runs, separating them lets training and rollout happen in parallel on different GPU pools:
# Non-colocated: 3 nodes train, 1 node rolls outrun = SlimeConfig( model=Qwen3_4B(), actor_num_nodes=4, colocate=False, # rollout_num_gpus auto-derived from spare nodes)Throughput knobs
Section titled “Throughput knobs”-
use_dynamic_batch_size=True+max_tokens_per_gpu=9216— pack variable-length prompts into a per-GPU token budget instead of a fixed batch count. Better GPU utilization for datasets with mixed prompt lengths. -
recompute_granularity="full"— recompute activations during backward pass instead of storing them. Trades compute for memory, letting you fit larger batches. -
n_samples_per_prompt=8— generate more rollout samples per prompt for better advantage estimation (GRPO benefits from larger groups).
Putting it together
Section titled “Putting it together”Here’s a production-scale config that overrides the preset defaults for a 4-node run with larger batches and non-colocated rollout:
class GSM8KDataset(HuggingFaceDataset): hf_repo = "zhuzilin/gsm8k" input_key = "messages" label_key = "label"
def __init__(self, data_root): super().__init__(data_root) test_path = self.prompt_data.replace("train.parquet", "test.parquet") self.eval_prompt_data = ["gsm8k", test_path]
def prepare(self): from datasets import load_dataset
super().prepare() test_path = self.prompt_data.replace("train.parquet", "test.parquet") ds = load_dataset(self.hf_repo, split="test") ds.to_parquet(test_path)base_model = Qwen3_4B()my_training_run = SlimeConfig( model=base_model, dataset=GSM8KDataset(DATA_PATH), wandb=WandbConfig(project="slime-grpo", group="qwen3-4b-scaled"), ref_load=base_model.model_name, # Override preset defaults for a larger run: actor_num_nodes=4, global_batch_size=128, num_rollout=3000, n_samples_per_prompt=8, rollout_batch_size=64,)app = my_training_run.build_app()Related API Reference
Section titled “Related API Reference”Source: tutorials/rl/002_customizing_your_slime_run/002_customizing_your_slime_run.py
| Open in Modal Notebook