Skip to content
GitHub

Customizing your slime run

Customizing your slime run — scaling nodes, parallelism, and throughput

The intro tutorial runs Qwen3-4B GRPO on GSM8K with all defaults — one node, colocated, the model’s built-in SlimePreset. This tutorial shows how to override those defaults when you need more from your training run.

The default SlimePreset for Qwen3-4B gives you a working run on a single 8×H100 node. That’s great for prototyping — you can validate your reward function, check that loss moves, and iterate on your dataset in minutes.

But a single-node run has limits:

  • Speed. GRPO generates rollouts for every prompt, scores them, then trains. On one node, rollout and training take turns on the same GPUs. A 7k-prompt dataset with 8 samples per prompt means 56k generations per training step. On 8 GPUs that’s slow.

  • Batch diversity. Larger global batches give more diverse advantage estimates per step, which stabilizes GRPO training. But you can only fit so much in 8 GPUs’ worth of memory.

  • Rollout throughput. More rollout GPUs = more generations per second. Separating rollout onto dedicated GPUs (non-colocated) lets training and inference overlap.

Scaling to 4 nodes (32 GPUs) addresses all three: 4× the data parallelism, 4× the rollout capacity, and room for much larger batches. The only thing that changes in your code is a few numbers on SlimeConfig.

from modal_training_gym.common.dataset import HuggingFaceDataset
from modal_training_gym.common.models import Qwen3_4B
from modal_training_gym.common.wandb import WandbConfig
from modal_training_gym.frameworks.slime import SlimeConfig
from modal_training_gym.frameworks.slime.config import DATA_PATH

Qwen3_4B ships a SlimePreset with these values:

FieldValueWhy
gpu_type"H100"Qwen3-4B fits comfortably on H100
actor_num_nodes14B params → no need for multi-node
actor_num_gpus_per_node8Standard Modal GPU node
colocateTrueActor + rollout share GPUs
tensor_model_parallel_size1Model fits on one GPU
sequence_parallelFalseNot needed at TP=1

These are the minimum viable settings. For a quick experiment they’re fine. For a production run you’ll want to scale up.

The simplest way to speed up training: add more nodes. This gives you more GPUs for data parallelism — each node processes a different batch slice in parallel, then gradients are all-reduced.

# 1 node (8 GPUs) — default from Qwen3_4B preset
run = SlimeConfig(model=Qwen3_4B(), ...)
# 4 nodes (32 GPUs) — 4x data parallelism
run = SlimeConfig(model=Qwen3_4B(), actor_num_nodes=4, ...)

With more GPUs you’ll typically also increase global_batch_size and num_rollout to keep each GPU busy:

NodesGPUsSuggested global_batch_sizeSuggested num_rollout
181650
21664200
4321283000

By default, actor (training) and rollout (SGLang inference) share the same GPUs (colocate=True). This is memory-efficient but means training and inference can’t overlap.

For larger runs, separating them lets training and rollout happen in parallel on different GPU pools:

# Non-colocated: 3 nodes train, 1 node rolls out
run = SlimeConfig(
model=Qwen3_4B(),
actor_num_nodes=4,
colocate=False,
# rollout_num_gpus auto-derived from spare nodes
)
  • use_dynamic_batch_size=True + max_tokens_per_gpu=9216 — pack variable-length prompts into a per-GPU token budget instead of a fixed batch count. Better GPU utilization for datasets with mixed prompt lengths.

  • recompute_granularity="full" — recompute activations during backward pass instead of storing them. Trades compute for memory, letting you fit larger batches.

  • n_samples_per_prompt=8 — generate more rollout samples per prompt for better advantage estimation (GRPO benefits from larger groups).

Here’s a production-scale config that overrides the preset defaults for a 4-node run with larger batches and non-colocated rollout:

class GSM8KDataset(HuggingFaceDataset):
hf_repo = "zhuzilin/gsm8k"
input_key = "messages"
label_key = "label"
def __init__(self, data_root):
super().__init__(data_root)
test_path = self.prompt_data.replace("train.parquet", "test.parquet")
self.eval_prompt_data = ["gsm8k", test_path]
def prepare(self):
from datasets import load_dataset
super().prepare()
test_path = self.prompt_data.replace("train.parquet", "test.parquet")
ds = load_dataset(self.hf_repo, split="test")
ds.to_parquet(test_path)
base_model = Qwen3_4B()
my_training_run = SlimeConfig(
model=base_model,
dataset=GSM8KDataset(DATA_PATH),
wandb=WandbConfig(project="slime-grpo", group="qwen3-4b-scaled"),
ref_load=base_model.model_name,
# Override preset defaults for a larger run:
actor_num_nodes=4,
global_batch_size=128,
num_rollout=3000,
n_samples_per_prompt=8,
rollout_batch_size=64,
)
app = my_training_run.build_app()

Source: tutorials/rl/002_customizing_your_slime_run/002_customizing_your_slime_run.py | Open in Modal Notebook