Quickstart
modal-training-gym wraps a handful of training frameworks (slime,
Megatron, MS-SWIFT) behind a single pattern: declare
a model, a dataset, and a logging config; hand
them to a framework factory; modal run the returned app. This
notebook walks through the pieces so the per-framework tutorials can
focus on what makes each framework different instead of re-explaining
the plumbing.
Nothing here launches a job. At the end there are links to concrete tutorials at each difficulty tier.
The three config containers
Section titled “The three config containers”Every tutorial builds its run from three pure-data containers: a
ModelConfiguration, a DatasetConfig, and a WandbConfig. Each
framework’s launcher translates these into its own CLI vocabulary —
you don’t write framework-specific flag names for the parts that are
shared across frameworks.
ModelConfiguration — the model to train
Section titled “ModelConfiguration — the model to train”A ModelConfiguration is identity + download hook. Built-in subclasses
(Qwen3_4B, Qwen3_32B, GLM_4_7, Llama2_7B, KimiK2_5) set
model_name and a ModelArchitecture spec. HFModelConfiguration is
the base for any HF-hosted model — it implements download_model() via
huggingface_hub.snapshot_download pulling weights into the
huggingface-cache Modal volume.
For a custom HF model, subclass ModelConfiguration inline in your
tutorial (see
002_custom_model) —
there’s no registry to update.
from modal_training_gym.common.models import Qwen3_4B
model = Qwen3_4B()print("model_name:", model.model_name)print("num_layers:", model.architecture.num_layers)DatasetConfig — what to train on
Section titled “DatasetConfig — what to train on”Subclass DatasetConfig, set the fields your framework needs, and
implement prepare() to materialize the data into the shared /data
volume. prepare() runs inside a Modal container so it can import
datasets, talk to HF Hub, or run any preprocessing; the output lands
on a volume that every subsequent run (on any node) reads from.
Fields like input_key, label_key, apply_chat_template, and
rm_type are read by each framework’s launcher and translated into its
own flags — you only write them once.
from modal_training_gym.common.dataset import DatasetConfig
class MyDataset(DatasetConfig): input_key = "messages" label_key = "label" apply_chat_template = True
def __init__(self, data_path): self._data_path = str(data_path) self.prompt_data = f"{self._data_path}/my_dataset/train.parquet"
def prepare(self): from datasets import load_dataset ds = load_dataset("some-org/my-dataset") ds["train"].to_parquet(self.prompt_data)
ds = MyDataset(data_path="/data")print("prompt_data:", ds.prompt_data)print("input_key:", ds.input_key)WandbConfig — where to log
Section titled “WandbConfig — where to log”Just metadata: project, group, optional exp_name. The W&B API key
is injected at launch time from the wandb Modal secret — you don’t
put it in code. Create the secret once via the Modal dashboard or
modal secret create.
from modal_training_gym.common.wandb import WandbConfig
wandb = WandbConfig(project="my-runs", group="concepts-demo")print("project:", wandb.project)The framework factory pattern
Section titled “The framework factory pattern”Every framework package exposes a build_<name>_app(...) factory
that takes a framework-specific config plus a ModalConfig and
returns a modal.App with a standard set of remote functions:
download_model— pulls the model into thehuggingface-cachevolume. One-time per(model,)pair — subsequent runs reuse the volume.prepare_dataset— runs yourDatasetConfig.prepare()against the/datavolume. One-time per(dataset,)pair.train— the actual training run. This is the long one; launch it withmodal run --detach.
Framework-specific extras exist too: raw-mode slime/Megatron runs add
convert_checkpoint; RL frameworks may add serving helpers. A
tutorial’s .py file is where all of this is wired together —
calling framework_config.build_app() closes over your config and
hands back the app.
The same shape, written out:
from modal_training_gym.common.models import Qwen3_4Bfrom modal_training_gym.common.wandb import WandbConfigfrom modal_training_gym.frameworks.slime import ( ModalConfig, SlimeConfig,)
run = SlimeConfig( model=Qwen3_4B(), # GPU type derived from model.training.gpu_type dataset=MyDataset(...), wandb=WandbConfig(project="my-runs", group="concepts-demo"), # … framework-specific flags …)
app = run.build_app() # modal.App with download_model / prepare_dataset / trainDifferent framework, same shape — swap slime for ms_swift,
etc. The framework-specific flags differ (that’s what makes
the frameworks different), but everything around them stays put.
Volume layout
Section titled “Volume layout”Three Modal volumes are shared across tutorials so you pay the download cost once:
| Path (in container) | Volume | Contents |
|---|---|---|
/root/.cache/huggingface | huggingface-cache | HF weights + tokenizers. Populated by download_model. |
/data | <framework>-data | Preprocessed datasets. Populated by prepare_dataset. |
/checkpoints | <framework>-checkpoints | Training outputs. Populated by train. |
Each framework uses its own data and checkpoints volumes (so one
framework’s half-written state can’t corrupt another’s), but the HF
cache is truly shared — download Qwen3-4B once, use it from slime
and ms-swift.
Nothing in a tutorial directly manipulates volumes — the launchers
mount them and the three remote functions know where to write. If you
need to delete stale checkpoints, use modal volume rm against the
relevant volume from a shell.
Running the pipeline
Section titled “Running the pipeline”Every tutorial exposes the same three-stage pipeline. Two ways to invoke it:
CLI — one command per stage, easy to script. Always --detach for
training so the run survives your terminal closing.
uv run modal run tutorials/<tutorial>/<tutorial>.py::app.download_modeluv run modal run tutorials/<tutorial>/<tutorial>.py::app.prepare_datasetuv run modal run --detach tutorials/<tutorial>/<tutorial>.py::app.trainReattach to a detached run’s logs from another terminal with
modal app logs <app-name>.
Interactive — inside a notebook cell, open an ephemeral app and
run one stage at a time. Good for iterating on prepare() or on a
config without re-submitting the whole script.
with modal.enable_output(): with app.run(): app.download_model.remote()The modal.enable_output() context manager streams logs back into
the notebook so you can see what’s happening in real time.
Where to go next
Section titled “Where to go next”Pick a tutorial that matches what you’re trying to learn. See the full
catalog in tutorials/README.md.
Beginner (single node, one concept at a time)
-
nccl_benchmark— validate that multi-node NCCL works in your workspace before running real training. -
002_custom_model— LoRA SFT on a tiny SmolLM2-135M, with an inline customModelConfigurationsubclass. Intermediate (non-default wiring, 1–2 nodes) -
slime_haiku— GRPO with a custom async reward function (structure scoring + optional LLM judge). -
ray_slime_standalone— Ray-on-Modal pattern demo with a custom training loop.
Advanced (≥2 nodes, non-trivial parallelism, large models)
slime_gsm8k— colocated 4-node GRPO, the canonical math-RL reference.001_ms_swift— GLM-4.7 LoRA SFT on GSM8K using ms-swift + Megatron.
Related API Reference
Section titled “Related API Reference”ModelConfigurationHFModelConfigurationModelArchitectureQwen3_4BDatasetConfigWandbConfigSlimeConfig
Source: tutorials/intro/001_quickstart/001_quickstart.py
| Open in Modal Notebook