Quickstart

modal-training-gym wraps a handful of training frameworks (slime, Megatron, MS-SWIFT) behind a single pattern: declare a model, a dataset, and a logging config; hand them to a framework factory; modal run the returned app. This notebook walks through the pieces so the per-framework tutorials can focus on what makes each framework different instead of re-explaining the plumbing.

Nothing here launches a job. At the end there are links to concrete tutorials at each difficulty tier.

The three config containers

Every tutorial builds its run from three pure-data containers: a ModelConfiguration, a DatasetConfig, and a WandbConfig. Each framework’s launcher translates these into its own CLI vocabulary — you don’t write framework-specific flag names for the parts that are shared across frameworks.

`ModelConfiguration` — the model to train

A ModelConfiguration is identity + download hook. Built-in subclasses (Qwen3_4B, Qwen3_32B, GLM_4_7, Llama2_7B, KimiK2_5) set model_name and a ModelArchitecture spec. HFModelConfiguration is the base for any HF-hosted model — it implements download_model() via huggingface_hub.snapshot_download pulling weights into the huggingface-cache Modal volume.

For a custom HF model, subclass ModelConfiguration inline in your tutorial (see 002_custom_model) — there’s no registry to update.

from modal_training_gym.common.models import Qwen3_4B

model = Qwen3_4B()
print("model_name:", model.model_name)
print("num_layers:", model.architecture.num_layers)

`DatasetConfig` — what to train on

Subclass DatasetConfig, set the fields your framework needs, and implement prepare() to materialize the data into the shared /data volume. prepare() runs inside a Modal container so it can import datasets, talk to HF Hub, or run any preprocessing; the output lands on a volume that every subsequent run (on any node) reads from.

Fields like input_key, label_key, apply_chat_template, and rm_type are read by each framework’s launcher and translated into its own flags — you only write them once.

from modal_training_gym.common.dataset import DatasetConfig

class MyDataset(DatasetConfig):
    input_key = "messages"
    label_key = "label"
    apply_chat_template = True

    def __init__(self, data_path):
        self._data_path = str(data_path)
        self.prompt_data = f"{self._data_path}/my_dataset/train.parquet"

    def prepare(self):
        from datasets import load_dataset
        ds = load_dataset("some-org/my-dataset")
        ds["train"].to_parquet(self.prompt_data)

ds = MyDataset(data_path="/data")
print("prompt_data:", ds.prompt_data)
print("input_key:", ds.input_key)

`WandbConfig` — where to log

Just metadata: project, group, optional exp_name. The W&B API key is injected at launch time from the wandb Modal secret — you don’t put it in code. Create the secret once via the Modal dashboard or modal secret create.

from modal_training_gym.common.wandb import WandbConfig

wandb = WandbConfig(project="my-runs", group="concepts-demo")
print("project:", wandb.project)

The framework factory pattern

Every framework package exposes a build_<name>_app(...) factory that takes a framework-specific config plus a ModalConfig and returns a modal.App with a standard set of remote functions:

download_model — pulls the model into the huggingface-cache volume. One-time per (model,) pair — subsequent runs reuse the volume.
prepare_dataset — runs your DatasetConfig.prepare() against the /data volume. One-time per (dataset,) pair.
train — the actual training run. This is the long one; launch it with modal run --detach.

Framework-specific extras exist too: raw-mode slime/Megatron runs add convert_checkpoint; RL frameworks may add serving helpers. A tutorial’s .py file is where all of this is wired together — calling framework_config.build_app() closes over your config and hands back the app.

The same shape, written out:

from modal_training_gym.common.models import Qwen3_4B
from modal_training_gym.common.wandb import WandbConfig
from modal_training_gym.frameworks.slime import (
    ModalConfig, SlimeConfig,
)

run = SlimeConfig(
    model=Qwen3_4B(),  # GPU type derived from model.training.gpu_type
    dataset=MyDataset(...),
    wandb=WandbConfig(project="my-runs", group="concepts-demo"),
    # … framework-specific flags …
)

app = run.build_app()   # modal.App with download_model / prepare_dataset / train

Different framework, same shape — swap slime for ms_swift, etc. The framework-specific flags differ (that’s what makes the frameworks different), but everything around them stays put.

Volume layout

Three Modal volumes are shared across tutorials so you pay the download cost once:

Path (in container)	Volume	Contents
`/root/.cache/huggingface`	`huggingface-cache`	HF weights + tokenizers. Populated by `download_model`.
`/data`	`<framework>-data`	Preprocessed datasets. Populated by `prepare_dataset`.
`/checkpoints`	`<framework>-checkpoints`	Training outputs. Populated by `train`.

Each framework uses its own data and checkpoints volumes (so one framework’s half-written state can’t corrupt another’s), but the HF cache is truly shared — download Qwen3-4B once, use it from slime and ms-swift.

Nothing in a tutorial directly manipulates volumes — the launchers mount them and the three remote functions know where to write. If you need to delete stale checkpoints, use modal volume rm against the relevant volume from a shell.

Running the pipeline

Every tutorial exposes the same three-stage pipeline. Two ways to invoke it:

CLI — one command per stage, easy to script. Always --detach for training so the run survives your terminal closing.

uv run modal run tutorials/<tutorial>/<tutorial>.py::app.download_model
uv run modal run tutorials/<tutorial>/<tutorial>.py::app.prepare_dataset
uv run modal run --detach tutorials/<tutorial>/<tutorial>.py::app.train

Reattach to a detached run’s logs from another terminal with modal app logs <app-name>.

Interactive — inside a notebook cell, open an ephemeral app and run one stage at a time. Good for iterating on prepare() or on a config without re-submitting the whole script.

with modal.enable_output():
    with app.run():
        app.download_model.remote()

The modal.enable_output() context manager streams logs back into the notebook so you can see what’s happening in real time.

Where to go next

Pick a tutorial that matches what you’re trying to learn. See the full catalog in tutorials/README.md.

Beginner (single node, one concept at a time)

nccl_benchmark — validate that multi-node NCCL works in your workspace before running real training.
002_custom_model — LoRA SFT on a tiny SmolLM2-135M, with an inline custom ModelConfiguration subclass. Intermediate (non-default wiring, 1–2 nodes)
slime_haiku — GRPO with a custom async reward function (structure scoring + optional LLM judge).
ray_slime_standalone — Ray-on-Modal pattern demo with a custom training loop.

Advanced (≥2 nodes, non-trivial parallelism, large models)

slime_gsm8k — colocated 4-node GRPO, the canonical math-RL reference.
001_ms_swift — GLM-4.7 LoRA SFT on GSM8K using ms-swift + Megatron.

Source: tutorials/intro/001_quickstart/001_quickstart.py | Open in Modal Notebook