Skip to content
GitHub

Quickstart

modal-training-gym wraps a handful of training frameworks (slime, Megatron, MS-SWIFT) behind a single pattern: declare a model, a dataset, and a logging config; hand them to a framework factory; modal run the returned app. This notebook walks through the pieces so the per-framework tutorials can focus on what makes each framework different instead of re-explaining the plumbing.

Nothing here launches a job. At the end there are links to concrete tutorials at each difficulty tier.

Every tutorial builds its run from three pure-data containers: a ModelConfiguration, a DatasetConfig, and a WandbConfig. Each framework’s launcher translates these into its own CLI vocabulary — you don’t write framework-specific flag names for the parts that are shared across frameworks.

A ModelConfiguration is identity + download hook. Built-in subclasses (Qwen3_4B, Qwen3_32B, GLM_4_7, Llama2_7B, KimiK2_5) set model_name and a ModelArchitecture spec. HFModelConfiguration is the base for any HF-hosted model — it implements download_model() via huggingface_hub.snapshot_download pulling weights into the huggingface-cache Modal volume.

For a custom HF model, subclass ModelConfiguration inline in your tutorial (see ms_swift_custom_hf) — there’s no registry to update.

from modal_training_gym.common.models import Qwen3_4B
model = Qwen3_4B()
print("model_name:", model.model_name)
print("num_layers:", model.architecture.num_layers)

Subclass DatasetConfig, set the fields your framework needs, and implement prepare() to materialize the data into the shared /data volume. prepare() runs inside a Modal container so it can import datasets, talk to HF Hub, or run any preprocessing; the output lands on a volume that every subsequent run (on any node) reads from.

Fields like input_key, label_key, apply_chat_template, and rm_type are read by each framework’s launcher and translated into its own flags — you only write them once.

from modal_training_gym.common.dataset import DatasetConfig
class MyDataset(DatasetConfig):
input_key = "messages"
label_key = "label"
apply_chat_template = True
def __init__(self, data_path):
self._data_path = str(data_path)
self.prompt_data = f"{self._data_path}/my_dataset/train.parquet"
def prepare(self):
from datasets import load_dataset
ds = load_dataset("some-org/my-dataset")
ds["train"].to_parquet(self.prompt_data)
ds = MyDataset(data_path="/data")
print("prompt_data:", ds.prompt_data)
print("input_key:", ds.input_key)

Just metadata: project, group, optional exp_name. The W&B API key is injected at launch time from the wandb Modal secret — you don’t put it in code. Create the secret once via the Modal dashboard or modal secret create.

from modal_training_gym.common.wandb import WandbConfig
wandb = WandbConfig(project="my-runs", group="concepts-demo")
print("project:", wandb.project)

Every framework package exposes a build_<name>_app(...) factory that takes a framework-specific config plus a ModalConfig and returns a modal.App with a standard set of remote functions:

  • download_model — pulls the model into the huggingface-cache volume. One-time per (model,) pair — subsequent runs reuse the volume.
  • prepare_dataset — runs your DatasetConfig.prepare() against the /data volume. One-time per (dataset,) pair.
  • train — the actual training run. This is the long one; launch it with modal run --detach.

Framework-specific extras exist too: raw-mode slime/Megatron runs add convert_checkpoint; RL frameworks may add serving helpers. A tutorial’s .py file is where all of this is wired together — calling framework_config.build_app() closes over your config and hands back the app.

The same shape, written out:

from modal_training_gym.common.models import Qwen3_4B
from modal_training_gym.common.wandb import WandbConfig
from modal_training_gym.frameworks.slime import (
ModalConfig, SlimeConfig,
)
run = SlimeConfig(
model=Qwen3_4B(),
dataset=MyDataset(...),
wandb=WandbConfig(project="my-runs", group="concepts-demo"),
# … framework-specific flags …
modal=ModalConfig(gpu="H100"),
)
app = run.build_app() # modal.App with download_model / prepare_dataset / train

Different framework, same shape — swap slime for ms_swift, etc. The framework-specific flags differ (that’s what makes the frameworks different), but everything around them stays put.

Three Modal volumes are shared across tutorials so you pay the download cost once:

Path (in container)VolumeContents
/root/.cache/huggingfacehuggingface-cacheHF weights + tokenizers. Populated by download_model.
/data<framework>-dataPreprocessed datasets. Populated by prepare_dataset.
/checkpoints<framework>-checkpointsTraining outputs. Populated by train.

Each framework uses its own data and checkpoints volumes (so one framework’s half-written state can’t corrupt another’s), but the HF cache is truly shared — download Qwen3-4B once, use it from slime and ms-swift.

Nothing in a tutorial directly manipulates volumes — the launchers mount them and the three remote functions know where to write. If you need to delete stale checkpoints, use modal volume rm against the relevant volume from a shell.

Every tutorial exposes the same three-stage pipeline. Two ways to invoke it:

CLI — one command per stage, easy to script. Always --detach for training so the run survives your terminal closing.

Terminal window
uv run modal run tutorials/<tutorial>/<tutorial>.py::app.download_model
uv run modal run tutorials/<tutorial>/<tutorial>.py::app.prepare_dataset
uv run modal run --detach tutorials/<tutorial>/<tutorial>.py::app.train

Reattach to a detached run’s logs from another terminal with modal app logs <app-name>.

Interactive — inside a notebook cell, open an ephemeral app and run one stage at a time. Good for iterating on prepare() or on a config without re-submitting the whole script.

with modal.enable_output():
with app.run():
app.download_model.remote()

The modal.enable_output() context manager streams logs back into the notebook so you can see what’s happening in real time.

Pick a tutorial that matches what you’re trying to learn. See the full catalog in tutorials/README.md.

Beginner (single node, one concept at a time)

  • nccl_benchmark — validate that multi-node NCCL works in your workspace before running real training.

  • ms_swift_custom_hf — LoRA SFT on a tiny SmolLM2-135M, with an inline custom ModelConfiguration subclass. Intermediate (non-default wiring, 1–2 nodes)

  • slime_haiku — GRPO with a custom async reward function (structure scoring + optional LLM judge).

  • ray_slime_standalone — Ray-on-Modal pattern demo with a custom training loop.

Advanced (≥2 nodes, non-trivial parallelism, large models)


Source: tutorials/intro/quickstart/quickstart.py | Open in Modal Notebook