GLM-4.7 LoRA SFT on GSM8K with ms-swift on Modal
GLM-4.7 LoRA SFT on GSM8K (Megatron)
What ms-swift is. ms-swift is ModelScope’s end-to-end fine-tuning toolkit. It supports PEFT methods (LoRA / QLoRA / DoRA) and full finetuning across HuggingFace and Megatron-LM backends. The Megatron backend is what makes it interesting at large scale — it gives you 4-D parallelism (TP / PP / EP / CP) for models that don’t fit on one GPU under HF’s data parallelism alone.
What this tutorial does. LoRA SFT of GLM-4.7 (a large MoE model)
on GSM8K, on 4 nodes × 8×H100 (32 GPUs). The interesting piece is
the parallelism split: TP=2, EP=4, PP=4, CP=1 — tensor parallel
across pairs of GPUs, 4-way expert parallel for the MoE layers,
4-stage pipeline parallel for the transformer blocks. Under the
hood this launches megatron sft via torchrun on each clustered
node. For the shared primitives (DatasetConfig, Model, 3-stage
pipeline) see quickstart.
What you’ll need.
- Access to Modal’s multi-node training preview (4 × 8×H100).
wandbModal secret.
What to watch. W&B project glm-4-7-sft. Watch train/loss
and train/grad_norm; LoRA converges quickly on GSM8K so expect
loss to fall off within the first few hundred iters.
import modal
from modal_training_gym.common.dataset import DatasetConfigfrom modal_training_gym.common.models import GLM_4_7from modal_training_gym.common.wandb import WandbConfigfrom modal_training_gym.frameworks.ms_swift import ( MsSwiftConfig, MsSwiftFrameworkConfig,)from modal_training_gym.frameworks.ms_swift.config import HF_CACHE_PATHDefine the dataset
Section titled “Define the dataset”ms-swift reads a JSONL file where each line is a chat-format object:
{"messages": [{"role": "user", ...}, {"role": "assistant", ...}]}.
prepare() converts GSM8K’s (question, answer) columns into that
shape and writes it under the HF cache volume so both download and
dataset prep share the same mount.
class GSM8KDataset(DatasetConfig): def __init__( self, hf_cache_root, data_folder="gsm8k", hf_dataset="openai/gsm8k", split="train", input_col="question", output_col="answer", ): self._hf_cache_root = str(hf_cache_root) self._data_folder = data_folder self._hf_dataset = hf_dataset self._split = split self._input_col = input_col self._output_col = output_col self.prompt_data = ( f"{self._hf_cache_root}/msswift-data/{data_folder}/training.jsonl" )
def prepare(self): import json import os
from datasets import load_dataset
output_dir = f"{self._hf_cache_root}/msswift-data/{self._data_folder}" os.makedirs(output_dir, exist_ok=True)
try: ds = load_dataset(self._hf_dataset, split=self._split) except ValueError: ds = load_dataset(self._hf_dataset, "joy/initial-setup", split=self._split)
out_path = f"{output_dir}/training.jsonl" with open(out_path, "w") as f: for row in ds: messages = [ {"role": "user", "content": row[self._input_col]}, {"role": "assistant", "content": row[self._output_col]}, ] f.write(json.dumps({"messages": messages}) + "\n") print(f"Wrote {out_path}")Define the experiment
Section titled “Define the experiment”MsSwiftFrameworkConfig holds ms-swift-specific knobs; the launcher
forwards them to megatron sft as --flag value args.
Parallelism, MoE, and LoRA — from ModelTrainingConfig
Section titled “Parallelism, MoE, and LoRA — from ModelTrainingConfig”GLM-4.7’s parallelism, MoE, and LoRA settings are defined on the
model itself via its ModelTrainingConfig (see GLM_4_7 in
common/models/glm_4_7.py). The framework pulls them automatically
— no need to set them on MsSwiftFrameworkConfig. Here’s what the
model provides for 32 GPUs = 4 nodes × 8 H100:
| Axis | Setting | Why |
|---|---|---|
| Tensor (TP) | 2 | Shard individual weight matrices across 2 GPUs |
| Expert (EP) | 4 | Spread MoE experts across 4 GPUs |
| Pipeline (PP) | 4 | 4-stage pipeline over transformer blocks |
| Context (CP) | 1 | No sequence-dim parallelism at this context length |
LoRA: lora_rank=128, lora_alpha=32 — higher rank than the usual
8–16; GLM-4.7 is large enough that a bigger rank pays for itself.
Throughput
Section titled “Throughput”global_batch_size=8,max_length=2048— GSM8K is short so we don’t need long context; batch is small because GLM-4.7 is big.lr=1e-4— standard LoRA LR (higher than a full-finetune LR because only the adapter params update).train_iters=1,num_train_epochs=1— set for a quick smoke run; bump either for real training.
swift_framework_config = MsSwiftFrameworkConfig( gpu="H100", n_nodes=4, gpus_per_node=8, global_batch_size=8, max_length=2048, train_iters=1, num_train_epochs=1,)
my_training_run = MsSwiftConfig( dataset=GSM8KDataset(HF_CACHE_PATH), model=GLM_4_7(), wandb=WandbConfig(project="glm-4-7-sft"), framework_config=swift_framework_config,)Build and run
Section titled “Build and run”build_app() returns a Modal app with download_model,
prepare_dataset, and train. See
quickstart for the pattern.
app = my_training_run.build_app()Related API Reference
Section titled “Related API Reference”Source: tutorials/sft/ms_swift_glm_4_7_gsm8k/ms_swift_glm_4_7_gsm8k.py
| Open in Modal Notebook