GLM-4.7 LoRA SFT on GSM8K with ms-swift on Modal

GLM-4.7 LoRA SFT on GSM8K (Megatron)

What ms-swift is. ms-swift is ModelScope’s end-to-end fine-tuning toolkit. It supports PEFT methods (LoRA / QLoRA / DoRA) and full finetuning across HuggingFace and Megatron-LM backends. The Megatron backend is what makes it interesting at large scale — it gives you 4-D parallelism (TP / PP / EP / CP) for models that don’t fit on one GPU under HF’s data parallelism alone.

What this tutorial does. LoRA SFT of GLM-4.7 (a large MoE model) on GSM8K, on 4 nodes × 8×H100 (32 GPUs). The interesting piece is the parallelism split: TP=2, EP=4, PP=4, CP=1 — tensor parallel across pairs of GPUs, 4-way expert parallel for the MoE layers, 4-stage pipeline parallel for the transformer blocks. Under the hood this launches megatron sft via torchrun on each clustered node. For the shared primitives (DatasetConfig, Model, 3-stage pipeline) see quickstart.

What you’ll need.

Access to Modal’s multi-node training preview (4 × 8×H100).
wandb Modal secret.

What to watch. W&B project glm-4-7-sft. Watch train/loss and train/grad_norm; LoRA converges quickly on GSM8K so expect loss to fall off within the first few hundred iters.

import modal

from modal_training_gym.common.dataset import DatasetConfig
from modal_training_gym.common.models import GLM_4_7
from modal_training_gym.common.wandb import WandbConfig
from modal_training_gym.frameworks.ms_swift import (
    MsSwiftConfig,
    MsSwiftFrameworkConfig,
)
from modal_training_gym.frameworks.ms_swift.config import HF_CACHE_PATH

Define the dataset

ms-swift reads a JSONL file where each line is a chat-format object: {"messages": [{"role": "user", ...}, {"role": "assistant", ...}]}. prepare() converts GSM8K’s (question, answer) columns into that shape and writes it under the HF cache volume so both download and dataset prep share the same mount.

class GSM8KDataset(DatasetConfig):
    def __init__(
        self,
        hf_cache_root,
        data_folder="gsm8k",
        hf_dataset="openai/gsm8k",
        split="train",
        input_col="question",
        output_col="answer",
    ):
        self._hf_cache_root = str(hf_cache_root)
        self._data_folder = data_folder
        self._hf_dataset = hf_dataset
        self._split = split
        self._input_col = input_col
        self._output_col = output_col
        self.prompt_data = (
            f"{self._hf_cache_root}/msswift-data/{data_folder}/training.jsonl"
        )

    def prepare(self):
        import json
        import os

        from datasets import load_dataset

        output_dir = f"{self._hf_cache_root}/msswift-data/{self._data_folder}"
        os.makedirs(output_dir, exist_ok=True)

        try:
            ds = load_dataset(self._hf_dataset, split=self._split)
        except ValueError:
            ds = load_dataset(self._hf_dataset, "joy/initial-setup", split=self._split)

        out_path = f"{output_dir}/training.jsonl"
        with open(out_path, "w") as f:
            for row in ds:
                messages = [
                    {"role": "user", "content": row[self._input_col]},
                    {"role": "assistant", "content": row[self._output_col]},
                ]
                f.write(json.dumps({"messages": messages}) + "\n")
        print(f"Wrote {out_path}")

Define the experiment

MsSwiftFrameworkConfig holds ms-swift-specific knobs; the launcher forwards them to megatron sft as --flag value args.

Parallelism, MoE, and LoRA — from `ModelTrainingConfig`

GLM-4.7’s parallelism, MoE, and LoRA settings are defined on the model itself via its ModelTrainingConfig (see GLM_4_7 in common/models/glm_4_7.py). The framework pulls them automatically — no need to set them on MsSwiftFrameworkConfig. Here’s what the model provides for 32 GPUs = 4 nodes × 8 H100:

Axis	Setting	Why
Tensor (TP)	2	Shard individual weight matrices across 2 GPUs
Expert (EP)	4	Spread MoE experts across 4 GPUs
Pipeline (PP)	4	4-stage pipeline over transformer blocks
Context (CP)	1	No sequence-dim parallelism at this context length

LoRA: lora_rank=128, lora_alpha=32 — higher rank than the usual 8–16; GLM-4.7 is large enough that a bigger rank pays for itself.

Throughput

global_batch_size=8, max_length=2048 — GSM8K is short so we don’t need long context; batch is small because GLM-4.7 is big.
lr=1e-4 — standard LoRA LR (higher than a full-finetune LR because only the adapter params update).
train_iters=1, num_train_epochs=1 — set for a quick smoke run; bump either for real training.

swift_framework_config = MsSwiftFrameworkConfig(
    gpu="H100",
    n_nodes=4,
    gpus_per_node=8,
    global_batch_size=8,
    max_length=2048,
    train_iters=1,
    num_train_epochs=1,
)

my_training_run = MsSwiftConfig(
    dataset=GSM8KDataset(HF_CACHE_PATH),
    model=GLM_4_7(),
    wandb=WandbConfig(project="glm-4-7-sft"),
    framework_config=swift_framework_config,
)

Build and run

build_app() returns a Modal app with download_model, prepare_dataset, and train. See quickstart for the pattern.

app = my_training_run.build_app()

Source: tutorials/sft/ms_swift_glm_4_7_gsm8k/ms_swift_glm_4_7_gsm8k.py | Open in Modal Notebook