from modal_training_gym.frameworks.slime.config import SlimeConfig
slime GRPO training configuration.
| Field | Type | Default | Description |
|---|
environment | dict | {'PYTHONPATH': '/root/Megatron-LM/', 'CUDA_DEVICE_MAX_CONNECTIONS': '1', 'NCCL_NVLS_ENABLE': '1'} | Injected into the Ray job runtime env. |
async_mode | bool | False | When True, uses train_async.py. Default False. |
slime_model_script | str | "" | Shell script path relative to /root/slime. Default "". |
model | `ModelConfiguration | None` | None |
| Field | Type | Default | Description |
|---|
dataset | `DatasetConfig | None` | None |
wandb | `WandbConfig | None` | None |
modal | `ModalConfig | None` | None |
app_tags | dict | {} | Extra Modal app tags. Default {}. |
| Field | Type | Default | Description |
|---|
actor_num_nodes | `int | None` | None |
actor_num_gpus_per_node | `int | None` | None |
colocate | `bool | None` | None |
rollout_num_gpus | `int | None` | None |
rollout_num_gpus_per_engine | int | 1 | GPUs per SGLang rollout engine. Default 1. |
tensor_model_parallel_size | `int | None` | None |
use_critic | bool | False | Use a separate critic network. Default False. |
critic_num_nodes | `int | None` | None |
critic_num_gpus_per_node | `int | None` | None |
| Field | Type | Default | Description |
|---|
advantage_estimator | str | "grpo" | Advantage estimation method. Default "grpo". |
n_samples_per_prompt | int | 2 | Rollout samples per prompt. Default 2. |
eps_clip | float | 0.2 | PPO clipping epsilon. Default 0.2. |
eps_clip_high | float | 0.28 | Asymmetric high-side PPO clip. Default 0.28. |
use_kl_loss | bool | True | Enable KL divergence loss. Default False. |
kl_loss_type | str | "low_var_kl" | KL loss variant. Default "low_var_kl". |
kl_loss_coef | float | 0.0 | KL loss coefficient. Default 0.0. |
entropy_coef | float | 0.0 | Entropy bonus coefficient. Default 0.0. |
ref_load | str | "" | Reference model checkpoint path. Default "". |
| Field | Type | Default | Description |
|---|
num_rollout | int | 1 | Number of rollout episodes per step. Default 1. |
rollout_batch_size | int | 8 | Batch size for rollout generation. Default 8. |
rollout_max_response_len | int | 8192 | Maximum response length during rollout. Default 8192. |
rollout_temperature | float | 1.0 | Sampling temperature for rollouts. Default 1.0. |
sglang_mem_fraction_static | float | 0.7 | SGLang static memory fraction. Default 0.7. |
| Field | Type | Default | Description |
|---|
global_batch_size | int | 16 | Global batch size. Default 16. |
lr | float | 1e-06 | Learning rate. Default 1e-6. |
lr_decay_style | str | "constant" | LR decay schedule. Default "constant". |
weight_decay | float | 0.1 | Weight decay. Default 0.0. |
adam_beta1 | float | 0.9 | Adam beta1. Default 0.9. |
adam_beta2 | float | 0.98 | Adam beta2. Default 0.95. |
optimizer | str | "adam" | Optimizer name. Default "adam". |
name | str | "" | |
| Field | Type | Default | Description |
|---|
attention_dropout | float | 0.0 | Attention dropout. Default 0.0. |
hidden_dropout | float | 0.0 | Hidden layer dropout. Default 0.0. |
attention_softmax_in_fp32 | bool | True | Compute softmax in FP32. Default True. |
accumulate_allreduce_grads_in_fp32 | bool | True | Accumulate allreduce grads in FP32. Default True. |
recompute_granularity | str | "full" | Activation recomputation granularity. Default "". |
recompute_method | str | "uniform" | Activation recomputation method. Default "". |
recompute_num_layers | int | 1 | Number of layers to recompute. Default None. |
| Field | Type | Default | Description |
|---|
use_dynamic_batch_size | bool | True | Use dynamic batch sizing. Default False. |
max_tokens_per_gpu | int | 9216 | Max tokens per GPU for dynamic batching. Default None. |
| Field | Type | Default | Description |
|---|
eval_interval | int | 20 | Evaluation interval in training steps. Default 20. |
n_samples_per_eval_prompt | int | 4 | Eval samples per prompt. Default 4. |
eval_max_response_len | int | 16384 | Max response length for eval. Default 16384. |
eval_top_p | float | 1.0 | Top-p sampling for eval. Default 1.0. |
eval_config | `dict | None` | None |
| Field | Type | Default | Description |
|---|
save | str | "/checkpoints" | Enable checkpoint saving. Default True. |
save_interval | int | 1000 | Checkpoint save interval. Default 1000. |
megatron_to_hf_mode | str | "bridge" | Checkpoint conversion mode ("bridge" or "raw"). Default "bridge". |
use_fault_tolerance | bool | True | Enable fault tolerance. Default False. |
| Field | Type | Default | Description |
|---|
custom_rm_path | str | "" | Python import path for custom reward function. Default "". |
| Field | Type | Default | Description |
|---|
sglang_config | `dict | None` | None |
custom_config_path | `dict | None` | None |
apply_chat_template_kwargs | str | "" | Extra kwargs for chat template. Default None. |
| Field | Type | Default | Description |
|---|
image_run_commands | list[str] | [] | |
local_python_sources | list[str] | [] | |
gpu_type | `str | None` | None |
sequence_parallel | `bool | None` | None |
rm_type | str | "math" | |
slime CLI arguments derived from this config.
Materialize the training data.
Extract dataset-related slime flags back into a DatasetConfig.
Extract model-related slime flags back into a ModelConfiguration.
Extract wandb-related slime flags back into a WandbConfig.
Total Modal cluster nodes required by this config.
Source: modal_training_gym/frameworks/slime/config.py