Qwen3_6_35b_Recipe

Qwen3.6-35B-A3B (MoE) on 1×8×H100 with TP2/PP2/CP1/EP4.

from modal_training_gym.train_recipes.slime_recipe.qwen3_6_35b import Qwen3_6_35b_Recipe

Qwen3.6-35B-A3B (MoE) on 1×8×H100 with TP2/PP2/CP1/EP4.

Inherits from: SlimeRecipe, BaseTrainRecipe

Fields

Field	Type	Default
`gpu_type`	`str`	`"H100"`
`colocate`	`bool`	`True`
`tensor_model_parallel_size`	`int`	`2`
`sequence_parallel`	`bool`	`True`
`rollout_num_gpus_per_engine`	`int`	`4`
`num_rollout`	`int`	`3000`
`rollout_batch_size`	`int`	`16`
`rollout_max_response_len`	`int`	`16384`
`rollout_temperature`	`float`	`1.0`
`save_interval`	`int`	`20`
`recipe_type`	`RecipeType`	`slime`
`name`	`str`	`""`
`app_tags`	`dict`	`{}`
`environment`	`dict`	`{'PYTHONPATH': '/root/Megatron-LM/', 'CUDA_DEVICE_MAX_CONNECTIONS': '1', 'NCCL_NVLS_ENABLE': '1'}`
`async_mode`	`bool`	`False`
`wandb`	`WandbConfig \| None`	`None`
`image_overlay`	`collections.abc.Callable[[modal.image.Image], modal.image.Image] \| None`	`None`
`local_slime`	`str \| None`	`None`
`memory`	`int \| tuple[int, int] \| None`	`None`
`cloud`	`str \| None`	`None`
`region`	`str \| None`	`None`
`slime_model_script`	`str`	`"scripts/models/qwen3.5-35B-A3B.sh"`
`source_hf_checkpoint`	`str \| None`	`None`
`megatron_conversion_hf_checkpoint`	`str \| None`	`None`
`patch_files`	`list[str]`	`[]`
`image_run_commands`	`list[str]`	`[]`
`image_env`	`dict[str, str]`	`{}`
`train_function_kwargs`	`dict[str, int]`	`{'ephemeral_disk': 1048576}`
`actor_num_nodes`	`int`	`1`
`actor_num_gpus_per_node`	`int`	`8`
`rollout_num_gpus`	`int \| None`	`None`
`use_critic`	`bool`	`False`
`critic_num_nodes`	`int \| None`	`None`
`critic_num_gpus_per_node`	`int \| None`	`None`
`advantage_estimator`	`str`	`"grpo"`
`n_samples_per_prompt`	`int`	`8`
`eps_clip`	`float`	`0.2`
`eps_clip_high`	`float`	`0.28`
`use_kl_loss`	`bool`	`True`
`kl_loss_type`	`str`	`"low_var_kl"`
`kl_loss_coef`	`float`	`0.0`
`kl_coef`	`float`	`0.0`
`entropy_coef`	`float`	`0.0`
`calculate_per_token_loss`	`bool`	`False`
`ref_load`	`str`	`"/checkpoints/Qwen3.6-35B-A3B_torch_dist_tp2pp2"`
`over_sampling_batch_size`	`int \| None`	`None`
`dynamic_sampling_filter_path`	`str \| None`	`None`
`balance_data`	`bool`	`True`
`rollout_shuffle`	`bool`	`True`
`rollout_top_p`	`float`	`1.0`
`rollout_stop_token_ids`	`list[int] \| None`	`None`
`sglang_mem_fraction_static`	`float`	`0.75`
`global_batch_size`	`int`	`128`
`lr`	`float`	`1e-06`
`lr_decay_style`	`str`	`"constant"`
`weight_decay`	`float`	`0.1`
`adam_beta1`	`float`	`0.9`
`adam_beta2`	`float`	`0.98`
`optimizer`	`str`	`"adam"`
`attention_dropout`	`float`	`0.0`
`hidden_dropout`	`float`	`0.0`
`attention_softmax_in_fp32`	`bool`	`True`
`accumulate_allreduce_grads_in_fp32`	`bool`	`True`
`use_distributed_optimizer`	`bool`	`False`
`recompute_granularity`	`str`	`"full"`
`recompute_method`	`str`	`"uniform"`
`recompute_num_layers`	`int`	`1`
`use_dynamic_batch_size`	`bool`	`True`
`max_tokens_per_gpu`	`int`	`8192`
`eval_interval`	`int \| None`	`None`
`n_samples_per_eval_prompt`	`int`	`4`
`eval_max_response_len`	`int`	`16384`
`eval_top_p`	`float`	`1.0`
`eval_config`	`dict \| None`	`None`
`save`	`str`	`"/checkpoints"`
`load`	`str`	`""`
`no_save_optim`	`bool`	`True`
`megatron_to_hf_mode`	`str`	`""`
`use_fault_tolerance`	`bool`	`True`
`update_weight_mode`	`str`	`"full"`
`update_weight_transport`	`str`	`"nccl"`
`update_weight_encoding`	`str`	`"indices"`
`update_weight_disk_dir`	`str`	`""`
`rm_type`	`str \| None`	`None`
`custom_rm_function`	`collections.abc.Callable \| None`	`None`
`custom_generate_function`	`collections.abc.Callable \| None`	`None`
`custom_rollout_log_function`	`collections.abc.Callable \| str \| None`	`None`
`custom_eval_rollout_log_function`	`collections.abc.Callable \| str \| None`	`None`
`rollout_function`	`collections.abc.Callable \| str \| None`	`None`
`custom_megatron_before_log_prob_hook`	`collections.abc.Callable \| str \| None`	`None`
`custom_megatron_before_train_step_hook`	`collections.abc.Callable \| str \| None`	`None`
`sglang_enable_dp_attention`	`bool`	`True`
`sglang_dp_size`	`int \| None`	`4`
`sglang_ep_size`	`int \| None`	`4`
`sglang_enable_dp_lm_head`	`bool`	`True`
`sglang_disable_custom_all_reduce`	`bool`	`False`
`sglang_cuda_graph_bs`	`list[int] \| None`	`[1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256]`
`sglang_max_running_requests`	`int \| None`	`512`
`extra_config`	`dict \| None`	`None`
`sglang_config`	`dict \| None`	`None`
`sglang_request_params`	`dict \| None`	`None`
`apply_chat_template_kwargs`	`dict \| str`	`""`
`train_env_vars`	`dict \| str \| None`	`None`
`multimodal_keys`	`dict \| str \| None`	`None`
`hf_checkpoint`	`str`	`"Qwen/Qwen3.6-35B-A3B"`
`pipeline_model_parallel_size`	`int`	`2`
`context_parallel_size`	`int`	`2`
`expert_model_parallel_size`	`int`	`4`
`expert_tensor_parallel_size`	`int`	`1`
`sglang_speculative_algorithm`	`str`	`"EAGLE"`
`sglang_speculative_num_steps`	`int`	`3`
`sglang_speculative_eagle_topk`	`int`	`1`
`sglang_speculative_num_draft_tokens`	`int`	`4`
`sglang_mamba_scheduler_strategy`	`str`	`"extra_buffer"`
`moe_token_dispatcher_type`	`str`	`"flex"`
`moe_enable_deepep`	`bool`	`True`
`optimizer_cpu_offload`	`bool`	`True`
`overlap_cpu_optimizer_d2h_h2d`	`bool`	`True`
`use_precision_aware_optimizer`	`bool`	`True`
`attention_backend`	`str`	`"flash"`

Methods

`cli_args(self, dataset: 'DatasetConfig | None' = None, model: 'ModelConfig | None' = None) -> list[str]`

`get_base_recipe(model_config: modal_training_gym.common.models.base.ModelConfig) -> 'SlimeRecipe | None'`

Source: modal_training_gym/train_recipes/slime_recipe/qwen3_6_35b.py