from modal_training_gym . common . models . base import ModelArchitecture
Transformer architecture parameters for a specific model.
Field Type Default Description num_layersint0Number of transformer layers. Default 0. hidden_sizeint0Hidden dimension size. Default 0. ffn_hidden_sizeint0Feed-forward network intermediate size. Default 0. vocab_sizeint0Vocabulary size. Default 0.
Field Type Default Description num_attention_headsint0Number of attention heads. Default 0. group_query_attentionboolTrueEnable grouped-query attention (GQA). Default True. num_query_groupsint0Number of KV head groups for GQA. Default 0. kv_channelsint0Per-head key/value channel dimension. Default 0.
Field Type Default Description normalizationstr"RMSNorm"Layer normalization type. Default "RMSNorm". norm_epsilonfloat1e-06Normalization epsilon. Default 1e-6. swigluboolTrueUse SwiGLU activation in FFN. Default True. disable_bias_linearboolTrueDisable bias in linear layers. Default True. qk_layernormboolTrueApply layer norm to query and key projections. Default True. untie_embeddings_and_output_weightsboolFalseUse separate output projection weights instead of tying to token embeddings. Default False.
Field Type Default Description num_expertsint0Total number of MoE experts. Default 0 (dense model). moe_ffn_hidden_sizeint0Per-expert FFN intermediate size. Default 0. moe_shared_expert_intermediate_sizeint0Shared expert FFN intermediate size. Default 0.
Field Type Default Description moe_router_score_functionstr""Router scoring function (e.g. "softmax"). Default "". moe_token_drop_policystr""Token drop policy for MoE routing. Default "". moe_router_dtypestr""Data type for router computation (e.g. "fp32"). Default "". moe_permute_fusionboolFalseEnable permute fusion optimization for MoE. Default False. moe_aux_loss_coefffloat | NoneNoneAuxiliary load-balancing loss coefficient. Default None.
Field Type Default Description megatron_model_typestr""Slime/Megatron model type string for checkpoint conversion (e.g. "qwen3.5-35B-A3B"). Used when the training recipe selects a non-bridge conversion mode. Default "".
Field Type Default Description apply_layernorm_1pboolFalseUse zero-centered LayerNorm (add 1 to gamma). Default False.
Field Type Default Description use_gated_attentionboolFalseEnable gated attention mechanism. Default False. attention_output_gateboolFalseEnable output gating on attention layers (required by some hybrid architectures such as Qwen 3.6). Default False.
Field Type Default Description use_rotary_position_embeddingsboolTrueUse RoPE positional encoding. Default True. rotary_baseint10000Base frequency for RoPE. Default 10000. rotary_percentfloat1.0Fraction of hidden dims to apply RoPE to. Default 1.0.
Field Type Default Description moe_grouped_gemmboolFalsemoe_shared_expert_gateboolFalsemoe_router_topkint0megatron_speclist[str] | NoneNone
Generate Megatron-LM CLI flags from this architecture spec.
Source: modal_training_gym/common/models/base.py