Skip to content
GitHub
View on GitHub

ModelArchitecture

Transformer architecture parameters for a specific model.

from modal_training_gym.common.models.base import ModelArchitecture

Transformer architecture parameters for a specific model.

FieldTypeDefaultDescription
num_layersint0Number of transformer layers. Default 0.
hidden_sizeint0Hidden dimension size. Default 0.
ffn_hidden_sizeint0Feed-forward network intermediate size. Default 0.
vocab_sizeint0Vocabulary size. Default 0.
FieldTypeDefaultDescription
num_attention_headsint0Number of attention heads. Default 0.
group_query_attentionboolTrueEnable grouped-query attention (GQA). Default True.
num_query_groupsint0Number of KV head groups for GQA. Default 0.
kv_channelsint0Per-head key/value channel dimension. Default 0.
FieldTypeDefaultDescription
normalizationstr"RMSNorm"Layer normalization type. Default "RMSNorm".
norm_epsilonfloat1e-06Normalization epsilon. Default 1e-6.
swigluboolTrueUse SwiGLU activation in FFN. Default True.
disable_bias_linearboolTrueDisable bias in linear layers. Default True.
qk_layernormboolTrueApply layer norm to query and key projections. Default True.
untie_embeddings_and_output_weightsboolFalseUse separate output projection weights instead of tying to token embeddings. Default False.
FieldTypeDefaultDescription
num_expertsint0Total number of MoE experts. Default 0 (dense model).
moe_ffn_hidden_sizeint0Per-expert FFN intermediate size. Default 0.
moe_shared_expert_intermediate_sizeint0Shared expert FFN intermediate size. Default 0.
FieldTypeDefaultDescription
moe_router_score_functionstr""Router scoring function (e.g. "softmax"). Default "".
moe_token_drop_policystr""Token drop policy for MoE routing. Default "".
moe_router_dtypestr""Data type for router computation (e.g. "fp32"). Default "".
moe_permute_fusionboolFalseEnable permute fusion optimization for MoE. Default False.
moe_aux_loss_coefffloat | NoneNoneAuxiliary load-balancing loss coefficient. Default None.
FieldTypeDefaultDescription
megatron_model_typestr""Slime/Megatron model type string for checkpoint conversion (e.g. "qwen3.5-35B-A3B"). Used when the training recipe selects a non-bridge conversion mode. Default "".
FieldTypeDefaultDescription
apply_layernorm_1pboolFalseUse zero-centered LayerNorm (add 1 to gamma). Default False.
FieldTypeDefaultDescription
use_gated_attentionboolFalseEnable gated attention mechanism. Default False.
attention_output_gateboolFalseEnable output gating on attention layers (required by some hybrid architectures such as Qwen 3.6). Default False.
FieldTypeDefaultDescription
use_rotary_position_embeddingsboolTrueUse RoPE positional encoding. Default True.
rotary_baseint10000Base frequency for RoPE. Default 10000.
rotary_percentfloat1.0Fraction of hidden dims to apply RoPE to. Default 1.0.
FieldTypeDefaultDescription
moe_grouped_gemmboolFalse
moe_shared_expert_gateboolFalse
moe_router_topkint0
megatron_speclist[str] | NoneNone

Generate Megatron-LM CLI flags from this architecture spec.

Source: modal_training_gym/common/models/base.py