Skip to content
GitHub
View on GitHub

SglangRecipe

SGLang serving configuration.

from modal_training_gym.deploy_recipes.sglang_recipe import SglangRecipe

SGLang serving configuration.

Inherits from: BaseDeployRecipe

FieldTypeDefaultDescription
recipe_typeDeployRecipeTypesglang
gpuLiteral['H100', 'H200', 'B200', 'B300']"H100"GPU type for the serving container. Default "H100".
tpint | NoneNoneTensor parallelism degree. Default None (SGLang infers from GPU count).
dpint | NoneNoneData parallelism degree. Enables --enable-dp-attention when set. Default None.
context_lengthint | NoneNoneMaximum context length. Default None (model default).
mem_fraction_staticfloat | NoneNoneFraction of GPU memory for KV cache. Default None (SGLang default).
chunked_prefill_sizeint | NoneNoneChunked prefill token budget. Default None.
max_running_requestsint | NoneNoneMax concurrent requests per worker. Default None.
sglang_imagestr"lmsysorg/sglang:v0.5.12"Docker image tag for the SGLang container. Default is a recent nightly.
extra_server_argsdict[str, str] | NoneNoneAdditional --flag value pairs passed to sglang.launch_server. Use an empty string value for boolean flags (e.g. {"--trust-remote-code": ""}). Default None.
environment_namestr | NoneNoneModal environment to deploy into. Default None.
deploy_strategystr"rolling"Modal deployment strategy. Default "rolling".
startup_timeoutint1200Seconds the server container is allowed to spend in startup before Modal kills it — gates both Modal’s container startup_timeout and the SGLang health-check poll. Bump this for very large models whose weight load exceeds the default (e.g. GLM-4.7 at 355B, Kimi-K2.5 at ~1T). Default 1200 (20 minutes).

server_args(self, *, served_model_name: 'str') -> 'dict[str, str]'

Section titled “server_args(self, *, served_model_name: 'str') -> 'dict[str, str]'”

Build the --flag value dict for the SGLang launch command.

Source: modal_training_gym/deploy_recipes/sglang_recipe.py