Windowed-FIFO rollout scheduling

Windowed-FIFO rollout scheduling for over-sampled RL

When you over-sample an RL rollout — generate more prompt groups than you keep, as DAPO does — slime collects the first rollout_batch_size groups to finish and discards the rest.

This greedy “first-finished-first-out” (FFFO) collection has a subtle failure mode:

In agent RL, completion time tracks task difficulty (hard prompts produce longer generations, take longer).

Thus, the fastest-finishing groups are the easy ones, and the kept batch skews easy— harder prompts are systematically dropped. This causes the training distribution to drift, and gradients to oscillate (MiniMax Forge §3.1).

There are many ways to fix this, but one of the methods is implementing a Windowed FIFO, where you keep a sliding window of width W = ratio x N over the generation queue (N = generation batch size).

A completed group may be collected only while it sits inside the window; groups that are past the window stay blocked even when it is finished, until the window advances as the head is consumed.

This keeps collection close to queue order — which is independent of difficulty, so the kept batch matches the true population.

Note that you only need to implement this if you are using over-sampling** (or dynamic filtering). If every generated group is kept, there’s nothing to reorder — set over_sampling_batch_size > rollout_batch_size for the knob to take effect.

See the bias (no GPU needed)

Before launching anything, let’s watch the problem and the fix on the actual scheduler. WindowedFIFOCollector is the pure-Python core of the rollout — no slime, no GPU. We simulate 256 prompt groups (2x over-sampling for a batch of 128) where harder prompts finish later, then collect a batch under FFFO vs. windowed FIFO and compare how many hard prompts survive.

import numpy as np

from modal_training_gym.frameworks.slime.windowed_fifo_rollout import (
    WindowedFIFOCollector,
)

N, target = 256, 128

def kept_batch(window_size):
    rng = np.random.default_rng(0)
    # Half easy, half hard prompts; latency rises with difficulty.
    d = np.where(rng.random(N) < 0.5, rng.beta(2, 5, N), rng.beta(5, 2, N))
    finish_order = np.argsort(d + 0.03 * rng.normal(0, 1, N))

    collector = WindowedFIFOCollector(total=N, window_size=window_size)
    kept = []
    for pos in finish_order:  # groups finish fastest-first
        collector.mark_completed(int(pos), int(pos))
        for g in collector.drain():
            if len(kept) < target:
                kept.append(g)
        if len(kept) >= target:
            break
    hard_frac = (d[np.array(kept)] > 0.6).mean()
    return hard_frac, (d > 0.6).mean()

fffo_hard, pop_hard = kept_batch(window_size=N)  # ratio = 1.0
win_hard, _ = kept_batch(window_size=int(0.3 * N))  # ratio = 0.3

print(f"population hard-prompt fraction: {pop_hard:.0%}")
print(f"  greedy FFFO  (ratio=1.0): {fffo_hard:.0%}  <- hard prompts dropped")
print(f"  windowed     (ratio=0.3): {win_hard:.0%}  <- matches population")

Turn it on in training

Setting windowed_fifo_ratio does two things for you: it selects the windowed-FIFO rollout function (--rollout-function-path) and ships the ratio to slime. Pair it with over_sampling_batch_size — that’s what gives the scheduler completed groups to choose between. Here we over-sample 2x (32 → 16) on a short Qwen3-4B math run.

Async training (async_mode=True) is the natural home for this: it overlaps generation with training, so the scheduler is continuously deciding which completed groups to admit.

We will be using a windowed_fifo_ratio=0.3 which is Forge’s recommended balance, where the window_fifo_ratio corresponds to the ratio of the window size to the generation batch size.

windowed_fifo_ratio=0.0 → window of 1 → strict FIFO (a slow head blocks everything)
windowed_fifo_ratio=1.0 → window of N → greedy FFFO (no blocking at all)

from typing import Any

from modal_training_gym import (
    HuggingFaceDataset,
    Qwen3_4B,
    SlimeRecipe,
    TrainConfig,
)

class MathDataset(HuggingFaceDataset):
    hf_repo = "zhuzilin/dapo-math-17k"
    input_key = "prompt"
    label_key = "label"
    output_format = "jsonl"
    apply_chat_template = True
    always_prepare = True

    def load(self, split: str = "all") -> Any:
        from datasets import load_dataset

        ds = load_dataset(self.hf_repo, self.hf_config, split=self.hf_split)
        stop = len(ds) if not self.n_rows else min(self.n_rows, len(ds))
        return ds.select(range(stop))

train_dataset = MathDataset(n_rows=2_000)

base_model = Qwen3_4B()
training_run = TrainConfig(
    model=base_model,
    dataset=train_dataset,
    recipe=SlimeRecipe(
        rm_type="dapo",
        gpu_type="H100",
        colocate=True,
        actor_num_nodes=1,
        actor_num_gpus_per_node=8,
        tensor_model_parallel_size=2,
        sequence_parallel=True,
        rollout_num_gpus_per_engine=1,
        async_mode=True,
        num_rollout=15,
        rollout_batch_size=16,
        # Over-sample 2x so the windowed-FIFO scheduler has groups to choose
        # between — without this, the ratio below is a no-op.
        over_sampling_batch_size=32,
        windowed_fifo_ratio=0.3,
        n_samples_per_prompt=8,
        rollout_max_response_len=8192,
        rollout_temperature=1.0,
        global_batch_size=32,
        lr=1e-6,
        advantage_estimator="grpo",
        use_kl_loss=False,
        kl_coef=0.0,
        eps_clip=0.2,
        eps_clip_high=0.28,
        use_dynamic_batch_size=True,
        max_tokens_per_gpu=9216,
        sglang_mem_fraction_static=0.75,
        save_interval=10,
        apply_chat_template_kwargs='{"enable_thinking": true}',
    ),
)
train_result = training_run.train()
print(f"Training run id: {train_result.training_run_id}")

Tuning

`windowed_fifo_ratio`	behavior	when
`0.0`	strict FIFO — max consistency, a slow head blocks the batch	low-variance tasks
`0.3`	Forge’s balance	general agent RL
`0.5`	more straggler-tolerant	high-variance tasks (coding, web agents)
`1.0`	greedy FFFO (slime’s default)	uniform difficulty

Windowed FIFO composes cleanly with the rest of the stack: over_sampling_batch_size feeds it candidates, dynamic_sampling_filter_path still drops zero-variance groups (now in windowed order), and balance_data independently controls how the kept batch is spread across DP ranks. It changes which groups train, not how they’re optimized.

Source: tutorials/rl/007_windowed_fifo/007_windowed_fifo.py | Open in Modal Notebook

Windowed-FIFO rollout scheduling

See the bias (no GPU needed)

Turn it on in training

Tuning

Related API Reference