Skip to content
GitHub
View on GitHub

HuggingFaceDataset

Dataset backed by a HuggingFace datasets repo.

from modal_training_gym.common.dataset import HuggingFaceDataset

Dataset backed by a HuggingFace datasets repo.

Inherits from: DatasetConfig

FieldTypeDefaultDescription
dataset_idstr""
input_keystr""
label_keystr"label"
apply_chat_templateboolTrue
always_prepareboolFalse
hf_repostr""
hf_splitstr"train"
hf_configstr | NoneNone
output_formatstr"parquet"
input_columnstr""
output_columnstr""
system_promptstr""
prompt_templatestr"{input}"
n_rowsint0

load(self, split: "Literal['all', 'train', 'eval']" = 'all') -> 'Any'

Section titled “load(self, split: "Literal['all', 'train', 'eval']" = 'all') -> 'Any'”

Load raw examples, optionally filtered by split.

prepare(self, path: 'str', eval_paths: 'dict[str, str] | None' = None) -> 'None'

Section titled “prepare(self, path: 'str', eval_paths: 'dict[str, str] | None' = None) -> 'None'”

Materialize training data to path (and eval splits to eval_paths).

to_pandas(self, *, formatted: 'bool' = False)

Section titled “to_pandas(self, *, formatted: 'bool' = False)”

validate_prepared(self, path: 'str') -> 'None'

Section titled “validate_prepared(self, path: 'str') -> 'None'”

Sniff what prepare() wrote and confirm the columns the framework will index.

Source: modal_training_gym/common/dataset.py