Skip to content
GitHub
View on GitHub

HarborDataset

Dataset backed by a Harbor task directory structure.

from modal_training_gym.common.harbor.dataset import HarborDataset

Dataset backed by a Harbor task directory structure.

Inherits from: DatasetConfig

FieldTypeDefaultDescription
dataset_idstr""
input_keystr""
label_keystr""
apply_chat_templateboolTrue
always_prepareboolFalse
dataset_namestr""
pathstr | NoneNone
task_rootstr""
task_globstr"*"
task_nameslist[str] | NoneNone
instruction_pathstr"instruction.md"
label_metadata_pathstr | NoneNone
test_data_dirstr | NoneNone
output_formatstr"parquet"
prompt_templatestr"{instruction}"
system_promptstr""
train_sizeint | NoneNone
eval_sizeint | NoneNone
train_repeatsint1
eval_repeatsint1
shuffle_tasksboolFalse
shuffle_seedint0

load(self, split: "Literal['all', 'train', 'eval']" = 'all') -> 'Any'

Section titled “load(self, split: "Literal['all', 'train', 'eval']" = 'all') -> 'Any'”

Load raw examples, optionally filtered by split.

prepare(self, path: 'str', eval_paths: 'dict[str, str] | None' = None) -> 'None'

Section titled “prepare(self, path: 'str', eval_paths: 'dict[str, str] | None' = None) -> 'None'”

Materialize training data to path (and eval splits to eval_paths).

to_pandas(self, *, formatted: 'bool' = False)

Section titled “to_pandas(self, *, formatted: 'bool' = False)”

validate_prepared(self, path: 'str') -> 'None'

Section titled “validate_prepared(self, path: 'str') -> 'None'”

Sniff what prepare() wrote and confirm the columns the framework will index.

Source: modal_training_gym/common/harbor/dataset.py