Summary
The documentation and the shipped environment template describe an EnvAdapter API (execute() / evaluate() / build_prompt(), with DataItem / TaskResult data classes and a BENCHMARK_REGISTRY) that does not exist anywhere in the codebase. The real abstract contract in skillopt/envs/base.py is a different set of methods (build_train_env / build_eval_env / rollout / reflect / get_task_types). A contributor who follows docs/guide/new-benchmark.md or copies skillopt/envs/_template/ writes an adapter that cannot satisfy EnvAdapter's abstract methods and will fail to instantiate.
All references are to main.
Static evidence
1. The real EnvAdapter contract — skillopt/envs/base.py:
class EnvAdapter(ABC):
@abstractmethod
def build_train_env(self, batch_size, seed, **kwargs): ...
@abstractmethod
def build_eval_env(self, env_num, split, seed, **kwargs): ...
@abstractmethod
def rollout(self, env_manager, skill_content, out_dir, **kwargs): ...
@abstractmethod
def reflect(self, results, skill_content, out_dir, **kwargs): ...
@abstractmethod
def get_task_types(self) -> list[str]: ...
No execute, evaluate, or build_prompt. The real reference env confirms this — skillopt/envs/officeqa/adapter.py implements build_train_env / rollout, not execute / evaluate.
2. docs/reference/api.md documents the opposite:
class EnvAdapter(ABC):
async def execute(self, item, skill, model) -> TaskResult
def evaluate(self, prediction, ground_truth) -> float
def build_prompt(self, item, skill) -> str
It also documents @dataclass DataItem, @dataclass TaskResult, DataLoader.get_split_items, ModelBackend, and Trainer.
3. DataItem and TaskResult are never defined in code. A repo code search for class DataItem and class TaskResult returns exactly one hit each — both in docs/reference/api.md. They do not exist in skillopt/types.py, which instead defines RolloutResult, Edit, Patch, RawPatch, SlowUpdateResult, etc. Real rollouts return list[dict] (see officeqa), not TaskResult.
4. docs/guide/new-benchmark.md compounds it. It tells contributors to import DataItem from skillopt.data.base (module path doesn't exist; the real base is skillopt.datasets.base), implement execute() / evaluate() / get_split_items(), and register via a BENCHMARK_REGISTRY dict in skillopt/envs/__init__.py. The real registration mechanism is _ENV_REGISTRY + _register_builtins() in scripts/train.py; skillopt/envs/__init__.py is a one-line docstring with no BENCHMARK_REGISTRY.
5. The shipped template propagates the broken API. skillopt/envs/_template/env_template.py defines class TemplateBenchmarkEnv(EnvAdapter) with async def execute(self, item, skill, model) and def evaluate(...); loader_template.py defines get_split_items(). Copying the official template produces an adapter missing all five real abstract methods.
6. The real dataloader base is also different. Docs say subclass DataLoader and implement get_split_items. The real base is SplitDataLoader(BaseDataLoader) in skillopt/datasets/base.py; subclasses implement load_raw_items / load_split_items (get_split_items exists only as an internal accessor).
Practical reproduction
I followed docs/guide/new-benchmark.md verbatim against main on a clean checkout and captured the real python3 output at each step. Captions match the output exactly.
Step 2 — the loader import the guide prescribes (from skillopt.data.base import DataLoader, DataItem):
Traceback (most recent call last):
File "<string>", line 1, in <module>
from skillopt.data.base import DataLoader, DataItem
ModuleNotFoundError: No module named 'skillopt.data'
Step 3 — the env import the guide prescribes (from skillopt.envs.base import EnvAdapter, TaskResult):
Traceback (most recent call last):
File "<string>", line 1, in <module>
from skillopt.envs.base import EnvAdapter, TaskResult
ImportError: cannot import name 'TaskResult' from 'skillopt.envs.base'. Did you mean: 'GateResult'?
Step 3 (cont.) — importing only the real EnvAdapter and defining the guide's __init__(cfg) + execute() / evaluate() / build_prompt() adapter, then instantiating it:
Traceback (most recent call last):
File "<string>", line 9, in <module>
DocFaithfulEnv({})
TypeError: Can't instantiate abstract class DocFaithfulEnv without an implementation for abstract methods 'build_eval_env', 'build_train_env', 'get_task_types', 'reflect', 'rollout'
Step 6 — the documented run command after Steps 1-5 exactly. It crashes inside the guide's own Step-5 config: the _base_: ['...'] list form is not supported by the loader:
Traceback (most recent call last):
File ".../scripts/train.py", line 458, in main
cfg = load_config(args)
File ".../skillopt/config.py", line 142, in _load_yaml
base_path = os.path.join(os.path.dirname(abs_path), base_ref)
TypeError: join() argument must be str, bytes, or os.PathLike object, not 'list'
Corrected run — reverting the Step-4 BENCHMARK_REGISTRY edit and changing only _base_ to the supported string form, to get past the two bugs above. The run now reaches adapter construction and fails because BENCHMARK_REGISTRY is never consulted; scripts/train.py resolves envs from its own _ENV_REGISTRY:
Traceback (most recent call last):
File ".../scripts/train.py", line 486, in main
adapter = get_adapter(cfg)
File ".../scripts/train.py", line 108, in get_adapter
raise ValueError(...)
ValueError: Unknown environment 'docfaithful'. Available: ['alfworld', 'searchqa', 'livemathematicianbench', 'docvqa', 'officeqa']
Impact
Following the official "Add a New Benchmark" guide or the shipped template is a dead end: the prescribed module path (skillopt.data.base) does not exist, the prescribed symbol (TaskResult) does not exist, the prescribed adapter cannot instantiate against the real EnvAdapter ABC, the guide's own Step-5 config crashes the loader, and the Step-4 registration mechanism is ignored. The only working way to add an environment today is to reverse-engineer skillopt/envs/officeqa/.
Suggested fix
Rewrite docs/reference/api.md, docs/guide/new-benchmark.md, and skillopt/envs/_template/ to match the real EnvAdapter contract (build_train_env / build_eval_env / rollout / reflect / get_task_types), the real dataloader base (SplitDataLoader.load_raw_items / load_split_items), the real result shape (list[dict] with hard / soft keys, or RolloutResult), the supported _base_ string form in configs, and the real registration path (_ENV_REGISTRY in scripts/train.py). OR, implement the proposed API - it is much cleaner.
Summary
The documentation and the shipped environment template describe an
EnvAdapterAPI (execute()/evaluate()/build_prompt(), withDataItem/TaskResultdata classes and aBENCHMARK_REGISTRY) that does not exist anywhere in the codebase. The real abstract contract inskillopt/envs/base.pyis a different set of methods (build_train_env/build_eval_env/rollout/reflect/get_task_types). A contributor who followsdocs/guide/new-benchmark.mdor copiesskillopt/envs/_template/writes an adapter that cannot satisfyEnvAdapter's abstract methods and will fail to instantiate.All references are to
main.Static evidence
1. The real
EnvAdaptercontract —skillopt/envs/base.py:No
execute,evaluate, orbuild_prompt. The real reference env confirms this —skillopt/envs/officeqa/adapter.pyimplementsbuild_train_env/rollout, notexecute/evaluate.2.
docs/reference/api.mddocuments the opposite:It also documents
@dataclass DataItem,@dataclass TaskResult,DataLoader.get_split_items,ModelBackend, andTrainer.3.
DataItemandTaskResultare never defined in code. A repo code search forclass DataItemandclass TaskResultreturns exactly one hit each — both indocs/reference/api.md. They do not exist inskillopt/types.py, which instead definesRolloutResult,Edit,Patch,RawPatch,SlowUpdateResult, etc. Real rollouts returnlist[dict](seeofficeqa), notTaskResult.4.
docs/guide/new-benchmark.mdcompounds it. It tells contributors to importDataItemfromskillopt.data.base(module path doesn't exist; the real base isskillopt.datasets.base), implementexecute()/evaluate()/get_split_items(), and register via aBENCHMARK_REGISTRYdict inskillopt/envs/__init__.py. The real registration mechanism is_ENV_REGISTRY+_register_builtins()inscripts/train.py;skillopt/envs/__init__.pyis a one-line docstring with noBENCHMARK_REGISTRY.5. The shipped template propagates the broken API.
skillopt/envs/_template/env_template.pydefinesclass TemplateBenchmarkEnv(EnvAdapter)withasync def execute(self, item, skill, model)anddef evaluate(...);loader_template.pydefinesget_split_items(). Copying the official template produces an adapter missing all five real abstract methods.6. The real dataloader base is also different. Docs say subclass
DataLoaderand implementget_split_items. The real base isSplitDataLoader(BaseDataLoader)inskillopt/datasets/base.py; subclasses implementload_raw_items/load_split_items(get_split_itemsexists only as an internal accessor).Practical reproduction
I followed
docs/guide/new-benchmark.mdverbatim againstmainon a clean checkout and captured the realpython3output at each step. Captions match the output exactly.Step 2 — the loader import the guide prescribes (
from skillopt.data.base import DataLoader, DataItem):Step 3 — the env import the guide prescribes (
from skillopt.envs.base import EnvAdapter, TaskResult):Step 3 (cont.) — importing only the real
EnvAdapterand defining the guide's__init__(cfg)+execute()/evaluate()/build_prompt()adapter, then instantiating it:Step 6 — the documented run command after Steps 1-5 exactly. It crashes inside the guide's own Step-5 config: the
_base_: ['...']list form is not supported by the loader:Corrected run — reverting the Step-4
BENCHMARK_REGISTRYedit and changing only_base_to the supported string form, to get past the two bugs above. The run now reaches adapter construction and fails becauseBENCHMARK_REGISTRYis never consulted;scripts/train.pyresolves envs from its own_ENV_REGISTRY:Impact
Following the official "Add a New Benchmark" guide or the shipped template is a dead end: the prescribed module path (
skillopt.data.base) does not exist, the prescribed symbol (TaskResult) does not exist, the prescribed adapter cannot instantiate against the realEnvAdapterABC, the guide's own Step-5 config crashes the loader, and the Step-4 registration mechanism is ignored. The only working way to add an environment today is to reverse-engineerskillopt/envs/officeqa/.Suggested fix
Rewrite
docs/reference/api.md,docs/guide/new-benchmark.md, andskillopt/envs/_template/to match the realEnvAdaptercontract (build_train_env/build_eval_env/rollout/reflect/get_task_types), the real dataloader base (SplitDataLoader.load_raw_items/load_split_items), the real result shape (list[dict]withhard/softkeys, orRolloutResult), the supported_base_string form in configs, and the real registration path (_ENV_REGISTRYinscripts/train.py). OR, implement the proposed API - it is much cleaner.