Skip to content

Docs and env template describe an API that does not exist in code #30

@emmahyde

Description

@emmahyde

Summary

The documentation and the shipped environment template describe an EnvAdapter API (execute() / evaluate() / build_prompt(), with DataItem / TaskResult data classes and a BENCHMARK_REGISTRY) that does not exist anywhere in the codebase. The real abstract contract in skillopt/envs/base.py is a different set of methods (build_train_env / build_eval_env / rollout / reflect / get_task_types). A contributor who follows docs/guide/new-benchmark.md or copies skillopt/envs/_template/ writes an adapter that cannot satisfy EnvAdapter's abstract methods and will fail to instantiate.

All references are to main.

Static evidence

1. The real EnvAdapter contract — skillopt/envs/base.py:

class EnvAdapter(ABC):
    @abstractmethod
    def build_train_env(self, batch_size, seed, **kwargs): ...
    @abstractmethod
    def build_eval_env(self, env_num, split, seed, **kwargs): ...
    @abstractmethod
    def rollout(self, env_manager, skill_content, out_dir, **kwargs): ...
    @abstractmethod
    def reflect(self, results, skill_content, out_dir, **kwargs): ...
    @abstractmethod
    def get_task_types(self) -> list[str]: ...

No execute, evaluate, or build_prompt. The real reference env confirms this — skillopt/envs/officeqa/adapter.py implements build_train_env / rollout, not execute / evaluate.

2. docs/reference/api.md documents the opposite:

class EnvAdapter(ABC):
    async def execute(self, item, skill, model) -> TaskResult
    def evaluate(self, prediction, ground_truth) -> float
    def build_prompt(self, item, skill) -> str

It also documents @dataclass DataItem, @dataclass TaskResult, DataLoader.get_split_items, ModelBackend, and Trainer.

3. DataItem and TaskResult are never defined in code. A repo code search for class DataItem and class TaskResult returns exactly one hit each — both in docs/reference/api.md. They do not exist in skillopt/types.py, which instead defines RolloutResult, Edit, Patch, RawPatch, SlowUpdateResult, etc. Real rollouts return list[dict] (see officeqa), not TaskResult.

4. docs/guide/new-benchmark.md compounds it. It tells contributors to import DataItem from skillopt.data.base (module path doesn't exist; the real base is skillopt.datasets.base), implement execute() / evaluate() / get_split_items(), and register via a BENCHMARK_REGISTRY dict in skillopt/envs/__init__.py. The real registration mechanism is _ENV_REGISTRY + _register_builtins() in scripts/train.py; skillopt/envs/__init__.py is a one-line docstring with no BENCHMARK_REGISTRY.

5. The shipped template propagates the broken API. skillopt/envs/_template/env_template.py defines class TemplateBenchmarkEnv(EnvAdapter) with async def execute(self, item, skill, model) and def evaluate(...); loader_template.py defines get_split_items(). Copying the official template produces an adapter missing all five real abstract methods.

6. The real dataloader base is also different. Docs say subclass DataLoader and implement get_split_items. The real base is SplitDataLoader(BaseDataLoader) in skillopt/datasets/base.py; subclasses implement load_raw_items / load_split_items (get_split_items exists only as an internal accessor).

Practical reproduction

I followed docs/guide/new-benchmark.md verbatim against main on a clean checkout and captured the real python3 output at each step. Captions match the output exactly.

Step 2 — the loader import the guide prescribes (from skillopt.data.base import DataLoader, DataItem):

Traceback (most recent call last):
  File "<string>", line 1, in <module>
    from skillopt.data.base import DataLoader, DataItem
ModuleNotFoundError: No module named 'skillopt.data'

Step 3 — the env import the guide prescribes (from skillopt.envs.base import EnvAdapter, TaskResult):

Traceback (most recent call last):
  File "<string>", line 1, in <module>
    from skillopt.envs.base import EnvAdapter, TaskResult
ImportError: cannot import name 'TaskResult' from 'skillopt.envs.base'. Did you mean: 'GateResult'?

Step 3 (cont.) — importing only the real EnvAdapter and defining the guide's __init__(cfg) + execute() / evaluate() / build_prompt() adapter, then instantiating it:

Traceback (most recent call last):
  File "<string>", line 9, in <module>
    DocFaithfulEnv({})
TypeError: Can't instantiate abstract class DocFaithfulEnv without an implementation for abstract methods 'build_eval_env', 'build_train_env', 'get_task_types', 'reflect', 'rollout'

Step 6 — the documented run command after Steps 1-5 exactly. It crashes inside the guide's own Step-5 config: the _base_: ['...'] list form is not supported by the loader:

Traceback (most recent call last):
  File ".../scripts/train.py", line 458, in main
    cfg = load_config(args)
  File ".../skillopt/config.py", line 142, in _load_yaml
    base_path = os.path.join(os.path.dirname(abs_path), base_ref)
TypeError: join() argument must be str, bytes, or os.PathLike object, not 'list'

Corrected run — reverting the Step-4 BENCHMARK_REGISTRY edit and changing only _base_ to the supported string form, to get past the two bugs above. The run now reaches adapter construction and fails because BENCHMARK_REGISTRY is never consulted; scripts/train.py resolves envs from its own _ENV_REGISTRY:

Traceback (most recent call last):
  File ".../scripts/train.py", line 486, in main
    adapter = get_adapter(cfg)
  File ".../scripts/train.py", line 108, in get_adapter
    raise ValueError(...)
ValueError: Unknown environment 'docfaithful'. Available: ['alfworld', 'searchqa', 'livemathematicianbench', 'docvqa', 'officeqa']

Impact

Following the official "Add a New Benchmark" guide or the shipped template is a dead end: the prescribed module path (skillopt.data.base) does not exist, the prescribed symbol (TaskResult) does not exist, the prescribed adapter cannot instantiate against the real EnvAdapter ABC, the guide's own Step-5 config crashes the loader, and the Step-4 registration mechanism is ignored. The only working way to add an environment today is to reverse-engineer skillopt/envs/officeqa/.

Suggested fix

Rewrite docs/reference/api.md, docs/guide/new-benchmark.md, and skillopt/envs/_template/ to match the real EnvAdapter contract (build_train_env / build_eval_env / rollout / reflect / get_task_types), the real dataloader base (SplitDataLoader.load_raw_items / load_split_items), the real result shape (list[dict] with hard / soft keys, or RolloutResult), the supported _base_ string form in configs, and the real registration path (_ENV_REGISTRY in scripts/train.py). OR, implement the proposed API - it is much cleaner.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions