Docs and env template describe an API that does not exist in code

## Summary

The documentation and the shipped environment template describe an `EnvAdapter` API (`execute()` / `evaluate()` / `build_prompt()`, with `DataItem` / `TaskResult` data classes and a `BENCHMARK_REGISTRY`) that **does not exist anywhere in the codebase**. The real abstract contract in `skillopt/envs/base.py` is a different set of methods (`build_train_env` / `build_eval_env` / `rollout` / `reflect` / `get_task_types`). A contributor who follows `docs/guide/new-benchmark.md` or copies `skillopt/envs/_template/` writes an adapter that cannot satisfy `EnvAdapter`'s abstract methods and will fail to instantiate.

All references are to `main`.

## Static evidence

**1. The real `EnvAdapter` contract — `skillopt/envs/base.py`:**

```python
class EnvAdapter(ABC):
    @abstractmethod
    def build_train_env(self, batch_size, seed, **kwargs): ...
    @abstractmethod
    def build_eval_env(self, env_num, split, seed, **kwargs): ...
    @abstractmethod
    def rollout(self, env_manager, skill_content, out_dir, **kwargs): ...
    @abstractmethod
    def reflect(self, results, skill_content, out_dir, **kwargs): ...
    @abstractmethod
    def get_task_types(self) -> list[str]: ...
```

No `execute`, `evaluate`, or `build_prompt`. The real reference env confirms this — `skillopt/envs/officeqa/adapter.py` implements `build_train_env` / `rollout`, not `execute` / `evaluate`.

**2. `docs/reference/api.md` documents the opposite:**

```python
class EnvAdapter(ABC):
    async def execute(self, item, skill, model) -> TaskResult
    def evaluate(self, prediction, ground_truth) -> float
    def build_prompt(self, item, skill) -> str
```

It also documents `@dataclass DataItem`, `@dataclass TaskResult`, `DataLoader.get_split_items`, `ModelBackend`, and `Trainer`.

**3. `DataItem` and `TaskResult` are never defined in code.** A repo code search for `class DataItem` and `class TaskResult` returns exactly one hit each — both in `docs/reference/api.md`. They do not exist in `skillopt/types.py`, which instead defines `RolloutResult`, `Edit`, `Patch`, `RawPatch`, `SlowUpdateResult`, etc. Real rollouts return `list[dict]` (see `officeqa`), not `TaskResult`.

**4. `docs/guide/new-benchmark.md` compounds it.** It tells contributors to import `DataItem` from `skillopt.data.base` (module path doesn't exist; the real base is `skillopt.datasets.base`), implement `execute()` / `evaluate()` / `get_split_items()`, and register via a `BENCHMARK_REGISTRY` dict in `skillopt/envs/__init__.py`. The real registration mechanism is `_ENV_REGISTRY` + `_register_builtins()` in `scripts/train.py`; `skillopt/envs/__init__.py` is a one-line docstring with no `BENCHMARK_REGISTRY`.

**5. The shipped template propagates the broken API.** `skillopt/envs/_template/env_template.py` defines `class TemplateBenchmarkEnv(EnvAdapter)` with `async def execute(self, item, skill, model)` and `def evaluate(...)`; `loader_template.py` defines `get_split_items()`. Copying the official template produces an adapter missing all five real abstract methods.

**6. The real dataloader base is also different.** Docs say subclass `DataLoader` and implement `get_split_items`. The real base is `SplitDataLoader(BaseDataLoader)` in `skillopt/datasets/base.py`; subclasses implement `load_raw_items` / `load_split_items` (`get_split_items` exists only as an internal accessor).

## Practical reproduction

I followed `docs/guide/new-benchmark.md` verbatim against `main` on a clean checkout and captured the real `python3` output at each step. Captions match the output exactly.

**Step 2 — the loader import the guide prescribes (`from skillopt.data.base import DataLoader, DataItem`):**

```
Traceback (most recent call last):
  File "<string>", line 1, in <module>
    from skillopt.data.base import DataLoader, DataItem
ModuleNotFoundError: No module named 'skillopt.data'
```

**Step 3 — the env import the guide prescribes (`from skillopt.envs.base import EnvAdapter, TaskResult`):**

```
Traceback (most recent call last):
  File "<string>", line 1, in <module>
    from skillopt.envs.base import EnvAdapter, TaskResult
ImportError: cannot import name 'TaskResult' from 'skillopt.envs.base'. Did you mean: 'GateResult'?
```

**Step 3 (cont.) — importing only the real `EnvAdapter` and defining the guide's `__init__(cfg)` + `execute()` / `evaluate()` / `build_prompt()` adapter, then instantiating it:**

```
Traceback (most recent call last):
  File "<string>", line 9, in <module>
    DocFaithfulEnv({})
TypeError: Can't instantiate abstract class DocFaithfulEnv without an implementation for abstract methods 'build_eval_env', 'build_train_env', 'get_task_types', 'reflect', 'rollout'
```

**Step 6 — the documented run command after Steps 1-5 exactly. It crashes inside the guide's own Step-5 config: the `_base_: ['...']` list form is not supported by the loader:**

```
Traceback (most recent call last):
  File ".../scripts/train.py", line 458, in main
    cfg = load_config(args)
  File ".../skillopt/config.py", line 142, in _load_yaml
    base_path = os.path.join(os.path.dirname(abs_path), base_ref)
TypeError: join() argument must be str, bytes, or os.PathLike object, not 'list'
```

**Corrected run — reverting the Step-4 `BENCHMARK_REGISTRY` edit and changing only `_base_` to the supported string form, to get past the two bugs above. The run now reaches adapter construction and fails because `BENCHMARK_REGISTRY` is never consulted; `scripts/train.py` resolves envs from its own `_ENV_REGISTRY`:**

```
Traceback (most recent call last):
  File ".../scripts/train.py", line 486, in main
    adapter = get_adapter(cfg)
  File ".../scripts/train.py", line 108, in get_adapter
    raise ValueError(...)
ValueError: Unknown environment 'docfaithful'. Available: ['alfworld', 'searchqa', 'livemathematicianbench', 'docvqa', 'officeqa']
```

## Impact

Following the official "Add a New Benchmark" guide or the shipped template is a dead end: the prescribed module path (`skillopt.data.base`) does not exist, the prescribed symbol (`TaskResult`) does not exist, the prescribed adapter cannot instantiate against the real `EnvAdapter` ABC, the guide's own Step-5 config crashes the loader, and the Step-4 registration mechanism is ignored. The only working way to add an environment today is to reverse-engineer `skillopt/envs/officeqa/`.

## Suggested fix

Rewrite `docs/reference/api.md`, `docs/guide/new-benchmark.md`, and `skillopt/envs/_template/` to match the real `EnvAdapter` contract (`build_train_env` / `build_eval_env` / `rollout` / `reflect` / `get_task_types`), the real dataloader base (`SplitDataLoader.load_raw_items` / `load_split_items`), the real result shape (`list[dict]` with `hard` / `soft` keys, or `RolloutResult`), the supported `_base_` string form in configs, and the real registration path (`_ENV_REGISTRY` in `scripts/train.py`). OR, implement the proposed API - it is much cleaner.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docs and env template describe an API that does not exist in code #30

Summary

Static evidence

Practical reproduction

Impact

Suggested fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Docs and env template describe an API that does not exist in code #30

Description

Summary

Static evidence

Practical reproduction

Impact

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions