This document is the contract reference for the public interfaces used by the current codebase. Paths are relative to the repository root.
Module: gridworld/task_spec.py
TaskSpecification is the canonical gridworld task object. Load it with
TaskSpecification.from_json(path) or TaskSpecification.from_dict(data).
Required top-level fields:
task_id: strseed: intdifficulty_tier: intmaze: MazeLayoutgoal: GoalSpecmax_steps: int
Optional top-level fields:
version: str = "1.0"description: str = ""mechanisms: MechanismSet = {}rules: Rules = {}dependency_chaindistractorsmetadata
Supported mechanisms:
keys:id,position,colordoors:id,position,requires_key,initial_stateswitches:id,position,controls,color,switch_type,initial_stategates:id,position,initial_stateblocks:id,position,pushable,colorteleporters:id,position_a,position_b,bidirectionalhazards:id,position,hazard_type
Supported rules:
key_consumption: boolswitch_type: "toggle" | "hold" | "one_shot"hidden_mechanisms: list[str]observability: "full" | "view_cone" | "fog_of_war"view_size: int, odd and at least 3
Supported goal types:
reach_positioncollect_allpush_block_tosurvive_steps
Validation:
is_valid, errors = spec.validate()Validation checks bounds, border-wall conflicts, duplicate mechanism IDs,
position overlaps, door-key color references, switch-gate references,
hidden-mechanism references, dependency-chain references, distractor references,
goal references, max_steps, and view_size.
Module: gridworld/backends/base.py
All grid backends implement AbstractGridBackend:
backend.configure(task_spec)
obs, state, info = backend.reset(seed=None)
obs, reward, terminated, truncated, state, info = backend.step(action)
image = backend.render()
mission = backend.get_mission_text()
state = backend.get_state()
backend.close()Backends use the MiniGrid-compatible 7-action external interface:
| ID | Name | Meaning |
|---|---|---|
| 0 | turn_left |
Rotate counter-clockwise |
| 1 | turn_right |
Rotate clockwise |
| 2 | move_forward |
Move one cell in the facing direction |
| 3 | pickup |
Pick up an object in front |
| 4 | drop |
Drop the held object |
| 5 | toggle |
Interact with a door, switch, or object in front |
| 6 | done |
Wait/no-op |
MiniGridBackend executes these actions directly. MultiGridBackend translates
them to the native multigrid.agent.Action enum:
| External ID | Native ID |
|---|---|
| 0 | 2 (TURN_LEFT) |
| 1 | 3 (TURN_RIGHT) |
| 2 | 0 (FORWARD) |
| 3 | 4 (PICKUP) |
| 4 | 5 (DROP) |
| 5 | 6 (TOGGLE) |
| 6 | 8 (WAIT) |
Module: gridworld/backends/base.py
GridState is the backend-independent snapshot returned by reset, step, and
get_state.
Fields:
agent_position: tuple[int, int]agent_direction: intagent_carrying: str | Nonestep_count: intmax_steps: intterminated: booltruncated: boolreward: floatopen_doors: set[str]collected_keys: set[str]active_switches: set[str]open_gates: set[str]block_positions: dict[str, tuple[int, int]]teleporter_cooldowns: dict[str, int]goal_reached: boolobservability_mode: strvisible_cells: set[tuple[int, int]]explored_cells: set[tuple[int, int]]
Use state.to_dict() and GridState.from_dict(data) for serialization.
Module: gridworld/task_parser.py
TaskParser creates a populated CustomMiniGridEnv from a TaskSpecification.
It is used by MiniGridBackend.
parser = TaskParser(render_mode="rgb_array")
env = parser.parse(spec, seed=spec.seed)
env = parser.parse_file("gridworld/tasks/tier1/maze_simple_001.json")
env = parser.parse_dict(data)Important behavior: TaskParser.parse() calls env.reset() internally before
placing task objects. Backend code must not call env.reset() again after parsing
or the task-specific objects will be removed.
Module: gridworld/runner/grid_runner.py
GridRunner executes episodes over any AbstractGridBackend.
runner = GridRunner(backend=backend)
result = runner.run_episode(spec, policy_fn=policy, seed=spec.seed)
results = runner.run_batch([spec1, spec2], policy_fn=policy)
demo = runner.collect_demonstrations(spec, actions=[2, 2, 1, 2])
records = runner.generate_observation_dataset([spec], output_dir="observations")policy_fn receives (observation, state, mission) and returns either an action
integer or (action, info_dict).
EpisodeResult contains task_id, success, total_reward, steps_taken,
max_steps, terminated, truncated, trajectory, final_state, seed, and
mission.
Module: model_interface.py
Model adapters implement:
class MyModel(ModelInterface):
@property
def model_name(self) -> str: ...
def predict(self, input: ModelInput) -> ModelOutput: ...ModelInput fields:
image: np.ndarraytext_prompt: straction_space: dict[int, str]step_number: intmax_steps: intadditional_context: str | Noneprior_images: list[np.ndarray] | None
ModelOutput fields:
action: intconfidence: float | Nonereasoning: str | Noneraw_output: str | None
Built-in adapters:
RandomModelInterfaceFileBasedModelInterfaceadapters.ollama_vlm_adapter.OllamaVLMAdapteradapters.lmstudio_vlm_adapter.LMStudioVLMAdapter
Module: evaluation_harness.py
EvaluationHarness adapts ModelInterface to GridRunner and computes aggregate
metrics.
harness = EvaluationHarness(model, backend=backend)
episode = harness.evaluate_task(spec)
tier_metrics = harness.evaluate_tier(1, task_dir="gridworld/tasks")
all_metrics = harness.evaluate_all(task_dir="gridworld/tasks", tiers=[1, 2, 3])
benchmark = harness.evaluate_task_dir("mazes/validation_10")
harness.close()The harness can include prior images and rolling text summaries in ModelInput
via history_images, history_text, and history_text_window.
Primary CLI:
python run_eval.py --benchmark validation_10 --model random
python run_eval.py --benchmark tiers --tier 1-3 --model random
python run_eval.py --benchmark directory --task-dir path/to/json_dir --model randomInstalled entry point:
multinet-run-eval --benchmark validation_10 --model random