Knot So Simple: A Minimalistic Environment for Spatial Reasoning

Cornell Tech
May 2025

Abstract

We propose KnotGym, an interactive environment for complex, spatial reasoning and manipulation. KnotGym includes goal-oriented rope manipulation tasks with varying levels of complexity, all requiring acting from pure image observations. Tasks are defined along a clear and quantifiable axis of complexity based on the number of knot crossings, creating a natural generalization test. KnotGym has a simple observation space, allowing for scalable development, yet it highlights core challenges in integrating acute perception, spatial reasoning, and grounded manipulation. We evaluate methods of different classes, including model-based RL, model-predictive control, and chain-of-thought reasoning, and illustrate the challenges KnotGym presents.

Three Tasks

Three tasks in KnotGym: unknot is the easiest task. tie and convert are both goal-conditioned thus harder. The difficulty of each task can be tuned by the number of crossings (nx) of the initial or the goal configurations. The goal space increases drastically as nx increases.

Results

We benchmarked general-purpose RL methods from different classes on KnotGym. The results are summarized in the figures below. While RL methods can learn to solve the easiest task (unknot), they struggle to learn or generalize to new goals. In contrast, chain-of-thought reasoning methods produce valid plans yet fails to generate grounded actions. Number of crossings (nx) is a key factor in the difficulty of the tasks, and presents a ladder of generalization challenges.

Benchmarking representative methods over nine KnotGym setups. Entries are training split success rates calculated over N rollouts.
Left: Increasing the training pool size challenges RL training (DreamV3, tie, nx=3). Right: Policies learned via RL generalizes to test configurations for task unknot regardless of the number of crossings. However, RL-tuned policies struggles to learn the goal-conditioned tasks (tie and convert), let alone generalizing to unseen configurations.
Example of VLM (gpt4) response to the Open prompt. VLMs are capable of recognizing the abstract goal and generating valid plans for the task, yet during rollouts these actions are too weak and imprecise to produce the desired effect. Including interaction history (Stateful) helps communicating the task dynamics only to a limited extent.

BibTeX

@misc{chen2025knotsimpleminimalisticenvironment,
      title={Knot So Simple: A Minimalistic Environment for Spatial Reasoning}, 
      author={Zizhao Chen and Yoav Artzi},
      year={2025},
      eprint={2505.18028},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2505.18028}, 
}