We propose KnotGym, an interactive environment for complex, spatial reasoning and manipulation. KnotGym includes goal-oriented rope manipulation tasks with varying levels of complexity, all requiring acting from pure image observations. Tasks are defined along a clear and quantifiable axis of complexity based on the number of knot crossings, creating a natural generalization test. KnotGym has a simple observation space, allowing for scalable development, yet it highlights core challenges in integrating acute perception, spatial reasoning, and grounded manipulation. We evaluate methods of different classes, including model-based RL, model-predictive control, and chain-of-thought reasoning, and illustrate the challenges KnotGym presents.
We benchmarked general-purpose RL methods from different classes on KnotGym. The results are summarized in the figures below. While RL methods can learn to solve the easiest task (unknot), they struggle to learn or generalize to new goals. In contrast, chain-of-thought reasoning methods produce valid plans yet fails to generate grounded actions. Number of crossings (nx) is a key factor in the difficulty of the tasks, and presents a ladder of generalization challenges.
@misc{chen2025knotsimpleminimalisticenvironment,
title={Knot So Simple: A Minimalistic Environment for Spatial Reasoning},
author={Zizhao Chen and Yoav Artzi},
year={2025},
eprint={2505.18028},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2505.18028},
}