Visual-guided locomotion in snake-like robots presents a formidable challenge, combining complex multi-joint undulation with the demands of real-time perception. Traditional control approaches often split vision processing and locomotion into separate modules, requiring extensive tuning to coordinate them. In contrast, a model-free reinforcement learning (RL) strategy now enables direct mapping from visual input to joint commands, creating an end-to-end perception-action pipeline.

Snake-like robots, inspired by biological snakes, are hyper-redundant mechanisms with numerous degrees of freedom. This structural agility allows them to perform tasks beyond the reach of wheeled or legged robots—search and rescue in collapsed structures, teleoperation in space, and minimally invasive surgery. However, their flexibility complicates control, especially in dynamic environments where model-based methods often fail to adapt.
Vision-guided locomotion is essential for autonomous deployment in unpredictable scenarios. With onboard cameras, snake-like robots can track moving targets and avoid obstacles, critical for disaster response and surveillance. Existing locomotion methods—sinusoid-based, central pattern generator (CPG) models, and dynamics-based control—do not directly address vision-based tracking. Such tasks demand rapid adaptation to unpredictable target trajectories, a domain where RL’s ability to learn complex mappings from perception to motion is advantageous.
The RL-based controller in this work uses a head-mounted RGB camera to capture target position. Due to undulatory motion, images shift horizontally; to reduce complexity, the system extracts a single row of pixels containing the target, converts it to grayscale based on red intensity, and uses this compact representation to infer relative position and distance. Alongside visual data, the controller receives proprioceptive inputs—joint angles, joint velocities, and head module velocity—forming a 49-dimensional observation space.
Actions correspond to eight joint positions, mapped continuously between −90° and 90°. The reward function encourages maintaining a specified distance from the target, defined by changes in distance before and after each action. Notably, the reward does not explicitly incentivize keeping the target centered in the field of view; the agent learns this behavior implicitly during training.
A fully connected two-hidden-layer neural network approximates the policy, trained using proximal policy optimization (PPO) for its robustness in continuous action spaces. Training occurs in simulation with randomly generated tracks to prevent overfitting. Over three million time steps, the mean reward stabilizes, indicating convergence to effective tracking behavior.
Testing on four track types—line, wave, zigzag, and random—demonstrates the RL controller’s adaptability. Trajectories of the robot’s head module follow the target closely, sometimes taking shortcuts but maintaining the desired distance. Distance histograms show variation around the 4.0 m target, with minimal deviation caused by oscillatory head motion.
For comparison, a model-based gait equation controller, derived from sinusoidal undulation patterns, was tuned extensively. It processes visual data to estimate lateral position and distance, using a proportional-integral controller for steering and speed adjustments. Despite optimization, its performance lags behind RL in tracking accuracy.
Two metrics highlight the difference: direct distance tracking and the Averaged Tracking Error (ATE), which incorporates both distance and angular deviation. The RL controller consistently achieves lower ATE—by roughly 50% on simpler tracks and up to 70% on more complex zigzag and random patterns. Its reduced lag in responding to target motion further enhances accuracy.
Limitations remain. The RL controller, trained in simulation, lacks target recovery behavior when the target leaves the field of view, a capability linked to predictive memory functions. Bridging the simulation-to-reality gap requires careful parameter matching to physical properties and consideration of real-world training challenges, such as resetting mobile robots between episodes. Policy transfer methods from simulation to hardware offer potential but are beyond the current scope.
This approach demonstrates that reinforcement learning can effectively unify perception and locomotion for snake-like robots, outperforming traditional methods in dynamic target tracking. The learned gaits offer a foundation for future work in more complex visual environments, obstacle-rich terrains, and recovery strategies when visual contact is lost.
