CMU Advances Drone Autonomy with Cross-Modal Simulation
Researchers at Carnegie Mellon University have developed a novel training method for autonomous drones that separates perception from control, enabling safe real-world deployment after training entirely in simulation. This two-stage approach addresses the persistent “simulation-to-reality gap” that has hindered performance when transferring skills learned in virtual environments to unpredictable physical settings.

Rogerio Bonatti, a doctoral student at CMU’s School of Computer Science Robotics Institute, explained the challenge: “Typically drones trained on even the best photorealistic simulated data will fail in the real world because the lighting, colors and textures are still too different to translate. Our perception module is trained with two modalities to increase robustness against environmental variabilities.”
The first modality focuses on visual data. Using a photorealistic simulator, the team created an environment featuring a drone, a soccer field, and red square gates positioned randomly to form a navigation course. Thousands of configurations of drone and gate positions were generated to build a large dataset of simulated images. The second modality captures the spatial attributes of the gates — their position and orientation — derived from the same dataset. By combining these two streams of information, the perception system gains a more resilient understanding of the environment, one that remains effective despite real-world variations.
A key part of the process involves compressing images to a lower resolution. This reduces the influence of visual noise and helps the model focus on essential features. Learning from low-dimensional representations allows the drone to identify gates in real-world conditions even when lighting or textures differ from the simulation.
Once perception is established, the second stage trains the drone’s control policy within the simulated environment. Here, the system learns how to physically maneuver — determining the appropriate velocity to apply when approaching and passing through gates. Because simulation offers precise trajectory calculations, the drone can be optimized without the risks and costs associated with real-world trial-and-error. This contrasts with traditional supervised learning, where an expert operator guides the drone, often requiring extensive and potentially hazardous field sessions.
Bonatti described his method of shaping the drone’s capabilities: “I make the drone turn to the left and to the right in different track shapes, which get harder as I add more noise. The robot is not learning to recreate going through any specific track. Rather, by strategically directing the simulated drone, it’s learning all of the elements and types of movements to race autonomously.” This targeted training builds a repertoire of maneuvers that can be applied flexibly in new environments.
The research diverges from prior autonomous racing efforts that emphasized speed through additional sensors and specialized software. “Most of the work on autonomous drone racing so far has focused on engineering a system augmented with extra sensors and software with the sole aim of speed. Instead, we aimed to create a computational fabric, inspired by the function of a human brain, to map visual information to the correct control actions going through a latent representation,” Bonatti said.
While drone racing serves as a compelling demonstration, the broader implications of this approach extend to other domains of artificial intelligence. Separating perception and control could benefit applications ranging from autonomous driving to industrial robotics. The perception stage could incorporate alternative modalities — such as sound or shape recognition — to support tasks like wildlife monitoring, vehicle identification, or object classification.
The work was a collaboration between CMU’s Rogerio Bonatti and Sebastian Scherer, along with Ratnesh Madaan, Vibhav Vineet, and Ashish Kapoor of Microsoft Corporation. Their paper, *Learning Visuomotor Policies for Aerial Navigation Using Cross-Modal Representations*, was accepted to the International Conference on Intelligent Robots and Systems (IROS) 2020. The team has open-sourced the code, providing a resource for other researchers to build upon this cross-modal simulation framework.
