RarePlanes: Synthetic Data Takes Flight

A collaboration between CosmiQ Works and AI.Reverie.

Figure 1: Example of the real and synthetic datasets present in RarePlanes. The top two rows feature the real Maxar WorldView-3 satellite imagery and the bottom two rows show the AI.Reverie synthetic data. The dataset features variable weather conditions, biomes, and ground surface types.


Key Takeaways

RarePlanes is the largest openly-available very-high resolution dataset built to test the value of synthetic data from an overhead perspective. Experiments with RarePlanes show:

  1. Synthetic data alone can train a robust object detection algorithm, as benchmarked against real world data.
  2. Fine tuning the synthetic only model with 10% of the observed dataset achieved roughly the same results as training on 100% of the observed dataset. This method would bypass 90% of the manual labeling and collection effort.
Watch the Webinar

Why It Matters

This study shows that synthetic data can dramatically reduce reliance on real data, which is slow,
expensive, and often difficult to procure. This opens opportunities for far more rapid and prolific adoption
of computer vision technologies across industries. For more information, please visit:


Rare Planes

Over the past decade, computer vision research and development of new algorithms has been driven largely by open datasets. However, development of such datasets is often labor-intensive, time-consuming and costly. An alternative approach is to create computer generated images and annotations (referred to as synthetic data), a process that can provide thousands of images at very low marginal cost.

Specifically, overhead datasets remain one of the best avenues for developing new computer vision methods that can adapt to limited sensor resolution, variable look angles and locate tightly grouped, cluttered objects. Such methods can extend beyond the overhead space and be helpful in domains such as autonomous driving and surveillance.


Varying Conditions and a Difficult Perspective

Creating synthetic datasets from an overhead perspective is a significant challenge, and simulators must attempt to closely mimic the complexities of a spaceborne or aerial sensor as well as Earth’s ever-changing conditions.

For example, to create a large and heterogeneous synthetic dataset, one must account for:

  • Each sensor’s varying spatial resolution
  • Changes in sensor look angle
  • Time of day of collection
  • Shadowing
  • Changes in illumination due to the sun’s location relative to the sensor
  • Ground appearance due to seasonal change, weather conditions and varying geographies or biomes

Further, it is difficult to classify the airplanes because they are visually similar. It’s a fine-grain classification problem.

AI.Reverie’s Configurable Weather Conditions


The Largest Dataset of Its Kind

RarePlanes is the largest openly-available very-high resolution dataset built to test the value of synthetic
data from an overhead perspective. It consists of:

  • Observed data: 253 Maxar WorldView-3 satellite scenes spanning 112 locations and 2,142km² with 14,700 hand-annotated aircraft. Annotations underwent two rounds of quality control, by a professional service and by the study authors.
  • Synthetic data: 50,000 synthetic satellite images with 630,000 aircraft annotations.
  • All data
    • 10 fine-grain attributes including: aircraft length, wingspan, wing shape, wing position, wingspan class, propulsion, number of engines, number of vertical stabilizers, presence of canards and aircraft role.
    • 33 sub-attribute choices within the categories above.

How We Tested It

We ran three experiments for two tasks: object detection (aircraft) and instance segmentation (the aircraft’s civil role).

For each task, we train on:

  • Observed data only (the benchmark)
  • Synthetic data only
  • Fine-tuning with roughly 10% of the observed data
AI.Reverie’s Simulated Airports


Synthetic Data Effective Alone and in Combination With Real Data

The mAP shows that the synthetic dataset is enough to build an accurate airplane detection model. Most importantly, when a small subset of real data is added for fine tuning, we observe a significant gain in mAP, leading to performance on par with the model trained on the real dataset only. Training by this method would bypass 90% of the manual labelling effort.


Learn and Experiment Further

About AI.Reverie

AI.Reverie is a simulation platform that trains AI to understand the world. It offers a suite of synthetic data and vision APIs to help businesses across different industries train their machine learning algorithms and improve their AI applications, along with benchmarking services to measure the impact.

About CosmiQ Works

Founded in 2015 as a technology challenge lab within In-Q-Tel (IQT), CosmiQ Works is focused on developing, prototyping, and evaluating emerging open source artificial intelligence capabilities for geospatial use cases. CosmiQ Works helps accelerate development and adoption of these technologies into deployable products.