Thursday, November 27, 2025

 End-to-End Object Detection with Transformers for Aerial Drone Images

End-to-End Object Detection with Transformers for Aerial Drone Images

Abstract

Introduction

Related Work

The DroneDETR Model

Experiments

Conclusion

Abstract

We present a novel approach to object detection in aerial drone imagery by extending the end-to-end detection paradigm introduced by DETR to the unique challenges of high-altitude, wide-area visual data. Traditional aerial detection pipelines rely heavily on handcrafted components such as anchor generation, multi-scale feature pyramids, and non-maximum suppression to handle the variability of object sizes and densities. Our method, DroneDETR, eliminates these components by framing detection as a direct set prediction problem. Leveraging a transformer encoder-decoder architecture, DroneDETR reasons globally about spatial context and object relations, while a bipartite matching loss enforces unique assignments between predictions and ground truth. We demonstrate that this approach achieves competitive accuracy compared to established baselines on aerial datasets, particularly excelling in large-scale geospatial scenes where contextual reasoning is critical. Furthermore, DroneDETR generalizes naturally to segmentation tasks, enabling unified panoptic analysis of aerial imagery. We provide code and pretrained models to encourage adoption in the aerial analytics community.

Introduction

Aerial drone imagery has become a cornerstone of modern geospatial analytics, with applications ranging from urban planning and agriculture to disaster response and wildlife monitoring. The task of object detection in this domain is particularly challenging due to the wide range of object scales, the frequent occlusions caused by environmental structures, and the need to process large images efficiently. Conventional detectors approach this problem indirectly, relying on anchors, proposals, or grid centers to generate candidate regions. These methods are sensitive to the design of anchors and require extensive postprocessing, such as non-maximum suppression, to eliminate duplicate predictions.

Inspired by advances in end-to-end structured prediction tasks such as machine translation, we propose a direct set prediction approach for aerial object detection. Our model, DroneDETR, adapts the DETR framework to aerial imagery by combining a convolutional backbone with a transformer encoder-decoder. The model predicts all objects simultaneously, trained with a bipartite matching loss that enforces one-to-one correspondence between predictions and ground truth. This design removes the need for anchors and postprocessing, streamlining the detection pipeline.

DroneDETR is particularly well-suited to aerial imagery and DOTA (Dataset for Object Detection in Aerial Images) because transformers excel at modeling long-range dependencies. In aerial scenes, objects such as vehicles, buildings, or trees often appear in structured spatial arrangements, and global reasoning is essential to distinguish them from background clutter. Our experiments show that DroneDETR achieves strong performance on aerial datasets, outperforming baselines on large-object detection while maintaining competitive accuracy on small objects.

Related Work

Object detection in aerial imagery has traditionally relied on adaptations of ground-level detectors such as Faster R-CNN or YOLO. These methods incorporate multi-scale feature pyramids to handle the extreme variation in object sizes, from small pedestrians to large buildings. However, their reliance on anchors and heuristic assignment rules introduces complexity and limits generalization.

Set prediction approaches, such as those based on bipartite matching losses, provide a more principled solution by enforcing permutation invariance and eliminating duplicates. DETR pioneered this approach in natural images, demonstrating that transformers can replace handcrafted components. In aerial imagery, several works have explored attention mechanisms to capture spatial relations, but most still rely on anchors or proposals. DroneDETR builds on DETR by applying parallel decoding transformers to aerial data, enabling efficient global reasoning across large-scale scenes.

The DroneDETR Model

DroneDETR consists of three main components: a CNN backbone, a transformer encoder-decoder, and feed-forward prediction heads. The backbone extracts high-level features from aerial images, which are often large and require downsampling for computational efficiency. These features are flattened and supplemented with positional encodings before being passed to the transformer encoder.

The encoder models global interactions across the entire image, capturing contextual relations between distant objects. The decoder operates on a fixed set of learned object queries, each attending to the encoder output to produce predictions. Unlike autoregressive models, DroneDETR decodes all objects in parallel, ensuring scalability for large aerial scenes.

Predictions are generated by feed-forward networks that output bounding box coordinates and class labels. A special “no object” class handles empty slots, allowing the model to predict a fixed-size set larger than the actual number of objects. Training is guided by a bipartite matching loss, computed via the Hungarian algorithm, which enforces unique assignments between predictions and ground truth. The loss combines classification terms with a bounding box regression term based on a linear combination of L1 and generalized IoU losses, ensuring scale-invariance across diverse object sizes.

Experiments

We evaluate DroneDETR on aerial datasets such as DOTA and VisDrone, which contain diverse scenes with varying object densities and scales. Training follows the DETR protocol, using AdamW optimization and long schedules to stabilize transformer learning. We compare DroneDETR against Faster R-CNN and RetinaNet baselines adapted for aerial imagery.

Results show that DroneDETR achieves comparable mean average precision to tuned baselines, with notable improvements in detecting large-scale objects such as buildings and vehicles. Performance on small objects, such as pedestrians, is lower, reflecting the limitations of global attention at fine scales. However, incorporating dilated backbones improves small-object detection, at the cost of higher computational overhead.

Qualitative analysis highlights DroneDETR’s ability to reason globally about spatial context, correctly distinguishing vehicles in crowded parking lots and separating overlapping structures without reliance on non-maximum suppression. Furthermore, extending DroneDETR with a segmentation head enables unified panoptic segmentation, outperforming baselines in pixel-level recognition tasks.

Conclusion

We have introduced DroneDETR, an end-to-end transformer-based detector for aerial drone imagery. By framing detection as a direct set prediction problem, DroneDETR eliminates anchors and postprocessing, simplifying the pipeline while enabling global reasoning. Our experiments demonstrate competitive performance on aerial datasets, with particular strengths in large-object detection and contextual reasoning. Future work will focus on improving small-object detection through multi-scale attention and exploring real-time deployment on edge devices for autonomous drone platforms.


No comments:

Post a Comment