Friday, April 25, 2025

 Drone Imagery Processing

We mentioned the drone video sensing platform DFCS to comprise of an image processor, an analytical engine and a drone router where the vision processor creates vectors for KeyPoint that are a tuple of pixel position and feature descriptor of the patch around the pixel which translates to world co-ordinates and time lapse information of that location. This article explains some of the tenets of the image processor.

One of the main requirements of the image processor is fast-frame alignment. Given that the images could be from any one of the units of the UAV swarm and from any position, the alignment of video frames is essential for subsequent tasks such as object detection and change-tracking. These three tasks are completed with the help of operators in an image pipeline fed with images from the drones’ sensors. The first flight around the region input by the user itself provides most of the survey of the landscape and brings in images from various vantage points. Most of the images are top-down imagery from this first video.

The frame alignment computes a mapping from each pixel to world-coordinates (longitude-latitude-height). The object detection and change-tracking encode the structured information obtained from the images. Machine Learning models extract information from the video. Frame alignment efficiently combines GPS and compass readings with image features. There is no need to compute or stash intermediary or output images from this processing. SIFT feature extraction derives KeyPoint in each video frame. Then KeyPoint are grouped together to describe the same world location such as a road divider or a chimney in two phases. Grouping involves creating stable groups in KeyPoint from multiple top-down images in a segment of the video from an aerial flight over the world location and then using that to create global groups by merging stable groups that describe the same world location. This inevitably leads to consolidation of all KeyPoint pertaining to a world location. Then the video frame is aligned by matching the SIFT KeyPoint computed in a single frame against the global groups, and this matching is used to estimate the drone’s position and orientation when it captured the frame. SIFT yields KeyPoint, frame alignment yields position and orientation and grouping yields KeyPoint corresponding to same world location. Grouping is iterative and initially starts with an empty set. For each frame, a KeyPoint is attempted to be matched with an existing group based on two conditions: 1. the similarity of the KeyPoint descriptor and the mean across descriptors in a group must lie below a threshold and 2. the pixel position of the most recent KeyPoint in the group when transformed via optical flow must fall close to that of the KeyPoint within a small threshold. Closeness is measured by Euclidean distance and the transformation is done with Lucas-Kanade method. If there is no match, the KeyPoint becomes a new group with a singleton member. Both existing and new groups are added to the global group.

After this aggregation into groups, GPS and compass readings are used to determine the world co-ordinates of stable groups. To merge stable groups into global groups, the co-ordinates of the global group is computed as the average across those of the stable groups and replace the optical flow constraint with the position estimate similarity constraint using the criteria of least-squares error to be below a threshold.


No comments:

Post a Comment