Tuesday, November 4, 2025

 Transient and Transit objects in aerial drone scene sequences 

Time, Location and frequency are the dimensions we would like to ideally capture for each object we detect in aerial drone image scene and their sequences but objects don’t have a generalized signature and often require training and deep learning supervision to detect them, especially for transient and transit objects such as pedestrians and vehicles.  Given several pedestrians and vehicles in scene sequences, we apply Density Based Clustering of Applications with Noise to work robustly on clusters with different shapes, without requiring the number of clusters and especially easy to filter out noise.  

Each image clip sequence is preprocessed to construct a high-quality, neural-network-friendly representation. For each frame, we extract three features: normalized spectral signatures, estimated uncertainty (e.g., motion blur or sensor noise), and timestamp. Spectral signature values are converted from log scale to linear flux using calibration constants derived from the UAV sensor specifications. We then subtract the median and standardize using the interquartile range (IQR), followed by compression into the [-1, 1] range using the arcsinh function. 

Time values are normalized to [0, 1] based on the total observation window, typically spanning 10–30 seconds. Uncertainty values are similarly rescaled and compressed to match the flux scale. The final input tensor for each sequence is a matrix of shape T × 3, where T is the number of frames, and each row contains spectral signature, uncertainty, and timestamp. 

This representation ensures any model can handle sequences of varying length and sampling rate, a critical requirement for aerial deployments where cadence may fluctuate due to flight path, altitude, or environmental conditions. 

The model that we use to leverage frequency domain has three core components: 
Wavelet Decomposition: A one-dimensional discrete wavelet transform (DWT) is applied to the spectral signature vector to suppress noise and highlight localized changes. This is particularly effective in identifying transient objects that appear briefly and then vanish. 

Finite-Embedding Fourier Transform (FEFT): A modified discrete Fourier transform is applied to the time series to extract periodic and harmonic features. FEFT enables detection of transit-like behavior, such as vehicles passing through occluded regions or pedestrians crossing paths. 

Convolutional Neural Network (CNN): The frequency-domain tensor is passed through a series of convolutional and fully connected layers, which learn to discriminate between the four object states. The model is trained using a categorical cross-entropy loss function and optimized with Adam. 

Each entry of the output vector v is called a logit and represents the predicted likelihood that the star is of each class (we use v0 = null, v1 = transient, v2 = pulsator, v3 = transit). We compare the output vectors to one-hot target vectors t with the formula: 

 

And 

 

 

We apply the DFT on an N-long signal x(n) as follows: 



 

To extract features from variable length sequences, we introduce a vector u from writing the result of the DFT as a product of u ⮾ v with u being a parameter to the model and its dimension as a hyperparameter named samples. When 

uk
 ranges from 0 to N-1, DFT samples those N frequencies in the data. We do not modify vector v but construct is as multiples of a factor

 up to the length of the input vector. 

Then we initialize the finite-embedding Fourier Transform as 



The outer product and element wise exponentiation of the matrix is fast operations. 

The DWT comprises of two operations: a high pass filtering and a low pass filtering. With wavelet decomposition applied to the result, we get two wavelets: 1 representing downscaled and smoothed version of a function and the other as 2. the variations and they are taken as bi-orthogonal. 

Samples from the DOTA dataset are taken to ensure generalization across diverse environments. The model is trained on a four-class scheme: null (no object), transient (brief appearance), stable (persistent presence), and transit (periodic occlusion or movement). 
#codingexercise: CodingExercise-11-04-2025.docx

Monday, November 3, 2025

 Transient and transit object detection in aerial drone images: 

Introduction: 

The increasing availability of high-resolution aerial imagery from unmanned aerial vehicles (UAVs) presents a unique opportunity for time-domain object detection. Unlike traditional satellite imagery, UAVs offer flexible sampling rates, dynamic perspectives, and real-time responsiveness. However, the irregular cadence and noise inherent in aerial sequences pose challenges for conventional object detection pipelines, especially when attempting to identify transient or fleeting objects such as pedestrians, vehicles, or small mobile assets. 

Machine learning techniques have become indispensable in aerial image analysis, particularly in large datasets where manual annotation is infeasible. Convolutional neural networks (CNNs) have been widely adopted for static object detection, but their performance degrades when applied to temporally sparse or noisy sequences. Prior work has explored phase-folding and frame-by-frame tracking, but these methods are computationally expensive and sensitive to sampling irregularities. 

This paper introduces DroneWorldNet, a frequency-domain model that bypasses the limitations of traditional tracking by transforming image clip vectors into frequency-domain tensors. DroneWorldNet applies discrete wavelet transform (DWT) to suppress noise and highlight localized changes, followed by FEFT to extract periodic and harmonic features across time. These tensors are then classified into one of four object states: null (no object), transient (brief appearance), stable (persistent presence), or transit (periodic occlusion or movement). 

We apply DroneWorldNet to the DOTA dataset, which contains annotated aerial scenes from diverse environments. Each image clip is treated as a temporal stack, and the model is trained on both real and synthetic sequences to ensure robustness across lighting, altitude, and occlusion conditions. The pipeline includes spatial clustering, data normalization, and tensor construction, followed by classification using CNN and fully connected layers. 

DroneWorldNet achieves subsecond inference latency and high classification accuracy, demonstrating its suitability for real-time deployment in edge-cloud UAV systems. This work lays the foundation for a full-scale variability survey of aerial scenes and opens new avenues for time-domain analysis in geospatial workflows. 

Data Preprocessing: 

Each image clip sequence is preprocessed to construct a high-quality, neural-network-friendly representation. For each frame, we extract three features: normalized brightness, estimated uncertainty (e.g., motion blur or sensor noise), and timestamp. Brightness values are converted from log scale to linear flux using calibration constants derived from the UAV sensor specifications. We then subtract the median and standardize using the interquartile range (IQR), followed by compression into the [-1, 1] range using the arcsinh function. 

Time values are normalized to [0, 1] based on the total observation window, typically spanning 10–30 seconds. Uncertainty values are similarly rescaled and compressed to match the flux scale. The final input tensor for each sequence is a matrix of shape T × 3, where T is the number of frames, and each row contains brightness, uncertainty, and timestamp. 

This representation ensures that DroneWorldNet can handle sequences of varying length and sampling rate, a critical requirement for aerial deployments where cadence may fluctuate due to flight path, altitude, or environmental conditions. 

DroneWorldNet model: 

DroneWorldNet is a hybrid signal-processing and deep learning model designed to classify aerial image sequences into four object states: null, transient, stable, and transit. The model architecture integrates three core components: 

Wavelet Decomposition: A one-dimensional discrete wavelet transform (DWT) is applied to the brightness vector to suppress noise and highlight localized changes. This is particularly effective in identifying transient objects that appear briefly and then vanish. 

Finite-Embedding Fourier Transform (FEFT): A modified discrete Fourier transform is applied to the time series to extract periodic and harmonic features. FEFT enables detection of transit-like behavior, such as vehicles passing through occluded regions or pedestrians crossing paths. 

Convolutional Neural Network (CNN): The frequency-domain tensor is passed through a series of convolutional and fully connected layers, which learn to discriminate between the four object states. The model is trained using a categorical cross-entropy loss function and optimized with Adam. 

 

Training and Evaluation: 

To train DroneWorldNet, we generate synthetic aerial sequences using motion simulation models that replicate pedestrian and vehicle dynamics under varying conditions. These include changes in lighting, altitude, occlusion, and background texture. Synthetic sequences are blended with real samples from the DOTA dataset to ensure generalization across diverse environments. 

The model is trained on a four-class scheme: null (no object), transient (brief appearance), stable (persistent presence), and transit (periodic occlusion or movement). On a held-out validation set, DroneWorldNet achieves an F1 score of 0.89, with precision and recall exceeding 0.90 for stable and transit classes. Transient detection remains challenging due to low signal-to-noise ratio, but wavelet decomposition significantly improves sensitivity. 


#codingexercise: CodingExercise-11-03-2025.docx