Saturday, October 25, 2025

 Previous article explained the organization of the software architecture with a shift towards deep learning, cataloguing and agentic-retrieval-based analytics of selected image set rather than image processing of each and every extracted frame from an aerial drone video. The model used for this object interpretation from scenes influences the depth and breadth of the drone world catalog. In this example below, we will use a yolov5m model pretrained on DOTA (aerial dataset) to enhance the detections for our purpose. 

The model can be downloaded as: 

and the python code to detect the objects in a scene can be as follows: 

import subprocess 

 

def run_yolov5_obb_inference(image_path, weights_path="weights/yolov5m.pt", output_dir="runs/detect"): 

    cmd = [ 

        "python", "detect.py", 

        "--weights", weights_path, 

        "--source", image_path, 

        "--img", "1024", 

 

        "--conf", "0.25", 

        "--iou", "0.4", 

        "--save-txt", 

        "--save-conf", 

        "--name", "drone_inference" 

    ] 

    subprocess.run(cmd) 

 

# Example usage 

run_yolov5_obb_inference("inference/images/urban_drone_image.jpg") 

 

or using ultralytics package as shown below on COCO-pretrained model 

from ultralytics import YOLO 
# from yolo import YOLODetector 
 
from ultralytics import YOLO 
 
# Load a COCO-pretrained YOLOv8n model 
model = YOLO("yolov8n.pt") 
 
# Display model information (optional) 
model.info() 

results = model("parking2.jpg") 
print(results) 

Result: 

YOLOv8n summary: 129 layers, 3,157,200 parameters, 0 gradients, 8.9 GFLOPs 
 
image 1/1 C:\Users\ravib\vision\ezvision\analytics\track\parking2.jpg: 384x640 1 truck, 1 snowboard, 175.9ms 
Speed: 21.1ms preprocess, 175.9ms inference, 11.1ms postprocess per image at shape (1, 3, 384, 640) 
[ultralytics.engine.results.Results object with attributes: 
 
boxes: ultralytics.engine.results.Boxes object 
keypoints: None 
masks: None 
names: {0: 'person', 1: 'bicycle', 2: 'car', 3: 'motorcycle', 4: 'airplane', 5: 'bus', 6: 'train', 7: 'truck', 8: 'boat', 9: 'traffic light', 10: 'fire hydrant', 11: 'stop sign', 12: 'parking meter', 13: 'bench', 14: 'bird', 15: 'cat', 16: 'dog', 17: 'horse', 18: 'sheep', 19: 'cow', 20: 'elephant', 21: 'bear', 22: 'zebra', 23: 'giraffe', 24: 'backpack', 25: 'umbrella', 26: 'handbag', 27: 'tie', 28: 'suitcase', 29: 'frisbee', 30: 'skis', 31: 'snowboard', 32: 'sports ball', 33: 'kite', 34: 'baseball bat', 35: 'baseball glove', 36: 'skateboard', 37: 'surfboard', 38: 'tennis racket', 39: 'bottle', 40: 'wine glass', 41: 'cup', 42: 'fork', 43: 'knife', 44: 'spoon', 45: 'bowl', 46: 'banana', 47: 'apple', 48: 'sandwich', 49: 'orange', 50: 'broccoli', 51: 'carrot', 52: 'hot dog', 53: 'pizza', 54: 'donut', 55: 'cake', 56: 'chair', 57: 'couch', 58: 'potted plant', 59: 'bed', 60: 'dining table', 61: 'toilet', 62: 'tv', 63: 'laptop', 64: 'mouse', 65: 'remote', 66: 'keyboard', 67: 'cell phone', 68: 'microwave', 69: 'oven', 70: 'toaster', 71: 'sink', 72: 'refrigerator', 73: 'book', 74: 'clock', 75: 'vase', 76: 'scissors', 77: 'teddy bear', 78: 'hair drier', 79: 'toothbrush'} 
obb: None 
orig_img: array([[[ 81,  79,  85], 
        [ 54,  52,  58], 
        [ 46,  44,  50], 
        ..., 
        [ 18,  32,  30], 
        [ 26,  39,  37], 
        [ 29,  42,  40]], 
 
       [[124, 122, 128], 
        [ 97,  95, 101], 
        [ 67,  65,  71], 
        ..., 
        [ 20,  34,  32], 
        [ 29,  42,  40], 
        [ 33,  46,  44]], 
 
       [[164, 162, 168], 
        [159, 157, 163], 
        [131, 129, 135], 
        ..., 
        [ 28,  42,  40], 
        [ 36,  50,  48], 
        [ 41,  55,  53]], 
 
       ..., 
 
       [[ 16,  10,   3], 
        [ 16,  10,   3], 
        [ 16,  10,   3], 
        ..., 
        [103,  84,  69], 
        [103,  84,  69], 
        [103,  84,  69]], 
 
       [[ 16,  10,   5], 
        [ 16,  10,   5], 
        [ 16,  10,   5], 
        ..., 
        [103,  84,  69], 
        [103,  84,  71], 
        [103,  84,  71]], 
 
       [[ 16,  10,   5], 
        [ 16,  10,   5], 
        [ 16,  10,   5], 
        ..., 
        [103,  84,  69], 
        [103,  84,  71], 
        [103,  84,  71]]], shape=(720, 1280, 3), dtype=uint8) 
orig_shape: (720, 1280) 
path: 'C:\\Users\\ravib\\vision\\ezvision\\analytics\\track\\parking2.jpg' 
probs: None 
save_dir: 'runs\\detect\\predict' 
speed: {'preprocess': 21.11890004016459, 'inference': 175.85100000724196, 'postprocess': 11.089799925684929}] 

 

on the following input: 

Aerial view of a parking lot

AI-generated content may be incorrect. 

 

Thursday, October 23, 2025

  Analytical Framework 

The  analytics comprises of Agentic retrieval with RAG-as-a-Service and Vision” framework is a modular, cloud-native system designed to ingest, enrich, index, and retrieve multimodal content—specifically documents that combine text and images. Built entirely on Microsoft Azure, this architecture enables scalable and intelligent processing of complex inputs, such as objects and scenes, logs, location and timestampsIt’s particularly suited for enterprise scenarios where fast, accurate, and context-aware responses are needed from large volumes of visual and textual data from aerial drone images. 

 Architecture Overview 

The system is organized into four primary layers: ingestion, enrichment, indexing, and retrieval. Each layer is implemented as a containerized microservice, orchestrated, and designed to scale horizontally. 

 1. Ingestion Layer: Parsing objects and scenes 

The ingestion pipeline begins video and images input either as a continuous stream or in batch mode. These are parsed and chunked into objects and scenes using a custom ingestion service. Each scene is tagged with metadata and prepared for downstream enrichment. This layer supports batch ingestion, including video indexing  to extract only a handful of salient images and is optimized for documents up to 20 MB in size. Performance benchmarks show throughput of approximately 50 documents per minute per container instance, depending on image density and document complexity. 

 2. Enrichment Layer: Semantic Understanding with Azure AI 

Once ingested, the content flows into the enrichment layer, which applies Azure AI Vision and Azure OpenAI services to extract semantic meaning. Scenes and objects are embedded using OpenAI’s embedding models, while objects are classified, captioned, and analyzed using Azure AI Vision. The outputs are fused into a unified representation that captures both textual and visual semantics. 

This layer supports feedback loops for human-in-the-loop validation, allowing users to refine enrichment quality over time. Azure AI Vision processes up to 10 images per second per instance, with latency averaging 300 milliseconds per image. Text embeddings are generated in batches, with latency around 100 milliseconds per 1,000 tokens. Token limits and rate caps apply based on the user’s Azure subscription tier. 

 3. Indexing Layer: Fast Retrieval with Azure AI Search 

 

Enriched content is indexed into Azure AI Search, which supports vector search, semantic ranking, and hybrid retrieval. Each scene or object is stored with its embeddings, metadata, and image descriptors, enabling multimodal queries. The system supports object caching and deduplication to optimize retrieval speed and reduce storage overhead. 

Indexing throughput is benchmarked at 100 objects per second per indexer instance. Vector search queries typically return results in under 500 milliseconds. This latency is tolerated with the enhanced spatial and temporal analytics that makes it possible to interpret what came before or after. Azure AI Search supports up to 1 million documents per index in the Standard tier, with higher limits available in Premium. 

 4. Retrieval & Generation Layer: Context-Aware Responses 

The final stage is the RAG orchestration layer. When a user submits a query, it is embedded and matched against the indexed content. Automatic query decomposition, rewriting and parallel searches are implemented using the vector store and the agentic retrieval. Relevant scenes are retrieved and passed to Azure OpenAI’s GPT model for synthesis. This enables grounded, context-aware responses that integrate both textual and visual understanding. 

End-to-end query response time is approximately 1.2 seconds for text-only queries and 2.5 seconds for multimodal queries. GPT models have context window limits (e.g., 8K or 32K tokens) and rate limits based on usage tier. The retrieval layer is exposed via RESTful APIs and can be integrated into dashboards, chatbots, or enterprise search portals. 

 Infrastructure and Deployment 

The entire system is containerized and supports deployment via CI/CD pipelines. A minimal deployment requires 4–6 container instances, each with 2 vCPUs and 4–8 GB RAM. App hosting resource has  autoscaling supports up to 100 nodes, enabling ingestion and retrieval at enterprise scale. Monitoring is handled via Azure Monitor and Application Insights, and authentication is managed through Azure Active Directory with role-based access control. 

 Security and Governance 

Security is baked into every layer. Data is encrypted at rest and in transit. Role-based access control ensures that only authorized users can access sensitive content or enrichment services. The system also supports audit logging and compliance tracking for enterprise governance. 

Applications: 

The agentic retrieval with RAG-as-a-Service and Vision offers a robust and scalable solution for multimodal document intelligence. Its modular design, Azure-native infrastructure, and performance benchmarks make it ideal for real-time aerial imagery workflows, technical document analysis, and enterprise search. Whether deployed for UAV swarm analytics or document triage, this system provides a powerful foundation for intelligent, vision-enhanced retrieval at scale. 

Wednesday, October 22, 2025

 The 2019 paper “Deep Learning in Remote Sensing Applications: A Meta-Analysis and Review” by Lei Ma et al. offers a comprehensive and accessible overview of how deep learning (DL) has transformed the field of remote sensing. Such a survey is pertinent to drone-based analytics. Over the past decade, remote sensing has evolved from traditional image processing methods to embrace powerful DL algorithms, which now play a central role in tasks like land cover classification, object detection, and scene interpretation. This review not only introduces key DL models but also analyzes over 200 publications to map out trends, challenges, and future directions. 

Remote sensing involves capturing and analyzing images of the Earth’s surface using satellites, drones, or aircraft. Traditionally, methods like support vector machines (SVMs) and random forests (RFs) were favored for their robustness and ease of use. However, since 2014, DL has gained traction due to its ability to automatically learn complex patterns from large datasets. The paper highlights that DL models now outperform traditional techniques in many areas, especially when high-resolution imagery is available. 

The authors begin by explaining the architecture of DL models. At the core are neural networks—systems of interconnected nodes (neurons) that process data through layers. Deep neural networks (DNNs) contain multiple hidden layers that progressively extract higher-level features from input data. Among these, convolutional neural networks (CNNs) are the most widely used in remote sensing. CNNs are particularly effective for image data because they can capture spatial hierarchies and patterns using convolutional and pooling layers. Popular CNN architectures like AlexNet, VGG, ResNet, and Inception have been adapted for remote sensing tasks. 

Recurrent neural networks (RNNs) are another class of DL models discussed in the paper. RNNs are designed to handle sequential data, making them suitable for time-series analysis in remote sensing. They can learn long-term dependencies, although they sometimes struggle with very long sequences. To address this, variants like long short-term memory (LSTM) networks and gated recurrent units (GRUs) have been developed. 

Autoencoders (AEs), including stacked autoencoders (SAEs), are unsupervised models used for feature compression and dimensionality reduction. These models are especially useful for spectral-spatial feature learning in hyperspectral imagery. Similarly, deep belief networks (DBNs), built from restricted Boltzmann machines (RBMs), are used for unsupervised pretraining followed by supervised fine-tuning, often yielding strong results in classification tasks. 

Generative adversarial networks (GANs) represent a newer frontier. GANs consist of two competing networks—a generator and a discriminator—that learn to produce realistic synthetic data. Though less common in remote sensing, GANs have shown promise in image enhancement and data augmentation. 

The paper’s meta-analysis reveals that most DL applications in remote sensing focus on land use and land cover (LULC) classification, object detection, and scene recognition. These tasks benefit from high-resolution imagery, which provides rich spatial detail. CNNs dominate the landscape, followed by AEs and RNNs. Interestingly, while segmentation, image fusion, and registration are less frequently studied, DL models have still demonstrated strong performance in these areas. 

The authors also examine the types of data used—hyperspectral, SAR, LiDAR—and the study areas, which range from urban environments to vegetation and water bodies. Most studies rely on publicly available benchmark datasets like Indian Pines, University of Pavia, and Vaihingen, which offer high-resolution imagery for testing DL models. 

In terms of accuracy, DL models consistently achieve high performance across classification tasks. However, the paper notes that many studies are still experimental, with limited real-world deployment. Challenges include the need for large labeled datasets, computational resources, and model interpretability. 

This review underscores the transformative impact of DL on remote sensing. It highlights the strengths of various DL models, maps out their applications, and calls for more practical implementations and interdisciplinary collaboration. As DL continues to evolve, its integration with remote sensing promises to unlock deeper insights into Earth’s systems and support more informed decision-making across domains like agriculture, urban planning, and environmental monitoring.