Monday, September 29, 2025

Object Tracking

Object tracking is done with video and image analysis with and without depending on cloud. Using embedded processors or lightweight edge AI chips, drones can continuously monitor and track objects in real time, maintaining lock even as the scene changes.

If the target is known beforehand (for example, a specific vehicle, person, or structure), onboard tracking algorithms—such as Kernelized Correlation Filters (KCF), Kalman Filters, or deep learning-based trackers pre-trained with the object's signature—can initialize detection using a reference image or video frame. Once the initial detection is established, the drone processes each incoming video frame locally, updating the object's spatial position, velocity, and trajectory. This supports persistent tracking—even when the object moves across the frame or through complex environments—without incurring the delays of transmitting every frame to the cloud.

Crucially, real-time onboard processing enables immediate response behaviors, such as adjusting flight path, camera gimbal, or surveillance pattern to keep the object centered in view. If multiple images are captured in burst mode, image-based keypoint matching (e.g., SIFT, ORB) or template matching can further refine object identity. Embedded systems may use bounding box prediction and re-identification modules to handle occlusion or temporary loss of visibility—all within the drone, keeping autonomy and performance high.

Onboard tracking is especially effective when the UAV must operate in bandwidth-limited or communication-constrained environments, delivering low-latency control and adaptive behavior for predetermined targets while limiting data offloading to only critical events or summary information.

When extended to the cloud, object tracking has leveraged significant compute capabilities, limitless storage and <10ms inter-services latency to demonstrate better performance and efficiency in the following examples:

1. CloudTrack: Semantic Object Tracking with Foundation Models

• CloudTrack introduces a two-part framework: real-time trackers run on the UAV, with cloud back-end using foundation models (e.g., large vision-language models) to deliver advanced semantic understanding and object disambiguation not feasible onboard. Experiments show cloud-enabled semantic object tracking outperforms onboard-only methods in accuracy, scalability, and multi-object scenarios, especially for open-vocabulary or rarely seen object types.3

• Extensive evaluation demonstrates improvement over state-of-the-art onboard approaches for semantic tracking, although it incurs extra runtime in the cloud (i.e., slightly more latency). The cloud empowers missions like search-and-rescue or tracking multiple distinct objects simultaneously that onboard systems struggle with.

2. DeepBrain: Energy and Throughput Benefits

• The DeepBrain project demonstrates cloud vision analytics deployed for drone video streams, with CNN models run in the cloud. Processing speed on cloud GPUs achieves much greater throughput (up to 12 frames/sec) compared to the ~1 fps limit of onboard GPU-less edge devices.

• Cloud offloading reduces the drone's energy consumption, enabling complex model execution (deep CNNs for car detection) that would otherwise overwhelm onboard resources. Thus, real-time cloud object tracking extends flight time and enables richer analytics.

3. AI-Powered Video Analysis for Scale and Accuracy

• Cloud video analytics allows for the aggregation of data across many cameras (or drones), using deep learning for retrospective tracking and behavioral analysis across large regions and time periods.

• AI-powered backend analysis often detects behavioral patterns, anomalies, or tracking targets that onboard models—limited by resource constraints and single-scene context—cannot match.

• Advanced cloud video analysis yields higher tracking accuracy (up to 99% in some deep-learning safety applications), supports forensic tracking, and adaptive queries, outperforming in-camera or edge-only solutions for large-scale applications.

4. Scalable Multi-Camera/Multi-Drone Tracking

• Multi-camera tracking research shows cloud analytics can correlate objects across different drones’ feeds, resolving ambiguities and re-identifying targets across wider areas than onboard systems typically support.

• The cloud backend processes and fuses metadata, supporting cross-drone object association, long-term monitoring, and efficient resource allocation.

Summary Table

Advantage 

Onboard Only 

Cloud Video Analytics 

Reference 

Tracking accuracy 

Limited by resources 

High (deep learning, semantic) 

3 

Throughput (fps) 

~1–5 fps 

Up to 12 fps (GPU) 

4 

Multi-object/vocabulary 

Limited 

Flexible, open vocabulary 

3 

Energy consumption (drone) 

High 

Low (offloaded) 

4 

Large-scale, post-hoc analytics 

Not feasible 

Aggregated, region-wide 

5 

Cloud video analytics has enabled more complex and accurate object tracking in drone applications—especially when advanced model architectures, semantic context, cross-feed correlation, and high throughput are crucial.


Sunday, September 28, 2025

 Location refers to a data type that is useful in conjunction with tasks that involve a map or the globe. It can be represented either as a point or a polygon, and each helps with answering questions such as getting top 3 stores near to a geographic point or stores within a region. Since it is a data type, many data storage products have supported out-of-box features to help with determining location and operations based on it. One such example is relational database management systems such as SQL Server. This database server defines not one but two data types for the purpose of specifying location: the Geography data type and the Geometry data type. The Geography data type stores ellipsoidal data such as GPS Latitude and Longitude and the Geometry data type stores Euclidean (flat) coordinate system. The point and the polygon are examples of the Geography data type. Both the Geography and the Geometry data type must have reference to a spatial system and since there are many of them, it must be used specifically in association with one. This is done with the help of a parameter called the Spatial Reference Identifier or SRID for short. The SRID 4326 is the well-known GPS coordinates that give information in the form of latitude/Longitude. Translation of an address to a Latitude/Longitude/SRID tuple is supported with the help of built-in functions that simply drill down progressively from the overall coordinate span. We will get to this shortly, but you can use both Geography and Geometry to define an entity. A table such as Zip Code could have an identifier, code, state, boundary, and center point with the help of these two data types.The boundarycould be considered the polygon formed by the zip and the Center point as the central location in this zip. Distances between stores and their membership to zip can be calculated based on this center point. Geography data type also lets you perform clustering analytics which answers questions such as the number of stores or restaurants satisfying a certain spatial condition and/or matching certain attributes. These are implemented using R-Tree data structures that support such clustering techniques. 

The operations performed with these data types include the distance between two geography objects, the method to determine a range from a point such as a buffer or a margin, and the intersection of two geographic locations. The geometry data type supports operations such as area and distance because it translates to coordinates. Some other methods supported with these data types include contains, overlaps, touches, and within.  

A note about the use of these data types now follows. One approach is to store the coordinates in a separate table where the primary keys are saved as the pair of latitude and longitude and then to describe them as unique such that a pair of latitude and longitude does not repeat. Such an approach is questionable because the uniqueness constraint for locations has a maintenance overhead. For example, two locations could refer to the same point and then unreferenced rows might need to be cleaned up. Locations also change ownership, for example, store A could own a location that was previously owned by store B but B never updates its location. Moreover, stores could undergo renames or conversions. Thus, it may be better to keep the spatial data associated in a repeatable way along with the information about the location. Also, these data types do not participate in set operations. That is easy to do with collections and enumerable with the programming language of choice and usually consist of the following four steps: answer initialization, return an answer on termination, accumulation called for each row, and merge called when merging the processing from parallel workers. These steps are like a map-reduce algorithm. These data types and operations are improved with the help of a spatial index. These indexes continue to be like indexes of other data types and are stored using B-Tree. Since this is an ordinary one-dimensional index, the reduction of the dimensions of the two-dimensional spatial data is performed by means of tessellation which divides the area into small subareas and records the subareas that intersect each spatial instance. For example, with a given geography data type, the entire globe is divided into hemispheres, and each hemisphere is projected onto a plane. When that given geography instance covers one or more subsections or tiles, the spatial index would have an entry for each such tile that is covered. The geometry data type has its own rectangular coordinate system that you define which you can use to specify the boundaries or the bounding box that the spatial index covers. Visualizers support overlays with spatial data which is popular with mapping applications that super-impose information over the map with the help of transparent layers.  

In aerial drone image analytics, the Geometry and Geography data types correspond to aerial field-of-view and per-frame geotags. The FOV polygon enables queries like “find all video frames covering a given region.” The per-frame geotag supports “retrieve all frames from Point X at Time Y.” Other comparisions follow: 

Aerial-FOV Representation 

  • Spatial Coverage Geometry 

  • Instead of just storing the geolocation for a frame center, an “aerial-FOV” is defined as the spatial coverage polygon (typically quadrilateral) representing the area visible to the camera at a particular time. 

  • This FOV footprint is computed from intrinsic camera parameters (focal length, sensor size, aspect ratio) and extrinsic UAV metadata (GPS location, altitude, pitch, yaw, roll). 

  • Aerial-FOV Index Fields 

  • Frame ID 

  • Timestamp 

  • FOV Polygon (encoded in geo-coordinates) 

  • Camera orientation/attitude 

  • Coverage area (m² or km²) 

  • Usage: Enables rapid spatial querying for coverage overlap, area-of-interest selection, and event localization within the observed region, rather than a single point location. 

 

Per-Frame Geotag Representation 

  • Frame Point Metadata 

  • Every frame indexed has a center geotag: latitude, longitude, and altitude. 

  • Typically included as: 

  • Frame ID 

  • Timestamp 

  • GPS (lat/lon/alt) 

  • Usage: Directly maps objects or features detected in an image to a specific Earth point, ideal for cataloging, measurement, and change detection. 

DJI and FMV metadata standards store both camera-target footprints (FOVs) and frame centers, allowing overlays on GIS platforms and precision annotation.