This is an implementation outline of an image retrieval system built for images captured from a UAV swarm such that the queries from the user can then be used to assign objects as targets for the drones to acquire a better angle in real-time.
Traditionally image retrieval systems either work with images or text for data and query respectively but not both. With multi-modal AI vector search, we can span both giving us better results. Azure provides Multimodal embeddings APIs that enable the vectorization of images and text queries. They convert images to coordinates in a multi-dimensional vector space. Then, incoming text queries can also be converted to vectors, and images can be matched to the text based on semantic closeness. This allows the user to search a set of images using text, without the need to use image tags or other metadata. Semantic closeness often produces better results in search. Vectorize Image and Vectorize Text APIs are available to convert images and text to vectors.
The open-source equivalent for an image retrieval system could involve ViT Image Retrieval project available on GitHub that uses Vision Transformers and Facebook AI Similarity Search for content-based image retrieval. This utilizes the ViT-B/16 model pretrained on ImageNet for robust feature extraction and FAISS indexing. It comes with a user-friendly graphical user interface for feature extraction and image search. Python API is used to index a directory of images and search for similar images, and this can be extended to directly work with an S3 store or Azure storage account.
“Contextual Embeddings” improves retrieval accuracy, cutting failures with re-ranking. It involves both a well-known Retrieval Augmented Generation technique with semantic search using embeddings and lexical search using sparse retrievers like BM25. The entire knowledge base is split into chunks. Both the TF-IDF encodings as well as semantic embeddings are generated. Parallel searches using both lexical and semantic searches are run. The results are then combined and ranked. The most relevant chunks are located, and the response is generated with enhanced context. This enhancement over multimodal embeddings and GraphRAG is inspired by Anthropic and a Microsoft Community blog.
#codingexercise: CodingExercise-11-24-2024.docx
No comments:
Post a Comment