Thursday, June 27, 2024

 Even when a vector database might be a straightforward choice for specific use cases involving drone data, the choice of vector database matters. For example, usages of vector embeddings and vector similarity search are two different use cases. The embedding model is a neural network that transforms raw data into a vector embedding, or a vector of numbers that represents the original data. Querying the vector database requires similarity search between the query vectors and the vectors in the database. The result of the search can be the most relevant vectors. The scope of the search can be limited to a subset of the original set of vectors in the embeddings and this is done with the help of metadata filtering. So, the difference between the two is that the first is geared for storing and retrieving large number of high-dimension numerical data vectors and the latter optimizes for selectivity and high computation over a subset of the data. Metadata might include dates, times, genres, categories, names, types, descriptions, and depending on our use-case, something custom including tags and labels. Frameworks like LangChain and LlamaIndex offer capabilities to automatically tag incoming queries with metadata. Cloud vector searches like Azure Cognitive Search can automatically index vector data from two primary sources: Azure Blob indexers and Azure Cosmos DB for NoSQL Indexers. Azure Cognitive Search also includes scoring algorithms for vector search which are primarily of two types: exhaustiveKnn that calculates the distance between the query vector and data points and Hierarchical Navigable Small World aka hnsw that organizes high-dimensional data points into a hierarchical graph structure. Amazon also offers bountiful cloud resources for varying purposes which is not all tightly integrated into a single platform like Vertex AI, Databricks or Snowflake do. A large number of Databricks users in organizations also use Snowflake. Vector databases also include pure form such as Pinecone, full-text search databases like ElasticSearch, vector libraries like Faiss, Annoy and Hnswlib, vector-capable NoSQL databases such as MongoDB, CosmosDB and Cassandra, and vector capable SQL databases like SingleStoreDB and PostgreSQL. Rockset is a leader in this quadrant.

When functionalities are met, choices are often prioritized by efficient storage, storing and retrieving with high performance, and the variety of metrics that can be used to perform similarity searches. Pure vector databases provide efficient similarity search with indexing techniques, scalability for large datasets and high query workloads, support high dimensional data, support HTTP  and JSON-based APIs, native support for vector operations including dot-products. Their main drawback is usually that indexing is time consuming especially given that there might be various parameters for indexing and incorrect values may introduce inefficiencies. Full-text search work great for text and work well with indexing libraries like Apache Lucene and vector libraries. If we want off-the-shelf vector computations such as fast nearest neighbor search, recommendation systems, image search and NLP, vector libraries are useful and more and more are being added to open source continually.  Their main drawback is that we must bring our own infrastructure. 


No comments:

Post a Comment