Tuesday, March 18, 2025

 The use of Large Language Model (LLM) for building a knowledge base (KB) seems to be a tribal art but in fact, it is applicable here as in the vast collection of domain specific text across many industries. A knowledge graph captures relationships between entities so bot the nodes and the edges are important to discover and there is no estimate of precision and recall to begin with. We take a specific example one application of LLM to build a KB with IText2KB. This is a zero-shot method for constructing incremental, topic-independent knowledge graphs from unstructured data using large-language models, without the need for extensive post-processing which is one of the main challenges for constructing knowledge graphs. Other challenges generally include the unstructured data type which might result in lossy processing and require advanced NLP techniques for meaningful insights, few-shot learning and cross-domain knowledge extraction. NLP techniques, in turn, face limitations, including reliance on pre-defined entities, and extensive human annotation.

This approach consists of four modules: Document distiller, Incremental Entities Extractor, Incremental Relations Extractor, and Neo4J graph integrator. The Document Distiller uses LLMs specifically GPT-4 to rewrite documents into semantic blocks, guided by a flexible schema to enhance graph construction. The Incremental Entities Extractor iteratively builds a global entity set by matching local entities from documents with previously extracted global entities. The Incremental Relations Extractor utilizes global document entities to extract both stated and implied relations, with variations based on the context provided. The approach is adaptable to various use cases, as the schema can be customized based on user preferences. The final module integrates the extracted entities and relations into a Neo4j database to visualize the knowledge graph. This forms a zero-shot technique because there are no predefined examples or ontologies.

The effectiveness of this technique which has broad applicability, can best be described by some metrics such as schema consistency scores across documents where a high score reflects high performance, information consistency metric where the higher consistency is desirable, triplet extraction precision which is more for local context-specific entites than for global entites and affects the richness of the graph, the false discovery rate which should be as low as possible for a successful entity/resolution process and estimation of cosine similarity for merging entites and relationships and to remove duplicates. This method outperforms on all these metrics. The results from experiments with documents such as CVs, scientific articles and websites have also emphasized effective data refinement and impact of document chunk size on KG construction.


No comments:

Post a Comment