Ontology enrichment using a large language model: Applying lexical, semantic, and knowledge network-based similarity for concept placement

Navya Martin Kollapally, James Geller, Vipina Kuttichi Keloth, Zhe He, Julia Xu

Research output: Contribution to journalArticlepeer-review

Abstract

Objective: Ontologies are essential for representing the knowledge of a domain. To make ontologies useful, they must encompass a comprehensive domain view. To achieve ontology enrichment, there is a need to discover new concepts to be added, either because they were missed in the first place, or the state-of-the-art has advanced to develop new real-world concepts. Our goal is to develop an automatic enrichment pipeline using a seed ontology, a Large Language Model (LLM), and source of text. The pipeline is applied to the domain of Social Determinants of Health (SDoH), using PubMed as a source of concepts. In this work, the applicability and effectiveness of the enrichment pipeline is demonstrated by extending the SDoH Ontology called SOHOv1, however our methodology could be used in other domains as well. Methods: We first retrieved PubMed abstracts of candidate articles with existing SOHOv1 concepts as search terms. Next, we used GPT-4-1201 to extract semantic triples from the abstracts. We identified concepts from these triples utilizing lexical, semantic, and knowledge network-based filtering. We also compared the granularity of semantic triples extracted with our method to the triples in the SemMedDB (Semantic MEDLINE Database). The results were evaluated by human experts and standard ontology tools for checking consistency and semantic correctness. Results: We expanded SOHOv1, which contained 173 concepts and 585 axioms, including 207 logical axioms to SOHOv2, which contains 572 concepts, 1,542 axioms, including 725 logical axioms. Our methods identified more concepts than those extracted from SemMedDB for the same task. While we have shown the feasibility of our approach for an SDoH ontology, the methodology is generalizable to other ontologies with an existing seed ontology and text corpus. Conclusions: The contributions of this work are: Extracting semantic triples from PubMed abstracts using GPT-4-1201 utilizing prompt chaining; showing the superiority of triples from GPT-4-1201 over triples from SemMedDB for SDoH; using lexical and semantic similarity search techniques with knowledge network-based search to identify the concepts to be added to the ontology; confirming the quality of the new concepts with human experts.

Original languageEnglish
Article number104865
JournalJournal of Biomedical Informatics
Volume168
DOIs
StatePublished - Aug 2025

Keywords

  • Large language model
  • Ontology enrichment
  • Ontology evaluation
  • SemMedDB database
  • Semantic MEDLINE
  • Semantic MEDLINE database
  • Similarity search
  • Social determinants of health

Fingerprint

Dive into the research topics of 'Ontology enrichment using a large language model: Applying lexical, semantic, and knowledge network-based similarity for concept placement'. Together they form a unique fingerprint.

Cite this