Abstract: Quality assurance of biomedical terminologies such as the National Cancer Institute (NCI) Thesaurus is an essential part of the terminology management lifecycle. We investigate a structural-lexical approach based on non-lattice subgraphs to automatically identify missing hierarchical relations and missing concepts in the NCI Thesaurus. We mine six structural-lexical patterns exhibiting in non-lattice subgraphs: containment, union, intersection, union-intersection, inference-contradiction, and inference-union. Each pattern indicates a potential specific type of error and suggests a potential type of remediation. We found 809 non-lattice subgraphs with these patterns in the NCI Thesaurus (version 16.12d). Domain experts evaluated a random sample of 50 small non-lattice subgraphs, of which 33 were confirmed to contain errors and make correct suggestions (33/50 = 66%). Of the 25 evaluated subgraphs revealing multiple patterns, 22 were verified correct (22/25 = 88%). This shows the effectiveness of our structural-lexical-pattern-based approach in detecting errors and suggesting remediations in the NCI Thesaurus.
Learning Objective 1: Mine lexical and structural patterns of the concepts of non-lattice subgraphs in biomedical terminologies, to identify certain types of errors and suggest remediations.
Rashmie Abeysinghe (Presenter)
University of Kentucky
Michael Brooks, University of Kentucky
Jeffery Talbert, University of Kentucky
Licong Cui, University of Kentucky