Abstract: In our study, we trained two word embeddings on two corpora separately, EHR Corpus and PubMed Corpus, and compared the trained word embeddings with another two public pre-trained embeddings, Google News and Wikipedia, in regards of semantic representations of medical terms. Four data sets were utilized to evaluate the word embeddings for capturing medical term semantics. The results of Pearson correlation coefficient show that the similarity results using word embeddings trained by PubMed Corpus are the closest to human experts’ results. However, there is no statistical significance between the word embeddings trained by EHR Corpus and those by PubMed Corpus. Both word embeddings are superior to the pre-trained word embeddings on Wikipedia and Google News.

Learning Objective 1: Learn how to what word embeddings are the best for representing semantics of medical terms?


Yanshan Wang, Mayo Clinic
Naveed Afzal (Presenter)
Mayo Clinic

Liwei Wang, Mayo Clinic
Feichen Shen, Mayo Clinic
Hongfang Liu, Mayo Clinic

Presentation Materials: