Jason Mancuso, John Carroll University
Peter Short, John Carroll University
Edumnds Reineks, Cleveland Clinic
Elena Manilich (Presenter)
John Carroll University
Abstract: Automated laboratories analyze thousands of blood samples every day. Some samples may become contaminated due to improper collection, producing incorrect results. Identification of such samples is critical but often requires a labor intensive expert review. We present a set of machine learning algorithms that efficiently analyzes Big Laboratory Data in Hadoop and Apache Spark. We compared the performance of each model using 13,945 blood samples. Our best model, Random Forest, achieves 86.5% sensitivity and 99.9% specificity.
Learning Objective 1: Understand how machine learning algorithms can be used to identify contaminated blood samples.