ParticipantsHyunkwang Lee, Boston, MA (Presenter) Nothing to Disclose
Sehyo Yune, MD,MPH, Boston, MA (Abstract Co-Author) Nothing to Disclose
Stuart R. Pomerantz, MD, Boston, MA (Abstract Co-Author) Research Grant, General Electric Company
Mohammad Mansouri, MD, MPH, Framingham, MA (Abstract Co-Author) Nothing to Disclose
Ramon G. Gonzalez, MD, PhD, Boston, MA (Abstract Co-Author) Nothing to Disclose
Michael H. Lev, MD, Boston, MA (Abstract Co-Author) Consultant, General Electric Company; Institutional research support, General Electric Company; Stockholder, General Electric Company; Consultant, MedyMatch Technology, Ltd; Consultant, Takeda Pharmaceutical Company Limited; Consultant, D-Pharm Ltd
Synho Do, PhD, Boston, MA (Abstract Co-Author) Nothing to Disclose
hyunkwanglee@seas.harvard.edu
PURPOSEMost of currently published deep learning studies in medical image analysis report their performance using carefully selected data. To use such tools in the clinical practice, however, it is critical to know how they work with the real-world data. Here, we evaluated the applicability of our ICH detection system in the clinical setting by comparing the model performance on the real-world cases to the performance on the selected dataset.
METHOD AND MATERIALSWe previously trained and validated the deep learning system for ICH detection using a total of 904 cases of 5mm, non-contrast head CT scans - 625 cases with ICH and 279 cases without ICH. Six board-certified neuroradiologists annotated all 2D axial slices according to the presence of ICH based on consensus. For evaluating the model, we retrieved an additional, non-overlapping set of 200 cases - 100 with ICH and 100 without ICH - with exclusion of cases with any history of brain surgery, intracranial tumor, intracranial device placement, skull fracture, or cerebral infarct. For performance evaluation in the real-world setting, all non-contrast head CT scans acquired at a single emergency department for three months from September to November 2017 were obtained. Collected were 2,606 consecutive cases including 163 cases with ICH.
RESULTSArea under the receiver operating curve (AUC) was 0.993 for detecting the presence of ICH on the 200 selected cases with sensitivity of 98.0%, specificity of 95.0%, and negative predictive value of 97.9%. The same model achieved AUC of 0.834 on the real-world cases with sensitivity of 87.1%, specificity of 58.3%, and negative predictive value of 98.5% at the high sensitivity operating point.
CONCLUSIONThe deep-learning-based ICH detection model achieved lower sensitivity and specificity when tested on real-world data compared to when tested on the selected data that excluded potentially confusing cases. However, the negative predictive values were similar in the two test datasets.
CLINICAL RELEVANCE/APPLICATIONThe performance of deep-learning based systems should be evaluated on the real-world data before being used in the clinical practice to assist clinicians in interpreting the automated output.