Why May Neural Network-based CADx Models Fail in Clinical Translation?

SSA11-08

Why May Neural Network-based CADx Models Fail in Clinical Translation?

Scientific Formal (Paper) Presentations

Presented on November 28, 2010
Presented as part of SSA11: ISP: Informatics (Clinical Decision Support)

Turgay Ayer, Presenter: Nothing to Disclose

Oguzhan Alagoz PhD, Abstract Co-Author: Nothing to Disclose

Elizabeth S. Burnside MD, MPH, Abstract Co-Author: Nothing to Disclose

Our findings indicate that the conventional ANN training methodology (using an enriched data set) have inferior performance in terms of both discrimination and accuracy of risk prediction, and may contribute to systematic error in clinical practice.

Artificial Neural Networks (ANNs) are currently in clinical use to detect breast cancer and are being proposed for breast cancer risk prediction. The current conventional training method is to use a balanced dataset of malignant and benign findings. In this study, we show that ANNs trained using the conventional method perform poorly on a dataset that is representative of real clinical data. We propose an alternative training method to improve the performance of ANNs in clinical practice.

We collected structured reports from 48,744 consecutive mammography examinations on 18,270 patients from 4/5/1999 to 2/9/2004. Using the National Mammography Database format, 62,219 mammographic findings (61,709 benign, 510 malignant) were successfully matched with our cancer registry, which served as our reference standard. We built two ANNs. ANN-I was trained on an enriched, balanced subset (255 benign and 255 malignant abnormalities), which is the conventional method. ANN-II was trained on an unbalanced subset (30,855 benign and 255 malignant abnormalities), in which the prevalence of malignant abnormalities reflects clinical practice. We tested both models on the remaining 31,109 abnormalities (30,854 benign and 255 malignant). We evaluated and compared 1) the discriminative performances of the two models in differentiating malignant findings from benign ones and 2) the accuracy of risk prediction (calibration). In terms of discrimination, ANN-II performed significantly better (AUC=0.971) than ANN-I (AUC=0.921) (P<0.001). In terms of calibration, again ANN-II performed significantly better (P>0.05). This is also illustrated in the figure.

We demonstrate that an ANN trained on data reflecting the prevalence of breast cancer in the population performs significantly better than the conventional training method.

Ayer, T, Alagoz, O, Burnside, E, Why May Neural Network-based CADx Models Fail in Clinical Translation?. Radiological Society of North America 2010 Scientific Assembly and Annual Meeting, November 28 - December 3, 2010 ,Chicago IL. http://archive.rsna.org/2010/9004850.html