Improving Specificity of Lung Cancer Screening CT Using Deep Learning

Participants
Diego Ardila, Mountain View, CA (Presenter) Employee, Alphabet Inc
Bokyung Choi, PhD, Mountain View, CA (Abstract Co-Author) Employee, Alphabet Inc
Atilla P. Kiraly, PhD, Mountain View,, CA (Abstract Co-Author) Former Employee, Siemens AG; Employee, Alphabet Inc
Sujeeth Bharadwaj, PhD, Mountain View, CA (Abstract Co-Author) Employee, Alphabet Inc
Joshua J. Reicher, MD, Stanford, CA (Abstract Co-Author) Investor, Health Companion, Inc; Consultant, Alphabet Inc
Greg Corrado, PhD, Mountain View, CA (Abstract Co-Author) Employee, Alphabet Inc
Daniel Tse, MD, Mountain View, CA (Abstract Co-Author) Employee, Alphabet Inc
Lily Peng, MD,PhD, Mountain View, CA (Abstract Co-Author) Employee, Alphabet Inc
Shravya Shetty, Mountain View, CA (Abstract Co-Author) Employee, Alphabet Inc

For information about this presentation, contact:

sshetty@google.com

PURPOSE

Evaluate the utility of deep learning to improve the specificity and sensitivity of lung cancer screening with low-dose helical computed tomography (LDCT), relative to the Lung-RADS guidelines.

METHOD AND MATERIALS

We analyzed 42,943 CT studies from 14,863 patients, 620 of which developed biopsy-confirmed cancer. All cases were from the National Lung Screening Trial (NLST) study. We randomly split patients into a training (70%), tuning (15%) and test (15%) sets. A study was marked "true" if the patient was diagnosed with biopsy confirmed lung cancer in the same screening year as the study.A deep learning model was trained over 3D CT volumes (400x512x512) as input. We used the 95% specificity operating point based on the tuning set, and evaluated our approach on the test set. To estimate radiologist performance, we retrospectively applied Lung-RADS criteria to each study in the test set. Lung-RADS categories 1 to 2 constitute negative screening results, and categories 3 to 4 constitute positive results. Neither the model nor the Lung-RADS results took into account prior studies, but all screening years were utilized in evaluation.

RESULTS

The area under the receiver operator curve of the deep learning model was 94.2% (95% CI 91.0, 96.9). Compared to Lung-RADS on the test set, the trained model achieved a statistically significant absolute 9.2% (95% CI 8.4, 10.1) higher specificity and trended a 3.4% (95% CI -5.2, 12.6) higher sensitivity (not statistically significant).Radiologists qualitatively reviewed disagreements between the model and Lung-RADS. Preliminary analysis suggests that the model may be superior in distinguishing scarring from early malignancy.

CONCLUSION

A deep learning based model improved the specificity of lung cancer screening over Lung-RADS on the NLST dataset and could potentially help reduce unnecessary procedures. This research could supplement future versions of Lung-RADS; or support assisted read or second read workflows.

CLINICAL RELEVANCE/APPLICATION

While Lung-RADS criteria is recommended for lung cancer screening with LDCT, there is still an opportunity to reduce false-positive rates which lead to unnecessary invasive procedures.

Abstract Archives of the RSNA, 2018