ParticipantsBryan Haslam, Cambridge, MA (Presenter) Employee, DeepHealth, Inc
William Lotter, PhD, Cambridge, MA (Abstract Co-Author) Officer, DeepHealth Inc
Abdul Rahman Diab, Cambridge, MA (Abstract Co-Author) Employee, DeepHealth, Inc
Mack K. Bandler, MD, Medford, OR (Abstract Co-Author) Nothing to Disclose
A. Gregory Sorensen, MD, Belmont, MA (Abstract Co-Author) Employee, DeepHealth, Inc Board member, IMRIS Inc Board member, Siemens AG Board member, Fusion Healthcare Staffing Board member, DFB Healthcare Acqusitions, Inc Board member, inviCRO, LLC
bhaslam@deep.health
PURPOSEGeneralization is critical for the successful clinical application of machine learning and cannot be assumed; recent studies have shown that indeed many machine learning algorithms (or models) do not transfer across populations or even across different imaging equipment manufacturers. Applying machine learning to screening mammography has shown promise in classifying the presence of cancer, but many of the results presented so far have been tested on data taken from the same distribution from which the algorithms were trained, including the same manufacturer and the same clinic. Therefore, we sought to develop and test a model that could translate from one manufacturer to another and from one site to another.
METHOD AND MATERIALSWe compiled two separate testing data sets consisting of de-identified images and linked reports, collected from two mammography centers (Site A and Site B) following an IRB-approved protocol. Data originated from GE equipment and included presentation FFDM studies from both sites. We developed a novel convolutional neural network (CNN) architecture and trained this model using entirely Hologic data, including the Digital Mammography DREAM Challenge training data set. The model was tested on the DREAM challenge test set and additionally on the two different data sets described: Site A: 1880 studies and 41 biopsy-confirmed malignancies, and Site B: 1792 studies and 83 biopsy-confirmed malignancies. The receiver operating characteristic (ROC) curves and the corresponding area-under-the-curve (AUC) were calculated for each of the two data sets. The AUC was obtained from the DREAM test set but due to the data being protected, the ROC curve was not reported to us.
RESULTSAUC values for performance on the test datasets were: DREAM: 0.90, Site A: 0.91, and Site B: 0.89.
CONCLUSIONThe developed machine learning model demonstrated successful transfer across different manufacturers and different clinical sites.
CLINICAL RELEVANCE/APPLICATIONMachine learning can be developed so that testing at new sites and with new manufacturers does not result in significant loss of performance; such robustness is critical for deploying machine learning to the clinic.