Regularized Training of CADx Algorithms with Unlabeled Data Using Dimension Reduction Techniques

SSG17-05

Regularized Training of CADx Algorithms with Unlabeled Data Using Dimension Reduction Techniques

Scientific Papers

Presented on December 1, 2009
Presented as part of SSG17: Physics (CAD: Colonography and Other)

Andrew Robert Jamieson BA, Presenter: Nothing to Disclose

Maryellen L. Giger PhD, Abstract Co-Author: Stockholder, Hologic, Inc Royalties, Hologic, Inc Research funded, Hologic, Inc Royalties, Riverain Medical Royalties, Mitsubishi Corporation Royalties, MEDIAN Technologies Royalties, General Electric Company Royalties, Toshiba Corporation

Lorenzo Pesce PhD, Abstract Co-Author: Consultant, Carestream Health, Inc Consultant, Siemens AG

The potential of leveraging unlabeled data information towards the design of more robust/stable breast mass lesion CAD classification algorithms was considered. Recently developed non-linear dimension reduction and data representation techniques offer a principled approach for integrating unlabeled and labeled image feature data.

For an ultrasound feature database consisting of 1126 unique lesions (2956 images), differently sized sub-sets were randomly sampled and identified to the algorithm as labeled or unlabeled. The unlabeled lesion feature information can be selectively incorporated during unsupervised dimension reduction mapping in place of feature selection. PCA as well as two recently developed non-linear methods are explored: Laplacian Eigenmaps (Belkin and Niyogi) and t-distributed stochastic neighbor embedding (t-SNE) (van der Maaten and Hinton). The newer methods aim to preserve local and global structural information inherent in the original high dimensional feature space while embedding in the lower dimensions. Using the reduced mapping as input, two classifiers, linear and nonlinear, are employed: LDA and Markov Chain Monte Carlo based Bayesian artificial neural network (MCMC-BANN). The AUC was estimated for each classifier using ROC analysis and leave-one-out-by-case (LOO) validation. The difference in AUCLOO (ΔAUCLOO) between with and without the use of unlabeled data was calculated. The impact of labeled/unlabeled sample size, lesion-category prevalence, and reduced embedding dimension size on CADx performance characteristics was investigated.

After 100 runs, each sampling 60 (30/30-malignant/benign) labeled US lesions (~150 images) with and without 500 unlabeled images from the total database and using 5D-tSNE, the MCMC-BANN produced a significant mean ΔAUCLOO=0.0271 (p=0.019) for the lower 25th percentile and ΔAUCLOO=-0.0154 (p=0.006) for the upper 75th on the distribution of AUC(without)LOO values, highlighting a regularizing effect.

Incorporation of unlabeled data information can impact CAD algorithms in a non-trivial fashion, helping to regularize the structure of CAD classifiers, better reflecting larger sample size expected performance.

Assembling labeled datasets can be very resource expensive, yet unlabeled data is abundant. Understanding how to intelligently use unlabeled data is absolutely vital in the age of digital radiology.

Jamieson, A, Giger, M, Pesce, L, Regularized Training of CADx Algorithms with Unlabeled Data Using Dimension Reduction Techniques. Radiological Society of North America 2009 Scientific Assembly and Annual Meeting, November 29 - December 4, 2009 ,Chicago IL. http://archive.rsna.org/2009/8009553.html