Clinical Context Improves the Performance of AI Models for Cranial Fracture Detection

For information about this presentation, contact:

swetha.tanamala@qure.ai

PURPOSE

Clinical history plays a vital role in a physician's or radiologist's diagnosis. However, when training AI models, clinical history or presence of an abnormality which correlates to the target abnormality were not generally considered. In this study, we use scalp hematoma as an additional clinical context in training the models and study the accuracy (AUC and average precision) of a fracture detection AI model before and after adding this clinical context.

METHOD AND MATERIALS

Using 141,105 studies, we trained a convolutional neural network (CNN) to detect cranial fractures on non-contrast head CT scans. Scalp hematoma is considered a good indicator by physicians for diagnosing fractures. We confirmed this by automated natural language processing (NLP) analysis of large number of reports. Therefore, scalp hematoma is a good candidate for improving AI algorithms for detecting fractures. A logistic regression model was trained to detect a cranial fracture, using the presence of a scalp hematoma and the output probability of the CNN as inputs. The original CNN by itself (Model 1) and the combined CNN-logistic regression algorithm (Model 2) were tested using an independent set containing 18200 scans. We used area under the ROC curve (AUC) and average precision (AP), a probability based metric that is inversely proportional to false positive rate, as evaluation metrics.

RESULTS

Analysis of 141,105 reports confirmed that scalp hematoma was present in 49.8% of scans with fractures and conversely fractures were present in 29.8% of scans with scalp hematoma. The CNN with images as sole inputs reached an AUC and AP of 0.9599 and 0.7952 respectively. Adding scalp hematoma as a feature increased AUC to 0.9666. AP however, increased significantly to 0.8190.

CONCLUSION

Using a simple probabilistic algorithm to add clinical context to a CNN resulted in a significant improvement in AP. As AUC is saturated, there is no significant difference in AUC. Results show significant decrease in false positive rate without impacting sensitivity.

CLINICAL RELEVANCE/APPLICATION

Like radiologists, deep learning models can be more accurate when they incorporate clinical context in addition to image analysis.

Abstract Archives of the RSNA, 2019