RSNA 2013 

Abstract Archives of the RSNA, 2013


SSA11-06

Compression of Radiology Reports Using a Semi-static Dictionary and Directed Pseudoforest  

Scientific Formal (Paper) Presentations

Presented on December 1, 2013
Presented as part of SSA11: ISP: Informatics (Education and Research)

Participants

Naveen Garg MD, Presenter: Consultant, Document Storage Systems, Inc
Peter Kamel, Abstract Co-Author: Nothing to Disclose
Sarfaraz Sadruddin MD, Abstract Co-Author: Nothing to Disclose
Jorge Herskovic MD, PhD, Abstract Co-Author: Nothing to Disclose
David Joseph Vining MD, Abstract Co-Author: Royalties, Bracco Group CEO, VisionSR Stockholder, VisionSR
Kevin W. McEnery MD, Abstract Co-Author: Advisor, Koninklijke Philips Electronics NV

PURPOSE

A radiologist will generally dictate a normal chest the same way every day, and usually describe the same pathology in a consistent style. Speech recognition systems rely on these recurring patterns of reporting style to develop statistical language models for improving. Because of this, we hypothesized that radiology reports would be highly compressible using static dictionaries. The more commonly used compression algorithms such as gzip obtain approximately 4x compression, but lose random access of the compressed data. In this work, we report on the compression ratios achieved on a large corpus of radiology reports using static dictionaries. We also present a novel method of compressing the static dictionary itself using a directed pseudoforest.

METHOD AND MATERIALS

We constructed dictionaries from a variable number of radiology reports. Dictionaries were constructed using a variation of a generalized suffix tree pruned by a threshold frequency of the suffixes. The dictionary was then itself compressed using a directed pseudoforest, taking advantage of the shared structure between phrases in the dictionary. Source documents were then compressed using the integer indices into the dictionary, coded with a prefix-free entropy code. The algorithm was coded in c++11 with no platform specific dependencies.

RESULTS

Compression ratios improved with increasing number of reports. A million reports compressed to 18.7% of original size including the compressed reports, and dictionary. These randomly accessible compressed reports were further compressible by gzip, bringing compressed size to 13.7 %. Pruning the dictionary of less frequently used n-grams  substantially decreased the size of the dictionary with only a minor increase in the size of the compressed reports. On a million reports, limiting the dictionary to n-grams that occur at least 30 times in the corpus results in overall better compression than allowing n-grams that occur 10 or more times.  

CONCLUSION

Static dictionaries with directed pseudoforests can compress radiology reports with a very high efficiency while retaining random access capability.

CLINICAL RELEVANCE/APPLICATION

Better compression of radiology reports and other medical records can be used to enable data mining applications to retain more data in memory allowing faster analytics.  

Cite This Abstract

Garg, N, Kamel, P, Sadruddin, S, Herskovic, J, Vining, D, McEnery, K, Compression of Radiology Reports Using a Semi-static Dictionary and Directed Pseudoforest  .  Radiological Society of North America 2013 Scientific Assembly and Annual Meeting, December 1 - December 6, 2013 ,Chicago IL. http://archive.rsna.org/2013/13020314.html