The Pilot Study of Character-based Adjacent Probability in Japanese CT Clinical Reports

LL-IN6166-R01

The Pilot Study of Character-based Adjacent Probability in Japanese CT Clinical Reports

Scientific Posters

Presented on November 29, 2007
Presented as part of LL-IN-R: Informatics

Naoki Nishimoto MS, Presenter: Nothing to Disclose

Satoshi Terae MD, Abstract Co-Author: Research grant, Daiichi Sankyo Company, Ltd Research grant, Eisai Co, Ltd Research grant, Medical Image Lab, Inc

Masahito Uesugi, Abstract Co-Author: Nothing to Disclose

Takayoshi Terashita, Abstract Co-Author: Nothing to Disclose

Takumi Tanikawa, Abstract Co-Author: Nothing to Disclose

Katsuhiko Ogasawara, Abstract Co-Author: Nothing to Disclose

Akira Endoh, Abstract Co-Author: Nothing to Disclose

Tsunetaro Sakurai, Abstract Co-Author: Nothing to Disclose

et al, Abstract Co-Author: Nothing to Disclose

The building of a medical ontology may contribute the information retrieval task that extracts information supporting diagnosis from the narrative texts written by experts. Because each word is not separated by a white space in some languages such as Japanese and Chinese language and combined technical terms exist in the medical domain, it is difficult for the computer programs to parse sentences appropriately The purpose of this study is to investigate the distribution of transitional probability of the medical term boundaries between characters in compounds.

We adopted Japanese 100 computed tomography (CT) reports randomly selected from 2,000 reports that were made during July 2005 in the Hokkaido University Hospital. Medical terms in CT reports were identified using Morphological analysis system ChaSen. ChaSen is based on the probabilistic language model and developed by the Matsumoto laboratory in Nara Institute of Technology. The MeSH-based medical terms (51,385 entries), obtained from the Metathesaurus in UMLS (Unified Medical Language System, 2005AA), were added as the medical dictionary of ChaSen. A radiographer corrected the parsing errors in the result set. We retrieved transitional probability as the conditional probability of uni-gram, bi-gram, tri-gram.

The number of characters in each report was 256.4±13.7 and the number of character and word types was 863 and 1,941 respectively. For an example of anatomical location, “pulmonary hilum” was parsed as a tri-gram and counted 74(the probability was 6.54*E-3).

Retrieval of transitional probability will make progress in correctly parsing medical texts. The transitional probabilities may allow us to fix the dictionary size for parsing the narrative texts and develop a medical ontology by using it in the term extraction algorithm. Farther work will be required for parsing the texts precisely.

Parsing the narrative texts may contribute to the information retrieval tasks that extractsinformation supporting diagnosis from the narrative texts written by experts.

Nishimoto, N, Terae, S, Uesugi, M, Terashita, T, Tanikawa, T, Ogasawara, K, Endoh, A, Sakurai, T, et al, , et al, , The Pilot Study of Character-based Adjacent Probability in Japanese CT Clinical Reports. Radiological Society of North America 2007 Scientific Assembly and Annual Meeting, November 25 - November 30, 2007 ,Chicago IL. http://archive.rsna.org/2007/5015943.html