Automatic Classification of Pathology Reports into SNOMED Codes June 2008

Automatic Classification of Pathology Reports into SNOMED CodesJune 2008 ByWeihang ZHANG (MIT) Supervisors : Prof. Jon PATRICK Dr. Irena KOPRINSKA

INTRODUCTION – Motivation • SWAPS -South West Area Pathology Service • Natural language medical records – 400K pathology texts • contains a great deal of formal terminology but used in an informal and haphazard way. • Medical records need to be converted to the formal terminology: • to enable accurate retrieval • to compile aggregated statistics of the medical care

INTRODUCTION – Context Text Categorization (TC) • Definition: Given a collection of documents D = {d1, d2, . . . , dn} , and a pre-defined category set C = {c1, c2, . . . , cm} , assign a True or False value for each pair <di , cj> ∈ D × C [Sebastiani F., 2002] • meaningful categories: • Topics (politics, sports, entertainment, etc.) • Spams, Child Safety, Scam • A successful project: ScamSeek Project • A clustering task, but the results depend more on the Feature Set than Machine Learner

How to select a perfect Feature Set for TC? • Usually ad-hoc • Domain Knowledge [Adam et al, 1998] • Medical Terminologies • Natural Language Processing • Negation [Crammer et al. 2007] • Text Manipulation • Bag of Words (Treat whole text as a set of words) • Sections Selection (Filter out ambiguous parts) • Sections Separation (Treat different parts in different ways) Not an Ad.

INTRODUCTION –Context (Cont’) • SNOMED - Systematized Nomenclature of Medicine • Concepts: • Basic unit of meaning designated by a unique numeric code, unique name (Fully Specified Name), and descriptions, including a preferred term and one or more synonyms.

INTRODUCTION –Context (Cont’) • TTSCT - Text to SNOMED CT* • A system which automatically maps free text into a medical reference terminology • NLP*-technique enhanced lexical token matcher • Qualifier identifier • Negation identifier [J. Patrick et al., 2006] *CT – Clinical Terms *NLP - Natural Language Processing

OBJECTIVES • Explore an effective information retrieval mechanism for medical notes classification • Evaluate the performance of classifiers with TTSCT support • Develop a SNOMED auto-coding system which helps clinician on research and decision making

RESEARCH METHOD • Data Inspection • System Design • Evaluation Metrics • Experiment

Data Inspection – The Pathology Text • A set of diagnoses for each report (pathology text), presented as SNOMED codes • A text consists of many Sections <Title>CLINICAL HISTORY</Title> Biopsy of discoid erythematosus like lesion from right cheek ? DLE. <Title>MACROSCOPIC</Title> LABELLED `RIGHT CHEEK LESION'. An ellipse 12 x 3mm with subcutis to 3mm. A poorly defined pale nodular lesion 3 x 3mm. It appears to abut the surgical margin. Representative sections embeded, A tips face on, B lesion and surgical margin. (MR 17/4) <DOT>TA</DOT> <Title>MICROSCOPIC</Title> Section shows hyperkeratosis with occasional follicular plugging, epidermal atrophy and severe sundamage to dermal collagen. A dense chronic inflammatory cell infiltrate, both superficial and deep is present, mainly in a perivascular and periadnexal distribution. No liquefaction degeneration of the basal layer, no dermal oedema and no interface dermatitis are seen. PAS stain reveals no thickening of the epidermal basement membrane and only an occasional fungal spore on the skin surface. Immunofluorescence for immunoglobulins and complement fractions are negative. The differential diagnosis rests between chronic discoid erythematosus, lymphocytic infiltration of skin of Jessner and the plaque type of polymorphous light eruption. The presence of marked solar damage to collagen, the absence of basal liquefaction degeneration and the negative immunofluorescence favours polymorphous light eruption. A reaction to drugs or an insect bite is also a possibility. No evidence of malignancy. Reported 24/4/98

Data Inspection –SNOMED Codes Distribution Uniformly random select 10K pathology texts from 400K texts in database • 867 types of codes occurred, and 30K codes have been assigned for the 10K texts • The 9 codes from these 867 codes with highest frequency are selected for experiments • All the left codes are considered as “others”

Read Document Vector Representation of Text (Indexing) Machine Learning (classifiers) Tokenize Text Remove Stopwords Feature Filtering to Reduce Dimensionality Lexical Validation SNOMED Code Stem TC Work Flow

Evaluation Metrics • Measurement for classification : • Recall (R) • Precision (P) • F-measure (combines recall and precision) • F = 2*precision*recall / (precision + recall) • Performance Averaging: • Bias on texts: Micro • Bias on categories: Macro • Standard Deviation • Measurement for System Performance : • Micro-F: gives equal weight to every text.

Experiment Feature Manipulation • N-Gram: • Unigram, Bigram, Trigram • TTSCT ConceptID integration (Negation included): • Keep text and add Concepts • Replace concept words with Concepts • Only Concepts, no text at all • Text Section Hidding: • <Clinical History>, <Microscopic>, <Macroscopic> • Text Section Separation System Component Comparison • Machine Learners: • SVM-Light, MaxEnt, J48 • Indexing Methods: • Boolean Weight • Word Frequency • Entropy Weight • Stemming Strategy: • all words, none of words • Dimension Reduction: • Frequency threshold: • >=1, >= 4 • Information Gain: top 100, 200, 500, 1000, 2000, 4000

RESULTS & DISCUSSION • Example Result: A result record table • Sample: 10000 texts • Categories: 9 SNOMED codes, and 1 ‘others’ • Recall, Precision, F: for individual category, and whole system • Standard Deviation • Micro, and Macro: for system performance • The Micro_9 results were used for evaluation, while the other three results for reference

System Component Comparison Base Line: Stem None, Unigram • SVM is better at classification, but cost much time; • Stem None of the words to keep more information; • Entropy Weight indexing gives better result; • InfoGain showed overfitting problem didn’t affect much; • Filtering out words which occurs more then 4 times doesn’t change the result too much

Feature Manipulation • Adds Bigram and Concepts raises F score; • Bigger N, nicer result; • N-Gram in Sentences does not decrease result too much, but reduce processing time, and gives more natural significance

Feature Manipulation (Cont’) • Hiding misleading part of text raises the classification result; • Keeping the original words and concept ids performs better then filtering out the concept-matched words; • Only using concept ids does not give a nice result – much information had been lost.

FUTURE WORK • Treat the hidden part of text in another way – just chunk them as a new type bag-of-words (experiments are under going) • Develop a web-based utility for production environment, i.e., “give the system the report, then system gives the result”

CONCLUSIONS • Non-Word-Stemming performed better than Stemming • N-gram increased the correct-rate of classification with the increase of N (Trigram>Bigram>Unigram) • NLP Technique and Concept Extraction indeed enhanced the classification • Hiding misleading parts of text did raise the score of performance

Thanks to • YOU • My supervisors • Y. Wang Glad to have your questions!

Automatic Classification of Pathology Reports into SNOMED Codes June 2008