Text Classification for Healthcare Information Support

Text Classification for Healthcare Information Support Rey-Long Liu (劉瑞瓏) Dept. of Medical Informatics Tzu Chi University, Taiwan

Background • Text categorization (TC) as a fundamental component for information processing • Many TC techniques were developed • Unfortunately, high-quality TC is often an unrealizable ideal • Very high precision • Very high recall

Consultancy General Users (e.g. patients) Healthcare Professionals Classified Inquiry Inquiry Classification Confirmation Query Relevant Information Classified Query Classified Information Information Gathered Classified Information Base Information Gathering Systems High-Quality TC Background (Cont.) • An application scenario: healthcare information support

Outline • Interaction as an approach to high-quality TC • Main consideration • Reducing the amount of the interaction • Criteria & straightforward interaction strategies • An intelligent interaction strategy: COM (Content Overlapping Measurement) • Empirical evaluation • Chinese cancer texts classification • Conclusion

Interaction for High-Quality TC • Interaction with the user • Possibly a “final” approach • More application scenarios • Information recommendation & archiving • Definite relevant vs. potentially relevant • Main consideration • Reducing the number of interactions

Interaction for High-Quality TC (Cont.) • Evaluation criteria • Confirmation Precision (CP) • Related to cognitive load to users • Confirmation Recall (CR) • Related to the quality of TC

(A) Setting two thresholds to identify the DOA range for confirmation (o: positive validation document; x: negative validation document) : Acceptance Threshold Rejection Threshold Max DOA Min DOA x o x x o x o x o o o o x x x o o (B) Confirmation strategy: Prob = 1.0 (when RT  DOA(d, c)  AT) Prob = 0 (when DOA(d, c) < RT) Prob = 0 (when DOA(d, c) > AT) • Uniform Confirmation (UC): Preferring CR Interaction for High-Quality TC (Cont.) • Straightforward interaction strategies

(A) Tuning a threshold in the hope to optimize F1 (o: positive validation document; x: negative validation document): The classifier’s Threshold (T) Max DOA Min DOA o x o x x o x o x o o o x x x o o • Probabilistic Confirmation (PC): Preferring CP (B) Confirmation strategy: Prob = 1.0 (when DOA(d, c) = threshold) Prob = 0 (when DOA(d, c) = Min) Prob = 0 (when DOA(d, c) = Max) Interaction for High-Quality TC (Cont.)

Underlying Classifier Feature Selection ICCOM (1) Content Overlap Measurement (COM) Training Documents for Classifier Building Classifier Building Training Documents for Threshold Tuning (validation) (2) Threshold Tuning based on Content Overlapping Threshold Tuning Incoming Document Classified/Filtered Documents Documents to be Confirmed Training Testing ICCOM: Interactive Confirmation by COM (3) Content Overlap Measurement (COM) Classification

ICCOM: Interactive Confirmation by COM (content overlapping measurement) • ProcedureCOM(c, d), where • (1) c is a category, • (2) d is a document for thresholding or testing • Return: Degree of content overlap (DCO) between d and c • Begin • (1) DCO = 0; • (2) For each term t that is positively correlated with c but does not appear ind, do • (2.1) DCO = DCO - 2(t,c); • (3) For each term t that is negatively correlated with c but appears ind, do • (3.1) DCO = DCO - (number of occurrences of t in d) 2(t,c); • (4) Return DCO; • End.

ICCOM: Interactive Confirmation by COM (content overlapping measurement, cont.)

ICCOM: Interactive Confirmation by COM (content overlapping measurement, cont.) “positively-correlated” if AD>BC; otherwise “negative-correlated” N: total number of documents, A: # documents that are in c and contain t, B: # documents that are not in c but contain t, C: # documents that are in c but do not contain t, and D: # documents that are not in c and do not contain t.

Rejection Threshold (RT) The classifier’s threshold (T) Max DOA Min DOA o o o x x x o x o o o x x x o o Rejection Invoking COM to compute DCO Positive Confirmation Threshold (PCT) Negative Confirmation Threshold (NCT) o o x o x o o o o o Acceptance Confirmation x x o Rejection Confirmation ICCOM: Interactive Confirmation by COM (thresholding)

ICCOM: Interactive Confirmation by COM (collaboration with the classifier) • ProcedureInteractiveHighQualityTC(c, d, T, RT, PCT, NCT), where • (1) c is a category, • (2) d is the document to be processed, • (3) T is the classifier’s threshold for c, • (4) RT is the rejection threshold for c, • (5) PCT is the positive confirmation threshold for c, and • (6) NCT is the negative confirmation threshold for c. • Return: • A decision (acceptance, rejection, or confirmation) for d with respect to c. • Begin • (1) DOAd = Invoke the classifier to compute DOA of d with respect to c; • (2) If (DOAd RT), Return “rejection”; • (3) Else • (3.1) DCOd = Invoke COM to compute DCO of d with respect to c; • (3.2) If (DOAd T) • (3.2.1) If (DCOd PCT), Return “acceptance”; • (3.2.2) Return “confirmation”; • (3.3) Else • (3.3.1) If (DCOd NCT), Return “rejection”; • (3.3.2) Return “confirmation”; • End.

Empirical Evaluation • Chinese disease (cancer) texts • 16 types of cancers (e.g. liver cancer, lung cancer, …, etc.) top-ranked by the department of health in Taiwan • Collected by sending cancer names to “知識+” (knowledge+) in Yahoo! at Taiwan • For each cancer, there are 5 subcategories • Cause, symptom, curing, side-effect, and prevention • Therefore, we have 80 (16*5) categories with 2850 documents • 90% for training; 10% for testing • 2-fold cross validation (classifier building vs. thresholding)

Empirical Evaluation (cont.) Classification of cancer information

Empirical Evaluation (cont.) Classification of 40 symptom description without cancer names Note: For the 40 test symptom documents, RO+ICCOM conducts 35 and 51 confirmations in the 1st and 2nd folds, respectively

Conclusion • High-quality TC is essential but often unrealizable • Interactive confirmation may be one final resort • Information recommendation & archiving • Healthcare information support • COM as a classifier-independent strategy for interaction

Thank you!

Text Classification for Healthcare Information Support

Text Classification for Healthcare Information Support

Presentation Transcript

Automatic Text Classification

Transductive Inference for Text Classification using Support Vector Machines

Text Classification

Life Support for Healthcare

Text Support!!!

Concept Ontology For Text Classification

Text for information message

Text box for information

TEXT CLASSIFICATION

Text Classification

Text Classification

Text Classification

Text Classification

ACTIVE LEARNING FOR TEXT CLASSIFICATION

IFT6255: Information Retrieval Text classification

Text Classification

Text Classification

Classification Text

Text Classification

TEXT CLASSIFICATION