Statistical Approaches to Analysis of Traditional Chinese Medicine Patient Records

ChengXiang (“Cheng”) Zhai Department of Computer Science Affiliated with Carl R.Woese Institute for Genomic Biology Department of Statistics School of Information Sciences University of Illinois, Urbana-Champaign http://czhai.cs.illinois.edu, czhai@illinois.edu Statistical Approaches to Analysis of Traditional Chinese Medicine Patient Records Distinguished Lecture in Causal Discovery, Univ. of Pittsburgh, May 18, 2017

Outline Overview of TIMAN Group Traditional Chinese Medicine (TCM) data analysis KnowEng Center and opportunities for collaboration

My Research Group: TIMAN(Text Information Management & Analysis) http://timan.cs.uiuc.edu Current: 11 Ph.D. students 5 MS students 2 Undergraduates 1 Visiting scholar Alumni: 27 Ph.D. students 40 MS students 20 Undergraduates  Academia + Industry Funding:

Research Roadmap of TIMAN Text Mining Decision Support Information Retrieval Biomedical researchers Physicians Patients … Biomedical Literature Medical Records Health Forums … Big Raw Data Small Relevant Data • We emphasize • - Development of general and robust computational methods • optimization of human-computer collaboration • + computers help human find patterns in data • + humans train computers to make them intelligent

Our Work in Medical & Health Informatics How can we find similar medical cases in medical literature, in online forums, …? EHR (Patient Records) Medical Case Retrieval Improved Health Care Similar Medical Cases Medical Knowledge Discovery Medical Knowledge How can we analyze EHR together with other knowledge bases to discover medical knowledge (e.g., effectiveness of drugs, ADRs) from EHR?

Today’s Talk: Analysis of Traditional Chinese Medicine Patient Records Probabilistic generative model Improve clustering by exploiting domain knowledge Integration of EMRs and biomedical information network • Disease Profiling (Wang et al. IEEE BIBM 2016) • Patient Subcategorization (Huang et al. ACM BCB 2016) • Survival Analysis (Huang et al. 2017, submission under review)

Traditional Chinese medicine (TCM) Tu Youyou屠呦呦 2015 Nobel Prize in Physiology or Medicine (jointly with William C. Campbell and Satoshi Ōmura) for discovering artemisinin (青蒿素) • More than 2,500 years of Chinese medical practice • Complementary and alternative medicine • Knowledge is not well-documented • A significant challenge to inherit/use TCM knowledge • An interesting opportunity for data mining!

Vision of TCM Data Mining • TCM patient records = “experimental results” of herbs on patients • Potentially discover effective herbs for treating particular groups of patients  • Effective chemical ingredients in effective herbs  • Combination with western medicine + genomics + biomedical knowledge  • Discover new medical knowledge & Provide a scientific foundation for TCM • TCM philosophy  Individualized experiments • Large-space of empirical hypotheses explored

Collaboration with Beijing TCM Data Center Sheng Wang Edward Huang • Data Sets: > 300,000 clinical cases • Collected EMRs since 2007 from six hospitals • Data stored in a clinical data warehouse • Key Collaborators: • Dr. Runshun Zhang, Dr. Jie Liu: Guang’anmen Hospital, Chinese Academy of Chinese Medical Sciences • Prof. Xuezhong Zhou, Department of Computer Science, Beijing Jiaotong University • PhD Students at UIUC

Outline Probabilistic generative model S. Wang, E. Huang, R. Zhang, X. Zhang, B. Liu. X. Zhou, C. Zhai, A Conditional Probabilistic Model for Joint Analysis of Symptoms, Diseases, and Herbs in Traditional Chinese Medicine Patient Records,IEEE BIBM 2016. • Disease Profiling (Wang et al. IEEE BIBM 2016) • Patient Subcategorization (Huang et al. ACM BCB 2016) • Survival Analysis (Huang et al. 2017, submission under review)

Problem definition • Input: TCM patient records (with symptoms, diseases, and herbs) • Output: disease profile: • Typical symptom groups associated with disease • Typical herb groups associated with disease • Correlations between symptom groups and herb groups

An example analysis of patient record

Challenge: mixed vocabulary • Each record may contain multiple diseases • Patient may have anemia, hypertension and chronic gastritis at the same time • The corresponding symptoms come from different diseases

Our solution • A conditional probabilistic topic model • Explicitly model each disease as one topic • Each disease has its own distribution of symptoms and herbs • Difference from topic models (e.g., PLSA/LDA) • Diseases as priors on topics. • A topic with two coordinated subtopics (one for herbs and one for symptoms) • Difference from previous work on TCM analysis • Modeling asymmetric causal relations between diseases and {symptoms, herbs}

The Conditional Symptom-Herb Model Typical symptoms of disease d Typical herbs of disease d Likelihood of patient t having disease d Symptom Disease Herb All diseases of patient t Observed Patient Record Optimization: Find optimal Pr(s|d), Pr(h|d) and Pr(d|t) so that the probability of all the observed { Pr(s,h|t)} would be maximized (EM algorithm)

Evaluation: Dataset • 10,907 patients TCM records in digestive system treatment • 3,000 symptoms, 97 diseases and 652 herbs • Most frequently occurring disease: chronic gastritis • Most frequently occurring symptoms: abdominal pain and chills • Ground truth: 27,285 manually curated herb-symptom relationship.

Output of the model

“Typical Symptoms” of 3 Diseases

“Typical Herbs” Prescribed for 3 Diseases

Algorithm-Recommended Herbs vs. Physician-Prescribed Herbs

Predict symptom-herb annotations

herb-symptoms relationships Top 10 herb-symptoms relationships identified by our method but not by frequent pattern mining

Summary of Disease Profiling • A new probabilistic model to analysis traditional Chinese medicine patient record • discovers meaningful TCM knowledge and outperforms previous work • can be used to develop a practically useful clinical decision making system • Future Work • Build an application system (e.g., recommending herbs) • Analyze effectiveness of herbs

Outline Improve clustering by exploiting domain knowledge E. Huang, S. Wang, R. Zhang, B. Liu, X. Zhou, C. Zhai, PaReCat: Patient Record Subcategorization for Precision Traditional Chinese Medicine, ACM BCB 2016. • Disease Profiling (Wang et al. IEEE BIBM 2016) • Patient Subcategorization (Huanget al. ACM BCB 2016) • Survival Analysis (Huang et al. 2017, submission under review)

Subcategorizations

Approach • Use both herb and symptom information for clustering • Challenge: data sparseness • Solution: Leverage domain knowledge (TCM dictionary); using an embedding approach

PaReCat Pipeline

Data Description • 2,276 medical records Manually annotated by doctors • 3 level 1 labels • 51 level 2 labels • 274 level 3 labels

Clustering • Agglomerative clustering • Cosine similarity as affinity measure • Clustered with level 2 labels as ground truth • Number of clusters = number of level 2 labels • Two feature types • Clustering using only symptoms • Clustering with both symptoms and herbs • Didn’t do clustering using only herbs (symptoms are assumed to be always available)

Evaluation • Adjusted Rand Index (ARI) • Maximum score of 1.0 • Counts overlaps in contingency table • Adjusted for chance

Sample Clustering Results

Similar symptoms treated differently

Different symptoms treated similarly

Summary of Subcategorization First study of subcategorization of TCM records Improve clustering by using TCM dictionary Experiment results show that using TCM is beneficial Future work: comparative analysis of similar subcategories & effectiveness of herbs

Outline Integration of EMRs and biomedical information network E. W. Huang, S. Wang, B. Li, R. Zhang, B. Liu, R. Zhang, J. Liu, X. Zhou, H. Lin, C. Zhai. HEMnet: Integration of Electronic Medical Records with Molecular Interaction Networks and Domain Knowledge for Survival Analysis, under review • Disease Profiling (Wang et al. IEEE BIBM 2016) • Patient Subcategorization (Huang et al. ACM BCB 2016) • Survival Analysis (Huang et al. 2017, submission under review)

Challenge in Analysis of EMR: Missing Data • Many data mining methods assume availability of all features • Assumption usually doesn’t hold for EMR • Example: doctors do not perform all medical tests on all patients • Mean imputation • Most common method to fill missing values • Introduces noise rather than reduce it

Challenge in Analysis of EMR: Semantic mismatch • Features that mean the same thing, but occupy different spaces in the vocabulary • Example: “hypertension” and “high blood pressure” • Similar problem in text mining can be solved with word2vec or other word embedding techniques, but such techniques are insufficient for EMR (e.g., med2vec)

Solution: External Knowledge • Idea: use external knowledge (expert curated, literature mined, etc.) to fill in missing context information • Molecular interaction networks, such as protein-protein interaction (PPI) networks (protein-protein edges) • Known drug targets (drug-protein edges) • Known drug-symptom relationships (drug-symptom edges) • But how can we incorporate all of this external information into EMRs?

HEMnet • Create a heterogeneous information network: HEterogeneous Electronic Medical network (HEMnet) • Use co-occurrence information to build network • Example: create edge between node d and node s if, for a patient, d is a drug that is prescribed to treat some symptom s. • With this co-occurrence network, we can add in the PPI network and other domain knowledge to create the HEMnet

Data Description • Lung cancer data set • 43 patients with squamous-cell lung carcinoma • 90 patients non-squamous-cell lung carcinoma • 449 unique features (excluding proteins) • 133 patient records with high degree of sparsity • Each record missing an average of 407 out of 449 features • Patient record matrix: 133x449 matrix • Each record also contains the patient’s survival information • Number of days until death • If no death event, number of days until hospital discharge • HEMnet • 11,911 nodes (most of these are proteins) • 379,715 edges • 23 edge types

Using HEMnet to enrich EMRs • With the HEMnet, we now wish to fill in the missing “contexts” of the medical records • Use network embedding technique, ProSNet, which has been shown to be useful in protein function prediction • word2vec: in sentence, a word’s neighbors should predict the word’s context • ProSNet: in graph, a node’s neighbors should predict the node’s context • Get a low-dimensional vector for each node • Get a pairwise cosine similarity matrix of nodes • Multiply matrix into patient record matrix

Heterogenous network PosNet Network Embedding [Wang et al. 17] z E3 E2 E4 E1 y Node/Entity vector space x E5 ProsNet Dimensionality Reduction Input Output • General: applicable to any heterogeneous network • e.g., gene, drug, disease network • Efficient: • 60K nodes and 10M edges: less than 30 minutes on a 12-core CPU

Path based nodes similarity score Score of path M starts from node u and ends at node v Weights on different dimension of Xv Global bias for path type M Weights on different dimension of Xu Compact node representation

Objective function

Fast online optimization • Challenge: • The heterogeneous network may have millions of nodes and edges • Methods based on random walk with restart can only handle network with ~10K nodes • Solution: Online learning • Randomly sample a path in the network every time • Only optimize nodes on this path • Sampling according to node degree • Inherently parallelized • A network with 60K nodes and 10M edges. • Less than 30 minutes on a 12-core CPU

Experiments • Compare performance of HEMnet-enriched patient record matrix with the baseline patient record matrix • Clustering experiment • Hospitals are interested in whether a patient will survive • With patient survival functions as ground truth, we want to separate patients into two groups • Thus, the best clustering is one in which the two clusters have the most different survival functions • Survival functions can be estimated with Kaplan-Meier curves, then compared with log-rank test • Low p-value means the two clusters have significantly different survival rates • So, the lower the p-value, the more successful the method was in separating relatively healthy patients from those with short survival times

Squamous-Cell Lung Carcinoma Results With HEMnet, p-value = 0.001020 Baseline, p-value = 0.01267

Non-Squamous-Cell Lung Carcinoma Results With HEMnet, p-value = 0.003652 Baseline, p-value = 0.02333

Summary of Survival Analysis HEMnet-enriched patient records can improve the performance of clustering over the baseline For the baseline, neither cancer subtype was significantly separated at the 0.01 level, while the HEMnet-enriched feature matrix was Therefore, the external information (PPI network and domain knowledge) is helpful in solving the issues associated with missing data and semantic mismatch

Summary: Analysis of Traditional Chinese Medicine Patient Records Probabilistic generative model Improve clustering by exploiting domain knowledge Integration of EMRs and biomedical information network • Disease Profiling (Wang et al. IEEE BIBM 2016) • Patient Subcategorization (Huang et al. ACM BCB 2016) • Survival Analysis (Huang et al. 2017, submission under review)

Statistical Approaches to Analysis of Traditional Chinese Medicine Patient Records

Statistical Approaches to Analysis of Traditional Chinese Medicine Patient Records

Presentation Transcript

Traditional Chinese Medicine

Traditional Chinese medicine in treating hypertension

SHANGHAI UNIVERSITY OF TRADITIONAL CHINESE MEDICINE

Traditional Chinese Medicine anno 2009

Traditional Chinese Veterinary Medicine

Traditional Chinese Medicine

Traditional Chinese Medicine and Acupuncture

Traditional Chinese Medicine

Traditional Chinese Herbal Medicine: An Overview

Traditional Approaches to Modeling and Analysis

Traditional Chinese Medicine

Traditional Chinese Medicine and Acupuncture

TRADITIONAL CHINESE MEDICINE TCM

Global Equipments Of Traditional Chinese Medicine Industry

Traditional Chinese Medicine Dublin

Chinese herbal medicine or traditional Chinese medicine

Traditional Chinese Medicine Expert - City Acupuncture

Brief Introduction of Traditional Chinese Medicine

Traditional Chinese Medicine

Winter Health According to Traditional Chinese Medicine

Traditional Chinese Medicine Practitioner

The Benefits of Using Traditional Chinese Medicine