1 / 74

Statistical Approaches to Analysis of Traditional Chinese Medicine Patient Records

ChengXiang (“Cheng”) Zhai Department of Computer Science Affiliated with Carl R.Woese Institute for Genomic Biology Department of Statistics School of Information Sciences University of Illinois, Urbana-Champaign http://czhai.cs.illinois.edu, czhai@illinois.edu.

rmcgrath
Download Presentation

Statistical Approaches to Analysis of Traditional Chinese Medicine Patient Records

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ChengXiang (“Cheng”) Zhai Department of Computer Science Affiliated with Carl R.Woese Institute for Genomic Biology Department of Statistics School of Information Sciences University of Illinois, Urbana-Champaign http://czhai.cs.illinois.edu, czhai@illinois.edu Statistical Approaches to Analysis of Traditional Chinese Medicine Patient Records Distinguished Lecture in Causal Discovery, Univ. of Pittsburgh, May 18, 2017

  2. Outline Overview of TIMAN Group Traditional Chinese Medicine (TCM) data analysis KnowEng Center and opportunities for collaboration

  3. My Research Group: TIMAN(Text Information Management & Analysis) http://timan.cs.uiuc.edu Current: 11 Ph.D. students 5 MS students 2 Undergraduates 1 Visiting scholar Alumni: 27 Ph.D. students 40 MS students 20 Undergraduates  Academia + Industry Funding:

  4. Research Roadmap of TIMAN Text Mining Decision Support Information Retrieval Biomedical researchers Physicians Patients … Biomedical Literature Medical Records Health Forums … Big Raw Data Small Relevant Data • We emphasize • - Development of general and robust computational methods • optimization of human-computer collaboration • + computers help human find patterns in data • + humans train computers to make them intelligent

  5. Our Work in Medical & Health Informatics How can we find similar medical cases in medical literature, in online forums, …? EHR (Patient Records) Medical Case Retrieval Improved Health Care Similar Medical Cases Medical Knowledge Discovery Medical Knowledge How can we analyze EHR together with other knowledge bases to discover medical knowledge (e.g., effectiveness of drugs, ADRs) from EHR?

  6. Today’s Talk: Analysis of Traditional Chinese Medicine Patient Records Probabilistic generative model Improve clustering by exploiting domain knowledge Integration of EMRs and biomedical information network • Disease Profiling (Wang et al. IEEE BIBM 2016) • Patient Subcategorization (Huang et al. ACM BCB 2016) • Survival Analysis (Huang et al. 2017, submission under review)

  7. Traditional Chinese medicine (TCM) Tu Youyou屠呦呦 2015 Nobel Prize in Physiology or Medicine (jointly with William C. Campbell and Satoshi Ōmura) for discovering artemisinin (青蒿素) • More than 2,500 years of Chinese medical practice • Complementary and alternative medicine • Knowledge is not well-documented • A significant challenge to inherit/use TCM knowledge • An interesting opportunity for data mining!

  8. Vision of TCM Data Mining • TCM patient records = “experimental results” of herbs on patients • Potentially discover effective herbs for treating particular groups of patients  • Effective chemical ingredients in effective herbs  • Combination with western medicine + genomics + biomedical knowledge  • Discover new medical knowledge & Provide a scientific foundation for TCM • TCM philosophy  Individualized experiments • Large-space of empirical hypotheses explored

  9. Collaboration with Beijing TCM Data Center Sheng Wang Edward Huang • Data Sets: > 300,000 clinical cases • Collected EMRs since 2007 from six hospitals • Data stored in a clinical data warehouse • Key Collaborators: • Dr. Runshun Zhang, Dr. Jie Liu: Guang’anmen Hospital, Chinese Academy of Chinese Medical Sciences • Prof. Xuezhong Zhou, Department of Computer Science, Beijing Jiaotong University • PhD Students at UIUC

  10. Outline Probabilistic generative model S. Wang, E. Huang, R. Zhang, X. Zhang, B. Liu. X. Zhou, C. Zhai, A Conditional Probabilistic Model for Joint Analysis of Symptoms, Diseases, and Herbs in Traditional Chinese Medicine Patient Records,IEEE BIBM 2016. • Disease Profiling (Wang et al. IEEE BIBM 2016) • Patient Subcategorization (Huang et al. ACM BCB 2016) • Survival Analysis (Huang et al. 2017, submission under review)

  11. Problem definition • Input: TCM patient records (with symptoms, diseases, and herbs) • Output: disease profile: • Typical symptom groups associated with disease • Typical herb groups associated with disease • Correlations between symptom groups and herb groups

  12. An example analysis of patient record

  13. Challenge: mixed vocabulary • Each record may contain multiple diseases • Patient may have anemia, hypertension and chronic gastritis at the same time • The corresponding symptoms come from different diseases

  14. Our solution • A conditional probabilistic topic model • Explicitly model each disease as one topic • Each disease has its own distribution of symptoms and herbs • Difference from topic models (e.g., PLSA/LDA) • Diseases as priors on topics. • A topic with two coordinated subtopics (one for herbs and one for symptoms) • Difference from previous work on TCM analysis • Modeling asymmetric causal relations between diseases and {symptoms, herbs}

  15. The Conditional Symptom-Herb Model Typical symptoms of disease d Typical herbs of disease d Likelihood of patient t having disease d Symptom Disease Herb All diseases of patient t Observed Patient Record Optimization: Find optimal Pr(s|d), Pr(h|d) and Pr(d|t) so that the probability of all the observed { Pr(s,h|t)} would be maximized (EM algorithm)

  16. Evaluation: Dataset • 10,907 patients TCM records in digestive system treatment • 3,000 symptoms, 97 diseases and 652 herbs • Most frequently occurring disease: chronic gastritis • Most frequently occurring symptoms: abdominal pain and chills • Ground truth: 27,285 manually curated herb-symptom relationship.

  17. Output of the model

  18. “Typical Symptoms” of 3 Diseases

  19. “Typical Herbs” Prescribed for 3 Diseases

  20. Algorithm-Recommended Herbs vs. Physician-Prescribed Herbs

  21. Predict symptom-herb annotations

  22. herb-symptoms relationships Top 10 herb-symptoms relationships identified by our method but not by frequent pattern mining

  23. Summary of Disease Profiling • A new probabilistic model to analysis traditional Chinese medicine patient record • discovers meaningful TCM knowledge and outperforms previous work • can be used to develop a practically useful clinical decision making system • Future Work • Build an application system (e.g., recommending herbs) • Analyze effectiveness of herbs

  24. Outline Improve clustering by exploiting domain knowledge E. Huang, S. Wang, R. Zhang, B. Liu, X. Zhou, C. Zhai, PaReCat: Patient Record Subcategorization for Precision Traditional Chinese Medicine, ACM BCB 2016. • Disease Profiling (Wang et al. IEEE BIBM 2016) • Patient Subcategorization (Huanget al. ACM BCB 2016) • Survival Analysis (Huang et al. 2017, submission under review)

  25. Subcategorizations

  26. Approach • Use both herb and symptom information for clustering • Challenge: data sparseness • Solution: Leverage domain knowledge (TCM dictionary); using an embedding approach

  27. PaReCat Pipeline

  28. Data Description • 2,276 medical records Manually annotated by doctors • 3 level 1 labels • 51 level 2 labels • 274 level 3 labels

  29. Clustering • Agglomerative clustering • Cosine similarity as affinity measure • Clustered with level 2 labels as ground truth • Number of clusters = number of level 2 labels • Two feature types • Clustering using only symptoms • Clustering with both symptoms and herbs • Didn’t do clustering using only herbs (symptoms are assumed to be always available)

  30. Evaluation • Adjusted Rand Index (ARI) • Maximum score of 1.0 • Counts overlaps in contingency table • Adjusted for chance

  31. Sample Clustering Results

  32. Similar symptoms treated differently

  33. Different symptoms treated similarly

  34. Summary of Subcategorization First study of subcategorization of TCM records Improve clustering by using TCM dictionary Experiment results show that using TCM is beneficial Future work: comparative analysis of similar subcategories & effectiveness of herbs

  35. Outline Integration of EMRs and biomedical information network E. W. Huang, S. Wang, B. Li, R. Zhang, B. Liu, R. Zhang, J. Liu, X. Zhou, H. Lin, C. Zhai. HEMnet: Integration of Electronic Medical Records with Molecular Interaction Networks and Domain Knowledge for Survival Analysis, under review • Disease Profiling (Wang et al. IEEE BIBM 2016) • Patient Subcategorization (Huang et al. ACM BCB 2016) • Survival Analysis (Huang et al. 2017, submission under review)

  36. Challenge in Analysis of EMR: Missing Data • Many data mining methods assume availability of all features • Assumption usually doesn’t hold for EMR • Example: doctors do not perform all medical tests on all patients • Mean imputation • Most common method to fill missing values • Introduces noise rather than reduce it

  37. Challenge in Analysis of EMR: Semantic mismatch • Features that mean the same thing, but occupy different spaces in the vocabulary • Example: “hypertension” and “high blood pressure” • Similar problem in text mining can be solved with word2vec or other word embedding techniques, but such techniques are insufficient for EMR (e.g., med2vec)

  38. Solution: External Knowledge • Idea: use external knowledge (expert curated, literature mined, etc.) to fill in missing context information • Molecular interaction networks, such as protein-protein interaction (PPI) networks (protein-protein edges) • Known drug targets (drug-protein edges) • Known drug-symptom relationships (drug-symptom edges) • But how can we incorporate all of this external information into EMRs?

  39. HEMnet • Create a heterogeneous information network: HEterogeneous Electronic Medical network (HEMnet) • Use co-occurrence information to build network • Example: create edge between node d and node s if, for a patient, d is a drug that is prescribed to treat some symptom s. • With this co-occurrence network, we can add in the PPI network and other domain knowledge to create the HEMnet

  40. Data Description • Lung cancer data set • 43 patients with squamous-cell lung carcinoma • 90 patients non-squamous-cell lung carcinoma • 449 unique features (excluding proteins) • 133 patient records with high degree of sparsity • Each record missing an average of 407 out of 449 features • Patient record matrix: 133x449 matrix • Each record also contains the patient’s survival information • Number of days until death • If no death event, number of days until hospital discharge • HEMnet • 11,911 nodes (most of these are proteins) • 379,715 edges • 23 edge types

  41. Using HEMnet to enrich EMRs • With the HEMnet, we now wish to fill in the missing “contexts” of the medical records • Use network embedding technique, ProSNet, which has been shown to be useful in protein function prediction • word2vec: in sentence, a word’s neighbors should predict the word’s context • ProSNet: in graph, a node’s neighbors should predict the node’s context • Get a low-dimensional vector for each node • Get a pairwise cosine similarity matrix of nodes • Multiply matrix into patient record matrix

  42. Heterogenous network PosNet Network Embedding [Wang et al. 17] z E3 E2 E4 E1 y Node/Entity vector space x E5 ProsNet Dimensionality Reduction Input Output • General: applicable to any heterogeneous network • e.g., gene, drug, disease network • Efficient: • 60K nodes and 10M edges: less than 30 minutes on a 12-core CPU

  43. Path based nodes similarity score Score of path M starts from node u and ends at node v Weights on different dimension of Xv Global bias for path type M Weights on different dimension of Xu Compact node representation

  44. Objective function

  45. Fast online optimization • Challenge: • The heterogeneous network may have millions of nodes and edges • Methods based on random walk with restart can only handle network with ~10K nodes • Solution: Online learning • Randomly sample a path in the network every time • Only optimize nodes on this path • Sampling according to node degree • Inherently parallelized • A network with 60K nodes and 10M edges. • Less than 30 minutes on a 12-core CPU

  46. Experiments • Compare performance of HEMnet-enriched patient record matrix with the baseline patient record matrix • Clustering experiment • Hospitals are interested in whether a patient will survive • With patient survival functions as ground truth, we want to separate patients into two groups • Thus, the best clustering is one in which the two clusters have the most different survival functions • Survival functions can be estimated with Kaplan-Meier curves, then compared with log-rank test • Low p-value means the two clusters have significantly different survival rates • So, the lower the p-value, the more successful the method was in separating relatively healthy patients from those with short survival times

  47. Squamous-Cell Lung Carcinoma Results With HEMnet, p-value = 0.001020 Baseline, p-value = 0.01267

  48. Non-Squamous-Cell Lung Carcinoma Results With HEMnet, p-value = 0.003652 Baseline, p-value = 0.02333

  49. Summary of Survival Analysis HEMnet-enriched patient records can improve the performance of clustering over the baseline For the baseline, neither cancer subtype was significantly separated at the 0.01 level, while the HEMnet-enriched feature matrix was Therefore, the external information (PPI network and domain knowledge) is helpful in solving the issues associated with missing data and semantic mismatch

  50. Summary: Analysis of Traditional Chinese Medicine Patient Records Probabilistic generative model Improve clustering by exploiting domain knowledge Integration of EMRs and biomedical information network • Disease Profiling (Wang et al. IEEE BIBM 2016) • Patient Subcategorization (Huang et al. ACM BCB 2016) • Survival Analysis (Huang et al. 2017, submission under review)

More Related