Disease prediction based on prior knowledge
This presentation is the property of its rightful owner.
Sponsored Links
1 / 24

Disease Prediction Based on Prior Knowledge PowerPoint PPT Presentation


  • 71 Views
  • Uploaded on
  • Presentation posted in: General

Disease Prediction Based on Prior Knowledge. Gregoe Stiglic , Igor Pernek , Peter Kokol Facuty of Health Sciences University of Maribor Slovania. Zoran Obradovic Center for Information Science and Technology, Temple University, Philadelphia, USA.

Download Presentation

Disease Prediction Based on Prior Knowledge

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Disease prediction based on prior knowledge

Disease Prediction Based on Prior Knowledge

GregoeStiglic, Igor Pernek, Peter Kokol

Facuty of Health Sciences

University of Maribor

Slovania

ZoranObradovic

Center for Information Science and Technology, Temple University, Philadelphia, USA

ACM SIGKDD Workshop on Health Informatics (HI-KDD 2012)

August 12, 2012


O utline

Outline

  • Background

  • Objectives

  • Evolution

  • Disease networks

  • SVM

  • Experimental setup

  • Class-imbalanced data

  • Classification

  • Experiments

  • Stability


Background

Background


Objectives

Objectives

  • Using prior knowledge from human disease networks to lower the burden of building classifiers

  • Estimation of disease risk from hospital discharge data optimally

  • Enhancing Support Vector Machine – Recursive Feature Elimination(SVM-RFE) approach to Support Vector Machine- Reweighted Recursive Feature Elimination (SVM-RRFE)


Evolution

Evolution

Bringing health data into digital form

Increasing acceptance of electronic health records (comparing patients)

Constructing disease related networks

Increasing amount of studies in the application of data mining approaches in this field


Methods of integrating network knowledge

Methods of integrating network knowledge

  • Network centric:

  • focuses on mapping gene/disease expression data onto a network and uses techniques from network analysis to select the important genes/diseases.

  • Data centric:

  • Focuses on machine learning techniques where prior knowledge from biological networks is used to bias the feature selection process toward strongly connected genes/diseases. (Such as SVM-RRFE)


Disease prediction based on prior knowledge

SVM

  • SVM used for classification and constructs a hyperplane or sets of hyperplanes in high dimensional space which divide data to two separate classes.

  • Hyperplane should divide the classes with the largest distance to the nearest training data point of any class.


Two basic concepts that are used when constructing disease networks

Two basic concepts that are used when constructing disease networks

  • Morbidity:

    • Represents the support for a single diagnosis in the given population.

  • Co-morbidity:

    • Represents the support for co-occurrence of two diseases.


Common measures for calculating relations

common measures for calculating relations

Weight:

Relative Risk:

Phi:

SVM-RFE

SVM-RRFE


Experimental setup

Experimental setup

  • Healthcare Cost and Utilization Project (HCUP)

  • Agency for Healthcare Research and Quality

  • Nationwide Inpatient Sample (NIS)

    • 20% of US hospitals

  • Data for the adult population from year 2008 for network construction.

  • Data for the adult population from year 2009 for model evaluation.

  • 6,840,196 discharge records in 2008.

  • 6,546,273 discharge records in 2009.


Disease prediction based on prior knowledge

Experimental setup

  • Each record contains:

  • Personal characteristics of a patient

    • Age

    • Gender

    • Race

  • Administrative information:

    • Length of stay

    • Discharge status

  • Medical information:

    • Diagnoses codes (ICD-9-CM)

    • Surgical and nonsurgical procedures


Disease prediction based on prior knowledge

Experimental setup

  • ICD-9-CM:

    • The International Classification of Diseases, 9th Revision, Clinical modification.

    • Uses taxonomy of 5-digit codes.

    • First 3 digits represent the general diagnosis

    • 2 additional digits describe more detailed subgroup of the general diagnosis.

  • 14,000 diagnosis codes in dataset

  • After removing the codes that were not used in data from 2009, there are 11,170 codes.


Disease prediction based on prior knowledge

Experimental setup

Age group frequency for both 2008 and 2009


Disease prediction based on prior knowledge

Experimental setup

  • Elimination of rare diagnosis with prevalence less than 1%

  • Ranking all diagnosis codes

  • Choosing 5 diagnosis with closest prevalence to 20, 10, 5, 2, and 1%


Class imbalanced data

Class-imbalanced data

  • This problem occurs when we have a dataset with a small number of positive (target class) samples and a much larger number of negative samples.

  • Solutions:

    • Undersampling

    • Oversampling

After balancing, classification performance increases.


Disease prediction based on prior knowledge

Experimental setup

  • Repeated random subsampling:

  • Selecting 10,000 samples randomly in each iteration for testing

  • In the first experience in training, they use train set with balanced samples.

  • In the second experience in training, they use train set that 75% of its samples are positive and 25% of samples are negative.

  • Each random subsampling evaluation was repeated 10 times for all target diagnosis codes.


Disease prediction based on prior knowledge

Experimental setup

  • Feature elimination:

  • In first experience, they eliminate 10% of low impact features (those with the lowest RR measure) in each iteration.

  • In second experience, they eliminate 50% of low impact features in each iteration.


Disease prediction based on prior knowledge

Experimental setup

  • Balanced Subsampling(50% positive – 50% negative)

    • Feature elimination with 10% removal rate

    • Feature elimination with 50% removal rate

  • Imbalanced subsampling(75% positive – 25% negative)

    • Feature elimination with 10% removal rate

    • Feature elimination with 50% removal rate


Disease prediction based on prior knowledge

dataset

Subsampling (repeated 10 times)

test set

train set

Feature elimination

remove 10 or 50 percent of low impact features

train SVM

are features left?

yes

no

sort features

test final model


Classification

Classification

Comparison of AUC for SVM-RRFE and SVM-RFE with 10% removal rate.


Classification1

Classification

Comparison of AUC for SVM-RRFE and SVM-RFE with 50% removal rate.


Experiments

Experiments

  • In the case of hospital discharge classification, it is crucial to use less complex and faster methods. So here the case of 50% removal rate, improves the performance of classification.

  • In the case of 10% removal rate, the differences in AUC between RFE and RRFE is not significant, but in the case of 50% removal it is significant.

  • Testing RRFE on another large dataset from 2000 to 2008 shows that a larger network does not produce significantly better results in classification performance.

  • Using less complex and more recent disease network does not significantly impact the classification performance.


Stability

Stability

Frequency of disease code selection in the optimal feature sets for Hyperlipidemia (272.4) classification.


Conclusion

Conclusion

  • Adaptation of the RRFE method for feature selection in imbalanced high-dimensional hospital discharge data.

  • Observe significant improvements of classification performance when large batches of features are eliminated.

  • After evaluation ofclassification performance of the proposed solution, it would be possible to use it in combination with another classification model.


  • Login