1 / 10

Majid Masso School of Systems Biology, George Mason University Manassas, Virginia

Prediction of HIV-1 Drug Resistance: Representation of Target Sequence Mutational Patterns via an n-Grams Approach. Majid Masso School of Systems Biology, George Mason University Manassas, Virginia BIBM 2012, Philadelphia, Pennsylvania. Graphical Outline of Presentation.

nerita
Download Presentation

Majid Masso School of Systems Biology, George Mason University Manassas, Virginia

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prediction of HIV-1 Drug Resistance:Representation of Target Sequence Mutational Patterns via an n-Grams Approach Majid Masso School of Systems Biology, George Mason University Manassas, Virginia BIBM 2012, Philadelphia, Pennsylvania

  2. Graphical Outline of Presentation

  3. HIV-1 Protein Sequence Datasets • Data available from Stanford HIV Drug Resistance Database • 548 protease (PR) and 331 reverse transcriptase (RT) sequences with distinct mutational patterns defined by residue substitutions • For each of 8 PR and 11 RT inhibitors, PhenoSense assay used to measure degree to which mutant target proteins are susceptible • PR/RT genotyping much faster and cheaper than phenotyping • Hence accurate predictive models of drug susceptibility only from target sequence are in high demand • Here we develop 19 inhibitor-specific predictive classification and regression models trained on the available phenotype data

  4. HIV-1 Protein Sequence Datasets a. S, sensitive; I, intermediate; R, resistant

  5. Sequence Feature Vectors Using n-Grams • Used successfully by other groups for sequence representation to study proteins; first application in this context (HIV-1 PR/RT) • Each of the 19 inhibitor sequence datasets encoded separately • Relative frequency method: sliding window of size n = 2 captures all ordered 2-grams of the seqs; calc. rel. freq. for all 400 types of 2-grams; represent each seq. as ordered vector of rel. freqs. • Counts method: each seq. represented as a 400-dim. vector, each component represents a specific 2-gram type whose value is the absolute freq. of its occurrence in that seq. • Dataset sequences have inhibitor susceptibility (phenotype) values (regression models), which can be be placed into 3 (S/I/R) groups (classification models)

  6. Classification and Regression Models • Algorithms: random forest (RF) for classification, reduced-error pruned tree (REPTree) for regression, implemented in Weka • Testing: stratified tenfold cross-validation applied to each dataset • Reported results on each dataset: • RF classification: accuracy (% correct), out-of-bag (OOB) error, balanced error rate (BER), area under ROC curve (AUC) • REPTree regression: corr coeff (r2), mean-squared error (mse), accuracy (% correct) based on where predicted numerical susceptibility values fall relative to S/I/R category thresholds

  7. Accuracy Results Rhee, et al. (Stanford) 0.78 0.76 0.83

  8. Information-Rich REPTree Attributes (Protease Inhibitors) • Based on relative frequency method for generating sequence feature vectors • Node attribute i is a vector component number, whose value is the rel. freq. for the (i, i + 1) sequence 2-gram • Ex.: root node 10 for APV corresponds to PR sequence positions (10, 11), and at least one of these is known to be an important drug resistance position (10 is in both IAS and TSM subsets) (Nucleoside / Nucleotide RT Inhibitors) (Non-nucleoside RT Inhibitors) a. Regular font, both IAS and TSM sets of positions; bold, TSM only; underlined, neither.

  9. Application: Drug Cocktail Effectiveness • Used relative frequency method and REPTree regression • Train with one inhibitor dataset, test with another • High corr coeff (r) between actual and predicted susceptibility values on test set both inhibitors (train and test sets) have similar resistance patterns and/or likely not good taken together • Low or slightly negative r potentially good in combination Known bad pairing Known good pairing 3TC/ABC or FTC/ABC pairs are effective, but high risk of severe adverse events that require stoppage Shaded areas: NRTI/NNRTI pairs (known good together) Two NNRTIs should NOT be taken together (based on clinical trials)

  10. Acknowledgements and References • Thanks tothe Stanford HIV Drug Resistance Database (http://hivdb.stanford.edu/) for the genotype-phenotype correlation data characterizing HIV-1 PR and RT sequences • This study was inspired by Rhee, et al., PNAS (2006) • Effective cocktails, and drugs not to co-administer, based on Antiretroviral Guidelines for Adults and Adolescents from the U.S. Department of Health and Human Services: http://www.aidsinfo.nih.gov/ContentFiles/AdultandAdolescentGL.pdf

More Related