1 / 21

Ivan Dimitrov

Ivan Dimitrov. School of Pharmacy Medical University of Sofia. Application of machine learning techniques for allergenicity prediction. 2nd Regional Conference “Supercomputing Applications in Science and Industry” Rodopi Hotel, Sunny Beach, Bulgaria, September 20-21, 2011.

inez
Download Presentation

Ivan Dimitrov

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ivan Dimitrov School of Pharmacy Medical University of Sofia Application of machine learning techniques for allergenicity prediction 2nd Regional Conference “Supercomputing Applications in Science and Industry” Rodopi Hotel, Sunny Beach, Bulgaria, September 20-21, 2011

  2. Allergen processing pathways C. M. Hawrylowicz & A. O'Garra, Nature Reviews Immunology 2005, 271-283

  3. FAO and WHO Codex alimentarius guidelines for evaluating potential allergenicity for novel proteins A query protein is potentially allergenicifit: has an identity of 6 to 8 contiguous amino acids or has > 35% sequence similarity over a window of 80 amino acids when compared with known allergens. Codex Principles and Guidelines on Foods Derived from Biotechnology. 2003 Rome, Italy: Codex Alimentarius Commission, Joint FAO/WHO Food Standards Programme, Food and Agriculture Organization.

  4. Bioinformatics approaches to allergen prediction • Sequence-alignment search of query protein • Extensive databases of known allergen proteins and the FAO/WHO guidelines • - Structural Database of Allergenic Proteins • - Allermatch Characteristics: • High sensitivity (true positives/(true positives + false negatives)) • - Produce many false positives and low precision • (true positives/(true positives + false positives)) • - Discovery of novel antigens is restricted by their lack of similarity to known allergens. Ivanciuc et al.Nucleic Acids Res. 2003, 31, 359–362 Fiers et al.BMC Bioinformatics 2004, 5, 133

  5. Bioinformatics approaches to allergen prediction 2. Identification of conserved allergenicity-related linear motifs • Comparing allergens to non-allergens by MEME motif discovery tool • - Clustering of known allergens, wavelet analysis and hidden Markov model • - Automated Selection of Allergen-Representative Peptides (DASARP). • Motif search by Support Vector Machines (SVM), MEME/MAST, IgE epitopes and Allergen-Representative Peptides (ARP) • - Iterative pairwise sequence similarity encoding scheme with SVM as the discriminating engine • Both approaches are based on the assumption that the allergenicity is a linearly coded property. Stadler and Stadler FASEB J. 2003, 17, 1141-1143Saha and Raghava Nucleic Acids Research,2006,34, 202-209 Li et al. Bioinformatics 2004, 20, 2572-2578.Muh et al. PLoS ONE, 2009, 4 (6), art. no. e5861 Björklund et al. Bioinformatics. 2005, 21, 39–50

  6. AIM of the study To create an alignment-free method for in silico identification of allergens based on the main chemical properties of amino acid sequences and implement it to a web server. Obstacles: The choice of an appropriate descriptors to represent the physicochemical properties of amino acid sequences. Allergens are proteins with different length.

  7. The z-scales …Phe – Arg – Trp… z1 z2 z3 hydrophobicity molecular size polarity z1 z2 z3 z1 z2 z3 z1 z2 z3 -4.22 1.94 1.08 3.62 2.60 -3.60 -4.36 3.94 0.69 Hellberg et al. J. Med. Chem. 1987; 30, 1126-1135

  8. ACC transformation Auto-covariance Cross-covariance j, k are the zscales (j=1,2,3); i is the amino acid positions; n is the number of amino acids in the sequence; Phe – Arg – Trp – Phe – Arg – Trp protein z1 z2 z3 - z1 z2 z3 - z1 z2 z3 – z1 z2 z3 - z1 z2 z3 – z1 z2 z3 /5 ACC11(1) z1 z2 z3 - z1 z2 z3 - z1 z2 z3 – z1 z2 z3 - z1 z2 z3 – z1 z2 z3 /5 ACC13(1) Wold et al. Anal. Chim. Acta 1993, 277:239-225

  9. Preliminary study 595 food allergens from CSL allergen database 595 non-allergens from NCBI database Training set 475 food allergens 475 non-allergens Test set 120 food allergens 120 non-allergens ACC transformation of z descriptors matrix with 45 variables (32 x 5) and 950 observations external validation statistical methods, machine learning Sensitivity Specificity Accuracy PLS - discriminant analysis Logistic regression Naïve - Bayes algorithm Decision tree algorithm k Nearest Neighbours http://allergen.csl.gov.uk http://www.ncbi.nlm.nih.gov/

  10. Results from preliminary study TP – true positive, FP – false positive TN – true negative, FN – false negative

  11. Web servers on the test set Algpred - SVM with single aa composition - SVM with dipeptide composition Evaller APPEL Allerhunter Test set 120 food allergens 120 non-allergens Sensitivity Specificity Accuracy Saha and Raghava Nucleic Acids Research,2006,34, 202-209. Barrio et al., Nucleic Acids Research2007, 35, 694-700 http://jing.cz3.nus.edu.sg/cgi-bin/APPEL Muh et al. PLoS ONE, 2009, 4 (6), art. no. e5861

  12. Conclusions from the preliminary study • The model developed by the k Nearest Neighbors method shows • the best performance on the test set comparing to the other methods. • It has a good balance between specificity and sensitivity, and the • highest accuracy. kNN was used further in the study. 2. The server Allerhunter is the best performing among the known servers for allergen prediction. kNN needs some more improvements. 3. A great misbalance exists between sensitivity and specificity for almost all servers. This indicates that the dataset needs some improvement too.

  13. The kNN algorithm Training set 475 allergens, 475 non-allergens Unknown protein ACC transformation of z descriptors ACC transformation of z descriptors vector with 45 variables (32 x 5) matrix of 45 variables (32 x 5) and 950 observations Calculate the Euclidian distance between the vector and each observation Sort the distance by value in ascending order Determine the class of unknown allergen according to the majority of nearest neighbours Determine the k nearest neighbours

  14. Next: Extend the data sets CSL allergen database, FARRP allergen database SDAP database, ADFS database 684 food, 1157 inhalant, 553 toxins, venom or salivary allergens Allergen species NCBI database Create local database Proteins from allergen species Blasts search against all allergens 684 non-allergen from food origin 1157 non-allergens from inhalant origin 553 non-allergens from species with toxins, venom or salivary allergens http://allergen.csl.gov.uk http://www.allergenonline.org/ http://fermi.utmb.edu/SDAP/ http://allergen.nihs.go.jp/ADFS/index.jsp http://www.ncbi.nlm.nih.gov/

  15. Next: kNN optimization 684 food allergens 684 non-allergens Training set 528 allergens 528 non-allergens Test set 156 allergens 156 non-allergens machine learning external validation k nearest neighbours Sensitivity Specificity Accuracy

  16. kNN models 684 food allergens 684 non-allergens 1157 inhalant allergens 1157 non-allergens Test set 156 allergens 156 non-allergens Training set 528 allergens 528 non-allergens Training set 933 allergens 933 non-allergens Test set 224 allergens 224 non-allergens external validation external validation external validation k NN k = 3 k NN k = 3 Sensitivity Specificity Accuracy

  17. kNN models

  18. AllerTOPweb tool for allergenicity prediction Training set 1952 food, inhalant and others allergens and 1952 non-allergens ACC transformation of z descriptors kNN model external validation AllerTOP http://www.pharmfac.net/alletop

  19. Servers performance on united testset United test set of 441 food and inhalant allergens and 441 non-allergens Two of the servers from preliminary studies: Appel and Evaller were not available during recent study. The results for Allerhunter server are achieved with smaller testset due to its incapability to work with short sequences (<21 amino acids)

  20. Conclusions • An alignment-free method for in silico prediction of allergens based on • the main physicochemical properties of proteins was developed. 2. The method uses z descriptors for representation of amino acids in the protein sequences and ACC transformation for conversion of proteins into uniform vectors. 3. The k Nearest Neighbours clustering method showed the best performance among the other algorithms for classification tested in the study: PLS - discriminant analysis, Logistic regression, Naïve - Bayes and Decision Tree algorithm. 4. The k NN algorithm was optimized and its performance was compared to the freely available web servers for prediction of allergens. 5. The kNN algorithm was implemented on a web server, freely available on: http://www.pharmfac.net/allertop

  21. Drug Design Group School of Pharmacy Medical University of Sofia Irini Doytchinova Ivan Dimitrov Mariyana Atanasova Panaiot Garnev Acknowledgements Darren R. Flower Aston University, Birmingham, UK Funding: National Research Fund, Ministry of Education and Science, Bulgaria, Grant 02-1/2009

More Related