1 / 49

Lecture 2 - Tools

Lecture 2 - Tools. Objective - To familiarize you with the available www resources,so that you can weave your way through analysis of your data, and to help interpret the analysis results. Computational Biology vs. Biologist using Computers- Two Different Things.

jed
Download Presentation

Lecture 2 - Tools

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 2 - Tools Objective - To familiarize you with the available www resources,so that you can weave your way through analysis of your data, and to help interpret the analysis results.

  2. Computational Biology vs. Biologist using Computers- Two Different Things A biologist or medical researcher typically supplies data to or retrieves data from a database and analyzes their data using available tools created by others. A computational biologist develops the original tools, applies tools in new ways to make discoveries in the data, develops and maintains databases, and attempts to form a bigger picture from large amounts of complex data. An ‘applied computational biologist’ uses computational tools and laboratory skills together.

  3. Most Important - Use every database and tool available, public and private. Check it regularly, for there is new data every day. Remember: A computer search can save you years!

  4. Tools • BLAST - preparing the input, interpreting the output. • Multiple alignment - assembly. • Protein structure visualization. • Coding region determination. • Feature extraction - CpG islands, polymorphism, visualization….

  5. Entrez is a search and retrieval system that integrates information from databases at NCBI. These databases include nucleotide sequences, protein sequences, macromolecular structures, whole genomes, and MEDLINE, through PubMed.

  6. What is BLAST? BLAST (Basic Local Alignment Search Tool) is an algorithm and a computer program that compares a query DNA or protein sequence to a database of other DNA or protein sequences. The results of that comparison are ranked according to a score and then each high scoring ‘hit’ is shown with the bases of the query and the hit aligned to show the regions of similarity. Search engines like BLAST, can find distant relationships between a query and a database entry, i.e. similarities that are far from identity. An adjustable scoring matrix is used by these codes to assign a value for a match and a penalty for a mismatch. This matrix reflects biological/evolution information specific to each species.

  7. Select your database, Be careful! Searching - BLAST blastp compares an amino acid query sequence against a protein sequence database; blastn compares a nucleotide query sequence against a nucleotide sequence database; blastx compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database; tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands). tblastx compares the six-frame translations of a nucleo- tide query sequence against the six-frame transla- tions of a nucleotide sequence database. http://www.ncbi.nlm.nih.gov/BLAST/blast_help.html

  8. Pre-Filtering before a BLAST search DNA sequences, especially those of mammals and plants contain a large number of repeated sequences, like CACACACACACACA….. The purpose of these sequences being present is unknown at this time. Since database entries and queries often have many repeat sequences contained in them, spurious similarities, due only because of the presence of these sequences often occur. This detracts from identifying important real similarities. To eliminate spurious hits to repeat sequences, query sequences are usually filtered and masked so that they will make no contribution to the overall similarity score. Careful, these simple sequence databases are incomplete, and vary from species to species!

  9. BLAST inputs. • Query, usually in Fasta format >Skipgene CAGTATAGTATATCAT • Search Parameters • Number of ‘Hits’ to save. • Search as DNA (4 nucleotide) or translated protein (amino acid) sequence • Similarity (PAM) matrix, a matrix of penalties used to compute the similarity score given the types of discrepancies between the query and the database entries.

  10. BLAST output components. • Execution statistics - database, its size, size of the query. • ‘Hits’ - entries in the databases that have the highest similarity to the query. • Alignments - a base by base or protein by protein comparison that can be inspected by eye to confirm regions of similarity.

  11. How the researcher uses similarity results…. • Each user typically inspects the score of the P(N) value and has a particular threshold above which that individual feels is significant, others use alignments, voodoo, etc.. • The user then inspects the short description for keywords or information of biological interest to him/her. The biological background and specific research objective greatly affects what is of interest. • For hits of interest, the user typically will inspect the alignment to confirm a real similarity. • For each hit of interest, the user might retrieve the full database entry and inspect the complete annotation.

  12. Lets start somewhere, how about a short set of sequences you saw as a marker in some paper. Where can we go from there? GCGAGCGTGTGGAAT GACGACCACAACTA How about complementing one of them to put on the same strand, concatenate with an “n” so that you know where you joined them, and submit to BLASTn, WITH GAPs. GCGAGCGTGTGGAATnCTGCTGGTGTTGAT

  13. You get this back: Score E Sequences producing significant alignments: (bits) Value gb|AF110314|AF110314 Homo sapiens herpesvirus immunoglobuli... 36 0.27 gb|AF060231|AF060231 Homo sapiens herpesvirus entry protein... 36 0.27 ref|NM_002855.1|HVEC| Homo sapiens herpesvirus entry mediat... 36 0.27 emb|Z34275|UUTUFG U.urealyticum tuf gene for elongation fac... 32 4.2

  14. The alignment on the second looks good, so click on it and lets see what is up. gb|AF060231|AF060231 Homo sapiens herpesvirus entry protein C (HVEC) mRNA, complete cds Length = 1710 Score = 36.2 bits (18), Expect = 0.27 Identities = 20/21 (95%) Strand = Plus / Plus Query: 1 gcgagcgtgtggaatncctgc 21 ||||||||||||||| ||||| Sbjct: 437 gcgagcgtgtggaattcctgc 457 Score = 32.2 bits (16), Expect = 4.2 Identities = 16/16 (100%) Strand = Plus / Plus Query: 17 cctgctggtgttgatt 32 |||||||||||||||| Sbjct: 1248 cctgctggtgttgatt 1263

  15. The HVEC DNA sequence can be retrieved. GCGAGCGTGTGGAATTCCTGCGGCCCTCCTTCACCGATGGCACTATCCGCCTCTCCCGCCTGGAGCTGGA GGATGAGGGTGTCTACATCTGCGAGTTTGCTACCTTCCCTACGGGCAATCGAGAAAGCCAGCTCAATCTC ACGGTGATGGCCAAACCCACCAATTGGATAGAGGGTACCCAGGCAGTGCTTCGAGCCAAGAAGGGGCAGG ATGACAAGGTCCTGGTGGCCACCTGCACCTCAGCCAATGGGAAGCCTCCCAGTGTGGTATCCTGGGAAAC TCGGTTAAAAGGTGAGGCCAGAGTACCAGGAGACTCCGGAACCCCAATGGCACCAGTGACGGTCATCAGC CGCTACCGCCTGGTGCCCAGCAGGGAAGCCCACCAGCAGTCCTTGGCCTGCATCGTCAACTACCACATGG ACCGCTTCAAGGAAAGCCTCACTCTCAACGTGCAGTATGAGCCTGAGGTAACCATTGAGGGGTTTGATGG CAACTGGTACCTGCAGCGGATGGACGTGAAGCTCACCTGCAAAGCTGATGCTAACCCCCCAGCCACTGAG TACCACTGGACCACGCTAAATGGCTCTCTCCCCAAGGGTGTGGAGGCCCAGAACAGAACCCTCTTCTTCA AGGGACCCATCAACTACAGCCTGGCAGGGACCTACATCTGTGAGGCCACCAACCCCATCGGTACACGCTC AGGCCAGGTGGAGGTCAATATCACAGAATTCCCCTACACCCCGTCTCCTCCCGAACATGGGCGGCGCGCC GGGCCGGTGCCCACGGCCATCATTGGGGGCGTGGCGGGGAGCATCCTGCTGGTGTTGATTGTGGTCGGCG There are a lot of directions one can go from here.

  16. M_002855 . Homo sapiens herpe...[gi:4506336] LOCUS HVEC 1557 bp mRNA PRI 10-NOV-1999 DEFINITION Homo sapiens herpesvirus entry mediator C (poliovirus receptor-related 1; nectin) (HVEC), mRNA. ACCESSION NM_002855 NID g4506336 VERSION NM_002855.1 GI:4506336 KEYWORDS . SOURCE human. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 1557) AUTHORS Lopez,M., Eberle,F., Mattei,M.G., Gabert,J., Birg,F., Bardin,F., Maroc,C. and Dubreuil,P. TITLE Complementary DNA characterization and chromosomal localization of a human gene related to the poliovirus receptor-encoding gene JOURNAL Gene 155 (2), 261-265 (1995) MEDLINE 95237621 REFERENCE 2 (bases 1 to 1557) AUTHORS Geraghty RJ, Krummenacher C, Cohen GH, Eisenberg RJ and Spear PG. TITLE Entry of alphaherpesviruses mediated by poliovirus receptor-related protein 1 and poliovirus receptor JOURNAL Science 280 (5369), 1618-1620 (1998) MEDLINE 98279152 REFERENCE 3 (bases 1 to 1557) AUTHORS Cocchi F, Menotti L, Mirandola P, Lopez M and Campadelli-Fiume G. TITLE The ectodomain of a novel member of the immunoglobulin subfamily related to the poliovirus receptor has the attributes of a bona fide receptor for herpes simplex virus types 1 and 2 in human cells JOURNAL J. Virol. 72 (12), 9992-10002 (1998) MEDLINE 99030909 COMMENT REFSEQ: This reference sequence was derived from X76400.1. PROVISIONAL RefSeq: This is a provisional reference sequence record that has not yet been subject to human review. The final curated reference sequence record may be somewhat different from this one. Inspect the annotation in the GenBank entry.

  17. FEATURES Location/Qualifiers source 1..1557 /organism="Homo sapiens" /db_xref="taxon:9606" /map="11q23- q24" /clone_lib="cDNA in pSPORT" gene 1..1557 /gene="HVEC" /note="PVRL1; HIGR; PRR1; PVRR1; SK-12" /db_xref="LocusID:5818" /db_xref="MIM:600644" CDS 1..1557 /gene="HVEC" /codon_start=1 /db_xref="LocusID:5818" /db_xref="MIM:600644" /product="herpesvirus entry mediator C (poliovirus receptor-related 1; nectin)" /protein_id="NP_002846.1" /db_xref="PID:g4506337" /db_xref="GI:4506337" /db_xref="SPTREMBL:Q15223" /translation="MARMGLAGAAGRWWGLALGLTAFFLPGVHSQVVQVNDSMYGFIG TDVVLHCSFANPLPSVKITQVTWQKSTNGSKQNVAIYNPSMGVSVLAPYRERVEFLRP SFTDGTIRLSRLELEDEGVYICEFATFPTGNRESQLNLTVMAKPTNWIEGTQAVLRAK KGQDDKVLVATCTSANGKPPSVVSWETRLKGEARVPGDSGTPMAPVTVISRYRLVPSR EAHQQSLACIVNYHMDRFKESLTLNVQYEPEVTIEGFDGNWYLQRMDVKLTCKADANP PATEYHWTTLNGSLPKGVEAQNRTLFFKGPINYSLAGTYICEATNPIGTRSGQVEVNI TEFPYTPSPPEHGRRAGPVPTAIIGGVAGSILLVLIVVGGIVVALRRRRHTFKGDYST KKHVYGNGYSKAGIPQHHPPMAQNLQYPDDSDDEKKAGPLGGSSYEEEEEEEEGGGGG ERKVGGPHPKYDEDAKRPYFTVDEAEARQDGYGDRTLGYQYDPEQLDLAENMVSQNDG

  18. Lets see if there is any information we can dig up on the protein.

  19. >gi|4506337|ref|NP_002846.1|pHVEC| herpesvirus entry mediator C (poliovirus receptor-related 1; nectin) MARMGLAGAAGRWWGLALGLTAFFLPGVHSQVVQVNDSMYGFIGTDVVLHCSFANPLPSVKITQVTWQKS TNGSKQNVAIYNPSMGVSVLAPYRERVEFLRPSFTDGTIRLSRLELEDEGVYICEFATFPTGNRESQLNL TVMAKPTNWIEGTQAVLRAKKGQDDKVLVATCTSANGKPPSVVSWETRLKGEARVPGDSGTPMAPVTVIS RYRLVPSREAHQQSLACIVNYHMDRFKESLTLNVQYEPEVTIEGFDGNWYLQRMDVKLTCKADANPPATE YHWTTLNGSLPKGVEAQNRTLFFKGPINYSLAGTYICEATNPIGTRSGQVEVNITEFPYTPSPPEHGRRA GPVPTAIIGGVAGSILLVLIVVGGIVVALRRRRHTFKGDYSTKKHVYGNGYSKAGIPQHHPPMAQNLQYP DDSDDEKKAGPLGGSSYEEEEEEEEGGGGGERKVGGPHPKYDEDAKRPYFTVDEAEARQDGYGDRTLGYQ YDPEQLDLAENMVSQNDGSFISKKEWYV Let work from the FASTA format of protein sequence. ref|NP_002846.1|PHVEC| herpesvirus entry mediator C (poliovirus receptor-related 1; nectin) >gi|1082702|pir||JC4024 poliovirus receptor-related protein - human >gi|732796|emb|CAA53980| (X76400) PRR1 [Homo sapiens] Length = 518 Score = 57.0 bits (135), Expect = 6e-08 Identities = 33/119 (27%), Positives = 61/119 (50%), Gaps = 5/119 (4%) Query: 2 VVYTDREVYGAVGSQVTLHCSFWSSEWVSDDISFTWRYQPEGGRDAISIFHYAKGQPYID 61 VV + +YG +G+ V LHCSF + TW+ G + ++I++ + G + Sbjct: 32 VVQVNDSMYGFIGTDVVLHCSFANPLPSVKITQVTWQKSTNGSKQNVAIYNPSMG---VS 88 Query: 62 EVGTFKERIQWVGDPSWKDGSIVIHNLDYSDNGTFTCDVKNPPDIVGKTSQVTLYVFEK 120 + ++ER++++ PS+ DG+I + L+ D G + C+ P + SQ+ L V K Sbjct: 89 VLAPYRERVEFL-RPSFTDGTIRLSRLELEDEGVYICEFATFP-TGNRESQLNLTVMAK 145 Myelin Membrane Adhesion Molecule is one thing we get back.

  20. And the sequence of the other hit, for HIgR >gi|4154346|gb|AAD04944.1| herpesvirus immunoglobulin-like receptor HIgR MARMGLAGAAGRWWGLALGLTAFFLPGVHSQVVQVNDSMYGFIGTDVVLHCSFANPLPSVKITQVTWQKS TNGSKQNVAIYNPSMGVSVLAPYRERVEFLRPSFTDGTIRLSRLELEDEGVYICEFATFPTGNRESQLNL TVMAKPTNWIEGTQAVLRAKKGQDDKVLVATCTSANGKPPSVVSWETRLKGEAEYQEIRNPNGTVTVISR YRLVPSREAHQQSLACIVNYHMDRFKESLTLNVQYEPEVTIEGFDGNWYLQRMDVKLTCKADANPPATEY HWTTLNGSLPKGVEAQNRTLFFKGPINYSLAGTYICEATNPIGTRSGQVEVNITEKPRPQRGLGSAARLL AGTVAVFLILVAVLTVFFLYNRQQKSPPETDGAGTDQPLSQKPEPSPSRQSSLVPEDIQVVHLDPGRQQQ QEEEDLQKLSLQPPYYDLGVSPSYHPSVRTTEPRGECP Can you identify motifs, or highly conserved regions, in these sequences? Try http://www.sdsc.edu/MEME/meme/website/ What about conserved regions for Myelin and HVEC, for which sequence homology was found?

  21. Lets put HVEC and Myelin into MEME The following motif’s are found. DATABASE meme.30154.data (peptide) Last updated on Tue Nov 23 05:59:31 1999 Database contains 1 sequences, 642 residues MOTIFS meme.30154.results (peptide) MOTIF WIDTH BEST POSSIBLE MATCH ----- ----- ------------------- 1 8 VYTCEFAN 2 12 ERHEQSLTCNVD 3 12 RSSQVNLNVFEK 4 8 PSWNDGSI 5 12 VSWQKRLKGEKR Myelin HVEC

  22. Myelin membrane adhesion molecule, which has a solved structure, has shared motifs, Motif 1,3,4, with HVEC. Myelin HVEC homology [4] [1] [3] 7.7e-11 1.0e-05 1.1e-11 PSWNDGSI VYTCEFAN RSSQVNLNVFEK ++++++++ +++++ + ++++++++++++ 76 PSWKDGSIVIHNLDYSDNGTFTCDVKNPPDIVGKTSQVTLYVFEKVPTRMARMGLAGAAGRWWGLALGLTAFFLP [1] [5] 1.3e-07 3.4e-12 VYTCEFAN VSWQKRLKGEKR + ++ +++ ++++++++++++ 151 GVHSQVVQVNDSMYGFIGTDVVLHCSFANPLPSVKITQVTWQKSTNGSKQNVAIYNPSMGVSVLAPYRERVEFLR [4] [1] [3] [1] 1.2e-08 5.0e-08 8.5e-09 5.2e-06 PSWNDGSI VYTCEFAN RSSQVNLNVFEK VYTCEFAN ++++++++ +++++++ ++++++++++ + + ++ ++ 226 PSFTDGTIRLSRLELEDEGVYICEFATFPTGNRESQLNLTVMAKPTNWIEGTQAVLRAKKGQDDKVLVATCTSAN [5] [2] [2] 4.3e-13 4.3e-09 2.6e-08 VSWQKRLKGEKR ERHEQSLTCNVD ERHEQSLTCNVD ++++++++++++ + ++++++++++ ++++++++ +++ 301 GKPPSVVSWETRLKGEARVPGDSGTPMAPVTVISRYRLVPSREAHQQSLACIVNYHMDRFKESLTLNVQYEPEVT

  23. There is no solved protein for our sequence, so we take the protein sequence BLAST results and see if we turn up any that are solved, and then look at those. Score E Sequences producing significant alignments: (bits) Value pdb|1NEU| Structure Of Myelin Membrane Adhesion Molecule P0 57 4e-09 pdb|1BIH|A Chain A, Crystal Structure Of The Insect Immune ... 34 0.025 pdb|2H1P|L Chain L, The Three-Dimensional Structures Of A P... 34 0.033 pdb|1A3L|L Chain L, Catalysis Of A Disfavored Reaction: An ... 33 0.056 pdb|1A4J|L Chain L, Diels Alder Catalytic Antibody Germline... 33 0.056 Lucky, Myelin Membrane Adhesion Molecule is solved! Region of high homology on the outside of the protein, perhaps a hint as to a domain involved in some kind of interaction, maybe not.

  24. Can we find a genomic clone for this sequence? Why would we want to? gb|AC015907.1|AC015907 Homo sapiens clone RP11-48A13, LOW-PASS SEQUENCE SAMPLING Length = 55317 Score = 48.1 bits (24), Expect = 0.005 Identities = 24/24 (100%) Strand = Plus / Plus Query: 878 gtgtggaggcccagaacagaaccc 901 |||||||||||||||||||||||| Sbjct: 44103 gtgtggaggcccagaacagaaccc 44126 Maybe, only maybe, because this is probably a repeat sequence that passed the filters (WHY?), but it might be worth trying to see what the rest of the sequence looked like from this Roswell Park clone, but the link is not good. Dead end?

  25. Maybe we can find by electronic PCR.

  26. What is missing? • Unification and integration of the analysis. • In-depth analysis. • The big picture? • Tools that work on many pieces of data at once. Data mining. • Expression database - mRNA, proteins. • Other? Now, a quick look at a couple of stabs at this list!

  27. Local Software Projects: BANAL - NLP/Bayesian Network analysis of Expression Arrays ARROGANT - Optimized Expression Array Design and Analysis X-Hyb - Looking for cross-hybridization in Expression Arrays MAD & PAD - Expression Array database and layout Protein Molecular Dynamics - Sequence polymorphism effects on solved protein structures SNIDE - SNP prediction Rep-X (aka UniPOMPOUS) - simple sequence repeat polymorphism prediction

  28. PANORAMA - a new server for Integrated Genomic Sequence Analysis • Genomic sequence features visualization • Preparation for Expert System Based Analysis • GenBank (EST and non-EST) homologies • Gene prediction (GenScan) • POMPOUS • New - control / recognition sequences, • Transcription factors, CpG islands, enhancers, termination sequences… more on the way!

  29. PANORAMA Integrated Analysis on The WWW BLAST CpG islands GenScan Repeats POMPOUS …….. Java soon.

  30. Polymorphism prediction software • SNIDE (SNp IDEntification) – Predict high-impact, high-probability SNPs. • POMPOUS – Prediction of polymorphic markers for allelotyping (PNAS, June 98, Vol. 95 p7514-19) • Rep-X (UniPOMPOUS)– Improvement of POMPOUS code and application to expressed gene sequences via Unigene

  31. Rep-X (Repeat eXpansions within mRNAs) Background on Nucleic acid repetitive elements • Repeating sequence units (microsatellites) known for long time to undergo expansion and contraction of base repeat unit • Slipped-strand mispairing and unequal recombination thought to be responsible • Well known polymorphic sequence units: CA (intervening sequence) and CAG (linked to several neurological disorders). • Polymorphic repeat units mentioned in the literature range from 1 to over 250 bp. • Impact of polymorphisms found in all regions 5’ UTR – Hyperandrogenaemia CDS – Haw River Syndrome Intron – Fredreich’s Ataxia 3’ UTR – Myotonic Dystrophy

  32. Reasons to study: • Candidates for genetic diseases • Candidates for phenotypic variations • Polymorphism profile indicative of functional role for protein region • Nature may use non-degenerate codon repeats for more rapid evolutionary response to selection pressure • Learn more about roles of peptide repeats

  33. Computational Process • Download UniGene (Unique Gene) dataset of assembled EST sequences • Longest, cleanest sequence obtained for each Unigene cluster • Program run on entire Unigene database (10/99 build = 85,639 entries) • Candidates for follow-up experimental study picked by repeat type, location and interest in the gene

  34. Example Follow-up on 30 patient DNAs: • Herpes Virus Entry Protein C – AGG(8) 5’-----------[start]--------------------------X---------[stop]----------------------->3’ • Variable resistance to HSV infection in population • HSV unable to penetrate cell in C-terminal deletion experiment (including Glu repeat) • Glu region bears homology to calcium-based transporters • HSV unable to enter without calcium present

  35. Experimental Verification Results • Out of 146 genes chosen for testing, 102 amplified and 54 were polymorphic (~53%) • Tested on 30 patient DNA samples.

  36. We can predict repeat polymorphisms, and there are a lot of them. We have found defective entries in UniGene that result in overprediction of the number of genes by ~20%

  37. Data Mining - What is it? Certainly a fashionable term.

  38. …and public servers are available for SQL queries to linked data… MEDLINE ACGATGTGGTCGATG TTCTCTATTATTATC GGAAGCTAAGGATAT CGCTGATGTGAGGTGA TCGGTTCTATCTGCA TAGCATGGATATTGA TGGCTTATAGGCTAG CGCTGATGTGAGGTG MVILLVILAIVLISD VTGREGSWQIPCMNV KRKKGREGDHIVLIL ILLNNAWASVLPESDS SDSGPLIILHEREKR LALAMAREENSPNCT PLIKRESAEDSEDLR KRKKTDEDDHIVLIL GenBank Protein Sequences Links Genomes Structures

  39. Some are using simple mining methods on titles… • pieces of evidence extracted from titles of articles in the biomedical literature (Swanson 1988, Swanson and Smalheiser 1997) • stress is associated with migraines • calcium channel blockers prevent some migraines • spreading cortical depression (SCD) is implicated in some migraines • migraine patients have high platelet aggregability • stress can lead to loss of magnesium • magnesium is a natural calcium channel blocker • high levels of magnesium inhibit SCD • magnesium can suppress platelet aggregability • led to the discovery: • magnesium deficiency may play a role in migraine headaches • confirmed by subsequent study (Ramadan et al. 1989)

  40. …and there is still enormous opportunity!! text databases A • Traditional approach • one thread is followed through several databases • Result • finding A is related to sequence B and structure C sequence databases structure databases

  41. …and there is still enormous opportunity!! text databases A • Directed approaches: • keyword/grammar based SQL • simple data mining on titles and abstracts • Result • finding A is related to other findings of the same data type sequence databases structure databases

  42. …and there is still enormous opportunity!! text databases A • Machine learning: • data mining on full texts and other biomedical data • Result • finding A is related to other findings of the same data type through connections found among other datatypes sequence databases structure databases

  43. V F T D U S A E B C …because tree overlap may be new associations… ?

  44. …and EMILE (Perot Systems)for knowledge discovery Entity Modeling Intelligent Learning Engine • language-independent grammar induction • cluster analysis on text to identify semantic and syntactic clusters • clustering of biomedical data based on concepts • will use for biological knowledge base construction • concept clustering may reveal previously undiscovered knowledge • molecular interaction networks (MINe) to serve as basis for static cell and dynamic cell modeling

  45. EMILE Results: dataset too small to learn the language 91 PubMed abstracts keywords: cancer, polymorphism

  46. EMILE realized that these are related: [94] --> Chinese [94] --> Japanese [94] --> Polish And found a biological connection among diverse verb use: [11] --> LOH was [105] % [105] --> identified in 13 cases (72 [105] --> detected in 9 of 87 informative cases (10 [105] --> observed in 5 (55 EMILE results: interesting associations are discovered when clusters are inspected manually

  47. Some vision of what is to come. • Data mining will become important given the amount of data becoming available. • Patient/phenotypes become increasingly important for identifying genes and their function within existing genomic sequence. • Genome-Transcriptome-Proteome are unified. • Understanding of complex systems (humans) possible from network analysis and computers. • Novel genes will continue to be discovered as we sequence more organisms.

  48. Closing message: The intent of this set of lectures was to introduce you to the wide variety of data and tools that are available, mainly on the www, and encourage you to use these tools For an in-depth understanding of the organization of the data and the algorithms that are the basis of the tools, there may be a new course next year?

More Related