1 / 24

Concept Modeling in Bio-informatics

Concept Modeling in Bio-informatics. Sanida Omerovic*, Saso Tomazic*, Mateo Valero**, Milos Milovanovic**, David Torrents** *University of Ljubljana, Slovenia ** UPC, Barcelona, Catalonia IPSI Firence -200 7. WHAT IS CONCEPT ?. Decision Making Algorithm. Concept Modeling Layer.

kynton
Download Presentation

Concept Modeling in Bio-informatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Concept Modeling in Bio-informatics Sanida Omerovic*, Saso Tomazic*, Mateo Valero**, Milos Milovanovic**, David Torrents** *University of Ljubljana, Slovenia ** UPC, Barcelona, Catalonia IPSI Firence-2007

  2. WHAT IS CONCEPT?

  3. Decision MakingAlgorithm

  4. Concept Modeling Layer • What is concept? • How is it modeled? • How is it built? • How is it exploited? • How is it updated?

  5. Classification of concept modeling (CM) and decision making systems (DMS) • This classification is made based on the followingassumption: • Any decision making system, regardless if the process is performed entirely by humans, supported by machines or totally automated, is a layered process, with one layer (explicit or implicit) which can be called Concept Modeling Layer

  6. Purpose (DMS) • General • Specialized (Bio-informatics)

  7. Bio-informatics Genomic researchers mostly deal with similarity issues between genomic sequences. Genomic sequences are treated as long sequences of letters: • A (adenine) • G (Guanine) • C (Cytosine) • T (Thymine) which represents nitrogenous bases in protein structure.

  8. DNA sequence DNA sequence is presented as an array of letters which are mapping the nucleotides in DNA (consisted of one of four types of nitrogenousbases A/G/C/T, a five-carbon sugar, and molecule of phosphoric acid). (A) (T) (G) (C) DNA chemistry compound DNA sequence

  9. DNA sequence analysis • GATTCATCGA CCATCAAAT GATT Useful data Noisy data Start sequence Start sequence End sequence

  10. Bio-informatics in DMS • Sequence concept (still impossible/there is no protein conceptual model) • Sequence analysis (software BLAST, Smith Waterman, FASTA, etc) • Sequence retrieval (easy/ available for free on the WEB: ENSEMBL.ORG, NCBI, UCSC, etc.) • Sequencing (hard/laboratory work on the level of chemical reactions to conclude weather C/T/G/A is in question in DNA chain)

  11. Sequence analysis • In the example shown at next two figures, one can see a fraction of the results obtained from a BLAST comparison of protein SLC7A7 (human) against a SwissProt database of proteins. • We selected two illustrative examples that show from a perfect (word) mach to a similar mach.

  12. BLAST Sample session, perfect match • >gi|12643348|sp|Q9UHI5|LAT2_HUMAN <http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=12643348&dopt=GenPept> Gene info <http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=search&term=12643348%5BPUID%5D> Large neutral amino acids transporter small subunit 2 (L-type • amino acid transporter 2) (hLAT2) • Length=535 • Score = 665 bits (1717), Expect = 0.0, Method: Composition-based stats. • Identities = 332/332 (100%), Positives = 332/332 (100%), Gaps = 0/332 (0%) • Query 1 MGIVQICKGEYFWLEPKNAFENFQEPDIGLVALAFLQGSFAYGGWNFLNYVTEELVDPYK 60 • MGIVQICKGEYFWLEPKNAFENFQEPDIGLVALAFLQGSFAYGGWNFLNYVTEELVDPYK • Sbjct 204 MGIVQICKGEYFWLEPKNAFENFQEPDIGLVALAFLQGSFAYGGWNFLNYVTEELVDPYK 263 • Query 61 NLPRAIFISIPLVTFVYVFANVAYVTAMSPQELLASNAVAVTFGEKLLGVMAWIMPISVA 120 • NLPRAIFISIPLVTFVYVFANVAYVTAMSPQELLASNAVAVTFGEKLLGVMAWIMPISVA • Sbjct 264 NLPRAIFISIPLVTFVYVFANVAYVTAMSPQELLASNAVAVTFGEKLLGVMAWIMPISVA 323 • Query 121 LSTFGGVNGSLFTSSRLFFAGAREGHLPSVLAMIHVKRCTPIPALLFTCISTLLMLVTSD 180 • LSTFGGVNGSLFTSSRLFFAGAREGHLPSVLAMIHVKRCTPIPALLFTCISTLLMLVTSD • Sbjct 324 LSTFGGVNGSLFTSSRLFFAGAREGHLPSVLAMIHVKRCTPIPALLFTCISTLLMLVTSD 383 • Query 181 MYTLINYVGFINYLFYGVTVAGQIVLRWKKPDIPRPIKINLLFPIIYLLFWAFLLVFSLW 240 • MYTLINYVGFINYLFYGVTVAGQIVLRWKKPDIPRPIKINLLFPIIYLLFWAFLLVFSLW • Sbjct 384 MYTLINYVGFINYLFYGVTVAGQIVLRWKKPDIPRPIKINLLFPIIYLLFWAFLLVFSLW 443 • Query 241 SEPVVCGIGLAIMLTGVPVYFLGVYWQHKPKCFSDFIELLTLVSQKMCVVVYPEVERGSG 300 • SEPVVCGIGLAIMLTGVPVYFLGVYWQHKPKCFSDFIELLTLVSQKMCVVVYPEVERGSG • Sbjct 444 SEPVVCGIGLAIMLTGVPVYFLGVYWQHKPKCFSDFIELLTLVSQKMCVVVYPEVERGSG 503 • Query 301 TEEANEDMEEQQQPMYQPTPTKDKDVAGQPQP 332 • TEEANEDMEEQQQPMYQPTPTKDKDVAGQPQP • Sbjct 504 TEEANEDMEEQQQPMYQPTPTKDKDVAGQPQP 535

  13. BLAST Sample session, similar match • >gi|12643378|sp|Q9UM01|YLA1_HUMAN <http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=12643378&dopt=GenPept> Gene info <http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=search&term=12643378%5BPUID%5D> Y+L amino acid transporter 1 (y(+)L-type amino acid transporter • 1) (y+LAT-1) (Y+LAT1) (Monocyte amino acid permease 2) (MOP-2) • Length=511 • Score = 257 bits (656), Expect = 4e-68, Method: Composition-based stats. • Identities = 138/315 (43%), Positives = 203/315 (64%), Gaps = 10/315 (3%) • Query 2 GIVQICKGEYFWLEPKNAFENFQEPDIGLVALAFLQGSFAYGGWNFLNYVTEELVDPYKN 61 • GIV++ +G E N+FE +G +ALA F+Y GW+ LNYVTEE+ +P +N • Sbjct 202 GIVRLGQGASTHFE--NSFEG-SSFAVGDIALALYSALFSYSGWDTLNYVTEEIKNPERN 258 • Query 62 LPRAIFISIPLVTFVYVFANVAYVTAMSPQELLASNAVAVTFGEKLLGVMAWIMPISVAL 121 • LP +I IS+P+VT +Y+ NVAY T + +++LAS+AVAVTF +++ G+ WI+P+SVAL • Sbjct 259 LPLSIGISMPIVTIIYILTNVAYYTVLDMRDILASDAVAVTFADQIFGIFNWIIPLSVAL 318 • Query 122 STFGGVNGSLFTSSRLFFAGAREGHLPSVLAMIHVKRCTPIPALLFTCISTLLMLVTSDM 181 • S FGG+N S+ +SRLFF G+REGHLP + MIHV+R TP+P+LLF I L+ L D+ • Sbjct 319 SCFGGLNASIVAASRLFFVGSREGHLPDAICMIHVERFTPVPSLLFNGIMALIYLCVEDI 378 • Query 182 YTLINYVGFINYLFYGVTVAGQIVLRWKKPDIPRPIKINLLFPIIYLLFWAFLLVFSLWS 241 • + LINY F + F G+++ GQ+ LRWK+PD PRP+K+++ FPI++ L FL+ L+S • Sbjct 379 FQLINYYSFSYWFFVGLSIVGQLYLRWKEPDRPRPLKLSVFFPIVFCLCTIFLVAVPLYS 438 • Query 242 EPVVCGIGLAIMLTGVPVYFL--GVYWQHKPKCFSDFIELLTLVSQKMCVVVYPEVERGS 299 • + + IG+AI L+G+P YFL V +P + T Q +C+ V E++ • Sbjct 439 DTINSLIGIAIALSGLPFYFLIIRVPEHKRPLYLRRIVGSATRYLQVLCMSVAAEMDLED 498 • Query 300 GTEEANEDMEEQQQP 314 • G E M +Q+ P • Sbjct 499 GGE-----MPKQRDP 508

  14. BLAST output:Score = 257 bits (656), Expect = 4e-68, Method: Composition-based stats.Identities = 138/315 (43%), Positives = 203/315 (64%), Gaps = 10/315 (3%) • BLAST expresses the level of similarity between query sequence and database sequence in terms of: Score, Expectations, Method, Identities, Positives, and Gaps. Here is where our DMA layer is finishing, and from this point inferring need to be done by researchers on the bases of software (ex. BLAST) output, and knowledge gathered elsewhere (book, computers, brains…). • Also, a forthcoming challenge in the field of comparative genomic analysis is to compare large amounts of genomic data (letters).For example, if one wants to compare one mammalian genomic sequence against all existing mammalian sequences, one would need a database with memory storage of 60 GB (Saragasso Sea project).

  15. Application for textanalysis: • Frequency (number of occurrences) • Distance -------------------------- • Exclude stop word lists (and, if, or etc) • Stemming (traveling => travel; traveled => travel) • Synonyms (sick = ill) • Visual Basic

  16. Home-made Brandy Production • Grape-gathering is the first phase in the production of brandy, through it might be made also from plums, figs, pears or cornel berries. The gathered grapes are crushed and then poured into wooden barrels. They are mixed several times a day, the more often the better. The obtained mass is called wine-marc. The process of alcoholic fermentation usually lasts fifteen or thirty days. When it is finished, or when, as usually people say the marc is still, distillation begins i.e. the making of brandy, which is done in special copper cauldrons. Hand made copper cauldrons can still be found in Tuscany households…

  17. word word frequency distance brandy grape 10 0 brandy alcohol 4 1 brandy distillation 3 3 brandy strength 3 5 brandy making 2 5 …

  18. Concept criteria: • Frequency > 5 • Distance < 2 • Concepts: brandy grape brandy alcohol • Transcription: brandy - made of - grapes brandy - kind of - alcohol

  19. Concept Modeling layer • Implicit(concepts are not explicitly mentioned): Protein conceptual model • Explicit (concepts are explicitly mentionedand/or defined): Frequency > 5 Distance < 2

  20. Concept definition (CM) • Node in concept network (semantic web) • Node in concept web

  21. Concept definitons • Structure that carries meaning. • Needs other concepts and relations among them to be defined. Without relations concept can not exist. • Relations between concepts can also be observed as concepts. • All concepts can be related among each other, forming whether: 1. concept web (where relations are concepts also) 2. concept network (where relations are not concepts)

  22. Concept Network ConceptWeb

  23. Concept Modeling Learning Module

  24. Thank you for your attention! Questions?

More Related