Reading for the Next Week - PowerPoint PPT Presentation

charity-trevino
reading for the next week n.
Skip this Video
Loading SlideShow in 5 Seconds..
Reading for the Next Week PowerPoint Presentation
Download Presentation
Reading for the Next Week

play fullscreen
1 / 34
Download Presentation
Reading for the Next Week
59 Views
Download Presentation

Reading for the Next Week

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Reading for the Next Week • Sequence Analysis and Alignment • Chapter 5, Chapter 8, Chapter 11 • Only about the 1st third of each chapter

  2. Sequence Files • Fasta format, has simplest structure >Sequence Title [new line] Sequence [new line] • very useful for handling sequence alone • usually included as one of the formats supported by programs that use sequence

  3. Example of Fasta Format >gi|212244|gb|M16260.1|CHKLCAMR… AGCTCCGTGCGCAGCGGTACCCGTACCGGTACCGGCCCGGTCCCTGAGCCATGGGCCGGCGGTGGGGTTCCCCCGCCCTGCAGCGCTTCCCCGTGTTGGTGCTGCTGCTGCTGCTCCAGGTGTGCGGCCGGCGGTGCGACGAGGCAGCCCCCTGCCAGCCCGGCTTTGCTGCAGAGACCTTCAGCTTCAGTGTGCCCCAGGACAGCGTGGCGGCGGGCAGGGAGCTGGGACGAGTGAGCTTTGCAGCCTGCAGCGGGCGGCCGTGGGCCGTGTATGTCCCGACTGACA…

  4. GENBANK Flat File • holdover from earlier versions of GENBANK, the US government-supported public database • DNA-centric, sequence based view of data • contains a number of fields with non-sequence information

  5. LOCUS CHKLCAMR 3545 bp mRNA linear VRT 30-NOV-1995 DEFINITION Chicken liver cell adhesion molecule L-CAM mRNA, complete cds. ACCESSION M16260 J04074 M22179 VERSION M16260.1 GI:212244 KEYWORDS cadherin; glycoprotein; liver cell adhesion molecule. SOURCE Gallus gallus cDNA to mRNA. ORGANISM Gallus gallus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Archosauria; Aves; Neognathae; Galliformes; Phasianidae; Phasianinae; Gallus. REFERENCE 1 (bases 201 to 3545) AUTHORS Gallin,W.J., Sorkin,B.C., Edelman,G.M. and Cunningham,B.A. TITLE Sequence analysis of a cDNA clone encoding the liver cell adhesion molecule, L-CAM JOURNAL Proc. Natl. Acad. Sci. U.S.A. 84 (9), 2808-2812 (1987) MEDLINE 87204217

  6. FEATURES Location/Qualifiers source 1..3545 /organism="Gallus gallus" /db_xref="taxon:9031" /clone="pEC3(20,30,31)" /tissue_type="liver" /dev_stage="10-11 day old embryo" mRNA <1..3545 /product="L-CAM mRNA" CDS 51..2714 /codon_start=1 /product="liver cell adhesion protein precursor" /protein_id="AAA82573.1" /db_xref="GI:212245" /translation="MGRRWGSPALQRFPVLVLLLLLQV…GGEDDE" sig_peptide 51..128 mat_peptide 531..2711 /product="liver cell adhesion protein" BASE COUNT 757 a 1125 c 1051 g 612 t ORIGIN 20 bp upstream of KpnI site. 1 agctccgtgc gcagcggtac ccgtaccggt accggcccgg tccctgagcc atgggccggc 61 ggtggggttc…

  7. Other Formats • XML - extensible markup language • similar to HTML only can implement user-defined tags • Graphic • extracts positions from features and creates a graphical output

  8. Database Types Characteristics, Strengths and Weaknesses

  9. What is a Database? • well-defined storage method for digital data • allows for relatively rapid retrieval of data • allows for complex conditional retrieval

  10. Three Main Types Used in Bioinformatics • Flat File • text stored in a file in stereotyped format • Hierarchical adds “tree” organization • Relational • a set of tables, with unique identifiers, and overlapping content • Object Oriented • data stored as part of a data structure (the object) that includes methods for manipulating the data

  11. Flat File Database • data is stored as an “unstructured” record • relationships between the data are inherent in the database schema, the description of the syntax of the storage file

  12. Flat File Database • Advantages • low overhead, do not need to have a complex computational superstructure to organize the data and keep track of it in memory • retrieval is not computationally complex • can take advantage of generalized standards for information organization

  13. Disadvantages • no random access, therefore the simplicity of storage imposes a cost on access and manipulation • partially resolved by indexing • change in the schema requires parsing and rewriting the whole database • all linkages between data entries must be explicitly defined either in the schema or by software that accesses the database

  14. Relational Databases • functionally consist of a set of tables, where each row in the table contains a set of properties of some entity • extensive formal analysis of relational approach has yielded a set of “normalizations” that maximize the interconnections between information, minimize redundancies

  15. Relational Databases • Advantages • readily available database management systems (DBMSs) that handle the computational overhead invisibly • high interconnectivity of data enhances data mining process • Structured Query Language (SQL) exists to make searching automated and relatively rapid, even complex searches

  16. changing schema does not necessarily involve rewriting whole database; can add new tables or new columns to existing tables • most common commercial database type therefore lots of support available (if you have the money) • wide usage means user skills are generalizable

  17. Disadvantages • overhead (computational and expertise) makes cost high for small databases • content-based query only rudimentary, can not do complex “fuzzy” queries within SQL • all implementations do not fully conform to theoretical criteria, therefore problems arise in large databases and/or complex queries

  18. 5’ Break

  19. Object-Oriented Databases • based on the object concept, a computational entity that consists of data and a set of methods that will perform operations on that data • ACeDB, the core DBMS for the C. elegans sequencing project is object oriented

  20. Object-Oriented Databases • Advantages • pre-existing schema is already worked out if you use ACeDB (http://www.acedb.org/) • a lot of procedural programming is not needed because methods for data manipulation are intrinsic to the object • natural database for object-oriented languages like C++ and Java

  21. Disadvantages • not easy to tweak; the DBMS is fairly complex, really only the developer community can alter it • if their data model is not adequate for your project, there is no easy way to expand it • therefore, tends to be good for specific genomes, high throughput operations, not databases set up and maintained by small users for idiosyncratic projects

  22. Combined Relational/Object • Relational database (tables) that can hold objects • As implemented, the DBMS simulates the object by creating a set of hidden tables • Larger computational overhead, less user control of database structure

  23. Summary • each database type has strengths and weaknesses • choice of database to use depends on many cost factors (money, computational overhead, learning curve for use, pre-existing support) • there is no single right choice

  24. GENBANK • the core GENBANK archival database is a flat file format • historically that is the way it started • when a major revamping was undertaken in the mid 90s, stayed with flat file format, but introduced a defined hierarchical data model using ASN.1

  25. ASN.1 • the underlying structure of the GENBANK database uses Abstract Syntax Notation as the syntax definition • this is a standard, general syntax definition for holding information in a machine-parseable form • hierarchical structure helps organize data

  26. Seq-entry ::= set { level 1 , class nuc-prot , descr { title "Chicken liver cell adhesion molecule L-CAM mRNA, and translated products" , update-date std { year 1995 , month 11 , day 30 } , source { org { taxname "Gallus gallus" , common "chicken" , db { { db "taxon" , tag

  27. GENBANK Data Model • to implement a database, you must have a data model • for flat files, consists of a set of rules about • the format of data storage • the syntax of the storage • implementation of any data analysis or manipulation is the responsibility of the user

  28. GENBANK Data Model • explicitly defined in on-line document • http://www.ncbi.nlm.nih.gov/IEB/ToolBox/SDKDOCS/DATAMODL.HTML • note: although not object oriented, the specification uses much of the terminology of object-oriented programming

  29. BioSeqs • has a at least one Seq-id • contains information about a biological sequence • virtual - contains a molecular type, a size, and topology (e.g. a band on a gel, an intron whose sequence has not been determined)

  30. raw - simple single sequence, which has all the properties of virtual plus actual sequence data • segmented - contains identifiers for other bioseqs and relative positional information, thus yielding a size • map - contains a rough size and co-ordinates that represent some kind of map data

  31. Bioseq Sets • sets of bioseq entities that are related somehow • nuc-prot set - nucleotide type bioseq and one or more associated protein type bioseqs • population set - set of related bioseqs that are aligned with each other. This is a basic type for population and phylogenetic studies

  32. Seq-Annot • A self-contained annotation that refers to a specific bio-seq entity • Can have multiple seq-annots • These elements hold the annotation data, e.g. positions of start sites, stop site, introns, regulatory sequences

  33. Our Lab Project • one of the main elements of the labs in this course is designing a database, populating it, analyzing the sequences in various ways, and annotating the database • we will use a flat file format to store data on a family of proteins • therefore, we need to define a schema