reading for the next week
Download
Skip this Video
Download Presentation
Reading for the Next Week

Loading in 2 Seconds...

play fullscreen
1 / 34

Reading for the Next Week - PowerPoint PPT Presentation


  • 53 Views
  • Uploaded on

Reading for the Next Week. Sequence Analysis and Alignment Chapter 5, Chapter 8, Chapter 11 Only about the 1st third of each chapter. Sequence Files. Fasta format, has simplest structure >Sequence Title [new line] Sequence [new line] very useful for handling sequence alone

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Reading for the Next Week' - charity-trevino


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
reading for the next week
Reading for the Next Week
  • Sequence Analysis and Alignment
  • Chapter 5, Chapter 8, Chapter 11
  • Only about the 1st third of each chapter
sequence files
Sequence Files
  • Fasta format, has simplest structure

>Sequence Title [new line]

Sequence [new line]

  • very useful for handling sequence alone
  • usually included as one of the formats supported by programs that use sequence
example of fasta format
Example of Fasta Format

>gi|212244|gb|M16260.1|CHKLCAMR…

AGCTCCGTGCGCAGCGGTACCCGTACCGGTACCGGCCCGGTCCCTGAGCCATGGGCCGGCGGTGGGGTTCCCCCGCCCTGCAGCGCTTCCCCGTGTTGGTGCTGCTGCTGCTGCTCCAGGTGTGCGGCCGGCGGTGCGACGAGGCAGCCCCCTGCCAGCCCGGCTTTGCTGCAGAGACCTTCAGCTTCAGTGTGCCCCAGGACAGCGTGGCGGCGGGCAGGGAGCTGGGACGAGTGAGCTTTGCAGCCTGCAGCGGGCGGCCGTGGGCCGTGTATGTCCCGACTGACA…

genbank flat file
GENBANK Flat File
  • holdover from earlier versions of GENBANK, the US government-supported public database
  • DNA-centric, sequence based view of data
  • contains a number of fields with non-sequence information
slide5

LOCUS CHKLCAMR 3545 bp mRNA linear VRT 30-NOV-1995

DEFINITION Chicken liver cell adhesion molecule L-CAM mRNA, complete cds.

ACCESSION M16260 J04074 M22179

VERSION M16260.1 GI:212244

KEYWORDS cadherin; glycoprotein; liver cell adhesion molecule.

SOURCE Gallus gallus cDNA to mRNA.

ORGANISM Gallus gallus

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;

Archosauria; Aves; Neognathae; Galliformes; Phasianidae;

Phasianinae; Gallus.

REFERENCE 1 (bases 201 to 3545)

AUTHORS Gallin,W.J., Sorkin,B.C., Edelman,G.M. and Cunningham,B.A.

TITLE Sequence analysis of a cDNA clone encoding the liver cell adhesion

molecule, L-CAM

JOURNAL Proc. Natl. Acad. Sci. U.S.A. 84 (9), 2808-2812 (1987)

MEDLINE 87204217

slide6

FEATURES Location/Qualifiers

source 1..3545

/organism="Gallus gallus"

/db_xref="taxon:9031"

/clone="pEC3(20,30,31)"

/tissue_type="liver"

/dev_stage="10-11 day old embryo"

mRNA <1..3545

/product="L-CAM mRNA"

CDS 51..2714

/codon_start=1

/product="liver cell adhesion protein precursor"

/protein_id="AAA82573.1"

/db_xref="GI:212245"

/translation="MGRRWGSPALQRFPVLVLLLLLQV…GGEDDE"

sig_peptide 51..128

mat_peptide 531..2711

/product="liver cell adhesion protein"

BASE COUNT 757 a 1125 c 1051 g 612 t

ORIGIN 20 bp upstream of KpnI site.

1 agctccgtgc gcagcggtac ccgtaccggt accggcccgg tccctgagcc atgggccggc

61 ggtggggttc…

other formats
Other Formats
  • XML - extensible markup language
    • similar to HTML only can implement user-defined tags
  • Graphic
    • extracts positions from features and creates a graphical output
database types

Database Types

Characteristics, Strengths and Weaknesses

what is a database
What is a Database?
  • well-defined storage method for digital data
  • allows for relatively rapid retrieval of data
  • allows for complex conditional retrieval
three main types used in bioinformatics
Three Main Types Used in Bioinformatics
  • Flat File
    • text stored in a file in stereotyped format
    • Hierarchical adds “tree” organization
  • Relational
    • a set of tables, with unique identifiers, and overlapping content
  • Object Oriented
    • data stored as part of a data structure (the object) that includes methods for manipulating the data
flat file database
Flat File Database
  • data is stored as an “unstructured” record
  • relationships between the data are inherent in the database schema, the description of the syntax of the storage file
flat file database1
Flat File Database
  • Advantages
    • low overhead, do not need to have a complex computational superstructure to organize the data and keep track of it in memory
    • retrieval is not computationally complex
    • can take advantage of generalized standards for information organization
slide14
Disadvantages
    • no random access, therefore the simplicity of storage imposes a cost on access and manipulation
      • partially resolved by indexing
    • change in the schema requires parsing and rewriting the whole database
    • all linkages between data entries must be explicitly defined either in the schema or by software that accesses the database
relational databases
Relational Databases
  • functionally consist of a set of tables, where each row in the table contains a set of properties of some entity
  • extensive formal analysis of relational approach has yielded a set of “normalizations” that maximize the interconnections between information, minimize redundancies
relational databases1
Relational Databases
  • Advantages
    • readily available database management systems (DBMSs) that handle the computational overhead invisibly
    • high interconnectivity of data enhances data mining process
    • Structured Query Language (SQL) exists to make searching automated and relatively rapid, even complex searches
slide17
changing schema does not necessarily involve rewriting whole database; can add new tables or new columns to existing tables
  • most common commercial database type therefore lots of support available (if you have the money)
  • wide usage means user skills are generalizable
slide18
Disadvantages
    • overhead (computational and expertise) makes cost high for small databases
    • content-based query only rudimentary, can not do complex “fuzzy” queries within SQL
    • all implementations do not fully conform to theoretical criteria, therefore problems arise in large databases and/or complex queries
object oriented databases
Object-Oriented Databases
  • based on the object concept, a computational entity that consists of data and a set of methods that will perform operations on that data
  • ACeDB, the core DBMS for the C. elegans sequencing project is object oriented
object oriented databases1
Object-Oriented Databases
  • Advantages
    • pre-existing schema is already worked out if you use ACeDB (http://www.acedb.org/)
    • a lot of procedural programming is not needed because methods for data manipulation are intrinsic to the object
    • natural database for object-oriented languages like C++ and Java
slide22
Disadvantages
    • not easy to tweak; the DBMS is fairly complex, really only the developer community can alter it
    • if their data model is not adequate for your project, there is no easy way to expand it
    • therefore, tends to be good for specific genomes, high throughput operations, not databases set up and maintained by small users for idiosyncratic projects
combined relational object
Combined Relational/Object
  • Relational database (tables) that can hold objects
  • As implemented, the DBMS simulates the object by creating a set of hidden tables
  • Larger computational overhead, less user control of database structure
summary
Summary
  • each database type has strengths and weaknesses
  • choice of database to use depends on many cost factors (money, computational overhead, learning curve for use, pre-existing support)
  • there is no single right choice
genbank
GENBANK
  • the core GENBANK archival database is a flat file format
  • historically that is the way it started
  • when a major revamping was undertaken in the mid 90s, stayed with flat file format, but introduced a defined hierarchical data model using ASN.1
asn 1
ASN.1
  • the underlying structure of the GENBANK database uses Abstract Syntax Notation as the syntax definition
  • this is a standard, general syntax definition for holding information in a machine-parseable form
  • hierarchical structure helps organize data
slide27

Seq-entry ::= set {

level 1 ,

class nuc-prot ,

descr {

title "Chicken liver cell adhesion molecule L-CAM mRNA, and translated

products" ,

update-date

std {

year 1995 ,

month 11 ,

day 30 } ,

source {

org {

taxname "Gallus gallus" ,

common "chicken" ,

db {

{

db "taxon" ,

tag

genbank data model
GENBANK Data Model
  • to implement a database, you must have a data model
  • for flat files, consists of a set of rules about
    • the format of data storage
    • the syntax of the storage
  • implementation of any data analysis or manipulation is the responsibility of the user
genbank data model1
GENBANK Data Model
  • explicitly defined in on-line document
  • http://www.ncbi.nlm.nih.gov/IEB/ToolBox/SDKDOCS/DATAMODL.HTML
  • note: although not object oriented, the specification uses much of the terminology of object-oriented programming
bioseqs
BioSeqs
  • has a at least one Seq-id
  • contains information about a biological sequence
    • virtual - contains a molecular type, a size, and topology (e.g. a band on a gel, an intron whose sequence has not been determined)
slide31
raw - simple single sequence, which has all the properties of virtual plus actual sequence data
  • segmented - contains identifiers for other bioseqs and relative positional information, thus yielding a size
  • map - contains a rough size and co-ordinates that represent some kind of map data
bioseq sets
Bioseq Sets
  • sets of bioseq entities that are related somehow
    • nuc-prot set - nucleotide type bioseq and one or more associated protein type bioseqs
    • population set - set of related bioseqs that are aligned with each other. This is a basic type for population and phylogenetic studies
seq annot
Seq-Annot
  • A self-contained annotation that refers to a specific bio-seq entity
  • Can have multiple seq-annots
  • These elements hold the annotation data, e.g. positions of start sites, stop site, introns, regulatory sequences
our lab project
Our Lab Project
  • one of the main elements of the labs in this course is designing a database, populating it, analyzing the sequences in various ways, and annotating the database
  • we will use a flat file format to store data on a family of proteins
  • therefore, we need to define a schema
ad