Reading for the next week
Download
1 / 34

Reading for the Next Week - PowerPoint PPT Presentation


  • 53 Views
  • Uploaded on

Reading for the Next Week. Sequence Analysis and Alignment Chapter 5, Chapter 8, Chapter 11 Only about the 1st third of each chapter. Sequence Files. Fasta format, has simplest structure >Sequence Title [new line] Sequence [new line] very useful for handling sequence alone

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Reading for the Next Week' - charity-trevino


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Reading for the next week
Reading for the Next Week

  • Sequence Analysis and Alignment

  • Chapter 5, Chapter 8, Chapter 11

  • Only about the 1st third of each chapter


Sequence files
Sequence Files

  • Fasta format, has simplest structure

    >Sequence Title [new line]

    Sequence [new line]

  • very useful for handling sequence alone

  • usually included as one of the formats supported by programs that use sequence


Example of fasta format
Example of Fasta Format

>gi|212244|gb|M16260.1|CHKLCAMR…

AGCTCCGTGCGCAGCGGTACCCGTACCGGTACCGGCCCGGTCCCTGAGCCATGGGCCGGCGGTGGGGTTCCCCCGCCCTGCAGCGCTTCCCCGTGTTGGTGCTGCTGCTGCTGCTCCAGGTGTGCGGCCGGCGGTGCGACGAGGCAGCCCCCTGCCAGCCCGGCTTTGCTGCAGAGACCTTCAGCTTCAGTGTGCCCCAGGACAGCGTGGCGGCGGGCAGGGAGCTGGGACGAGTGAGCTTTGCAGCCTGCAGCGGGCGGCCGTGGGCCGTGTATGTCCCGACTGACA…


Genbank flat file
GENBANK Flat File

  • holdover from earlier versions of GENBANK, the US government-supported public database

  • DNA-centric, sequence based view of data

  • contains a number of fields with non-sequence information


LOCUS CHKLCAMR 3545 bp mRNA linear VRT 30-NOV-1995

DEFINITION Chicken liver cell adhesion molecule L-CAM mRNA, complete cds.

ACCESSION M16260 J04074 M22179

VERSION M16260.1 GI:212244

KEYWORDS cadherin; glycoprotein; liver cell adhesion molecule.

SOURCE Gallus gallus cDNA to mRNA.

ORGANISM Gallus gallus

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;

Archosauria; Aves; Neognathae; Galliformes; Phasianidae;

Phasianinae; Gallus.

REFERENCE 1 (bases 201 to 3545)

AUTHORS Gallin,W.J., Sorkin,B.C., Edelman,G.M. and Cunningham,B.A.

TITLE Sequence analysis of a cDNA clone encoding the liver cell adhesion

molecule, L-CAM

JOURNAL Proc. Natl. Acad. Sci. U.S.A. 84 (9), 2808-2812 (1987)

MEDLINE 87204217


FEATURES Location/Qualifiers linear VRT 30-NOV-1995

source 1..3545

/organism="Gallus gallus"

/db_xref="taxon:9031"

/clone="pEC3(20,30,31)"

/tissue_type="liver"

/dev_stage="10-11 day old embryo"

mRNA <1..3545

/product="L-CAM mRNA"

CDS 51..2714

/codon_start=1

/product="liver cell adhesion protein precursor"

/protein_id="AAA82573.1"

/db_xref="GI:212245"

/translation="MGRRWGSPALQRFPVLVLLLLLQV…GGEDDE"

sig_peptide 51..128

mat_peptide 531..2711

/product="liver cell adhesion protein"

BASE COUNT 757 a 1125 c 1051 g 612 t

ORIGIN 20 bp upstream of KpnI site.

1 agctccgtgc gcagcggtac ccgtaccggt accggcccgg tccctgagcc atgggccggc

61 ggtggggttc…


Other formats
Other Formats linear VRT 30-NOV-1995

  • XML - extensible markup language

    • similar to HTML only can implement user-defined tags

  • Graphic

    • extracts positions from features and creates a graphical output


Database types

Database Types linear VRT 30-NOV-1995

Characteristics, Strengths and Weaknesses


What is a database
What is a Database? linear VRT 30-NOV-1995

  • well-defined storage method for digital data

  • allows for relatively rapid retrieval of data

  • allows for complex conditional retrieval


Three main types used in bioinformatics
Three Main Types Used in Bioinformatics linear VRT 30-NOV-1995

  • Flat File

    • text stored in a file in stereotyped format

    • Hierarchical adds “tree” organization

  • Relational

    • a set of tables, with unique identifiers, and overlapping content

  • Object Oriented

    • data stored as part of a data structure (the object) that includes methods for manipulating the data


Flat file database
Flat File Database linear VRT 30-NOV-1995

  • data is stored as an “unstructured” record

  • relationships between the data are inherent in the database schema, the description of the syntax of the storage file


Flat file database1
Flat File Database linear VRT 30-NOV-1995

  • Advantages

    • low overhead, do not need to have a complex computational superstructure to organize the data and keep track of it in memory

    • retrieval is not computationally complex

    • can take advantage of generalized standards for information organization


  • Disadvantages linear VRT 30-NOV-1995

    • no random access, therefore the simplicity of storage imposes a cost on access and manipulation

      • partially resolved by indexing

    • change in the schema requires parsing and rewriting the whole database

    • all linkages between data entries must be explicitly defined either in the schema or by software that accesses the database


Relational databases
Relational Databases linear VRT 30-NOV-1995

  • functionally consist of a set of tables, where each row in the table contains a set of properties of some entity

  • extensive formal analysis of relational approach has yielded a set of “normalizations” that maximize the interconnections between information, minimize redundancies


Relational databases1
Relational Databases linear VRT 30-NOV-1995

  • Advantages

    • readily available database management systems (DBMSs) that handle the computational overhead invisibly

    • high interconnectivity of data enhances data mining process

    • Structured Query Language (SQL) exists to make searching automated and relatively rapid, even complex searches



  • Disadvantages database; can add new tables or new columns to existing tables

    • overhead (computational and expertise) makes cost high for small databases

    • content-based query only rudimentary, can not do complex “fuzzy” queries within SQL

    • all implementations do not fully conform to theoretical criteria, therefore problems arise in large databases and/or complex queries


5 break
5’ Break database; can add new tables or new columns to existing tables


Object oriented databases
Object-Oriented Databases database; can add new tables or new columns to existing tables

  • based on the object concept, a computational entity that consists of data and a set of methods that will perform operations on that data

  • ACeDB, the core DBMS for the C. elegans sequencing project is object oriented


Object oriented databases1
Object-Oriented Databases database; can add new tables or new columns to existing tables

  • Advantages

    • pre-existing schema is already worked out if you use ACeDB (http://www.acedb.org/)

    • a lot of procedural programming is not needed because methods for data manipulation are intrinsic to the object

    • natural database for object-oriented languages like C++ and Java


  • Disadvantages database; can add new tables or new columns to existing tables

    • not easy to tweak; the DBMS is fairly complex, really only the developer community can alter it

    • if their data model is not adequate for your project, there is no easy way to expand it

    • therefore, tends to be good for specific genomes, high throughput operations, not databases set up and maintained by small users for idiosyncratic projects


Combined relational object
Combined Relational/Object database; can add new tables or new columns to existing tables

  • Relational database (tables) that can hold objects

  • As implemented, the DBMS simulates the object by creating a set of hidden tables

  • Larger computational overhead, less user control of database structure


Summary
Summary database; can add new tables or new columns to existing tables

  • each database type has strengths and weaknesses

  • choice of database to use depends on many cost factors (money, computational overhead, learning curve for use, pre-existing support)

  • there is no single right choice


Genbank
GENBANK database; can add new tables or new columns to existing tables

  • the core GENBANK archival database is a flat file format

  • historically that is the way it started

  • when a major revamping was undertaken in the mid 90s, stayed with flat file format, but introduced a defined hierarchical data model using ASN.1


Asn 1
ASN.1 database; can add new tables or new columns to existing tables

  • the underlying structure of the GENBANK database uses Abstract Syntax Notation as the syntax definition

  • this is a standard, general syntax definition for holding information in a machine-parseable form

  • hierarchical structure helps organize data


Seq-entry ::= set { database; can add new tables or new columns to existing tables

level 1 ,

class nuc-prot ,

descr {

title "Chicken liver cell adhesion molecule L-CAM mRNA, and translated

products" ,

update-date

std {

year 1995 ,

month 11 ,

day 30 } ,

source {

org {

taxname "Gallus gallus" ,

common "chicken" ,

db {

{

db "taxon" ,

tag


Genbank data model
GENBANK Data Model database; can add new tables or new columns to existing tables

  • to implement a database, you must have a data model

  • for flat files, consists of a set of rules about

    • the format of data storage

    • the syntax of the storage

  • implementation of any data analysis or manipulation is the responsibility of the user


Genbank data model1
GENBANK Data Model database; can add new tables or new columns to existing tables

  • explicitly defined in on-line document

  • http://www.ncbi.nlm.nih.gov/IEB/ToolBox/SDKDOCS/DATAMODL.HTML

  • note: although not object oriented, the specification uses much of the terminology of object-oriented programming


Bioseqs
BioSeqs database; can add new tables or new columns to existing tables

  • has a at least one Seq-id

  • contains information about a biological sequence

    • virtual - contains a molecular type, a size, and topology (e.g. a band on a gel, an intron whose sequence has not been determined)


  • raw database; can add new tables or new columns to existing tables - simple single sequence, which has all the properties of virtual plus actual sequence data

  • segmented - contains identifiers for other bioseqs and relative positional information, thus yielding a size

  • map - contains a rough size and co-ordinates that represent some kind of map data


Bioseq sets
Bioseq Sets database; can add new tables or new columns to existing tables

  • sets of bioseq entities that are related somehow

    • nuc-prot set - nucleotide type bioseq and one or more associated protein type bioseqs

    • population set - set of related bioseqs that are aligned with each other. This is a basic type for population and phylogenetic studies


Seq annot
Seq-Annot database; can add new tables or new columns to existing tables

  • A self-contained annotation that refers to a specific bio-seq entity

  • Can have multiple seq-annots

  • These elements hold the annotation data, e.g. positions of start sites, stop site, introns, regulatory sequences


Our lab project
Our Lab Project database; can add new tables or new columns to existing tables

  • one of the main elements of the labs in this course is designing a database, populating it, analyzing the sequences in various ways, and annotating the database

  • we will use a flat file format to store data on a family of proteins

  • therefore, we need to define a schema


ad