Reading for the next week
This presentation is the property of its rightful owner.
Sponsored Links
1 / 34

Reading for the Next Week PowerPoint PPT Presentation


  • 40 Views
  • Uploaded on
  • Presentation posted in: General

Reading for the Next Week. Sequence Analysis and Alignment Chapter 5, Chapter 8, Chapter 11 Only about the 1st third of each chapter. Sequence Files. Fasta format, has simplest structure >Sequence Title [new line] Sequence [new line] very useful for handling sequence alone

Download Presentation

Reading for the Next Week

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Reading for the next week

Reading for the Next Week

  • Sequence Analysis and Alignment

  • Chapter 5, Chapter 8, Chapter 11

  • Only about the 1st third of each chapter


Sequence files

Sequence Files

  • Fasta format, has simplest structure

    >Sequence Title [new line]

    Sequence [new line]

  • very useful for handling sequence alone

  • usually included as one of the formats supported by programs that use sequence


Example of fasta format

Example of Fasta Format

>gi|212244|gb|M16260.1|CHKLCAMR…

AGCTCCGTGCGCAGCGGTACCCGTACCGGTACCGGCCCGGTCCCTGAGCCATGGGCCGGCGGTGGGGTTCCCCCGCCCTGCAGCGCTTCCCCGTGTTGGTGCTGCTGCTGCTGCTCCAGGTGTGCGGCCGGCGGTGCGACGAGGCAGCCCCCTGCCAGCCCGGCTTTGCTGCAGAGACCTTCAGCTTCAGTGTGCCCCAGGACAGCGTGGCGGCGGGCAGGGAGCTGGGACGAGTGAGCTTTGCAGCCTGCAGCGGGCGGCCGTGGGCCGTGTATGTCCCGACTGACA…


Genbank flat file

GENBANK Flat File

  • holdover from earlier versions of GENBANK, the US government-supported public database

  • DNA-centric, sequence based view of data

  • contains a number of fields with non-sequence information


Reading for the next week

LOCUS CHKLCAMR 3545 bp mRNA linear VRT 30-NOV-1995

DEFINITION Chicken liver cell adhesion molecule L-CAM mRNA, complete cds.

ACCESSION M16260 J04074 M22179

VERSION M16260.1 GI:212244

KEYWORDS cadherin; glycoprotein; liver cell adhesion molecule.

SOURCE Gallus gallus cDNA to mRNA.

ORGANISM Gallus gallus

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;

Archosauria; Aves; Neognathae; Galliformes; Phasianidae;

Phasianinae; Gallus.

REFERENCE 1 (bases 201 to 3545)

AUTHORS Gallin,W.J., Sorkin,B.C., Edelman,G.M. and Cunningham,B.A.

TITLE Sequence analysis of a cDNA clone encoding the liver cell adhesion

molecule, L-CAM

JOURNAL Proc. Natl. Acad. Sci. U.S.A. 84 (9), 2808-2812 (1987)

MEDLINE 87204217


Reading for the next week

FEATURES Location/Qualifiers

source 1..3545

/organism="Gallus gallus"

/db_xref="taxon:9031"

/clone="pEC3(20,30,31)"

/tissue_type="liver"

/dev_stage="10-11 day old embryo"

mRNA <1..3545

/product="L-CAM mRNA"

CDS 51..2714

/codon_start=1

/product="liver cell adhesion protein precursor"

/protein_id="AAA82573.1"

/db_xref="GI:212245"

/translation="MGRRWGSPALQRFPVLVLLLLLQV…GGEDDE"

sig_peptide 51..128

mat_peptide 531..2711

/product="liver cell adhesion protein"

BASE COUNT 757 a 1125 c 1051 g 612 t

ORIGIN 20 bp upstream of KpnI site.

1 agctccgtgc gcagcggtac ccgtaccggt accggcccgg tccctgagcc atgggccggc

61 ggtggggttc…


Other formats

Other Formats

  • XML - extensible markup language

    • similar to HTML only can implement user-defined tags

  • Graphic

    • extracts positions from features and creates a graphical output


Database types

Database Types

Characteristics, Strengths and Weaknesses


What is a database

What is a Database?

  • well-defined storage method for digital data

  • allows for relatively rapid retrieval of data

  • allows for complex conditional retrieval


Three main types used in bioinformatics

Three Main Types Used in Bioinformatics

  • Flat File

    • text stored in a file in stereotyped format

    • Hierarchical adds “tree” organization

  • Relational

    • a set of tables, with unique identifiers, and overlapping content

  • Object Oriented

    • data stored as part of a data structure (the object) that includes methods for manipulating the data


Flat file database

Flat File Database

  • data is stored as an “unstructured” record

  • relationships between the data are inherent in the database schema, the description of the syntax of the storage file


Flat file database1

Flat File Database

  • Advantages

    • low overhead, do not need to have a complex computational superstructure to organize the data and keep track of it in memory

    • retrieval is not computationally complex

    • can take advantage of generalized standards for information organization


Reading for the next week

  • Disadvantages

    • no random access, therefore the simplicity of storage imposes a cost on access and manipulation

      • partially resolved by indexing

    • change in the schema requires parsing and rewriting the whole database

    • all linkages between data entries must be explicitly defined either in the schema or by software that accesses the database


Relational databases

Relational Databases

  • functionally consist of a set of tables, where each row in the table contains a set of properties of some entity

  • extensive formal analysis of relational approach has yielded a set of “normalizations” that maximize the interconnections between information, minimize redundancies


Relational databases1

Relational Databases

  • Advantages

    • readily available database management systems (DBMSs) that handle the computational overhead invisibly

    • high interconnectivity of data enhances data mining process

    • Structured Query Language (SQL) exists to make searching automated and relatively rapid, even complex searches


Reading for the next week

  • changing schema does not necessarily involve rewriting whole database; can add new tables or new columns to existing tables

  • most common commercial database type therefore lots of support available (if you have the money)

  • wide usage means user skills are generalizable


Reading for the next week

  • Disadvantages

    • overhead (computational and expertise) makes cost high for small databases

    • content-based query only rudimentary, can not do complex “fuzzy” queries within SQL

    • all implementations do not fully conform to theoretical criteria, therefore problems arise in large databases and/or complex queries


5 break

5’ Break


Object oriented databases

Object-Oriented Databases

  • based on the object concept, a computational entity that consists of data and a set of methods that will perform operations on that data

  • ACeDB, the core DBMS for the C. elegans sequencing project is object oriented


Object oriented databases1

Object-Oriented Databases

  • Advantages

    • pre-existing schema is already worked out if you use ACeDB (http://www.acedb.org/)

    • a lot of procedural programming is not needed because methods for data manipulation are intrinsic to the object

    • natural database for object-oriented languages like C++ and Java


Reading for the next week

  • Disadvantages

    • not easy to tweak; the DBMS is fairly complex, really only the developer community can alter it

    • if their data model is not adequate for your project, there is no easy way to expand it

    • therefore, tends to be good for specific genomes, high throughput operations, not databases set up and maintained by small users for idiosyncratic projects


Combined relational object

Combined Relational/Object

  • Relational database (tables) that can hold objects

  • As implemented, the DBMS simulates the object by creating a set of hidden tables

  • Larger computational overhead, less user control of database structure


Summary

Summary

  • each database type has strengths and weaknesses

  • choice of database to use depends on many cost factors (money, computational overhead, learning curve for use, pre-existing support)

  • there is no single right choice


Genbank

GENBANK

  • the core GENBANK archival database is a flat file format

  • historically that is the way it started

  • when a major revamping was undertaken in the mid 90s, stayed with flat file format, but introduced a defined hierarchical data model using ASN.1


Asn 1

ASN.1

  • the underlying structure of the GENBANK database uses Abstract Syntax Notation as the syntax definition

  • this is a standard, general syntax definition for holding information in a machine-parseable form

  • hierarchical structure helps organize data


Reading for the next week

Seq-entry ::= set {

level 1 ,

class nuc-prot ,

descr {

title "Chicken liver cell adhesion molecule L-CAM mRNA, and translated

products" ,

update-date

std {

year 1995 ,

month 11 ,

day 30 } ,

source {

org {

taxname "Gallus gallus" ,

common "chicken" ,

db {

{

db "taxon" ,

tag


Genbank data model

GENBANK Data Model

  • to implement a database, you must have a data model

  • for flat files, consists of a set of rules about

    • the format of data storage

    • the syntax of the storage

  • implementation of any data analysis or manipulation is the responsibility of the user


Genbank data model1

GENBANK Data Model

  • explicitly defined in on-line document

  • http://www.ncbi.nlm.nih.gov/IEB/ToolBox/SDKDOCS/DATAMODL.HTML

  • note: although not object oriented, the specification uses much of the terminology of object-oriented programming


Bioseqs

BioSeqs

  • has a at least one Seq-id

  • contains information about a biological sequence

    • virtual - contains a molecular type, a size, and topology (e.g. a band on a gel, an intron whose sequence has not been determined)


Reading for the next week

  • raw - simple single sequence, which has all the properties of virtual plus actual sequence data

  • segmented - contains identifiers for other bioseqs and relative positional information, thus yielding a size

  • map - contains a rough size and co-ordinates that represent some kind of map data


Bioseq sets

Bioseq Sets

  • sets of bioseq entities that are related somehow

    • nuc-prot set - nucleotide type bioseq and one or more associated protein type bioseqs

    • population set - set of related bioseqs that are aligned with each other. This is a basic type for population and phylogenetic studies


Seq annot

Seq-Annot

  • A self-contained annotation that refers to a specific bio-seq entity

  • Can have multiple seq-annots

  • These elements hold the annotation data, e.g. positions of start sites, stop site, introns, regulatory sequences


Our lab project

Our Lab Project

  • one of the main elements of the labs in this course is designing a database, populating it, analyzing the sequences in various ways, and annotating the database

  • we will use a flat file format to store data on a family of proteins

  • therefore, we need to define a schema


  • Login