information organization l.
Download
Skip this Video
Download Presentation
Information organization

Loading in 2 Seconds...

play fullscreen
1 / 22

Information organization - PowerPoint PPT Presentation


  • 130 Views
  • Uploaded on

Information organization. Oct 2, 2012

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Information organization' - elsie


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
information organization
Information organization
  • Oct 2, 2012
  • Learning objectives-Demonstrate Dotter Program. Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how to search GenBank for information. Understand difference between header, features and sequence. Distinguish between a primary database and secondary database.
  • Homework #2 due today.
  • Homework #3 due Tues. Oct. 9
what is genbank
What is GenBank?
  • Gene sequence database
  • Annotated records that represent single contiguous stretches of DNA or RNA-may have more than one coding region.
  • Generated from direct submissions to the DNA sequence databases from the authors.
  • Part of the International Nucleotide Sequence Database Collaboration.
history of genbank
History of GenBank
  • Began with Atlas of Protein Sequences and Structures (Dayhoff et al., 1965)
  • In 1986 it shared data with EMBL and in 1987 it shared data with DDBJ.
  • Primary database
  • Examples of secondary databases derived from GenBank: UniProt, EST database.
  • GenBank Flat File is a human readable form of a GenBank record.
slide5

Downstream (relative to CDS)

Upstream (relative to CDS)

Transcription

initiation site

Transcription

termination site

Start of gene

Coding strand

End of gene

5’

3’

DNA

Promoter

Protein Coding Sequence (CDS)

5’

3’

Template strand

5’ untranslated region (5’UTR)

3’ untranslated region (3’UTR)

Transcription

3’

5’

RNA

Translation

Protein

Protein folding

Folded protein

transcript splicing

DNA

3

4

2

1

Intron 2

Intron 1

Intron 3

Transcription

Primary

transcript

2

4

1

3

Splicing

mRNA

Translation

protein

Transcript splicing
slide7

Alternative splicing

1

2

3

4

Primary transcript

general comments on gbff
General Comments on GBFF
  • Three sections:
    • 1) Header-information about the whole record
    • 2) Features-description of annotations-each represented by a key.
    • 3) Nucleotide sequence-each ends with // on last line of record.
  • DNA-centered
  • Translated sequence is a feature
feature keys
Feature Keys
  • Purpose:
    • 1) Indicates biological nature of sequence
    • 2) Supplies information about changes to sequences
  • Feature KeyDescription

conflict Separate determinations of the same seq. differ

rep_origin Origin of replication

protein_bind Protein binding site on DNA

CDS (Protein) coding sequence

feature keys terminology
Feature Keys-Terminology

Feature Key Location/Qualifiers

CDS 23..400

/product=“alcohol dehydro.”

/gene=“adhI”

The feature CDS is a coding sequence beginning at base 23 and ending at base 400 that has a product called “alcohol dehydrogenase” and corresponds to the gene called “adhI”.

feature keys terminology cont
Feature Keys-Terminology (Cont.)

Feat. Key Location/Qualifiers

CDS join (544..589,688..1032)

/product=“T-cell recep. B-ch.”

/partial

The feature CDS is a partial coding sequence formed by joining the indicated elements to form one contiguous sequence encoding a product called T-cell receptor beta-chain.

record from genbank
Record from GenBank

GenBank division (plant, fungal and algal)

Locus name

Modification date

LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999

DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and

Axl2p (AXL2) and Rev7p (REV7) genes, complete cds.

ACCESSION U49845

VERSION U49845.1 GI:1293613

KEYWORDS .

SOURCE baker's yeast.

ORGANISM Saccharomyces cerevisiae

Eukaryota; Fungi; Ascomycota; Hemiascomycetes; Saccharomycetales;

Saccharomycetaceae; Saccharomyces.

Accession number (never changes)

Coding sequence

GeneInfo identifier (changes whenever there is a change)

Nucleotide sequence identifier (changes when there is a change

in sequence (accession.version))

Word or phrase describing the sequence (not based on controlled vocabulary).

Not used in newer records.

Common name for organism

Formal scientific name for the source organism and its lineage

based on NCBI Taxonomy Database

record from genbank cont 1
Record from GenBank (cont.1)

Oldest reference first

REFERENCE 1 (bases 1 to 5028)

AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W.

TITLE Cloning and sequence of REV7, a gene whose function is required

for DNA damage-induced mutagenesis in Saccharomyces cerevisiae

JOURNAL Yeast 10 (11), 1503-1509 (1994)

MEDLINE 95176709

REFERENCE 2 (bases 1 to 5028)

AUTHORS Roemer,T., Madden,K., Chang,J. and Snyder,M.

TITLE Selection of axial growth sites in yeast requires Axl2p, a

novel plasma membrane glycoprotein

JOURNAL Genes Dev. 10 (7), 777-793 (1996)

MEDLINE 96194260

Medline UID

REFERENCE 3 (bases 1 to 5028)

AUTHORS Roemer,T.

TITLE Direct Submission

JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University,

New Haven, CT, USA

Submitter of sequence (always the last reference)

record from genbank cont 2
Record from GenBank (cont.2)

There are three parts to the feature key: a keyword (indicates functional group), a location

(instruction for finding the feature), and a qualifier (auxiliary information about a feature)

FEATURES Location/Qualifiers

source 1..5028

/organism="Saccharomyces cerevisiae"

/db_xref="taxon:4932"

/chromosome="IX"

/map="9"

CDS <1..206

/codon_start=3

/product="TCP1-beta"

/protein_id="AAA98665.1"

/db_xref="GI:1293614"

/translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA

AEVLLRVDNIIRARPRTANRQHM"

Location

Keys

Qualifiers

Partial sequence on the 5’ end. The 3’ end is complete.

Start of open reading frame

Descriptive free text must be in quotations

Database cross-refs

Protein sequence ID #

Values

Note: only a partial sequence

record from genbank cont 3
Record from GenBank (cont.3)

Another location

gene687..3158

/gene="AXL2"

CDS 687..3158

/gene="AXL2"

/note="plasma membrane glycoprotein"

/codon_start=1

/function="required for axial budding pattern of S.

cerevisiae"

/product="Axl2p"

/protein_id="AAA98666.1"

/db_xref="GI:1293615" /translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVN. . . “

gene complement(3300..4037)

/gene="REV7"

CDS complement(3300..4037)

/gene="REV7"

/codon_start=1

/product="Rev7p"

/protein_id="AAA98667.1"

/db_xref="GI:1293616" /translation="MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQ . . . “

Cutoff

Another location

Cutoff

record from genbank cont 4
Record from GenBank (cont.4)

BASE COUNT 1510 a 1074 c 835 g 1609 t

ORIGIN

1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg

61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct . . .//

slide18

Protein databases derived

from GenBank containing

data for a single gene

  • Non-redundant (nr)
  • UniProtKB

protein

DNA

RNA

cDNA

DNA databases derived from GenBank

containing data for a single gene

  • Non-redundant (nr)
  • dbGSS
  • dbSTS

RNA (cDNA) databases derived

from GenBank

containing data for a single gene

  • dbEST
  • UniGene
  • RefSeq
types of primary databases carrying biological infomation
GenBank/EMBL/DDBJ

dbEST-expressed sequence tags-single pass cDNA sequences (high error freq.)

It is non-redundant

PDB-Three-dimensional structure coordinates of biological molecules

PROSITE-database of protein domain/function relationships.

Types of primary databases carrying biological infomation
summary
Summary
  • GenBank-longest running molecular biology database.
  • Three sections in every GenBank record
  • Primary databases and secondary databases.
  • RefSeq-contains unique record for each RNA variant.
  • UniProtKB-protein centered
workshop
Workshop
  • Do problem 1 in Chapter 2.
homework
Homework
  • Do problems 2 and 3 in Chapter 2.