Computer storage of sequences
Download
1 / 31

Computer Storage of Sequences - PowerPoint PPT Presentation


  • 165 Views
  • Updated On :

Computer Storage of Sequences. (Chapter 2 of Bioinformatics: Sequence and Genome Analysis By David W. Mount). CSE730: Seminar on “Information Retrieval of Biomedical Text and Data”. Outline. Storing DNA/Protein sequences into computer files or databases.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Computer Storage of Sequences' - ivanbritt


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Computer storage of sequences l.jpg

Computer Storage of Sequences

(Chapter 2 of

Bioinformatics: Sequence and Genome Analysis

By David W. Mount)

CSE730: Seminar on

“Information Retrieval of Biomedical Text and Data”


Outline l.jpg
Outline

  • Storing DNA/Protein sequences into computer files or databases.

  • Related information placed in the database along with the sequence in a number of sequence data formats.

  • Online public access Databases for sequence retrieval.




Sequence formats l.jpg
Sequence Formats

Sequence is stored as ASCII text (i.e. sequence of A,G,C,T…) along with annotations.

Different sequence formats recognized by different sequence analyzer programs.

Sequence Format includes accessory information, gene names, source organism, investigator name, references, and the actual sequence.


Sequence formats continued l.jpg
Sequence Formats (continued)

  • FASTA

  • GenBank Flat File format

  • PIR/CODATA format

  • EMBL sequence entry format

  • Intelligenetics sequence entry format

  • GCG (Genetics Computer Group) sequence entry format.

  • ASN.1

  • XML


Databases l.jpg
Databases

  • NCBI

    GenBank at the National Center of Biotechnology Information (NCBI), National Library of Medicine, Washington, DC

  • NBRF

    Protein Information Resource (PIR) database at the National Biomedical Research Foundation in Washington, DC


Databases continued l.jpg
Databases (continued)

  • SwissProt

    The SwissProt protein sequence database at ISREC, Swiss Institute for Experimental Cancer Research.

  • EMBL

    European Molecular Biology Laboratory (EMBL) Outstation at Hixton, England

  • DDBJ

    DNA DataBank of Japan (DDBJ) at Mishima, Japan


Databases on internet l.jpg
Databases on Internet

  • NCBI http://www.ncbi.nlm.nih.gov

  • PIR

    http://www-nbrf.georgetown.edu/pirwww

  • SwissProt

    http://www.expasy.ch/cgi-bin/sprot-search-de

  • EMBL http://www.ebi.ac.uk/embl/index.html

  • DDBJ http://www.ddbj.nig.ac.jp/


Slide10 l.jpg
NCBI

  • National resource for molecular biology information.

  • Maintains comprehensive databases for variety of Biotech related information.

  • Develops and manages access to a range of databases and softwares for scientific and medical communities.


Ncbi integrated databases l.jpg
NCBI : Integrated Databases

  • Literature Databases

    • Pubmed

    • PubMed Central

    • OMIM

    • PROW

    • BookShelf


Ncbi integrated databases continued l.jpg
NCBI : Integrated Databases (continued)

  • Nucleotide Databases

    • GenBank

    • EST Database

    • GSS Database

    • SNPs Database

    • RefSeq

    • STS Database


Ncbi integrated databases continued13 l.jpg
NCBI : Integrated Databases (continued)

  • Entrez Databases

    • Pubmed

    • Protein Sequence Database

    • Nucleotide Sequence Database

    • Taxonomy

    • OMIM


Genbank l.jpg
GenBank

  • GenBank is the NIH genetic sequence database.

  • Annotated collection of all publicly available DNA sequences.

  • GenBank is a part of an international collaboration of sequence databases along with EMBL and DDBJ.


Genbank dna sequence format l.jpg
GenBank DNA Sequence Format

DNA sequence in GenBank is formatted into distinct attributes as following

  • Locus: locus name, sequence length, division, date

  • Definition: description of entry

  • Accession: unique accession number

  • Version: version of sequence

  • Keywords: keywords for cross referencing


Genbank dna sequence format continued l.jpg
GenBank DNA Sequence Format(continued)

  • Source: source organism of DNA

  • Organism: description of organism

  • References: authors, title, journal, Medline, etc

  • Features: information about sequence

  • Basecount: number of bases in sequence

  • Origin: sequence data begin following origin.

  • Genebank sample


Ncbi tools l.jpg
NCBI : Tools

Tools for Data Retrieval and submission

  • Text Term Searching

  • Sequence Similarity Searching

  • Taxonomy Searching

  • Sequence Submission


Ncbi entrez l.jpg
NCBI : ENTREZ

  • Entrez is a search and retrieval system that integrates information from databases at NCBI.

  • These databases include nucleotide sequences, protein sequences, macromolecular structures, whole genomes, and MEDLINE, PubMed. Etc.

  • Entrez


Ncbi blast l.jpg
NCBI : BLAST

BLAST: Basic Local Alignment Search Tool

  • It is a set of similarity search programs designed to explore available sequence databases.

  • It uses a heuristic algorithm which is able to detect relationships among sequences which share only isolated regions of similarity

    Q-BLAST: It is a queuing system to BLAST that allows users to retrieve results at their convenience and format their results.


Ncbi blast continued l.jpg
NCBI : BLAST (continued)

Access to BLAST service

  • Web-BLAST

  • Standalone BLAST

  • Network BLAST

  • BLAST URL API


Ncbi blast continued21 l.jpg
NCBI : BLAST (continued)

BLAST Programs

  • Blastp : Compares amino acid sequence against protein sequence Database

  • Blastn : Compares nucleotide sequence against nucleotide sequence Database

  • Blastx :Compares nucleotide query sequence against protein sequence Database

  • Tblastn : Compares protein query sequence against nucleotide sequence Database

    BLAST


Nbrf pir l.jpg
NBRF :PIR

Protein Information Resource

3 Major Databases:

  • PSD (Protein Sequence Database)

  • iProClass

  • PIR-NREF

    (Nonredundant REFerence protein database)


Pir psd l.jpg
PIR: PSD

  • The PIR, in collaboration with MIPS and JIPID, produces and distributes the PIR-International Protein Sequence Database (PSD) .

  • Comprehensive and expertly annotated protein sequence database.

  • The primary sources of PSD data are sequences from GenBank/EMBL/DDBJ translations, published literature, and direct submission to PIR-International.


Pir psd continued l.jpg
PIR: PSD (continued)

  • The PIR-PSD data is available in XML format and NBRF, PIR/CODATA formats. The sequence file is available in FASTA format.

  • Also available at PIR UNIX FTP server. Address:

    ftp://ftp.pir.georgetown.edu/pir_databases/psd/


Codata format l.jpg
CODATA format

  • CODATA format has approximately the same information as a GenBank or EMBL sequence file, but is slightly differently formatted and has different field names.

  • Also called PIR format, used by NBRF.

    CODATA Sample


Pir iproclass l.jpg
PIR: iProClass

  • The iProClass database provides comprehensive descriptions of all proteins and serves as a framework for data integration in a distributed networking environment.

  • Very user-friendly description.


Pir nref non redundant reference protein database l.jpg
PIR: NREF (Non-redundant REFerence protein database)

  • Comprehensive: Containing all sequences from PIR-PSD, Swiss-Prot, TrEMBL, RefSeq, GenPept, and updated bi-weekly.

  • Non-Redundant: Clustered by sequence identity and taxonomy at the species level.

  • Source Attribution: Containing protein IDs and names from associated databases (with hypertext links), in addition to protein sequence, taxonomy, and bibliography.

    The current version (July 2002) consists of more than 809,000 non-redundant PIR-PSD, SwissProt and TrEMBL proteins organized with more than 36,200 PIR superfamilies, 145,340 families, and links to over 50 molecular biology databases.


Swiss prot l.jpg
Swiss-Prot

  • Swiss-Prot is a protein knowledgebase established in 1986.

  • Maintained collaboratively, by the Department of Medical Biochemistry of the University of Geneva (now the Swiss Institute of Bioinformatics) and the EMBL Data Library.

    Swiss-Prot Sequence Entry Example


Sequence format conversion l.jpg
Sequence Format Conversion

READSEQ:

Sequence Format Conversion program.

http://bimas.dcrt.nih.gov/molbio/readseq/

Can convert to/from:

  • ASN.1

  • FASTA

  • CODATA

  • GCG

  • EMBL format

  • GenBank format and many other formats


References l.jpg
References

  • http://www.ncbi.nlm.nih.gov

  • http://www-nbrf.georgetown.edu/pirwww

  • http://www.expasy.ch/cgi-bin/sprot-search-de

  • http://www.ebi.ac.uk/embl/index.html

  • http://www.ddbj.nig.ac.jp/


Thank you l.jpg

Thank You

Presented by:Hemal Patel &Jeetal Shah


ad