Sequence Databases

Sequence Databases • NCBI (National Center for Biotechnology Information) • NIH (National Institutes of Health)

What does NCBI do? • Established in 1988 as a national resource for molecular biology information, NCBI creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information - all for the better understanding of molecular processes affecting human health and disease.

Databases and Software • NCBI assumed responsibility for the GenBank DNA sequence database in October 1992. • In addition to GenBank, NCBI supports and distributes a variety of databases for the medical and scientific communities. These include the Online Mendelian Inheritance in Man (OMIM), the Molecular Modeling Database (MMDB) of 3D protein structures, the Unique Human Gene Sequence Collection (UniGene), a Gene Map of the Human Genome, the Taxonomy Browser, and the Cancer Genome Anatomy Project (CGAP), in collaboration with the National Cancer Institute.

Databases and Software • Entrez is NCBI's search and retrieval system that provides users with integrated access to sequence, mapping, taxonomy, and structural data. • BLAST is a program for sequence similarity searching developed at NCBI, and is instrumental in identifying genes and genetic features. BLAST can execute sequence searches against the entire DNA database in less than 15 seconds.

What is BLAST • BLAST stands for Basic Local Alignment Search Tool.The emphasis of this tool is to find regions of sequence similarity, which will yield functional and evolutionary clues about the structure and function of your novel sequence. WU-BLAST 2.0 and NCBI BLAST2 are distinctly different software packages, although they have a common lineage for some portions of their code, so the two packages do their work differently and obtain different results and offer different features.

Blast Query Tutorial • The core of NCBI 's BLAST services is BLAST 2.0 otherwise known as "Gapped BLAST". This service is designed to take protein and nucleic acid sequences and compare them against a selection of NCBI databases.

1. Selecting the BLAST Program

1. Selecting the BLAST Program • blastp • Compares an amino acid query sequence against a protein sequence database. • blastn • Compares a nucleotide query sequence against a nucleotide sequence database. • blastx • Compares a nucleotide query sequence translated in all reading frames against a protein sequence database. You could use this option to find potential translation products of an unknown nucleotide sequence. • tblastn • Compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames. • tblastx • Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that the tblastx program cannot be used with the nr database on the BLAST Web page because it is computationally intensive.

2.Selecting the BLAST Database

2.Selecting the BLAST Database • You can select several NCBI databases to compare your query sequences against. Note that some databases are specific to proteins or nucleotides and cannot be used in combination with certain BLAST programs (for example a blastn search against swissprot).

Proteins Database

Nucleotides Database

3. Entering your Sequence

3. Entering your Sequence • The BLAST web pages accept input sequences in three formats; FASTA sequence format, NCBI Accession numbers, or GIs.

4. Submitting your Search • 1. Make sure you have selected the correct BLAST program and BLAST database. • 2. If you have entered your FASTA sequence or an Accession or GI number, click the "Submit Query Button". • 3. BLAST will now open a new window and tell you it is working on your search. • 4. Once your results are computed they will be presented in the window.

FASTA format description • A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length.

An example sequence in FASTAformat • gi|532319|pir|TVFV2E|TVFV2E envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL LAAVEAQQQMLKLTIWGVK

Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap of indeterminate length; and in amino acid sequences, U and * are acceptable letters (see below). Before submitting a request, any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue).

The nucleic acid codes supported are: • A --> adenosine U --> uridine • M --> A C (amino) D --> G A T • C --> cytidine R --> G A (purine) • S --> G C (strong) H --> A C T • G --> guanine Y --> T C (pyrimidine) • W --> A T (weak) V --> G C A • T --> thymidine K --> G T (keto) • B --> G T C N --> A G C T (any) - gap of indeterminate length

For those programs that use amino acid query sequences (BLASTP and TBLASTN), the accepted amino acid codes are: • A alanine P proline • B aspartate or asparagine Q glutamine • C cystine R arginine • D aspartate S serine • E glutamate T threonine • F phenylalanine U selenocysteine • G glycine V valine • H histidine W tryptophan • I isoleucine Y tyrosine • K lysine Z glutamate or glutamine • L leucine X any • M methionine * translation stop N asparagine - gap of indeterminate length

GenBank Flat File Format

National Institutes of Health • Begun as a one-room Laboratory of Hygiene in 1887, the National Institutes of Health (NIH) today is one of the world's foremost medical research centers. An agency of the Department of Health and Human Services, the NIH is the Federal focal point for health research. • NIH is the steward of medical and behavioral research for the Nation. Its mission is science in pursuit of fundamental knowledge about the nature and behavior of living systems and the application of that knowledge to extend healthy life and reduce the burdens of illness and disability.

Databases : Nucleic Acids • ENTREZ - Protein and Nucleotide Databases (see Documentation) at NCBI • SRS - Sequence Retrieval System at EBI, UK • GenBank Database and Updates at ( NCBI). A form based interface for searching the Genbank Database and daily updates at NCBI. • BankIt -- GenBank Submissions by WWW (NCBI) • EMBL - Nucleotide Sequence Database search via SRS • Genome Sequence Data Base • dbEST -- an NCBI resource that contains sequence and mapping data on partial, "single-pass" cDNA sequences or Expressed Sequence Tags. • NDB - Nucleic Acid Database at Rutgers U. • Inherited Disease Genes Identified by Positional Cloning • The Tumor Gene Database

Database : Proteins • SwissProt search and the SwissProt Release info. SwissProt can also be accessed from its home at the ExPASy Server in Switzerland. • PIR Search and Release Version. PIR access via its home at NBRF, Georgetown. • OWL - Non-Redundant Protein Sequence Database [Updated Daily] (John Hopkins) • ENTREZ - Protein and Nucleotide Databases (see Documentation) at NCBI • SRS - Sequence Retrieval System at EBI, UK • Molecules 'R US at NIH accesses the Protein Data Bank and displays the structures as text, images or interactive molecules. 3DB Browser is another tool for searching PDB. • Merops - The Peptidase Database

Computer Centers • Helix Systems at CIT. • The Helix Systems are a group of high-performance computers available to NIH personnel for scientific purposes. Major protein, nucleotide and other scientific databases are maintained and updated, and an extensive collection of software is available to manipulate them. In addition, mathematical, graphical and internet tools are available on the helix systems. • Frederick Biomedical Supercomputing Center at NCI. • NCI's FBSC is located in Frederick, MD, and provides computational support for biomedical research to NCI, NIH and extramural biomedical researchers. Web-based applications include sequence analysis, phylogenetic analysis, gene mapping and databases. • National Center for Biotechnology Information at NLM. • NCBI, NLM maintains Genbank, Entrez, and Blast which are available to all researchers at NIH and elsewhere. In addition, selected databases such as OMIM are web-accessible at NCBI's web site.

Conclusion • http://www.ncbi.nlm.nih.gov/ • http://www.nih.gov/ • http://cmgm.stanford.edu/biochem218/03Genome%20Databases.html • http://www.sp.uconn.edu/~gogarten/temp/lecture4_16_02.htm • http://www.sp.uconn.edu/~gogarten/tools/tools.htm • http://people.musc.edu/~hazards/WebBioInformatics/

Sequence Databases