1 / 85

DNA Databanks

DNA Databanks. Speaker: Yu-Chung Chang 張猷忠 Institute of Biochemistry National Yang-Ming University. DNA databanks GenBank , DDBJ , EMB L, … Protein databases PIR, Swiss-Prot, PRF, GenPept, TrEMBL, PDB, … EST databases dbEST, DOTS, UniGene, GIs, STACK, … Structure databases

taini
Download Presentation

DNA Databanks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DNA Databanks Speaker: Yu-Chung Chang 張猷忠 Institute of Biochemistry National Yang-Ming University

  2. DNA databanks GenBank, DDBJ, EMBL,… Protein databases PIR, Swiss-Prot, PRF, GenPept, TrEMBL, PDB,… EST databases dbEST, DOTS, UniGene, GIs, STACK,… Structure databases MMDB, PDB, Swiss-3DIMAGE,… Pathway databases KEGG, BRITE, TRANSPATH,… Integrated databases SRS Motif or cis-element databases Prosite, Pfam, BLOCKS, TransFac, PRINTS, URLs,… Gene, protein & disease databases GeneCards, OMIM, OMIA,… Taxonomy databases Literature databases PubMed, Medline,… Patent database Apipa, CA-STN, IPN, USPTO, EPO, Beilstein,… Others… RNA databases,… Biological Databases

  3. DNA Databanks • cDNA resources • Genbank (NCBI), Nucleotide Sequence Database (EMBL), DDBJ , MGC,… • Genomic DNA resources • HTG, dbGSS, GOLD, ERGO,… • EST resources • dbEST, UniGene, GIs, STACKS, DOTS,… • Others • dbSTS, UniSTS, dbSNP, TransFac, ISIS, Repbase, ...

  4. GenBank at National Center for Biotechnology Information (NCBI) • GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. • There are approximately 11,720,000,000 bases in 10,897,000 sequence records as of February 2001. • GenBank is part of the International Nucleotide Sequence Database Collaboration, which is comprised of the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at NCBI. These three organizations exchange data on a daily basis. • http://www.ncbi.nlm.nih.gov/Entrez/

  5. NCBI-SITEMAP

  6. European Molecular Biology Laboratory (EMBL) • The EMBL Nucleotide Sequence Database constitutes Europe's primary nucleotide sequence resource. Main sources for DNA and RNA sequences are direct submissions from individual researchers, genome sequencing  projects and patent applications. • http://www.ebi.ac.uk/Databases/index.html

  7. EBI-databases & tools

  8. DNA Data Bank of Japan (DDBJ) http://www.ddbj.nig.ac.jp/ • Database Search • Getentry, SFgate & WAIS, SRS, Homology Search, TXSearch, SQmatch • Data Analysis • malign, clustal w • Genome Analysis • GTOP • Protein Structure • PDB Retriever, SSThread, LIBRA I

  9. Genome Projects • Whole genome sequences • EST projects • MGC projects • SNP projects • GSS projects • STS projects

  10. Graphs created on 12 Dec 2000

  11. Graphs created on 12 Dec 2000

  12. GenBank Sequence Submission Policy At this time the following types of submissions are NOT acceptable. • sequences of less than 50 bp in length. • computer generated or otherwise predicted sequences (i.e. EST assembled sequences). • third party sequences downloaded from a sequence database or journal. • one genomic sequence with multiple exons joined together without the sequence of the intervening introns. • primer only sequences.

  13. GenBank Sequence Submission Policy (cont.) At this time the following types of submissions are NOT acceptable. • protein only sequences. • non-biologically contiguous sequences containing internal unsequenced spacers. • sequences containing a mix of genomic and mRNA sequence represented as a single sequence • EST submissions should be submitted through the dbEST system. • as of 1 January, 2000, Genome Survey Sequences (GSSs) should not be submitted through Bankit; use the dbGSS system.

  14. WWW Bankit WebIn Sakura Sequin e-mail Sequin Diskette Sequin Data Submission

  15. Nucleotides dbEST UniGene dbGSS dbSTS UniSTS RefSeq MGC dbSNP HTGs UniVec DNA Databases at NCBI

  16. dbEST http://www.ncbi.nlm.nih.gov/dbEST/index.html dbEST • dbEST is a database of expressed sequence tags; short, single pass read cDNA (mRNA) sequences. Also includes cDNA sequences from differential display experiments and RACE experiments.

  17. dbGSShttp://www.ncbi.nlm.nih.gov/dbGSS/index.html • Database of genome survey sequences. • Short, single pass read genomic sequences. • Exon trapped sequences. • Cosmid/BAC/YAC ends. • Alu PCR sequences. • GSS sequences are available from two sources: dbGSS and the GSS division of GenBank. The sequences and accession numbers in both sources are the same but the record formats differ.

  18. dbSTShttp://www.ncbi.nlm.nih.gov/dbSTS/index.html • Database of sequence tagged sites. • Short sequences that are operationally unique in the genome, used to generate mapping reagents. • STS sequences are available from two sources: dbSTS and the STS division of GenBank. The sequences and accession numbers in both sources are the same but the record formats differ.

  19. HTGshttp://www.ncbi.nlm.nih.gov/HTGS/ • High throughput genome sequences from large scale genome sequencing centers. • Unfinished (phase 0, 1, 2) and finished (phase 3) sequences. • Sequence data in this division are available for BLAST homology searches against either the "htgs" database or the "month" database, which includes all new submissions for the prior month.

  20. dbSNPhttp://www.ncbi.nlm.nih.gov/SNP/ • Database of single nucleotide polymorphisms. • Small-scale insertions/deletions. • Polymorphic repetitive elements. • Microsatellite variation.

  21. New HTC (High Throughput cDNA) division • At the May 2000 collaborative meeting DDBJ/EMBL/GenBank agreed to create a new database division HTC to represent unfinished High Throughput cDNA sequences. HTC sequences may include 5'UTR and 3'UTR regions and (part of a) coding region. Upon finishing of these sequences, they will be moved to the corresponding taxonomic division. HTC sequence entries will include the keyword 'HTC'. The keyword will be removed once the entry has been included in the taxonomic division.

  22. Mammalian Gene Collection (MGC)http://www.ncbi.nlm.nih.gov/MGC/ • The Mammalian Gene Collection (MGC) project is a new effort by the NIH to generate full-length complementary DNA (cDNA) resources.

  23. Entrez -A search & retrival system

  24. Entrez Searching • Subject searching • Phrase searching • Searching for authors • Searching for unique identifiers • Searching by molecular weight • Range searching • Truncating searching (Wildcard searching) • Combining sets

  25. Entrez -Subject searching • Text searching • hiv-1 • Subject terms are automatically combined • hiv-1 protease, hiv-1 AND protease $ L

  26. Entrez -Phrase searching • “hiv-1 protease” • Using quotes forces Entrez to check a phrase list against which the search terms are matshed. • It is not adjacency searching. • If the search phrase is not in the phrase list, Entrez treats it as a subject searching.

  27. Entrez -Searching for authors • Chang YC • Search only the author field • Chang • Search all fields • Subject searching • Do not use punctuation.

  28. Entrez -Searching for unique identifiers • Accession numbers • GenBank/EMBL/DDBJ: U12345, AF123456 • GenPept: AAA12345 • SwissProt & PIR: P12345 • RefSeq: NM_123456, NT_123456, NP_123456, NC_123456, XM_123456, XP_123456 • Sequence identification numbers • GI numbers: 6995995 • Version numbers: AF123456.3

  29. Entrez -Searching by molecular weight • 010600[Molecular Weight] • 012345[MOLWT] • 010000:050000[MOLWT] • 002000:010000[MOLWT] AND human[Organism] • [field name]  feature table

  30. Entrez -Range searching • Accession numbers [ACCN], sequence length [SLEN], and molecular weight [MOLWT] • AF114696:AF114714[ACCN] • Not for GI and Version numbers • 3000:4000[SLEN] • 002002:002100[MOLWT]

  31. Entrez -Truncating searching • Wildcard searching • Root word plus * • bacte*, retroviru* • Only retrieve the first 150 variations of truncated terms • Left-handed trunction is not possible • *ology

  32. Entrez -Combining sets • Use your search History to combine documents • #1 AND #4 L

  33. Entrez -Boolean operators • AND, OR, NOT • bacteria AND virus NOT phage • (bacteria AND virus) NOT phage • hiv-1 OR bacterial protease • hiv OR (bacterial AND protease) L

  34. Entrez -Boolean operators

  35. Entrez -Using limits

  36. Entrez -Limit a search to a particular database field • You are only intrested in nucleotide sequences from the mouse • Select Nucleotide database from the black menu bar or the Search pull-down menu. • Select limits. • In the "Limits To:" section, select Organism from the Search Field pull-down menu. • Type "mouse" without quotes in the query box and select Go.

  37. Entrez -Limit a search to a particular database field • You are only interested in protein sequences that are less than 50 amino acids in length. • Select the Protein database from the black menu bar or the Search pull-down menu. • Select Limits. • In the "Limited To:" section, select Sequence Length from the Search Field pull-down menu. • Type "0:50" without quotes in the query box and select Go.

  38. Entrez -Exclude certain kinds of sequences • You are interested in mitochondrial carriers but you do not want the EST sequences. • Select the Nucleotide database from the black menu bar or the Search pull-down menu. • Type "mitochondrial carrier" without quotes in the query box. • Select Limits. • In the "Limited To:" section, checkthe box next to “Exclude ESTs" and select Go.

  39. Entrez -Limit the search to a particular molecule type • You are only interested in Cryptosporidium ribosomal RNA sequences. • Select the Nucleotide database from the black menu bar or the Search pull-down menu. • Type "cryptosporidium" without quotes in the query box. • Select Limits. • In the "limited to:" section, select the "Molecule" pull-down menu and choose rRNA and select Go.

  40. Entrez -Limit the search to a particular gene location • You are interested in the genes in the chloroplast of flowering plants. • Select the Nucleotide database from the black menu bar or the Search pull-down menu. • Type "flowering plants" without quotes in the query box. • Select Limits. • In the "Limited To:" section, select the "Gene Location" pull down menu and choose chloroplast and select Go.

  41. Entrez-Limit the search to records from a particular sequence database • You are interested only in cysteine phosphatase protein sequences submitted directly to PIR. • Select the Protein database from the black menu bar or the Search pull-down menu. • Type "cysteine phosphatase" without quotes in the query box. • Select Limits. • In the "Limited To:" section, select the "Only From" pull-down menu and choose PIR and select Go.

  42. Entrez -Limit the search by date • You want to see any nucleotide sequences from pigs added to the database (or updated) in the last 30 days. • Select the Nucleotide database from the black menu bar or the Search pull-down menu. • Type "pigs" without quotes in the query box. • Select Limits. • In the "Limited To:" section, select Organism from the Search Field pull-down menu. • And in the "Limited To:" section, select the "Modification Date" pull down menu and choose 30 days and select Go.

  43. Entrez -Limit the search by date • You want to retrieve all mouse or human nucleotide sequences added to the database (or updated) during 1997. • Select the Nucleotide database from the black menu bar or the Search pull-down menu. • Type "mouseOR human" without quotes in the query box. • Select Limits. • In the "Limited To:" section, select Organism from the Search Field pull-down menu. • And in the "Limited To:" section, select the "Modification Date" pull down menu and choose Modification Date. In the date boxes, type the dates in the format YYYY/MM/DD. You can tab from box to box in the date fields. Select Go.

  44. Entrez -Using more than one limit at a time • You are interested in the protein translations of human GenBank nucleotide sequences added to the protein database (or updated) in the last 30 days. You do not want patent records. • Select the Protein database from the black menu bar or the Search pull-down menu. • Type "human" without quotes in the query box. • Select Limits. • In the "Limited To:" section, select Organism from the Search Field pull-down menu. • On the same screen, select the exclude patents check box, select GenBank from the Only From pull-down menu, and finally select 30 days from the Modification Date pull-down menu and select Go.

  45. Entrez -Writing advanced search statements • Find all human nucleotide sequences with LTR annotations. • In the Nucleotide database use the following expression - LTR[FKEY] AND human[ORGN] • Find drosophila population studies published in the Journal of Molecular Evolution • In the PopSet database use the following expression - j mol evol[JOUR] AND drosophila[ORGN]

  46. Entrez -Writing advanced search statements • Find all human protein sequences with lengths between 50 and 60 amino acids and that were entered into the database during 1999. • In the Protein database use the following expression - human[ORGN] AND 50[SLEN]:60[SLEN] AND 1999[MDAT]

  47. Feature key or descriptor line Feature qualifiers

  48. allele attenuator CAAT_signal CDS enhancer exon gene GC_signal iDNA intron J_region LTR misc_binding misc_feature mRNA polyA_signal polyA_site STS 3’UTR 5’clip Feature Key Name (partial list) ftp://ncbi.nlm.nih.gov/genbank/gbrel.txt

More Related