1 / 53

بسم الله الرحمن الرحیم

بسم الله الرحمن الرحیم. Using NCBI Resources for Gene Discovery. Lecturer: Dr. Farkhondeh Poursina , PhD poursina@med.mui.ac.ir 1392. National Center for Biotechnology Information (NCBI) National Library of Medicine National Institutes of Health http://www.ncbi.nlm.nih.gov/.

jered
Download Presentation

بسم الله الرحمن الرحیم

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. بسم الله الرحمن الرحیم

  2. Using NCBI Resources for Gene Discovery Lecturer: Dr. FarkhondehPoursina, PhD poursina@med.mui.ac.ir 1392 National Center for Biotechnology Information (NCBI) National Library of Medicine National Institutes of Health http://www.ncbi.nlm.nih.gov/

  3. Nucleic acid & Protein EMBL(European Molecular Biology Laboratory) DDBJ (DNA Data Bank of Japan) GenBank (NCBI,The National Center for Biotechnology Information) Primary biological databases

  4. EMBL/GenBank/DDJB • These 3 db contain mainly the same information (few differences in the format) • Serve as archives containing all sequences (single genes, ESTs, complete genomes, etc.) • derived from: • Genome projects and sequencing centers • Individual scientists • Non-confidential data are exchanged daily • Currently: 2.5 x107 sequences, over 3.2 x1010 bp; • Sequences from > 50,000 different species;

  5. THE ‘PERFECT’ DATABASE • Comprehensive, but easy to search. • Annotated, but not “too annotated”. • A simple, easy to understand structure. • Cross-referenced. • Minimum redundancy. • Easy retrieval of data.

  6. The National Center for Biotechnology Information Bethesda,MD • Created in 1988 as a part of the • National Library of Medicine at NIH(National Institutes of Health) • Establish public databases • Research in computational biology • Develop software tools for sequence analysis • Disseminate biomedical information

  7. Web Access: www.ncbi.nlm.nih.gov New pages! New Homepage Common footer

  8. TYPES OF MOLECULAR DATABASES(Sequence) at NCBI • Primary Databases • Original submissions by experimentalists • Content controlled by the submitter • Examples: GenBank, Trace, SRA, SNP, GEO • Derivative Databases • Derived from primary data • Curated/expert review(Content controlled by third party (NCBI) • compilation and correction of data • Examples: NCBI Protein, Refseq, RefSNP, UniGene, Homologene, Structure, Conserved Domain

  9. ACGTGC C C GA GA ATT GA GA C ATT TATAGCCG AGCTCCGATA CCGATGACAA RefSeq C TATAGCCG ACGTGC Curators CGTGA ATTGACTA TTGACA Genome Assembly TTGACA TTGACA ACGTGC ACGTGC TATAGCCG CGTGA CGTGA TATAGCCG ATTGACTA TATAGCCG ATTGACTA ATTGACTA CGTGA ATTGACTA ATTGACTA ATT TATAGCCG TATAGCCG TATAGCCG TATAGCCG TATAGCCG TTGACA C GenBank UniGene GA AT C C C C ATT GA GA GA GA ATT ATT ATT Algorithms GA GA GA GA C C ATT ATT C C PRIMARY VS. DERIVATIVE SEQUENCE DATABASES Labs Sequencing Centers Updated continually by NCBI Updated ONLY by submitters

  10. The Problem • Rapidly growing databases with complex and changing relationships • Rapidly changing interfaces to match the above Result • Many people don’t know: • Where to begin • Where to click on a Web page • Why it might be useful to click there

  11. Derivative Sequence Databases

  12. ENTREZFINDING RELEVANT INFORMATION IN NCBI DATABASES

  13. You can search DNA sequence database Retrieve known sequences by • ENTREZ • http://www.ncbi.nlm.nih.gov/Entrez/ • Click – Nucleotide • OR • Accession number • Keyword search

  14. Entrez is Internally Cross-linked • DNA and protein sequences are linked to other similar sequences • Medline citations are linked to other citations that contain similar keywords 3-D structures are linked to similar structures

  15. Databases contain more than just DNA &protein sequences

  16. Retrieve all sequences for an organism or taxon • Starting with an organism or taxon name... • How to: Download the complete genome for an organism • Starting at the Genomes

  17. How to: Find transcript sequences for a gene • Starting with ... • A GENE NAME, PRODUCT NAME, OR SYMBOL • How to: Obtain genomic sequence for/near a gene, marker, transcript or protein • Starting with... • A GENE NAME OR SYMBOL

  18. Entrez Protein Gene Other Entrez DBs HomoloGene UniGene ENTREZ TIP: START SEARCHES IN GENE BLink Homologene: Gene Neighbors

  19. How to: Display genomic annotation graphically • Starting with... • A NUCLEOTIDE RECORD (e.g. NC_000001)

  20. By applying limits, there are now just two entries

  21. Precise Results

  22. A Traditional GenBank Record Molecular weight Locus Field Molecule Type ACCESSION NO ACCESSION VERSSION Modification Date Definition Line Genbank Division GI (GenInfo) Taxonomy Submission Field

  23. Traditional GenBank Record • Accession • Stable • Reportable • Universal ACCESSION U07418 VERSION U07418.1 GI:466461 Coding sequence Version Tracks changes in sequence GI number NCBI internal use the sequence is the data

  24. What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775 GenBank genomic DNA sequence NT_030059 Genomic contig Rs7079946 dbSNP (single nucleotide polymorphism) N91759.1 An expressed sequence tag (1 of 170) NM_006744 RefSeq DNA sequence (from a transcript) NP_007635 RefSeq protein AAC02945 GenBank protein Q28369 SwissProt protein 1KT7 Protein Data Bank structure record DNA RNA protein Page 27

  25. Feature Table GenPept Record Genomic DNA Sequence

  26. GenPept: GenBank CDS translations FEATURES Location/Qualifiers source 1..2484 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" /chromosome="3" /map="3p22-p23" gene 1..2484 /gene="MLH1" CDS 22..2292 /gene="MLH1" /note="homolog of S. cerevisiae PMS1 (Swiss-Prot Accession Number P14242), S. cerevisiae MLH1 (GenBank Accession Number U07187), E. coli MUTL (Swiss-Prot Accession Number P23367), Salmonella typhimurium MUTL (Swiss-Prot Accession Number P14161) and Streptococcus pneumoniae (Swiss-Prot Accession Number P14160)" /codon_start=1 /product="DNA mismatch repair protein homolog" /protein_id="AAC50285.1" /db_xref="GI:463989" /translation="MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKS TSIQVIVKEGGLKLIQIQDNGTGIRKEDLDIVCERFTTSKLQSFEDLASISTYGFRGE ALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPKPCAGNQGTQITVEDLFYNIA TRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNASTVDNIRS >gi|463989|gb|AAC50285.1| DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...

  27. Reference Sequences • Nucleotide sequences and protein translation • Curated by NCBI or NCBI-approved programs. • Difference between GenBank and RefSeq • GenBank has raw data and duplicated records • Metadata in GenBank can be incomplete • RefSeq annotated, curated and non-redundant. • NCBI takes best sequences from GenBank and • curates for RefSeq records RefSeq

  28. Selected RefSeq Accession Numbers mRNAs and Proteins NM_123456Curated mRNA NP_123456Curated Protein NR_123456Curated non-coding RNA XM_123456Predicted mRNA XP_123456Predicted Protein XR_123456Predicted non-coding RNA Gene Records NG_123456Reference Genomic Sequence Chromosome NC_123455Microbial replicons, organelle genomes, human chromosomes AC_123455 Alternate assemblies Assemblies NT_123456Contig NW_123456WGSSupercontig

  29. over 100,000 nucleotide entries for HIV-1 only 1 RefSeq

  30. How to save? • Choose FASTA from the Display drop-down menu • Transform the content of this window into plain text by choosing Text from the drop-down menu located on the far right of the menu bar. • Save the FASTA sequence by using the following protocol: • a. In the Edit menu of your Web browser, click Select All and then • click Copy. • b. Open a default Word document and, in the Edit menu of Word, click Paste. • c. Finally, save your document as dUTPaseDNA.txt by choosing the Save as type option text only (*.txt).

  31. FASTA format description • FASTA is a DNA and proteinsequence alignment software package first described (as FASTP) by David J. Lipman and William R. Pearson in 1985 • Popular Format and commonly used • A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length.

More Related