1 / 30

Genome Annotation and Databases

Genome Annotation and Databases. Genomic DNA sequence Genomic annotation. Reading Ch 9, Ch10. BIO520 Bioinformatics Jim Lund. Genome Annotation. Find known repeats Search for new repeated sequences Predict Genes BLASTX Genewise, Fgenes, Genscan…

Download Presentation

Genome Annotation and Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genome Annotation and Databases Genomic DNA sequence Genomic annotation Reading Ch 9, Ch10 BIO520 Bioinformatics Jim Lund

  2. Genome Annotation • Find known repeats • Search for new repeated sequences • Predict Genes • BLASTX • Genewise, Fgenes, Genscan… • Integrate other data sources. Accuracy highest in “high homology” class

  3. Genome annotation servers • Integrate information from several maps • DNA sequence (contigs, quality). • Physical (cytogenetic, STS content). • Genes (show gene annotations and evidence). • Several prediction programs. • Expressed sequence tags (ESTs, Unigene clusters) • Evidence (Predicted, confirmed) • Non-coding RNA (ncRNA) transcripts. • Variation (e.g., SNPs) • Regions of shared synteny.

  4. Data Release • Human genome sequence released under 1996 Bermuda rules • Assembled sequence greater than 1000bp long is deposited in public database (GenBank/EMBL/DDBJ) every 24 hours • No patents are filed • Bermuda principles reaffirmed at January 2003 WT/NIH meeting • Pre-release of data for all “community projects” • Nature 421 , 875 (2003) • NHGRI: • http://www.genome.gov/page.cfm?pageID=10506376 • WT: • http://www.wellcome.ac.uk/About-us/Policy/Policy-and-position-statements/WTD002751.htm • Benefits of Open Data Access supported by OECD report • http://dataaccess.ucsd.edu

  5. Accessing the Genome • Genomes sequences are becoming available very rapidly • Large and difficult to handle computationally • Everyone expects to be able to access them immediately • Bench Biologists • Has my gene been sequenced? • What are the genes in this region? • Where are all the GPCRs • Connect the genome to other resources. • Research Bioinformatics • Give me a dataset of human genomic DNA. • Give me a protein dataset.

  6. Getting information out • Search/browse to find the gene or region. • Export formats: • Screen shot • FASTA seq. • Genbank file with features annotated • Feature list (Gff, tab-delimited text) • Pip (plot of sequence identity between organisms).

  7. Challenges • Scale and data flow • Presentation, ease of use. • Engineering problems. • User interface design. • Algorithmic • Partly engineering (pre-compute hard computations, etc.) • Partly research.

  8. NCBI sequence assembly (sequence chromosome) • Remove contaminants • Bin by chromosome arms • Sequence Layout • Sequence Building • Place on chromosomes http://www.ncbi.nlm.nih.gov/genome/guide/build.shtml

  9. BAC Sequence Fragments Assemble Order NCBIContig NCBI sequence assembly - a modified greedy approach • Sequence Layout • Curated Finished Regions • Curated assembly instructions • MegaBLAST hits • Consider clone order • BAC chromosome assignment • annotation • STS markers • personal communication • Remove conflicting overlaps, redundant BACs • Sequence Building • Consider fragment:fragment sequence overlaps for each BAC pair in layout • Meld overlapping sequence • Order and Orient (o+o ): • alignments (mRNA, EST) • BAC annotation • paired plasmid reads

  10. NCBI Genome Build Process STS dbSNP Clones GenomeScan Collaboration Curation GenBank LocusLink RefSeq Update: Links gi’s Prepare for release LocusLink Annotation Contig Build & Release Assembly Resource Updates Freeze Input Data: Sequences Curated NTs TPF BLAST hits Public Release Sequences (contig mRNA protein) Exclude Problem accessions Analysis & Review Corrections for next build Map Viewer FTP BLAST Input Resources

  11. What is being annotated? Feature Method Genes: By alignment, by prediction By ePCR Markers: Variation: By alignment Clones/Cytogenetic location: By alignment (BAC ends) Phenotype (MIM): Via Gene identification, associated markers Cytogenetic Position: By annotated BAC-END sequenced clones By FISH-mapped clones used in assembly

  12. RefSeq: a reagent for Contig Annotation genome • Potential Problems • With ESTs: • Gene Families • Partial • Chimeric • Intron read-through • Linker • Vector • Wrong organism RefSeq mRNAs GenBank mRNAs ESTs • RefSeq Advantages: • Separate Gene Families • Not Partial • Means to correct • problem sequences TBLASTN RPSBLAST RefSeq process results in excluding problem GenBank sequences from annotation pipeline GenomeScan

  13. RefSeqs (transcripts, proteins) Gene id (LocusID) features in chromosome coordinates features in contig (NT accession) coordinates Available in: Map Viewer Graphical display Tabular display Sequence downloads FTP RefSeqs (contigs, transcripts, proteins) Mapping Data LocusLink & Other resources NCBI: Products of annotation

  14. NCBI Map Viewer

  15. NCBI Map Viewer: Tabular report

  16. Anchored by human gene order Anchored by mouse gene order Genes in regions of conserved synteny

  17. Chromosomal segments in dog conserved with human and mouse Dog: 38 autosomes + sex chr

  18. Query by sequence: Review the alignment • A click away: • Alignments (BLAST hit) • Gene Description (LocusLink) • Report of all features in the region • Contig sequence • Sequence in the region • other mRNAs aligning in the region • Define your own gene model based on alignments in the region

  19. Is the sequence correct? Is the feature correctly placed? Is there a feature that should be placed? Are the attributes of the feature correct? Quality Control - Genome review • Approaches: • In-house analysis & review (manual curation) • Shared information (UCSC/Ensembl) • Solicited review by experts in local regions

  20. Ensembl Annotation pipeline • Set of high quality gene predictions • From known human mRNAs aligned against genome • From similar protein and mRNAs aligned against genome • From Genscan predictions confirmed via BLAST of Protein, cDNA, ESTs databases. • Initial functional annotation from Interpro • Integration with external resources (SNPs, SAGE, OMIM) • Comparative analysis between mouse/human • DNA sequence alignment • Protein orthologs

  21. Ensembl gene prediction pipeline DNA RepeatMasker Genscan Pmatch all human Proteins and cdnas Blast genscan peptides v Protein,unigene,est,vert mrna MiniGenewise MiniEst2genome Genes

  22. Genome Annotation The generic structure of an automatic genome annotation pipeline and delivery system

  23. Configuration Chromosome Overview Genes and Markers 1Mb Detailed View Genes, ESTs, CpG etc. 100kb

  24. Useful genomic annotation and browser URLs EBI/Sanger Institute Ensembl Project: http://www.ensembl.org/Homo_sapiens/ NCBI Human Genome Browser: http://www.ncbi.nlm.nih.gov/mapview/map_search.cgi?chr=hum_chr.inf&query The Oak Ridge National Laboratories Genome Channel: http://compbio.ornl.gov/channel/ UCSC Human Genome Browser: http://genome.ucsc.edu/cgi-bin/hgGateway The Institute for Genomic Research (TIGR): http://www.tigr.org/

  25. Genome annotation -things still being worked out- • Annotation servers. • Pro: make genomics information accessible to biologists without expert bioinformatics skills. • Con: makes it difficult to perform large-scale data mining. • Solution: enable more experienced users to retrieve the data they require and to run analyses locally. • Open annotation systems. • Biologists need to have access to annotations available in the community and to share their own contributions with the community. • A common protocol between systems that enables genome data to be freely exchanged • AGAVE (Architecture for Genomic Annotation, Visualization and Exchange) • Distributed Annotation System (DAS) projects

  26. Genome annotation servers • Several ways to find information: • Search by clone, gene, EST, marker. • Browse sequence. • BLAST searches. • Homology, start in one organism, jump to the syntenic region of another.

  27. UCSC Genome Browser http://genome.ucsc.edu/cgi-bin/hgGateway

More Related