310 likes | 519 Views
The Zebrafish Genome Sequencing Project Bioinformatics resources. Kerstin Howe, Mario Caccamo, Ian Sealy. Bioinformatics resources. outline clone mapping, sequencing and manual annotation in genome assemblies and automated annotation in integrated ZF-Models data and tools.
E N D
The Zebrafish Genome Sequencing ProjectBioinformatics resources Kerstin Howe, Mario Caccamo, Ian Sealy
Bioinformatics resources • outline • clone mapping, sequencing and manual annotation in • genome assemblies and automated annotation in • integrated ZF-Models data and tools
Clone mapping and sequencing • mapping • 2 BAC Tuebingen libraries • 1 BAC and 1 cosmid library from single Tuebingen double-haploid fish • end sequencing, RH mapping, fingerprinting • pieced together according to fingerprints, marker mapping, sequence alignment • currently ~ 2500 ctgs
+ + = Clone mapping and sequencing • sequencing pipeline • select clones based on position in fpc contig • subcloning • sequencing • automatical assembly/pre-finishing (back to sequencing if necessary) • finishing • QC • automated analysis pipeline • manual annotation • submission to EMBL
RepeatMasker • CpG island prediction • Genscan • FGenesh • halfwise (Pfam) • EPCR • Blast (ESTs, cDNAs, proteins) • gene structures • remarks (gene names, function, similarities) • other features otter • mysql database in 'ensembl style' • acedb or apollo front end • open to users from the 'outside' EMBL Manual annotation unfinished sequence finished sequence automated analysis pipeline manual annotation
Manual annotation • annotation policy • follows guidelines for human annotation (havana team, Sanger Institute) • no "guesses", annotations solely based on supporting evidence • annotation of: CDSs and UTRs / transcripts • splice variants • pseudogenes • poly A features • transposons • repeats • approved nomenclature (SI:clone.number) • collaboration with ZFIN • existing ZFIN records are reported • ZFIN provides new records for newly found genes
DNA CpG island repeats Genscan FGenesH proteins ESTs mRNAs Manual annotation
Vega contigview
Vega geneview
when to use what • go to vega.sanger.ac.uk if you need • highly reliable sequence • highly reliable annotation (with your input) • ‘your gene’ stable over time (TILLING) • go to www.ensembl.org if you need • the whole genome • comparative data • ZF-Models microarray or insertional mutagenesis data • complicated searches (BioMart)
Zebrafish Genome Project clone libraries markers (T51) sequencing tile path BACs map WGS assembly fpc ctg contig supercontig 1.63 Gb contigs finish clone clones+ctgs whole genome shotgun sequencing clone mapping and sequencing WGS reads integration (un)finished clones assembly release (Zv5) ~ 8,000 finished clones (~1 Gb) automatic annotation manual annotation
WGS assembly A B C phrap B C A read-pair tracker B A C NNNNNNNN gap Phusion assembler - High Performance Assembly Group (Zemin Ning et al.) reads group reads contig contig contig contig contig supercontig supercontig supercontig supercontig
Read grouping • word distribution seq.errors frequency repeats k-mer occurrence ~7 • k-mer word hashing continuous base hash - k=12 ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGT TGGCGTGCAGTC GGCGTGCAGTCC GCGTGCAGTCCA gap hash k=12 (4x3) - dealing with variation ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGTCCATGT TGGCGTGCAGTCCATGTT GGCGTGCAGTCCATGTTC GCGTGCAGTCCATGTTCG
Zebrafish Genome Project clone libraries markers (T51) sequencing map WGS assembly whole genome shotgun sequencing clone mapping and sequencing WGS reads integration (un)finished clones assembly release (Zv5) ~ 7,000 finished clones (~1 Gb) automatic annotation manual annotation
Integration cDNA WGS supercontig bacends marker Zv5 scaffoldn.1 BX005153 Zv5 scaffoldn.3 BX005057.8 Zv5 scaffoldn.5 BX005049.6 BX005123.6 Zv5 scaffoldn.7 BACs BX005049.6 BX005123.6 BX005153 BX005057.8 fpc contig Zv5 scaffoldn
Automatic Annotation Zebrafish Proteins Other Proteins Zebrafish cDNAs Zebrafish ESTs Genewise Exonerate Exonerate Genewise genes Aligned cDNAs AlignedESTs ClusterMerge Genewise geneswith UTRs Supported ab initio (optional) Genebuilder Ensembl EST genes Final set
Biomart start filter output
Do’s and Dont’s go elsewhere (Ensembl) if you want to know about the whole genome need comparative data need ZF-Models microarray or insertional mut data need to do complicated searches go to Vega if you need highly reliable sequence need highly reliable annotation need ‘your gene’ stable over time (TILLING)
DAS DAS client DAS server DAS server DAS server remote storage remote storage remote storage genome browser local storage reference sequence XML