290 likes | 457 Views
Slow and Steady: The Sea Urchin Genome Project. David A. Schwarz Mentor: Dr. Andrew Cameron Site: California Institute of Technology. Objective. Curate the non annotated, predicted genes of the sea urchin genome. Learn to annotate genes and register as many as possible to spbase.org.
E N D
Slow and Steady:The Sea Urchin Genome Project David A. Schwarz Mentor: Dr. Andrew Cameron Site: California Institute of Technology
Objective • Curate the non annotated, predicted genes of the sea urchin genome. • Learn to annotate genes and register as many as possible to spbase.org
Importance • The purple sea urchin: the only non-chordate deuterostome with a sequenced genome. • It could help us understand the evolution of biological processes such as odor perception and immunity. • Developments made in the project could benefit future genome projects.
Strongylocentrotus purpuratus • Phylum: Echinodermata • Radially symmetrical shell, 3 – 10 cm. • Spines can reach 3 cm long. • Moves slowly, feeding mostly on algae. • Reproduces by external fertilization.
Genome Sequencing • WGS = Whole Genome Shotgun Sequencing • Genome assembly named Spur_v0.5 • CAPSS = Cloned-Array Pooled Shotgun Sequencing Strategy • Genome assembly named Spur_v2.1
WGS: Extract DNA Digest Sequence the Fragments Assemble the genome. CAPSS: Combines WGS with BAC. Uses BACs as framework for genome assembly. Sequencing
Spur_v0.5 – 28,944 predicted ~10,044 annotated 18,944 non annotated Spur_v2.1 23,300 estimated Gene number reduced when duplicates overlap Discrepancy • ~ 5,700 gene difference possibly due to: • 4 – 5% species polymorphism (E. Davidson, et al.) • Assembly error • Prediction error
Python Filtering Python Searching BioPython module: BLAST hit FASTA sequences Grep-like functions: GLEAN models by protein type FASTA sequences in GLEAN protein databse Methods
Example List GLEAN3_00003 ref|NP_104627.1| hypothetical protein [Mesorhizobium loti] >gi|1... 38 0.48 GLEAN3_00004 ref|NP_788284.1| CG33087-PC [Drosophila melanogaster] >gi|232403... 40 0.19 GLEAN3_00005 ref|NP_509604.1| abnormal NUClease NUC-1, deoxyribonuclease DLAD... 69 4e-11 GLEAN3_00008 ref|XP_293875.3| similar to RIKEN cDNA B130016O10 gene [Homo sap... 240 5e-62 GLEAN3_00010 gb|AAH36744.1| FLJ11712 protein [Homo sapiens] 86 6e-16 GLEAN3_00011 gb|AAH36744.1| FLJ11712 protein [Homo sapiens] 143 3e-32 GLEAN3_00014 ref|NP_062642.1| ubiquitin-conjugating enzyme E2A, RAD6 homolog;... 229 2e-59 GLEAN3_00018 failed GLEAN3_00019 failed GLEAN3_00020 failed GLEAN3_00021 ref|NP_196259.2| chaperone protein - related [Arabidopsis thalia... 110 4e-23 GLEAN3_00023 failed GLEAN3_00024 sp|O42587|PRSA_XENLA 26S protease regulatory subunit 6A (TAT-bin... 130 1e-29 GLEAN3_00027 gb|AAD19348.1| reverse transcriptase-like protein [Takifugu rubr... 172 2e-41 GLEAN3_00028 gb|AAH53792.1| MGC64389 protein [Xenopus laevis] 164 3e-39 GLEAN3_00029 failed GLEAN3_00030 ref|XP_060945.2| similar to Olfactory receptor 10T2 [Homo sapien... 54 5e-06 GLEAN3_00032 dbj|BAA22375.1| Nfrl [Xenopus laevis] 339 7e-92 GLEAN3_00033 ref|XP_354640.1| RIKEN cDNA D430035D22 gene [Mus musculus] 186 1e-45 GLEAN3_00034 dbj|BAC04242.1| unnamed protein product [Homo sapiens] 207 5e-52 GLEAN3_00037 dbj|BAC02921.1| zVeph-A [Danio rerio] 112 4e-23 GLEAN3_00038 ref|NP_004198.1| solute carrier family 16, member 3; monocarboxy... 44 0.008 GLEAN3_00039 failed
Data Curation Condition: Different name, same genome coordinates Genes removed: 139
Data Curation Condition: Evidence for gene expression Genes removed: 1,603
Data Curation Condition: No hits Genes removed: 3,145
Data Curation Condition: Exactly the same BLAST hit Genes removed: 4,545
Data Curation Condition: Successful Reciprocal BLAST match Genes removed: 3,952
Reciprocal Blast A B Good Reciprocal Blast Y X Sea urchin protein database (GLEAN) NCBI Nr database GLEAN_A NCBI Protein B (score) (e-value)
Reciprocal Blast A B Y X Bad Reciprocal Blast Sea urchin protein database (GLEAN) NCBI Nr database GLEAN_A NCBI Protein B (score) (e-value)
Data Curation Conditions: Names such as “hypothetical”, “predicted”, “unnamed” Genes removed: 3,041
Contributions to Annotation • AnnotationAssist.py • Automates searching for families in the Glean database • Autofetches sequences for Clustal X • Stores everything on a unique directory based on Glean model name and family
References • Polymorphism: R.J. Britten, A. Cetta, E.H. Davidson, Cell 15, 1175 (1978) • CAPSS: W. W. Cai, R. Chen, R. A. Gibbs, A. Bradley, Genome Res.11, 1619 (2001).
Dr. Andrew Cameron David Felt Lauren Lee and Nowelle Ibarra SoCalBSI Staff and Coordinator SoCalBSI Participants Funding: NIH NSF DOE Beckman Institute Acknowledgments