1 / 19

Midterm Project

Midterm Project. Database Schema. GeneIDTable Information about “ gene ” and corresponding “ protein ” gene_id, gene_name, gene_seq, protein_id, protein_name, protein_seq, gene_type gene_id – primary key (type varchar(255)) gene_type type varchar(255)

steffi
Download Presentation

Midterm Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Midterm Project

  2. Database Schema • GeneIDTable • Information about “gene” and corresponding “protein” • gene_id, gene_name, gene_seq, protein_id, protein_name, protein_seq, gene_type • gene_id – primary key (type varchar(255)) • gene_type type varchar(255) • All other entries are of type longtext

  3. Database Schema • GeneFuncTable • Information about “gene functions” • gene_id, gene_fun, comment • gene_id – foreign key • All entries are of type longtext

  4. Database Schema • ProteinFuncTable • Information about “protein functions” • protein_id, protein_fun, comment • All entries are of type longtext

  5. Database Schema • PathwayFuncTable • Information about “pathway functions” • pathway_id, pathway_name, pathway_fun, pathway_loc, comment All entries are of type longtext

  6. Database Schema • PathwayTable • Information about “gene pathway association” • gene_id, pathway_id • gene_id type varchar(255) • pathway_id type longtext

  7. Database Schema • BiologicalProcessTable • Gene Ontology related table • Information about “biological processes” of a particular gene • gene_id, GO_num, biological_process • gene_id – foreign key (type varchar(255)) • All other entries are of type longtext

  8. Database Schema • CellularComponentTable • Gene Ontology related table • Information about “cellular component” • gene_id, GO_num, cellular_component • gene_id – foreign key (type varchar(255)) • All other entries are of type longtext

  9. Database Schema • MolecularFunctionTable • Gene Ontology related table • Information about “molecular functions” • gene_id, GO_num, molecular_function • gene_id – foreign key (type varchar(255)) • All entries are of type longtext

  10. Steps to Follow – Step 1 • Get the RefSeq Accession Number of your species from the NCBI Genome database • e.g. NC_000913 for Escherichia Coli K12

  11. Steps to Follow – Step 2 • Downloading files needed using the NCBI ftp site (ftp://ftp.ncbi.nlm.nih.gov) • genomes/Bacteria/[species name]/[RefSeq #].gbk (main information for genes and proteins and GO functions) • e.g. genomes/Bacteria/Escherichia_coli_k12/NC_000913.gbk • genomes/Bacteria/[species name]/[RefSeq #].ffn (gene sequence) • e.g. genomes/Bacteria/Escherichia_coli_k12/NC_000913.ffn

  12. Steps to Follow – Step 3 • Go to KEGG selected organisms (http://www.genome.jp/kegg/catalog/org_list.html) • Find your species and click the second column of the species (e.g. eco for E Coli) • Go to “pathway maps” to get pathway information to put into the PathwayFunc table

  13. Steps to Follow – Step 4 • Use eutils function of NCBI Entrez to get the file that contains gene pathway association (http://eutils.ncbi.nlm.nih.gov/entrez/eutils/) • Use esearch to search your species in the gene database http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=database&term=query&usehistory=y • Use efetch to fetch the result file • http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=database&WebEnv=WebEnvString&query_key=key

  14. Steps to Follow – Step 5 • Edit .gbk file to remove the beginning and the end part • Parse the .gbk and the .ffn file to fill all the tables except the PathwayFunc table and Pathway table • Link to the sample parser file • Parse.java

  15. Steps to Follow – Step 6 • Parse the eutils resulting file to get the gene pathway association • Link to the sample parsePath file • ParsePath.java

  16. Database Name Format • Example species Escherichia Coli K12 • Species name: Escherichia_Coli_K12 • Database name: escherichia_coli_k12

  17. Sample Output File • outputFile.txt (output file after parsing .gbk and .ffn files) • outputPath.txt (output file after parsing gene pathway association file) • PathwayFunc.txt (output file after analyzing KEGG pathways)

  18. To Find the Number of Genes • Search your species in NCBI gene database • e.g. Escherichia Coli K12 [orgn] • Check the number of genes in your result with this number

  19. Submit your project (the 3 output files, the parsers if any changes) to: • vgummulu@cise.ufl.edu • Any questions: • yizhang@cise.ufl.edu • anupamd@ufl.edu

More Related