1 / 38

The DoOP database

The DoOP database. Endre Sebestyén Agricultural Research Institute of the Hungarian Academy of Sciences Beyond Next Generation Sequencing Workshop Budapest, July 20-23, 2011. Transcription. Transcription factors and binding sites. Transcription factors

kumiko
Download Presentation

The DoOP database

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The DoOP database Endre Sebestyén Agricultural Research Institute of the Hungarian Academy of Sciences Beyond Next Generation Sequencing Workshop Budapest, July 20-23, 2011

  2. Transcription

  3. Transcription factors and binding sites • Transcription factors • Activator domain and DNA binding domain • Recognizes specific sequence motifs • Binding is influenced by various factors • Binding sites • Short sequence motifs (6-12 bp) • Usually in promoter, but theoretically everywhere (3’ and 5’ UTR, introns, etc) • Conserved and ambiguous positions

  4. „Real” promoter structure • No general motifs • No TATA-box, GC-box, etc • Lots of false positive TFBS • With wet-lab and in silico methods • Sometimes no apparent common TFBSs between coregulated genes

  5. Binding site search and promoter analysis • Wet-lab methods • DNAse footprinting • Electrophoretic mobility shift assay • ChIP-Chip, ChIP-Seq • In silico methods • Experimentally verified sites • Consensus sequences • Consensus matrices • De novo motif discovery • Oligo frequency • Phylogenetic footprinting • Other methods

  6. Representation of sites • Consensus sequence • IUPAC nomenclature • Ambiguous positions • ACACTSSNWTT • With repeats • ACACTS{1,4}N{1,2}WTT

  7. Representation of sites • Matrices • Position Frequency Matrix (1.) • Position Weight Matrix (3.) • (Position Scoring Matrix)

  8. Sequence logo

  9. Search for sites • Known motifs represented as consensus seqs • Perl regular expressions • if ($seq =~ /[AT]{1,}CCT[CG]/) { print “got it!\n” } • EMBOSS • http://emboss.sourceforge.net/ • Fuzznuc • [CG](5)TG{A}N(1,5)C

  10. Search for sites • Known motifs represented as matrices • TFBS module • Bio::Matrix module • MotifScanner • http://homes.esat.kuleuven.be/~thijs/Work/MotifScanner.html • Using a background model

  11. Search for sites • Denovo motifs • Orthologous genes • Same function in different species’ • Organ specific genes • Tissue specific genes • Developmental state specific genes • Etc

  12. Search for sites • Denovo motifs • Short oligo frequency • Expected vs observed frequency • Over or underrepresented oligos in various promoter groups

  13. Search for sites • Denovo motifs • Phylogenetic footprinting • Functional binding sites and regions should be conserved • Sequence alignment • Global/local – ClustalW/Dialign • Dialign is useful where sequences share only local homologies

  14. Global/local alignment

  15. Search for sites • M.K. Das and H.-K. Dai, “A survey of DNA motif finding algorithms,” BMC Bioinformatics, vol. 8, 2007. • M. Tompa, N. Li, T.L. Bailey et al., “Assessing computational tools for the discovery of transcription factor binding sites,” Nature Biotechnology, vol. 23, Jan. 2005, pp. 137-44.

  16. Promoter databases • EPD http://epd.vital-it.ch/ • Eukaryotic Promoter Database • Release 107 • Egyik fele kísérletes eredmények alapján (4800) • Maize • Drosophila • Xenopus • Mouse • Human • Etc • Bulk promoter annotation (13000) • Rice

  17. Promoter databases • DBTSS http://dbtss.hgc.jp/ • Database of Transcriptional Start Sites • Release 7.0 • cDNS 5’ sequencing, exact transcription start sites • Alternative promoters too • Species • Mouse • Rat • Fugu • Etc

  18. Promoter databases • PlantProm http://mendel.cs.rhul.ac.uk/mendel.php?topic=plantprom • Növényi promóterek • PromoSer http://biowulf.bu.edu/zlab/PromoSer/ • Ember, egér, patkány • SCPD http://rulai.cshl.edu/SCPD/ • Sacharomyces cerevisiae • DCPD http://www-biology.ucsd.edu/labs/Kadonaga/DCPD.html • Drosophila • CEPDB http://rulai.cshl.edu/cgi-bin/CEPDB/home.cgi • C. elegans • NAR database (january) & webserver (july) issue

  19. Transcription factor binding site databases • TRANSFAC http://www.gene-regulation.com/ • Transcription factors, binding sites, literature data • Matrices and consensus sequences

  20. Transcription factor binding site databases • JASPAR http://jaspar.genereg.net/ • Smaller amount of data • Non redundant • Downloadable in multiple formats • Free

  21. Transcription factor binding site databases • ORegAnno http://www.oreganno.org/ • Open REGulatory ANNOtation database • cisRED http://www.cisred.org/ • Cis-regulatory element database • Based on ENSEMBL • Human, mouse, rat, C. elegans • Place http://www.dna.affrc.go.jp/PLACE/ • PlantCARE http://bioinformatics.psb.ugent.be/webtools/plantcare/html/ • Plant binding sites • Irodalmi adatok alapján

  22. Database of Orthologous Promoters • Collection of orthologous eukaryotic promoters • Two sections • Plant: based on Arabidopsis thaliana genome • Chordate: based on Homo sapiens genome • Aims • Provide a comprehensive promoter collection • Define and annotate conserved regions • Create a search interface for analysis, wet-lab pre screening, etc

  23. DoOP – Database creation

  24. DoOP – Database creation

  25. Cluster number in the plant versions

  26. Cluster number in the chordate versions

  27. DoOP – Cluster subsets • Cluster > Subset • Subset: collection of evolutionary monophyletic sequences in a cluster • Plant subsets • Brassicaceae • Arabidopsis thaliana • Brassicaceae species • Eudicotyledons • Grape, Solanum species, papaya, tobacco • Magnoliophyta • Maize, rice • Viridiplantae

  28. DoOP – Cluster subsets • Chordate subsets • Primates: primates only • Euarchontoglires: rodents • Eutheria: placental mammals • Theria: marsupials (opossum) • Mammalia: all mammals, incl Prototheria (platypus) • Amniota: birds and reptiles • Tetrapoda: amphibians • Teleostomi: most of the fishes • Vertebrata: all vertebrates • Chordata: all chordates, incl Ciona sp.

  29. Chordate subsets in last version

  30. Global/local alignment

  31. Modified information content value • 80% of maximum value as seeds • Can drop to 65% • Min 5 nucleotides • Max 20% gaps in each contributing sequence • Max 40% gap or N letters

  32. Motif generation eudicotyledons Magnoliophyta Brassicaceae

  33. DoOP – Motif number

  34. DoOP – Web interface • Search in the promoter collections • Annotation • Name • Sequence Ids • Taxons • Sequence • Search in the motifs • Search in the promoters with your own motifs

  35. References • E. Sebestyén, T. Nagy, S. Suhai, and E. Barta, „DoOPSearch: a web-based tool for finding and analysing common conserved motifs in the promoter regions of different chordate and plant genes,” BMC Bioinformatics, vol. 10, Jan. 2009 • E. Barta, E. Sebestyén, T.B. Pálfy, G. Tóth, C.P. Ortutay, and L. Patthy, „DoOP: Databases of Orthologous Promoters, collections of clusters orthologous upstream sequences from chordates and plants,” Nucleic Acids Research, vol. 33, 2005, pp. D86-90.

  36. Contact • http://doop.abc.hu • http://www.slideshare.net/razZ0r • sebestyene@mail.mgki.hu

More Related