1 / 30

Motif Finding Workshop Project

Motif Finding Workshop Project. Chaim Linhart January 2008. Outline. 1. Some background again… 2. The project. 1. Background. Slides with Ron Shamir and Adi Akavia. Gene: from DNA to protein. Pre-mRNA. Mature mRNA. DNA. protein. transcription. splicing. translation. DNA.

bertha
Download Presentation

Motif Finding Workshop Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Motif Finding WorkshopProject Chaim Linhart January 2008

  2. Outline 1. Some background again… 2. The project

  3. 1. Background Slides with Ron Shamir and Adi Akavia

  4. Gene: from DNA to protein Pre-mRNA Mature mRNA DNA protein transcription splicing translation

  5. DNA • DNA: a “string” over the alphabet of 4 bases (nucleotides): { A, C, G, T } • Resides in chromosomes • Complementary strands: A-T ; C-G • Forward/sense strand: AACTTGCG • Reverse-complement/anti-sense strand: TTGAACGC • Directional: from 5’ to 3’: • (upstream) AACTTGCGATACTCCTA (downstream) 5’ end 3’ end

  6. Gene structure (eukaryotes) Promoter DNA Coding strand Transcription start site (TSS) Transcription (RNA polymerase) Pre-mRNA Intron Exon Exon Splicing (spliceosome) 5’ UTR 3’ UTR Mature mRNA Stop codon Start codon Coding region Translation (ribosome) Protein

  7. Translation • Codon - a triplet of bases, codes a specific amino acid (except the stop codons); many-to-1 relation • Stop codons - signal termination of the protein synthesis process http://ntri.tamuk.edu/cell/ribosomes.html

  8. Genome sequences • Many genomes have been sequences, including those of viruses, microbes, plants and animals. • Human: • 23 pairs of chromosomes • 3+ Gbps (bps = base pairs) , only ~3% are genes • ~25,000 genes • Yeast: • 16 chromosomes • 20 Mbps • 6,500 genes

  9. Regulation of Expression • Each cell contains an identical copy of the whole genome - but utilizes only a subset of the genes to perform diverse, unique tasks • Most genes are highly regulated – their expression is limited to specific tissues, developmental stages, physiological condition • Main regulatory mechanism – transcriptional regulation

  10. TF TF 5’ 3’ Gene BS BS Transcriptional regulation • Transcription is regulated primarily by transcription factors (TFs) – proteins that bind to DNA subsequences, called binding sites (BSs) • TFBSs are located mainly (not always!) in the gene’s promoter – the DNA sequence upstream the gene’s transcription start site (TSS) • BSs of a particular TF share a common pattern, or motif • Some TFs operate together – TF modules TSS

  11. TFBS motif models AC CG ACT T • Consensus (“degenerate”) string: gene 1 gene 2 AACTGT gene 3 CACTGT gene 4 CACTCT gene 5 CACTGT gene 6 gene 7 gene 8 gene 9 AACTGT gene 10 • Statistical models… • Motif logo representation

  12. Human G2+M cell-cycle genes:The CHR – NF-Y module CDCA3(trigger of mitotic entry 1) CTCAGCCAATAGGGTCAGGGCAGGGGGCGTGGCGGGAAGTTTGAAACT -18 CDCA8(cell division cycle associated 8) TTGTGATTGGATGTTGTGGGA…[25bp]…TGACTGTGGAGTTTGAATTGG +23 CDC2(cell division control protein 2 homolog) CTCTGATTGGCTGCTTTGAAAGTCTACGGGCTACCCGATTGGTGAATCCGGGGCCCTTTAGCGCGGTGAGTTTGAAACTGCT 0 CDC42EP4 (cdc42 effector protein 4) GCTTTCAGTTTGAACCGAGGA…[25bp]…CGACGGCCATTGGCTGCTGC -110 CCNB1(G2/mitotic-specific cyclin B1) AGCCGCCAATGGGAAGGGAG…[30bp]…AGCAGTGCGGGGTTTAAATCT +45 CCNB2(G2/mitotic-specific cyclin B2) TTCAGCCAATGAGAGT…[15bp]…GTGTTGGCCAATGAGAAC…[15bp]…GGGCCGCCCAATGGGGCGCAAGCGACGCGGTATTTGAATCCTGGA +10 BS’s are short, non-specific, hiding in both strands and at various locations along the promoters TFs: NF-Y , CHR

  13. The computational challenge • Given a set of co-regulated genes(e.g., from gene expression chips) • Find a motif that is over-represented (occurs unusually often) in their promoters • This may be the TF binding site motif • Find TF modules – over-represented motifs that tend to co-occur

  14. The computational challenge (II) • Motifs can also be found w/o a given target-set – “genome-wide” • Find a motif that is localized - occurs more often neat the TSS of genes • Find a motif with a strand bias – occurs more often on the genes’ coding strand • Find TF modules with biases in their order / orientation / distance

  15. Motif finding algorithms • >100 motif finding algs • Main differences between them: • Type of analysis & input: • Target-set vs. genome-wide • Single vs. multi-species (conservation) • Single motifs vs. modules • Motif model • Score for evaluating motif • Motif search technique: • Combinatorial (enumeration) vs. Statistical optimization

  16. Example - Amadeus Over-represented motifs in the promoters of genes expressed in the G2 and G2/M phases of the human cell cycle: CHR NF-Y

  17. 2. The project

  18. General goals • Develop software from A-Z: • Design • Implementation • (Optimization) • Execution & analysis of real data • A taste of bioinformatics • Have fun • Get credit…

  19. The computational task • Given a set of DNA sequences • Find “interesting” pairs of motifs: • Order bias • Other scores… • Main challenges: • Performance (time, memory) • Output redundancy

  20. Input File with DNA sequences in “fasta” format: >sequence-name1 <space> [header1] ACCCGNNNNTCGGAAATGANN CGGAGTAAAATATGCGAGCGT >sequence-name2 <space> [header2] cggattnnnaccgcannnnnnnnaccgtga >sequence-name3 <space> [header3] agtttagactgctagctcgatcgcta gcggatnggctannnnnatctag

  21. Input (II) • Ignore the header lines • Sequence may span multiple lines or one long line • Sequence contains the characters A,C,G,T,N in upper or lower case • “N” means unknown or masked base • Sample input files will be supplied

  22. Input (III) • Search parameters: • Length of motifs (between 5-10) • Min. + Max. distance between the motifs: ACGGATTGATNNNTGGATGCCAT distance=9 • Single vs. two strands search • Min. number of occurrences (hits) of pair: GCGGATTCAGTGATGCCANGNATGCCTCAGGATTGNAATGCCA hit hit hit • Max. p-value • Additional parameters… (don’t count overlaps, e.g. AAAAAA)

  23. Output • A list of the string pairs with the best order-bias score (smallest p-values): Motif A Motif B A→B B→A p-value ACGTT GGATT 97 17 4.3E-15 ACGTT GATTC 87 16 2.7E-13 TTAAC CAGCC 31 114 1.2E-12 • A non-redundant list of motif pairs (motif = consensus string): logos, # of hits, additional scores

  24. Part A: String pairs with order bias • nA = # of A→B ; nB = # of B→A • WLOG, nA > nB • n = nA + nB • H0 = random order: nA ~ B(n, 0.5) • p-value = prob for at least nA occurrences of A→B = tail of B(n, 0.5) • Normal approximation (central limit thm.) • Fix for multiple testing: x2

  25. Part B: Non-redundant list of motif pairs • Collect similar strings to motif with better score: (motif = consensus) String pair (p-value) Motif pair ACGTT , GGATT (4.3E-15) ACGAT , GGATT (2.4E-11) AGGAT , GGTTT (1.7E-5) AGGTT , GGTTT (5.9E-5) • Don’t report similar motif pairs: • Motifs that consist of similar strings • Motif pairs that are small shifts of one another • Palindromes , (8.1E-31)

  26. Part B (cont.): Additional score Option I: Co-occurrence rate N = total # of sequences sA = # of sequences that contain motif A sAB = # of sequences that contain motifs A and B H0 = motifs occur independently and randomly p-value = prob for at least joint occurrences, given the number of hits of each single motif= tail of hypergeometric distribution

  27. Part B (cont.): Additional score Option II: Distance bias Is the distance between the two motifs uniform (H0), or are there specific distances that are very common? Option III: Gap variability Are the sequences between the motifs conserved (H0), or are they highly variable? Other options??

  28. Implementation • Java (Eclipse) ; Linux • GUI: Simple graphical user interface for supplying the input parameters and reporting the results • Packages for motif logo and statistical scores will be supplied • Time performance will be measured only for part A • Reasonable documentation • Separate packages for data-structures, scores, GUI, I/O, etc.

  29. Design document • Due in 3 weeks (Feb 24) • 3-5 pages (Word), Hebrew/English • Briefly describe main goal, input and output of program • Describe main data structures, algorithms, and scores for parts A+B • Meet with me before submission

  30. Fin

More Related