1 / 16

Detection of Transcription Factor Binding Sites

Detection of Transcription Factor Binding Sites. Michael Morra CSE 4939W. Background. DNA is comprised of a combination of 4 chemical bases Adenine – A Thymine – T Guanine – G Cytosine - C. Background (Continued). Each individual organism has a unique DNA sequence

helmut
Download Presentation

Detection of Transcription Factor Binding Sites

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Detection of Transcription Factor Binding Sites Michael Morra CSE 4939W

  2. Background • DNA is comprised of a combination of 4 chemical bases • Adenine – A • Thymine – T • Guanine – G • Cytosine - C Image from : http://www.genetest.org/page5.html

  3. Background (Continued) • Each individual organism has a unique DNA sequence • The DNA sequence contains information which can be used by a cell to construct proteins • Each set of instructions within this sequence is called a gene Image from: http://www.buzzle.com/articles/point-mutations.html

  4. Transcription Factors • To regulate the expression of genes, proteins known as transcription factors are used • Each transcription factor binds to the DNA sequence, turning a gene on or off Image from: http://www.cs.uiuc.edu/homes/sinhas/work.html

  5. Binding Sites • The portions of the DNA where the transcription factors are able to bind are known as binding sites • A single transcription factor’s binding sites may vary

  6. Introduction • The detection of binding sites is important to understanding the regulatory network of an organism • As binding sites can vary considerably, searching for them within a DNA sequence is tedious

  7. Project • Implement a method used to accurately and precisely discover the locations of transcription factor binding sites within a DNA sequence.

  8. Data • 4 species (Human, Mouse, Fruit Fly & Yeast) • Human • 26 Transcription Factors, 300 binding sites • Mouse • 12 Transcription Factors, 98 binding sites • Fruit Fly • 6 Transcription factors, 51 binding sites • Yeast • 8 Transcription Factors, 75 binding sites

  9. Multiple Sequence Alignment • To be able to analyze the data effectively, each transcription factor’s binding sites need to be aligned • http://www.ebi.ac.uk/Tools/clustalw2/index.html >s1 GACTTTTCGCT >s2 CGATTTTCTCG >s3 GCATTTTCCCA >s4 AGAGAAAACCC >s5 GAATAACCCAAGAGAAA >s6 ACAGAAAAATC >s7 CGAGAAAATCG >s8 TGGTTTTCCCG >s9 GGGTTTCTCCC

  10. Scoring • Berg and von Hippel method • l = length of the sequence to be scored • j = position in the sequence • nj = number of times a base occurs at position j in the alignment • tj = base at position j in the sequence to be scored • nj(0) = most common base at position j

  11. Scoring Example • ACTCA • n1(0)= 3 • n2(0)= 2 • n3(0)= 2 • n4(0)= 2 • n5(0)= 2 • n1(A)= 3 • n2(C)= 1 • n3(T)= 2 • n4(C)= 1 • n5(A)= 2 • Score = log(1) + log(1.5/2.5) + log(1) + log(1.5/2.5) + log(1) = -0.443697499

  12. Leave One Out Cross Validation • To determine the effectiveness of the algorithm, a cross validation technique is used • This technique involves leaving one binding site out when the multiple sequence alignment is performed, and then scoring that left out sequence • If the algorithm is effective, the left out sequence should score higher than the majority of other binding sites within that species. (>80-90%)

  13. Implementation • C++ • Input • Multiple Sequence Alignment of a transcription factor’s binding sites • All binding sites of a species • Output • Scores • Results of Leave One Out Cross Validation

  14. Desired Functionality • Deal with cases where the sequence to be scored is longer or shorter than the multiple sequence alignment • Slide the sequence over the alignment and take the highest scoring portion

  15. Timeline • Oct 4th – Oct 18th • Create multiple sequence alignments for all transcription factors • Oct 18th – Nov 15th • Implement scoring algorithm in C++ • Nov 15th – Nov 29th • Implement leave one out methods • Nov 29th – Dec 6th • Tweaks and Improvements

  16. Questions? Image from: http://www.ideacenter.org/contentmgr/showdetails.php/id/954

More Related