Homework 1 and 2 review session

Homework 1 and 2 review session Presented by Kirill Bessonov November 2012

HW1: classical Q & A (GenomeGraphs) (1) First two questions were on Bioconductor libraries. There are BioC 608 packages To get citations on particular library use citation("library_name") You were asked to get genomic data on specific gene library(GenomeGraphs) #download the whole database of Ensemble IDs ensembl_Human_Genes= useMart("ensembl",dataset="hsapiens_gene_ensembl"); #get info on gene form the database on the Ensemble ID gene <- makeGene(id = "ENSG00000115145", type="ensembl_gene_id", biomart = ensembl_Human_Genes) #get info on transcript transcript <- makeTranscript(id = "ENSG00000115145", type="ensembl_gene_id", biomart= ensembl_Human_Genes) gdPlot ( list("gene"=gene, "transcripts"=transcript)) #retrieve info from the database displaying first 25 entries getBM(c("ensembl_gene_id", "hgnc_symbol", "description"), filter=c("with_exon_transcript", "with_protein_id", "with_transcript_variation"),values=list(TRUE, TRUE, TRUE), ensembl_Human_Genes)[1:25,]

HW1: classical Q & A (GenomeGraphs) (2) What is the gene name (i.e. hgnc_symbol) and function represented by the Ensembl ID - ENSG00000115145? geneInfo=getBM(c("ensembl_gene_id", "hgnc_symbol", "description"), filter=c("with_exon_transcript", "with_protein_id", "with_transcript_variation"),values=list(TRUE, TRUE, TRUE), ensembl_Human_Genes) > geneInfo[geneInfo$ensembl_gene_id == "ENSG00000115145", ] ensembl_gene_idhgnc_symboldescription 4829 ENSG00000115145 STAM2 signal transducing adaptor molecule (SH3 domain and ITAM motif) 2 How many exons does the ensemble id ENSG00000115145has? 51 exons attr(gene, "ens") ensembl_gene_idensembl_transcript_idensembl_exon_idexon_chrom_startexon_chrom_end rank strand biotype 1 ENSG00000115145 ENST00000263904 ENSE00001351655 153032117 153032506 1 -1 protein_coding ENSG00000115145 ENST00000263904 ENSE00002888710 153006659 153006743 2 -1 protein_coding …… 48 ENSG00000115145 ENST00000494589 ENSE00002785037 153004538 153004636 3 -1 protein_coding 49 ENSG00000115145 ENST00000494589 ENSE00002808134 153003676 153003822 4 -1 protein_coding 50 ENSG00000115145 ENST00000494589 ENSE00002929781 153001402 153001471 5 -1 protein_coding 51 ENSG00000115145 ENST00000494589 ENSE00001828491 153000503 153000527 6 -1 protein_coding

HW1: classical Q & A (GenomeGraphs) (3) Execute the following command. How many chromosomes do you see? 25 chromosomes. 22 autosomal pairs, 1 sex pair and one mitochondrial chromosome Why the number of chromosomes in this Ensembl dataset is greater than 23 chromosome pairs? What does “MT”, “X” and “Y” refer to? Because of the MT chromosome, since X and Y can be grouped to a single pair > getBM("chromosome_name","","", ensembl_Human_Genes)[c(1:22,433:435),1] [1] "1" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "2" "20" "21" "22" "3" "4" "5" "6" "7" "8" "9" "MT" "X" "Y"

HW2: Pairwise alignments (classical Q&A)

HW2: Pairwise alignments (classical Q&A) Q1 • Please align globally using Needleman–Wunsch algorithm the following DNA sequences. Use • The following scoring rules: a) gap -5; b) match between two bases +5; c) mismatch between two bases +3;

HW2: Pairwise alignments (classical Q&A) Q3 Do local protein alignment using BLOSUM 62 matrix on the HEAGAWGHEE and PAWHAE sequence. The scoring rules are a) gap -8; matches and mismatches are given in BLOSUM 62 matrix.

HW2: Pairwise alignments (classical Q&A) Q5 Produce a dot plot of Human and Mouse p53 proteins from previous question and paste the plot below. Complete the lines of R code to get the dot plot. Are both proteins similar? Yes, very similar since we see clear diagonal corresponding to >90% of sequences length Where is/are the region(s) of greatest variation occur? Between 50-100

HW2: Pairwise alignments (classical Q&A) Q7 What global alignment score do you get for the two p53 proteins, when you use the BLOSUM62 alignment matrix, a gap opening penalty of -10 and a gap extension penalty of -0.5? Answer: score of 1556 query("p53_HUMAN", "AC=P04637"); p53_HUMAN_seq = getSequence(p53_HUMAN); query("p53_MOUSE", "AC=P02340"); p53_MOUSE_seq = getSequence(p53_MOUSE); globalAlign<- pairwiseAlignment(p53_HUMAN_seq, p53_MOUSE_seq, substitutionMatrix = "BLOSUM62", gapOpening = -10, gapExtension = -0.5) Errors: the R-code was not stated and the ID of proteins were not given such as Uniprot ID P04637

HW2: Computer Style • Implementation of NW algorithm in R

HW2: Computer style (NW algorithm) [1] for to length(A) F(i,0) ← d*i for j=0 to length(B) F(0,j) ← d*j for i=1 to length(A){ for j=1 to length(B) { Match ← F(i-1,j-1) + S(Ai, Bj) Delete ← F(i-1, j) + d Insert ← F(i, j-1) + d F(i,j) ← max(Match, Insert, Delete) } } d = gap penalty score i and j = positions in A & B sequences • Given the pseudo-code implement NW algorithm in R • Algorithm has two parts • Calculation of the alignment F-matrix • Finding the optimal path(s) through the matrix

HW2: Computer style (NW algorithm) [2] Fmatrix = function(A,B){ fmatrix = matrix(0, nrow = (nchar(A)+1) , ncol = nchar(B)+1) d = -8 #this is gap penalty for(i in 0 : nchar(A)){ fmatrix[i+1,1] = d * i #populates initial row with gap penalty } for(j in 0 : nchar(B)){ fmatrix[1,j+1] = d * i } for(i in 1 : nchar(A)){ for(j in 1 : nchar(B)) { score = rules(A,B) #get me sccore for the pair of aa or nt match = fmatrix[i,j] + score delete = fmatrix[i,j+1] + d insert = fmatrix[i+1,j] + d fmatrix[i+1,j+1] = max(match,delete,insert) } } colnames(fmatrix) = strsplit( paste(" " , B, sep=""), "")[[1]]; rownames(fmatrix) = strsplit( paste(" " , A, sep=""), "")[[1]]; return(fmatrix) }

HW2: Computer style (NW algorithm) [3] > s.matrix A C T G A2 -1 -1 -1 C -1 2 -1 -1 G -1 -1 2 -1 T -1 -1 -1 2 rules = function(A,B){ s.matrix <- matrix(rep(0,16), nrow = 4, ncol=4, byrow=TRUE, dimnames = list(c("A","C","G","T"),c("A","C","T","G"))) s.matrix["A",] = c(2,-1,-1,-1) s.matrix["C",] = c(-1,2,-1,-1) s.matrix["T",] = c(-1,-1,2,-1) s.matrix["G",] = c(-1,-1,-1,2) }

HW2: Computer style (NW algorithm) [4] • Check the F-matrix fmatrix=Fmatrix("ATCG", "TG") T G -32 -32 -32 A -8 -16 -24 T -16 -6 -14 C -24 -14 -4 G -32 -22 -12 • Start finding the optimal path(s) through the matrix AlignmentA = "" AlignmentB = "" i = nchar(A) + 1 j = nchar(B) + 1 while(i > 1 && j > 1){ CurrentScore = fmatrix[i,j] #get score at current position of F-matrix ScoreDiag = fmatrix[i - 1, j - 1] ScoreUp = fmatrix[i, j - 1] what is around that F-matrix cell? ScoreLeft = fmatrix[i - 1, j]

HW1: Computer style (NW algorithm) [5] Which cell of the F-matrix I am now? On diagonal path: previous + next cell Selecting the bottom right cell and starting to trace-back the path of optimal alignment AlignmentA = "" AlignmentB = "" while(i > 1 && j > 1){ CurrentScore = fmatrix[i,j] ScoreDiag = fmatrix[i - 1, j - 1] ScoreUp = fmatrix[i, j - 1] ScoreLeft = fmatrix[i - 1, j] #considering the score came from diagonal if (CurrentScore == ScoreDiag + s.matrix[substr(A,i,i), substr(B,j,j)) ){ AlignmentA = paste(substr(A,i-1,i-1),AlignmentA, sep = "") AlignmentB = paste(substr(B,j-1,j-1),AlignmentB, sep = "") i = i - 1 j = j - 1 }

HW2: Computer style (NW algorithm) [6] #considering if the score comes from left (introducing a gap) else if(CurrentScore == ScoreLeft + d){ AlignmentA = paste(substr(A,i-1,i-1),AlignmentA, sep = "") AlignmentB = paste( "-", AlignmentB, sep = "") i = i - 1 } #considering if the score comes from upper cell (introducing a gap) else if(CurrentScore == ScoreUp + d) { AlignmentA = paste( "-", AlignmentA, sep = "") AlignmentB = paste(substr(B,j-1,j-1), AlignmentB, sep = "") j = j – 1 } print(AlignmentA) print(AlignmentB) finalScore = cat("Final score :",fmatrix[(nchar(A)+1),(nchar(B)+1)])

HW2: Computer style (NW algorithm) [7] The scoring matrices could have been accessed though character indices not requiring conversion and making code faster How one would output more than one BEST possible alignments? Please use more comments in your R-code Would be nice to see trace-backs visually Also the scoring rules were not stated clearly

Homework 1 and 2 review session

Homework 1 and 2 review session

Presentation Transcript

Session 2—Review Quiz1

2-1 Homework

Review Session 1

Session 1 LTC Review

1. Homework #2 2. Inferential Statistics 3. Review for Exam

Session 1 -2

Review Session #1

Homework 1 and 2

Session 2 Review

Review session # 2

Session 1 Review

Morning Session Homework #1

Session 2 1

Review and Homework

Homework 2-1

Review Session 1

1. Homework #2 2. Inferential Statistics 3. Review for Exam

Exam 1 – Review Session

Exam 1 Review Session

Homework 2 review

Homework 1 Homework 2 Homework 3 Homework 4

Module 2: Homework Review / KWL