Sho Murakami, Takuya Yoshihiro, Etsuko Inoue and Masaru Nakagawa

Predicting Combinatorial Protein-Protein Interactions from Protein Expression DataBased on Correlation Coefficient Sho Murakami, Takuya Yoshihiro, Etsuko Inoue and Masaru Nakagawa Faculty of Systems Engineering, Wakayama University

Agenda • Background • Combinatorial Protein-Protein Interactions • The Proposed Data Mining Method • Evaluation • Conclusion 2

Background • FindingInteractions amonggenes/proteinsareimportant • Many data-mining algorithms to discover gene-gene (or protein-protein) interactions are proposed so far. • One of the main source is gene or protein expression data Colorstrength isexpression level Size ofspotisexpressionlevel Microarray （forgene expression) 2D Electorophoresis （forprotein expression）

Related Work for Interaction Discovery • Bayesian Networks • Discoveringinteractions from expression databased on conditional probability among events Ex. to discover protein-protein interactions among proteins A, B and C, 1. Define events A, B and C 2. Compute conditional probability related with A, B and C A B A C If high, Interaction is predicted B C samples Event “C is expressed”

Problems of Bayesian Networks • Bayesian Networks Require large Number of Samples • For gene: microarray supplies cheap and high-speed experiment • For protein: 2D-electrophoresis takes time and expensive ex. to discover protein-protein interactions among proteins A, B and C, 1. Define events A, B and C 2. Compute conditional probability related with A, B and C A B A C sufficient samples in the area ? B C Many Samples are Necessary to obtain statistically reliable results

The Objective of our study Finding combinatorial protein-protein interactions from small-size protein expression data

Expression Data 2D-electrophoresis processed for each sample which includes expression levels of each protein. Expression levels: obtained by measuring size of areas As pre-processing, normalization is applied ･･･････････････ sample3 sample1 sample2 Each black area indicates a protein: size of areas represent expression levels Proteins 7

Model of Protein-Protein Interaction Considered • Model: two proteins A and B effect on other protein C’s expression level only when both A and B are expressed Sole effect from A,B on C is usually considered Only If both A and B exist, Combinatorial effect works on C! A A A A A C C B B B B Effect on expression levels C B Complex of A and B We want to estimate the combinatorial Effect! 8

Predicting Interactions by Correlation Coefficient • Computing correlation coefficient of (A,B) and C • Correlation coefficient requires less number of samples • The amount of complex (A,B) is estimated by min(A,B) • Total effect on C will be high if correlation is high Expression level Compute correlation of min(A,B) and C This amount would Effect on C Estimated amount of complex of A and B C A B Expression level of A and B of a sample min(A,B) 9

The problem of scale difference Amount of expression level for 1 molecular is different among proteins, so the same amount of A and B not always combined. Therefore, taking min cannot express correct amount of complex Scaling problem and solution The amount of complex is not correct Exp.level is the expression level required for a complex Estimated number of complex A B Solution： correct the scale of A ProteinsA andB Taking min leads correctamount of complex Exp.level A B ProteinsAandB 10

How to determine correct scale? • Select the scale which leads the maximum correlation coefficient of min(A,B) and C • If interaction of our model exists, high correlation value must appear. Expression level Compute Correlation A k1A k2A k3A B Score S min(A,B) min(A,B) min(A,B) min(A,B) Correlation：0.1 Correlation：0.2 Correlation：0.3 Correlation：0.7 We compute Score S: the total effect of (A, B) on C 11

Estimating Combinatorial Effect from Score S • Score S consists of “Sole Effect” and “Combinatorial Effect” • Compute Score S’: Score S assuming no combinatorial effect • Difference between S and S’ is the level of Combinatorial Effect C A B Computing Statistic Distribution Assuming no combinatorial Effect Level of combinatorial effect A C B C Score S’ C A B Score S A C B C C A B The difference between score S and S’ is the combinatorial effect

How to compute distribution of score S’? • Assume that expression levels of proteins A, B and C follow normal distribution • Computer simulation leads the distribution of Score S’ ①　Randomly create a distribution of A, B and C where correlation coefficient of A-B is α, that of B-C is β Distribution of A Distribution of B Distribution of C Correlation β Correlation α Repeat computation of score S Score S’ of α=0.5, β=0.3 Score S’ofα=0.5, β=0.4 ②　Obtain distribution of score S’ ③　Create the table of average and stddev for each α and β We can obtain the distribution for each α and β. Upper: average Lower: stddev

Computing Combinatorial Effect as Z-score • Place the score S in distribution of S’ • Z-score: Measure difference between score S and average of S’ as the count of standard deviation Score S’ Score S The amount of combinatorial effect level Distribution of score S’ corresponding Compute score S Z-score＝(score S-avg(S’)) / stddev(S’) Measurement as count of standard deviation Z-score Score S average The higher z-score is, the stronger the combinatorial effect is !

A Ｃ Try every scales B A Ｃ B A Ｃ B Summary of the proposed algorithm • Trying all combination of A, B and C • Compute the maximum correlation coefficient among all scale of A and Bto compute Score S • Compute z-score and create ranking by them 3 1 2 Compute max correlationamong every scale Compute z-scoresfrom distribution of S’ Trying all combinations S’ correlation：0.3 S correlation：0.8 Expression Data Z-score= 5.5 Score S = 0.8 correlation: 0.5 ４ list of all combinations Ranking by z-score

Evaluation • Applying our method into real expression data • Protein expression data of black cattle • # of samples is 195, # of proteins is 879 finding combinatorial protein-protein interactions using our method

The Expression Data Follows Normal Distribution • By way of Jarque-Bera test with confidential level of 95%, we test if expression data follows normal distribution. • Result: 454 proteins out of 879 proteins follow normal distribution • Thus, we use 454 proteins for evaluation

Results • We foundsomanycombinations ofproteinswhich would havecombinatorial effect • The maximum value of z-score is 11.0 • The combinations where z-value is more than about 5.5(p-value isless than 0.000000019(=0.05/454C3))) would have combinatorial effect with confidential level of 95%. The histogram of z-score # of combinations Z-score

Comparing z-scores with normal distribution • We compare thehistogram with that of without combinatorial effect • Createdbyaugmenting normal distribution with the number of trials (454C3) • It is inferred that this data includes considerable amount of combinatorial effect Estimated distribution of z-score obtained from real data Distribution of z-score underassumption no combinatorial effect Histogram of real data Histogram withoutcombinatorialeffect # of combinations # of combinations Z-score Z-score

The Ranking based on Z-score • The rankingtableshowsthat • CombinationswithlowscoreSareretrieved. • Sameproteintends to appearmany times. ・・・・・・・・・・・・・・・・・・・・・・・・ B C Correlation of A-C Correlation of B-C Score S Z-score A Protein Num B Protein Num C Protein Num Rank A C C A B The ranking of Z-score obtained from real data

Conclusion • Summary • We proposeamethod to estimatecombinatorialeffectofthreeproteinsfrom proteinexpression data • Applyingthe methodintoreal data,wefoundmanycombinationswhich would havecombinatorialeffect • Futurework • To confirmthe reliability, we areplanningto studywhetherthefoundcombinationsincludewell-knownprotein-proteininteractionsor not.

Sho Murakami, Takuya Yoshihiro, Etsuko Inoue and Masaru Nakagawa