600 likes | 616 Views
Explore how social science methods for voting and decision-making can be applied to information technology problems. Topics include computational intractability, limitations on computational power, security, privacy, and more.
 
                
                E N D
Voting, Meta-search, and Bioconsensus Fred S. Roberts Department of Mathematics and DIMACS (Center for Discrete Mathematics and Theoretical Computer Science) Rutgers University Piscataway, New Jersey
From Social Science Methods to Information Technology Applications Over the years, social scientists have developed a variety of methods for dealing with problems of voting, decisionmaking, conflict and cooperation, measurement, etc. These methods, often heavily mathematical, are beginning to find novel uses in a variety of information technology applications.
Such methods will need to be substantially improved to deal with such issues as: • Computational Intractability • Limitations on Computational Power/Information • The Sheer Size of Some of the New Applications • Learning Through Repetition • Security, Privacy, and Cryptography • This talk will concentrate on social science methods for dealing with voting and decisionmaking. We will look briefly at various applications of these methods to a variety of information technology problems and then concentrate on a particular application to biological databases.
Voting/ Group Decisionmaking In a standard model for voting, each member of a group gives an opinion. We seek aconsensusamong these opinions. Sometimes the opinion is just a vote for afirst choiceamong a set of alternative choices or candidates. In other contexts, the opinion might be arankingof all the alternatives.
Obtaining opinions as rankings among alternatives or candidates can sometimes give a lot more information about a voter’s true preferences than simply obtaining their first choice. But then we have the challenge of defining what we mean by a consensus. In many applications, we seek a ranking that is in some sense a consensus of the rankings provided by all of the voters.
Medians and Means Among the most important directions of research in the theory of group consensus is the idea that we can obtain a group consensus by first finding a way to measure the distance between any two alternatives or any two rankings. Let M be the set of alternatives (candidates) or the set of rankings of alternatives and d(a,b) = distance between a and b in M. Aprofile (of opinions) is a vector (a1,a2, …, an) of points from M.
Themedianof a profile is the set of all points x of M that minimize d(ai,x) and themeanis the set of all points x of M that minimize d(ai,x)2 One very commonly used method for measuring the distance between two rankings of candidates is called theKemeny-Snell distance:twice the number of pairs of candidates i and j for which i is ranked above j in one ranking and below j in the other + the number of pairs that are ranked in one ranking and tied in another.
Consider the following profile: Voter 1 (a1): Bush, Gore, Nader Voter 2 (a2): Bush, Gore, Nader Voter 3 (a3): Gore, Bush, Nader In the case of this profile, the Kemeny-Snell median is the ranking x = Bush, Gore, Nader. We have d(a1,x) + d(a2,x) + d(a3,x) = 0 + 0 + 2 = 2. However, the Kemeny-Snell mean is the ranking y = Bush-Gore, Nader, in which Bush and Gore are tied for first place. For d(a1,y)2 + d(a2,y)2 + d(a3,y)2 = 1 + 1 + 1 = 3, while d(a1,x)2 + d(a2,x)2 + d(a3,x)2 = 4.
Note that medians or means need not be unique. Voter 1: Bush, Gore, Nader Voter 2: Gore, Nader, Bush Voter 3: Nader, Bush, Gore This is the “voter’s paradox” situation. In this case, there are three Kemeny-Snell medians, these three rankings. However, there is a unique Kemeny-Snell mean, the ranking in which all three candidates are tied. Because of non-uniqueness, we think of consensus as defining a function F from the set of profiles to the set of sets of rankings. Kenneth Arrow called this a group consensus function or social welfare function.
In case the elements of M are rankings, calculation of medians and means can be quite difficult. Theorem (Bartholdi, Tovey and Trick, 1989; Wakabayashi, 1986):Under the Kemeny-Snell distance, the calculation of the median of a profile of rankings is an NP-complete problem.
Meta-Search and Other Information Technology Applications of Consensus Methods Meta-Search Meta-Search is the process of combining the results of several search engines. We seek the consensus of the search engines, whether it is a consensus first choice website or a consensus ranking of websites.
Meta-search has been studied using consensus methods by Cohen, Schapire, and Singer (1999) and Dwork, Kumar, Naor, and Sivakumar (2000). One important point: In voting, there are usually few candidates and many voters. In meta-search, there are usually few voters and many candidates. In this setting, Dwork, et al. developed an approximation to the Kemeny-Snell median that preserves most of its desirable properties and is computationally tractable.
Information Retrieval We rank documents according to their probability of relevance to a query. Given a number of queries, we seek a consensus ranking. Collaborative Filtering We use knowledge of the behavior of multiple users to make recommendations to an active user, for example combining movie ratings by others to prepare an ordered list of movies for a given user.
Consensus methods have been applied to collaborative filtering by Freund, Iyer, Schapire, and Singer (1998) -- they designed an efficient “boosting system” for combining preferences. Pennock, Horvitz, and Giles (2000) applied consensus methods to develop “Recommender Systems.” Software Measurement Combining several different measures or ratings through appropriate consensus methods is an important topic in the measurement of the understandability, quality, functionality, reliability, efficiency, usability, maintainability, or portability of software. (Fenton and Pfleeger, 1997)
Ordinal Filtering in Digital Image Processing In one method of noise removal, to check if a pixel is noise, one compares it with neighboring pixels. If the values are beyond a certain threshold, one replaces the value of the given pixel with a mean or median of the values of its neighbors. (See Janowitz (1986).) Related methods are used inmodels of “distributed consensus.” A number of processors each holds an initial (binary) value, some of them may be faulty and ignore any protocol, yet it is required that the non-faulty processors eventually agree (reach consensus) on a value.
Berman and Garay (1993) developed a protocol for distributed consensus based on the parliamentary procedure known as “cloture” and showed it was very good in terms of a number of important criteria including polynomial computation and communication.
Bioconsensus In recent years, methods of consensus developed for applications in the social sciences have become widely used in biology. In molecular biology alone, Bill Day has compiled a bibliography of hundreds of papers that use such consensus methods.
The following are some of the ways that bioconsensus problems arise: • Alternative phylogenies (evolutionary trees) are produced using different methods and we need to choose a consensus tree. • Alternative taxonomies (classifications) are produced using different models and we need to choose a consensus taxonomy. • Alternative molecular sequences are produced using different criteria or different algorithms and we need to choose a consensus sequence. • Alternative sequence alignments are produced and we need to choose a consensus alignment.
Finding A Pattern or Feature Appearing in a Set of Molecular Sequences In many problems of the social and biological sciences, data is presented as a sequence or “word” from some alphabet . Given a set of sequences, we seek a pattern or feature that appears widely, and we think of this as a consensus sequence or set of sequences. A pattern is often thought of as a consecutive subsequence of short, fixed length. In biology, such sequences arise from DNA, RNA, proteins, etc.
Why Look for Such Patterns? Similarities between sequences or parts of sequences lead to the discovery of shared phenomena. For example, it was discovered that the sequence for platelet derived factor, which causes growth in the body, is 87% identical to the sequence for v-sis, a cancer-causing gene. This led to the discovery that v-sis works by stimulating growth.
In recent years, we have developed huge databases of molecular sequences. For example, GenBank has over 7 million sequences comprising 8.6 billion bases. The search for similarity or patterns has extended from pairs of sequences to finding patterns that appear in common in a large number of sequences or throughout the database. To find patterns in a database of sequences, it is useful to measure the distance between sequences. If a and b are sequences of the same length, a common way to define the distance d(a,b) is to take it to be the number of mismatches between the sequences.
To measure how closely a pattern fits into a sequence, we have to measure the distance between sequences of different lengths. If b is longer than a, then d(a,b) could be the smallest number of mismatches in all possible alignments of a as a consecutive subsequence of b. We call this the best-mismatch distance.
Example: a = 0011, b = 111010 Possible Alignments: 111010 111010 111010 0011 0011 0011 The best-mismatch distance is 2, which is achieved in the third alignment.
An alternative way to measure d(a,b) is to count the smallest number of mismatches between sequences obtained from a and b by inserting gaps in appropriate places -- a mismatch between a letter of  and a gap is counted as an ordinary mismatch. We won’t use this alternative measure of distance. • Waterman (1989), Waterman, Galas, and Arratia (1984), Galas, Eggert, and Waterman (1985) and others study the following situation: •  is a finite alphabet • k is a fixed finite number (the pattern length) • A profile  = (a1,a2, …, an) consists of a set of words (sequences) of length L from , with L k
We seek a set F() = F(a1,a2, …, an) of consensus words of length k from . Here is a small piece of data from Waterman (1989), in which he looks at 59 bacterial promoter sequences: RRNABP1: ACTCCCTATAATGCGCCA TNAA: GAGTGTAATAATGTAGCC UVRBP2: TTATCCAGTATAATTTGT SFC: AAGCGGTGTTATAATGCC Notice that if we are looking for patterns of length 4, each sequence has the pattern TAAT.
However, suppose that we add another sequence: M1 RNA: AACCCTCTATACTGCGCG The pattern TAAT does not appear here. However, it almost appears, since the word TACT appears, and this has only one mismatch from the pattern TAAT. So, in some sense, the pattern TAAT is a pattern that is a good consensus pattern. We now make this idea precise.
In practice, the problem is a bit more complicated than we have described it. We have long sequences and we consider “windows” of length L beginning at a fixed position, say the jth. Thus, we consider words of length L in a long sequence, beginning at the jth position. For each possible pattern of length k, we ask how closely it can be matched in each of the sequences in a window of length L starting at the jth position.
Formalization Let  be a finite alphabet of size at least 2 and  be a finite collection of words of length L on . Let F() be the set of words of length k 2 that are our consensus patterns. (We drop the distinction between profile as vector and profile as set.) Let  = {a1, a2, …, an}. One way to define F() is as follows. Let d(a,b) be the best-mismatch distance. Consider nonnegative parameters d that are monotone decreasing with d and let F(a1,a2, …, an) be all those words w of length k that maximize s(w) = d(w,ai)
We call such an F a Waterman consensus. In particular, Waterman and others use the parameters d = (k-d)/k. Example: An alphabet used frequently is the purine/pyrimidine alphabet {R,Y}, where R = A (adenine) or G (guanine) and Y = C (cytosine) or T (thymine). For simplicity, it is easier to use the digits 0,1 rather than the letters R,Y. Thus, let  = {0,1}, let k = 2. Then the possible pattern words are 00, 01, 10, 11.
Suppose a1 = 111010, a2 = 111111. How do we find F(a1,a2)? We have: d(00,a1) = 1, d(00,a2) = 2 d(01,a1) = 0, d(01,a2) = 1 d(10,a1) = 0, d(10,a2) = 1 d(11,a1) = 0, d(11,a2) = 0 S(00) = d(00,ai) = 1 + 2, S(01) = d(01,ai) = 0 + 1 S(10) = d(10,ai) = 0 + 1 S(11) = d(11,ai) = 0 + 0 As long as 0 > 1 > 2, it follows that 11 is the consensus pattern, according to Waterman’s consensus.
Example: Let  ={0,1}, k = 3, and consider F(a1,a2,a3) where a1 = 000000, a2 = 100000, a3 = 111110. The possible pattern words are: 000, 001, 010, 011, 100, 101, 110, 111. d(000,a1) = 0, d(000,a2) = 0, d(000,a3) = 2, d(001,a1) = 1, d(001,a2) = 1, d(001,a3) = 2, d(100,a1) = 1, d(100,a2) = 0, d(100,a3) = 1, etc. S(000) = 2 + 20, S(001) = 2 + 21, S(100) = 21 + 0, etc. Now, 0 > 1 > 2 implies that S(000) > S(001). Similarly, one shows that the score is maximized by S(000) or S(100). Monotonicity doesn’t say which of these is highest.
Other Consensus Functions The median is the collection of words w of length k which minimize (w) = d(w,ai). The mean is the collection of words w of length k which minimize (w) = d(w,ai) 2.
Another measure which it might be of interest to minimize is a convex combination of these two: (w) = (w) + (1- ) (w) ,   [0,1]. Words which minimize  will be called the mixed median-mean. This might be of interest if we are not ready to choose either medians or means or want some combination of the two. We might also choose to minimize d(w,a)m or logd(w,a)mfor fixed m.
Example: Let  = {0,1}, k = 2,  = {a1,a2,a3,a4}, a1 = 1111, a2 = 0000, a3 = 1000, a4 = 0001. Possible pattern words: 00, 01, 10, 11. d(00,ai) = 2, d(01,ai) = 3, d(10,ai) = 3, d(11,ai) = 4. Thus, 00 is the median. d(00,ai)2 = 4, d(01,ai)2 = 3, d(10,ai)2 = 3, d(11,ai)2 = 6, so the mean consists of the two words 01 and 10, neither of which is a median.
Summary of Notation s(w) = d(w,ai) (w) = d(w,ai). (w) = d(w,ai)2. (w) = (w) + (1- ) (w) ,  [0,1].
The Special Case d = (k-d)/k Suppose that d = (k-d)/k. We have (w) = d(w,ai), s(w) = d(w,ai) = n - (1/k)d(w,ai). Thus, for fixed k 2, of size at least 2, and any size set , for all words w,w of length L: (w) (w)  s(w) s(w).
It follows that for fixed k 2,  of size at least 2, and any size set , there is a choice of the parameter d so that the Waterman consensus is the same as the median. (This also holds for k = 1 or  of size 1, but these are uninteresting cases.) Similarly, one can show that for any fixed k 2,  of size at least 2, and any size set , there is a choice of parameter d so that for all words w,w of length L: (w) (w)  s(w) s(w). For this choice of d , a word is a Waterman consensus iff it is a mean.
More generally, for all rational numbers   [0,1], for fixed k 2,  of size at least 2, and any size set , there is a choice of parameter d so that for all words w,w of length L: (w) (w)  s(w) s(w). For this choice of d , a word is a Waterman consensus iff it is a mixed median-mean with convex combination depending upon.
What Parameters d Give Rise to Median, Mean, or Mixed Median-Mean? Let us first decide if  can have repeated words. From the point of view of the application, this is a reasonable assumption. (Repeats are allowed in the database; or some words have more significance than others.) We shall investigate both the repetitive case -- where  is allowed to have repeated words -- and the nonrepetitive case. The following results are joint with Boris Mirkin.
When do we Get the Median in Waterman Consensus? Theorem: Suppose k is fixed, k 2,  is an alphabet of at least two letters, and d is a sequence with 1 < 0. Then the following are equivalent: (a). The equivalence (w) (w)  s(w) s(w) holds for all words w, wof length k from  and all finite nonrepetitive sets  (of any size) of words of length L k from . (b). There are constants B, C, B < 0, s.t. for all 0  j  k, j = Bj + C.
In other words, under the hypotheses of the theorem, the median procedure corresponds exactly to the choice of parameters j = Bj + C, B < 0. Note that j = (k-j)/k is a special case of this. Remark: This theorem (and all subsequent theorems) also hold if we replace “words of length L  k” by “words of fixed length L, L  k.”
When do we Get the Mean in Waterman Consensus? Theorem: Suppose k is fixed, k 2,  is an alphabet of at least two letters, and d is a sequence with 1 < 0. Then the following are equivalent: (a). The equivalence (w) (w)  s(w) s(w) holds for all words w, wof length k from  and all finite sets  (of any size) of words of length L k from . (b). There are constants A, C, A < 0, so that for all 0  j  k, j = Aj2 + C.
In other words, under the hypotheses of the theorem, the mean procedure corresponds exactly to the choice of parameters j = Aj2 + C, A < 0. Note that we require (a) to hold for all finite sets , even those with repetitions. It is a technicality to try to remove this hypothesis. To do so, we have found it necessary to allow a larger alphabet or to take k sufficiently large and also to consider only L larger than k. The first result uses an alphabet of size at least four.
Theorem: Suppose k is fixed, k 2,  is an alphabet of at least four letters, and d is a sequence with 1 < 0. Then the following are equivalent: (a). The equivalence (w) (w)  s(w) s(w) holds for all words w, wof length k from  and all finite nonrepetitive sets  (of any size) of words of length L> k from . (b). There are constants A, C, A < 0, s.t. for all 0  j  k, j = Aj2 + C.
The next result removes the hypothesis of the alphabet being of size at least 4, but adds a hypothesis that k 3: Theorem: Suppose k is fixed, k3,  is an alphabet of at least twoletters, and d is a sequence with 1 < 0. Then the following are equivalent: (a). The equivalence (w) (w)  s(w) s(w) holds for all words w, wof length k from  and all finite nonrepetitive sets  (of any size) of words of length L> k from . (b). There are constants A, C, A < 0, s.t. for all 0  j  k, j = Aj2 + C.
When do we Get the Mixed Median-Mean in Waterman Consensus? Recall that (w) = (w) + (1- ) (w) ,  [0,1]. In the following, we shall assume that is a rational number. It might just be of purely technical interest to figure out what happens when  is irrational, but we have not been able to obtain analogous results for this case.
Theorem: Suppose k is fixed, k 2,  is an alphabet of at least twoletters, and d is a sequence with 1 < 0. Then the following are equivalent: (a). The equivalence (w) (w)  s(w) s(w) holds for all words w, wof length k from  and all finite sets  (of any size) of words of length L k from . (b). There are constants D, E, D < 0, s.t. for all 0  j  k, j = D(1- )j2 + Dj + E.
In other words, under the hypotheses of the theorem, the mixed median-mean procedure corresponds exactly to the choice of parameters j = D(1-)j 2 + Dj + E, D < 0. If we only want to require that the equivalence in part (a) holds for finite nonrepetitive sets , one way to do so, as with the results about the mean or , is to assume that  is sufficiently large. We have been able to prove this result under the added hypothesis that L > k and the added hypothesis that  has at least r()+ 1 elements, where 2(2-) = s/t for s, t positive integers and r() = max{s-t,t+1}.
Other Consensus Functions as Special Cases of Waterman’s Consensus It would be interesting to study conditions under which other consensus methods are special cases of Waterman’s. Of particular interest might be such consensus methods as minimizing d(w,ai)m or minimizing logd(w,ai)m.
Algorithms In practical applications in molecular biology, good algorithms for obtaining a consensus pattern are essential. Waterman and his co-authors provide a method for computing their consensus patterns in the case d = (k-d)/k. The Brute Force Algorithm Suppose that D(w,w) is the number of mismatches between two words w, wof length k. The most naïve algorithm for finding the Waterman consensus would proceed by brute force and calculate all best-mismatch distances d(w,a) for w a potential pattern word of length k and a  by calculating D(w,w) for all words wof length k in a.