200 likes | 331 Views
This text dives into critical concepts in computer science, including the probability of node failures in distributed systems, explained using the formula ( f = 1 - (1-p)^n ). It illustrates practical examples such as the Map-Reduce approach to data processing and the K-means algorithm for clustering. Key topics include calculating failure probabilities, assigning clusters, and utilizing Bayes' rule in classification problems. This knowledge is essential for effective algorithm design and understanding performance metrics like accuracy and precision.
E N D
From W2-S9 Node failure The probability that at least one node failing is: f= 1 – (1-p)n When n =1; then f =p Suppose p=0.0001 but n=10000, then: f = 1 – (1 -0.0001)10000 = 0.63 [why/how ?] This is one of the most important formulas to know (in general).
From W2-S15 Example • For example suppose the hash functions maps {to, Java, road} to one node. Then • (to,1) remains (to,1) • (Java,1);(Java,1);(Java,1) (Java, [1,1,1]) • (road,1);(road,1)(road,[1,1]); • Now REDUCE function converts • (Java,[1,1,1]) (Java,3) etc. • Remember this is a very simple example…the challenge is to take complex tasks and express them as Map and Reduce!
From W2-S19 Similarity Example [2] Notice, it requires some ingenuity to come up with key-value pairs. This is key to suing map-reduce effectively
From W3-S14 K-means algorithm Let C = initial k cluster centroids (often selected randomly) Mark C as unstable While <C is unstable> Assign all data points to their nearest centroid in C. Compute the centroids of the points assigned to each element of C. Update C as the set of new centroids. Mark C as stable or unstable by comparing with previous set of centroids. End While Complexity: O(nkdI) n:num of points; k: num of clusters; d: dimension; I: num of iterations Take away: complexity is linear in n.
From W3-S16 Example: 2 Clusters A(-1,2) B(1,2) c c 4 (0,0) C(-1,-2) D(1,-2) c c 2 K-means Problem: Solution is (0,2) and (0,-2) and the clusters are {A,B} and {C,D} K-means Algorithm: Suppose the initial centroids are (-1,0) and (1,0) then {A,C} and {B,D} end up as the two clusters.
From W4-S21 Bayes Rule Prior Posterior
From W4-S25 Example: Iris Flower choose the maximum F=Flower; SL=Sepal Length; SW = Sepal Width; PL=Petal Length; PW =Petal Width Data
From W5-S7 Confusion Matrix Label 1 is called Positive, Label -1 is called Negative Let the number of test samples be N N = N1 + N2 + N3 + N4. False Positive Rate (FPR) = N2/(N2+N4) True Positive Rate (TPR) = N1/(N1+N3) False Negative Rate (FNR) = N3/(N1+N3) True Negative Rate (TNR) = N4/(N4+N2) Accuracy = (N1+N4)/(N1+N2+N3+N4) Precision = N1/(N1+N2) Recall = N1/(N1+N3)
From W5-S9 ROC (Receiver Operating Characteristic) Curves • Generally a learning algorithm A will return a real number…but what we want is a label {1 or -1} • We can apply a threshold..T TPR = 3/4 FPR = 2/5 TPR = 2/4 FPR = 2/5
From W5-S13 Random Variable • A random variable X can take values in a set which is: • discrete and finite. • Lets toss a coin and X = 1 if it’s a head and X=0 if it’s a tail. X is random variable • discrete and infinite(countable) • Let X be the number of accidents in Sydney in a day.. Then X = 0,1,2,….. • Infinite(uncountable) • Let X be the height of a Sydney-sider. • X = 150, 150.11,150.112,……
From W7-S2 These slides are from Steinbach, Pang and Kumar
From W11-S9 The Key Idea • Decompose the User x Rating matrix into: • User x Rating = ( User x Genre ) x (Genre x Movies) • Number of Genres is typically small • Or • R =~ UV • Find U and V such that ||R – UV|| is minimized… • Almost like k-means clustering…why ?
From W11-S15 UV Computation…. This example is from Rajaraman, Leskovic and Ullman: See Textbook