Parallel Data Mining Brandon Filer, Austin Davis
What is Data Mining Data mining is the method of processing large(millions+) of data items with possibly many variables, and finding meaningful patterns from the data set.
What uses does data mining have? Have you ever used any shopping site's "You might like?" feature? - Suggests items that people commonly buy together, i.e. an electric guitar is normally bought with an amp. Credit card security - keeps records of your purchase history(inernet, types of things bought) extreme purchases are questioned and account held/user caled Netflix's movie suggestions (netflix challenge)
Recent Target found out teenage girl was pregnant before her father did, using data mining from her shopping history (Andrew Pole, Target statistician) [Pole] ran test after test, analyzing the data, and before long some useful patterns emerged. Lotions, for example. Lots of people buy lotion, but one of Pole’s colleagues noticed that women on the baby registry were buying larger quantities of unscented lotion around the beginning of their second trimester. Another analyst noted that sometime in the first 20 weeks, pregnant women loaded up on supplements like calcium, magnesium and zinc. Many shoppers purchase soap and cotton balls, but when someone suddenly starts buying lots of scent-free soap and extra-big bags of cotton balls, in addition to hand sanitizers and washcloths, it signals they could be getting close to their delivery date. As Pole’s computers crawled through the data, he was able to identify about 25 products that, when analyzed together, allowed him to assign each shopper a “pregnancy prediction” score. More important, he could also estimate her due date to within a small window, so Target could send coupons timed to very specific stages of her pregnancy.
How do I data mine? Data mining commonly uses a Neural Network or Learning Machine(see presentation from last week about neural networks) to learn from the data set using training data(data where the input and output are known. Put in input, until output is found adjust weights. There are many different algorithms, 5will be discussed
Why parallelize? For smaller teams/projects, specialized cloud computing can be expensive and include many functions not required. Running millions of data items through multiple iterations(common for training) takes time when run sequentially. Some common functions are easily parallelizable. Databases are becoming larger and larger Performance of computers increases by 10-15% a year while data collected doubles
Attribute focusing Support of a pattern A in a set S is the ratio of the number of transactions containing A and the total number of transactions in S. Confidence of a rule A -> B is the probability that pattern B occurs in S when pattern A occurs in S and can be defined as the ratio of the support of AB and support of A. The rule is then described as A -> B [support, confidence] and a strong association rule has a support greater than a pre-determined minimum support and a confidence greater than a re-determined minimum confidence. This can also be taken as the measure of “interestingness” of the rule. High Performance Data Mining Using Data Cubes On Parallel Computers
Interestingness The “interestingness” measure is the size Ij (E) of the difference between: (a) the probability of E among all such events in the data set (b) the probability that x1; x2; x3;.xj-1; xj+1;...xn and xj occurred independently. The condition of interestingness an then be deﬁned as Ij(E) > δ, where δ is some ﬁxed threshold.
Applied with a hypercube. Constructed by partitioning large data set across processors, each processor loads it's set into a multidimensional array where the size of the array is the number of unique values for the attribute of that dimension. (large array) Each tuple in the MD array is indexed by it's attribute's value. The measures (values of the tuples) are loaded into their place in the array, done with hash based method. Finally the processors aggregate.
Calculating interestingness 1.Replicate each single attribute sub-cubes on all processors using a Gather followed by a Broadcast. 2.Perform a Combine operation of ALL (0D cube) followed by a Broadcast to get the correct value of ALL on all processors. 3.Take the ratio of each element of the AB sub-cube and ALL to get P (AB). Similarly calculate P (A) and P (B) using the replicated sub-cubes A and B. 4.For each element i in AB calculate |P (AB) P (A)P(B)|, and compare it with a threshold δ, setting AB[i] to 1 if it is greater, else set it to 0.
Use of sub-cubes AB, A and B for calculations on 3 processors |P(AB)-P(A)P(B)| => |0.03-0.22*0.08| = 0.0124. Which is greater than interestingness thresholds of 0.01 and 0.001.
All images from: High Performance Data Mining Using Data Cubes On Parallel Computers
Decision Tree Target (or dependent) variables are what we want to find association for. A common example is deciding whether or not to play golf. In this example, the target variable is Class(whether or not golf was played)
Decision Tree Algorithm works by starting with target variable, compares every other variable and picks the 'most associated' variable as second choice. Tree gains a branch for every option deemed relevant. Repeat until criteria are met, tree depth, max # of nodes, etc. Prune results that do not increase confidence by desired amount.
Gradient Boosted Decision Trees -Used for search engines. Easily parallelizable -Good for ranking and classification -Small amount of communication overhead when parallelized Parallel Boosted Regression Trees for Web Search Ranking
GBDT Like the Decision Tree algorithm, but with a Gradient Descent included.
Boosting trees Gradient boosting finds a predictor h that minimizes cost of C(h) C(h) =Sum from i=1->n(h(xi) − yi)2 Accomplished by adding function gt(x) to each iteration of h where g is the derivative of the cost function. Finally, trees are built as described previously with the Cost functions guiding the construction. Still runs in O(n log n), though already more precise
Parallelize The costs are still sequentially generated and iterated, the construction of trees is parallelized. Master Slave used, Master finds 'root' and sends layer to each slave. Slaves compress their data with histograms, and return to Master. Master finds best fit and creates the leaves, sends next layer to slaves until termination. Time complexity reduced to O(n log b) where b << n
Netflix prize Description of problem: Suggesting movies based upon past ratings. Contest had 480,189 users with 17,770 movies. Users were selected who had rated 18 movies within data gathering period, the last 9 ratings were recorded for data sets. Hold-out set saved for Probe, Test, and Quiz consisted of 4.2 million ratings, the rest of the ratings were used for Training data set. The Hold out set was comprised of users who rated few movies, and were therefore harder to predict for. One million dollar prize
Netflix 2 Used GBDT (Gradient Boosted Decision Trees) to blend
Apriori -Used for association rule mining -Apriori and apriori based algorithms are the most common algorithms used to find associations -Widely known -Easy to implement -Works by finding all associations of size 1 and then working up -Uses previously found associations to build larger ones
Sequential Aprori Image From: http://webdocs.cs.ualberta.ca/~zaiane/courses/cmput499/slides/Lect10/sld054.htm
Advantage of Apriori Reduces search space significantly Image from: http://www.inf.ed.ac.uk/undergraduate/projects/mathiasengvall/
Improving and Parallelizing Association Rule Mining - It is not easy to parallelize sequential data mining algorithms Challenges - Minimize I/O -Too much reading and writing from the disk can drastically lower performance - Evenly distribute the workload - Minimize communication overhead - Avoid Duplication of work
Ways To parallelize Apriori Count Distribution - Give part of the database to each processor then find the local counts of all associations being searched for. Sum the local counts after each iteration to get global count. Data Distribution - Partition the database and the candidate associations. Candidate Distribution - Don't split up the database only split up the associations and let each process work independently.
Count Distribution -Divides data between processes evenly -Each process gets local counts of associations -All the processes communicate to get global counts -The processes then generate candidate sets from the previous sets independently -The part of the database held by each processor is then scanned to get the new local count -Again all processes get a global count and generate new candidate sets
Count Distribution Image from:http://users.eecs.northwestern.edu/~yingliu/papers/para_arm_cluster.pdf
Eclat c-The other well known association rule mining algorithm -Very scalable when parallelized -Data is represented vertically -Given an association make a table of TIDs that contain the association -Additional associations are made by intersecting previous associations -Builds from small to large associations and therefore takes advantage of the same major optimization as apriori
Parallelizing Eclat -Divide the database between processes -Find local counts for all sets of size one and two -Find global counts and partition frequent sets of size 2 by prefix between the processors (Each processor gets its own prefix) -Each process changes its local database to be vertical sets of two items -Each processor exchanges parts of the vertical database with other processors to get the global parts of the vertical database it has been assigned -Join pairs of sets assigned to the same processor and find intersections untill all sets are found
Parallel Eclat Images from:http://users.eecs.northwestern.edu/~yingliu/papers/para_arm_cluster.pdf
Clustering -Grouping data with similar attributes together -Several types of clustering algorithms each serving different purposes -Partitioning, Ranking, Density Based, Grid based and more -Very different results based on choice -Applications -Pattern recognition -Market Research -Gene expression analysis
k-means -Partitioning algorithm -Randomly selects k objects to use to define each cluster -Each object not chosen is then assigned to the closest cluster based on distance between it and the currently defined clusters -After assigning every object to a cluster the mean of the cluster is found (centroid) -Each object is then assigned to the cluster that has a mean with the least distance from the object -Stop when sum of squared error has converged
Parallel k-means -Evenly partition data between processors -Select the k objects -Each processor assigns its objects to the nearest cluster and computes error of the data its holding -Communicate to get global sum of squared error and new means -Keep assigning data to new clusters and finding error untill error converges
k-means Image from:http://users.eecs.northwestern.edu/~yingliu/papers/para_arm_cluster.pdf
Sources Li, Jianwei, et al. "Parallel data mining algorithms for association rules and clustering." International Conference on Management of Data. 2008.