1 / 69

What is Cluster Analysis? (1/4)

What is Cluster Analysis? (1/4). Cluster : a collection of data objects ( 物以類聚 ) Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping a set of data objects into clusters

warda
Download Presentation

What is Cluster Analysis? (1/4)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What is Cluster Analysis? (1/4) • Cluster: a collection of data objects (物以類聚) • Similar to one another within the same cluster • Dissimilar to the objects in other clusters • Cluster analysis • Grouping a set of data objects into clusters • 將一異質的群體(a diverse group)區隔為同質性較高的群集(clusters叢聚)或是子群(subgroups) • Clustering is unsupervised classification: no predefined classes • 資料依照本身的自我相似性(self-similarity)而群集在一起,群集(clusters)的意義要靠事後的闡釋才能得知。 Data Mining

  2. What is Cluster Analysis? (2/4) • 找出隱藏的現象或內部結構 Data Mining

  3. What is Cluster Analysis? (3/4) • Typical applications • As a stand-alone tool to get insight into data distribution • As a preprocessing step for other algorithms • clustering might be the first step in a market segmentation effort • a one-size-fits-all rule for “what kind of promotion do customers respond to best” (x) • what kind of promotion works best for each cluster (with similar buying habit) (o) Data Mining

  4. What is Cluster Analysis? (4/4) • 線上購物網站的使用者族群與消費能力 • 具有類似基本資料的人,通常也有相近的行為模式 Data Mining

  5. What Is Good Clustering? (1/2) • A good clustering method will produce high quality clusters with • high intra-class similarity and low inter-class similarity • The quality of a clustering result depends on both the similarity measure used by the method and its implementation. • The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns. • 在十數個刷卡行為的群集中,出現一個群集含有高比例的呆帳案例,而其他群集毫無特色可言 Data Mining

  6. What Is Good Clustering? (2/2) Data Mining

  7. Cluster Analysis的議題 • 根據甚麼資訊(特徵,屬性)來分群 • 事先決定cluster的數目是一件困難的工作 • data屬於那個cluster應該是程度的問題(fuzzy) 而非是或否的問題(crisp) • 非監督式學習沒有所謂最佳的模型 • 視覺化工具 vs 分群演算法 (專家經驗) Data Mining

  8. A scatter graph helps to understand and visualize clusters of customers (1/2) Data Mining

  9. A scatter graph helps to understand and visualize clusters of customers (2/2) • Each Axis • a purchase of an item associate with that pet • The box at the intersection • the number of customers who purchased the corresponding items • Four segments of customers • Only-dog-owners • Only-cat-owners • Only-fish-owners and cat-and-dog-owners • The rest can be lumped together as “others” Data Mining

  10. Cluster Analysis based on RFM (1/2) • 透過RFM值的分析可以量化顧客消費行為 並且衡量顧客忠誠度和貢獻度,以利顧客分群 及目標客戶的鎖定 • R(Recency): 最近購買日 • the time period since the last purchase; • F(Frequency): 購買頻率 • the number of purchases made in a certain time period; • M(Monetary):購買金額 • the amount of money spent during a certain period of time. Data Mining

  11. Cluster Analysis based on RFM (2/2) • 取得某一時間區間內客戶們的RFM值 • 進行叢聚分析 • Average RFM values of each cluster (Vc) are compared with the total average RFM values of all clusters (Vt) • if vc > vt then give  else give  • 目標客戶與行銷策略 • R  F  M : Promising customers • R  F  M : Loyal customers • R  F  M : Vulnerable customers • 有些變化的組合很難去解釋、以及變化的幅度未考量 Data Mining

  12. Examples of Clustering Applications • Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs • Land use: Identification of areas of similar land use in an earth observation database • Insurance: Identifying groups of motor insurance policy holders with a high average claim cost • City-planning: Identifying groups of houses according to their house type, value, and geographical location • Text Mining: 文件分類、客服申訴處理、病人病例分析、軍事刑事情報管理 (關鍵字結構的相似性) Data Mining

  13. Data Classification 與 Data Clustering之比較 • Data Classification • 是根據資料的屬性和一些預先建立的規則(Rule)來將資料分類 • 事前必先對資料的結構有一定的了解才能實行 • 找出許多(輸入)變數與命題(輸出變數)之間的關連性 • Data Clustering • 它不需要了解資料庫中的資料特色和結構,就能把資料分類成群 • 讓群組內的資料相似度最高,讓群組跟群組間的資料相似度最低 • 呈現變數之間的結構,有比較多的詮釋空間 Data Mining

  14. Description and Visualization (1/2) • 描述在複雜的資料庫中到底發生了什麼?透過這種 方式,可以讓我們對我們的客戶、產品以及流程等 有更多的認識與了解。 • A good enough description of a behavior will often suggest an explanation for it as well • parental movie viewing habits are strongly influenced by the taste of children Data Mining

  15. Description and Visualization (2/2) • Data visualization is one powerful form of descriptive data mining. • It is not always easy to come up with meaningful visualizations, but the right picture really can be worth a thousand association rules • Data Cube, Scatter graph, Histogram, … Data Mining

  16. 資料探勘的技術 • 統計分析 (Statistic Analysis) • 關聯分析 (Association Analysis) • 分類法 (Classification) • 叢聚分析 (Clustering Analysis) • 其他的技術 • 趨勢分析 (Trend Analysis)、時間序列分析 (Time Serial Analysis)、迴歸分析 (Regression Analysis)、異常值分析 (Outlier Analysis)或是人工智慧領域中的類神經網路(Neural Network)技術……等。 Data Mining

  17. All six tasks in one small database 以電影迷(Moviegoers)資料庫為例 • We wondered • what movies a person watches • Who goes to see a movie • The moviegoers database contains • the responses to an informal survey conducted during August and September of 1996 • The Sample Populations • the survey was distributed to four different populations in hopes that interesting intergroup differences might be revealed • The survey asked for age, sex, and last movies seen • in a movie theater Data Mining

  18. The layout of the moviegoers database 1 1 ∞ ∞ ∞ ∞ 1 Data Mining

  19. 姓名 性別 年紀 來源地點 電影名稱 Amy 女 27 Oberlin Independence day Andrew 男 25 Oberlin 12 monkeys Andy 男 34 Oberlin The birdcage Anne 女 30 Oberlin Trainspotting Ansje 女 25 Oberlin I shot andy wrrhol Beth 女 30 Oberlin Chain reaction Bob 男 51 Pinewoods Schindler’s list Brian 男 23 Oberlin Super cop Candy 女 29 Oberlin Eddie Cara 女 25 Oberlin Phenomenon Cathy 女 39 124Mt.Aubum The birdcage Charles 男 25 Oberlin Kingpin Curt 男 30 MRJ T2 judgment day David 男 40 MRJ Independence day Erica 女 23 124 Mt.Aubum trainspotting Moviegoer Survey (The first few rows are shown) Data Mining

  20. What can data mining do? (1/3) • 電影迷分類(Moviegoer Classification) • 根據年齡、來源以及看的電影來區分性別 • 根據性別、年齡以及看的電影來區分來源 • 根據以往看過的電影、年齡、性別和來源去區分會看 • 什麼電影 (most recent movie) • 技術: 決策樹 • 電影迷推估(Estimation) • 年齡為連續性變數,因此可以作為推估作業的目標變數。 • 年齡 = f(來源地點,性別,看過的電影) Data Mining

  21. What can data mining do? (2/3) • 電影迷預測(Prediction) • 預測一部新片上映時,誰會是它的觀眾? • 將影迷與電影進行群集分析 • 針對每一群影迷,挖掘規則來解釋這群人的電影品味 • 針對每一群電影,挖掘規則描述其最佳目標觀眾 • 新電影上映時,由新電影所屬群集就可以找出目標觀眾 • 電影迷關聯分組(Affinity grouping) • 哪些電影總是被同類的人觀賞 (which movies go together?) • 經由產生的關聯法則來分析性別的分類 (Virtual items)  Data Mining

  22. What can data mining do? (3/3) • 電影迷群集化 • to find groups of movies that go together because they are seen by the same people • to find groups of people that go together because they see the same movies • people with young children form a clearly recognizable cluster in the moviegoers database • 電影迷描述 • 基本統計量: 平均年齡、女性人口百分比。 • 關聯規則: 看過X電影的人也會看Y電影 • 規則也可視為一種描述:12~17歲的男性喜歡看X電影 Data Mining

  23. Evaluation and Interpretation • Model validation • after building a model, you must evaluate its results and interpret their significance • accuracy by itself is not necessarily the right metric for selecting the best model. You need to know more about the type of errors and the costs associated with them • Confusion matrices • for classification problem, a confusion matrix is a very useful tool for understanding results • it shows not only how well the model predicts, but also presents the details needed to see exactly where things may have gone wrong Data Mining

  24. Confusion matrix (1/2) Model X • this is much more informative than simply telling us an overall accuracy rate of 82% (123/150) • If there are different costs associated with different errors, a model with a lower overall accuracy may be preferable to one with higher accuracy but a greater cost to the organization due to the types of errors it makes Data Mining

  25. Confusion matrix (2/2) Model Y • The accuracy has dropped to 79% (118/150) • Suppose each correct answer had a value of $10 and each incorrect answer for class A had a cost of $5, for class B a cost of $10, and for class C a cost of $20 • The net value of model X = (123*10)-(5*5)-(12*10)-(10*20) = 885 • The net value of model Y = (118*10)-(22*5)-(7*10)-(3*20) = 940 Data Mining

  26. Confusion matrix 的使用 (1/4) • Data mining: 利用historical data找出rare event • 高度獲利或嚴重損失,但是針對所有的客戶採取行動,又顯得划不來 • 使用confusion matrix可以獲得三種資訊: 3R • Response Rate (回應率): 在我們預測的名單中找出多少稀有事件? • Recall (反查):預測出來的稀有事件佔總體稀有事件多少比例? • Range Reduce (間距縮減): 透過資料採礦模型來找尋稀有事件時,名單縮小了多少? Data Mining

  27. Confusion matrix 的使用 (2/4) 0: 不會購買 1:會購買 • Response Rate (回應率): 寧缺勿濫的能力 • Response Rate = 6961 / (2497+6961) = 73.6% • 總體Response Rate = (6961 + 2171) / (6855+2171+2497+6961) = 49.4% • 回應率提升了1.49倍 Data Mining

  28. Confusion matrix 的使用 (3/4) 0: 不會購買 1:會購買 • Recall (反查):寧可殺錯一萬,不可誤放一人 • Recall = 6961 / (6961+2171) = 76.22% • Range Reduce :根據模型執行活動時的成本 • Range Reduce = (6961 + 2497) / (6855+2171+2497+6961) = 51.2% Data Mining

  29. Confusion matrix 的使用 (4/4) • Which is the best model depends on the business problem • For a marketing response problem, we want to get as many potential responders as possible and we do not care about false positives • For a medical diagnostic test for cancer, we might use such a model as a initial screen. We care a lot about false negatives – and we want as few as possible Data Mining

  30. The Lift (Gain) Chart • It shows how responses are changed by applying the model. This change ratio is called the lift Data Mining

  31. The ROI (Return on Investment) Chart • A pattern may be interesting, but acting on it may cost more than the revenue or savings it generate • Here, ROI is defined as ratio of profit to cost Data Mining

  32. The Profit Chart • Profit = revenue minus cost • The maximum lift was achieved at the 1st decile (10%), the maximum ROI at the 2nd decile (20%), and the maximum profit at the 3rd and 4th deciles Data Mining

  33. External Validation • No matter how good the accuracy of a model is estimated to be, there is no guarantee that it reflects the real world • One of the main reasons for this problem is that there are always assumptions implicit in the model • The inflation rate may not have been included as a variable in a model that predicts the propensity of an individual to buy • It is important to test a model in the real world • do a test mailing to verify the model • try the model on a small set of applicants before full deployment Data Mining

  34. Deploy the model and results (1/2) • The first way is for an analyst to recommend actions based on simply viewing the model and its results • The analyst may look at the clusters the model has identified, the rules that define the model, or the lift and ROI charts that depict the effect of the model • The second way is to apply the model to different data sets • to flag records based on their classification, • to assign a score such as the probability of an action, or • can select some records from the database and subject these to further analyses with an OLAP tool, and so on Data Mining

  35. Deploy the model and results (2/2) • The amount of time to process each new transaction, and the rate at which new transactions arrive, will determine whether a parallelized algorithm is needed • Monitoring credit card transactions or cellular telephone calls for fraud • When delivering a complex application, data mining is often only a small, albeit critical, part of the final product • In a fraud detection system, known patterns of fraud may be combined with discovered patterns • You must measure how well your model has worked after you use it (model monitoring) • To be retested, retrained and possibly completely rebuilt Data Mining

  36. Acting on the Results (1/2) • Sometimes, it is valuable to incorporate a bit of experimental design into the process • If we are predicting customer response to a product, we might have three different groups • A group of customers based on the results of the Data Mining model, who get the marketing message • A group of customers chosen at random, who get the marketing message • A group of customers chosen at random, who do not get the marketing message Data Mining

  37. Acting on the Results (2/2) • What we hope is that • the first group will have a high response rate • The second group will have a mediocre response rate • The third will have a negligible response rate • We can test the strength of the marketing message • The difference in response between the second and third groups • We can test the strength of the data mining • The difference between the first and second groups Data Mining

  38. Measuring the Model’s Effectiveness • We need to compare the results to what actually happened in the real world • Did the predicted behavior actually happen? • Did the prospects accept the offer, did the customers purchase the new product, did they churn? • The lift charts and confusion matrixes can adapted to compare actual results to predicted results • The score set is usually more recent than the model set • Model performance usually degrades over time • The model captures patterns from the past and, over time, the patterns become less relevant Data Mining

  39. What Makes Predictive Modeling Successful? • Modeling Shelf-Life • The whole process of predictive modeling is based on some key assumptions Data Mining

  40. A. Modeling Shelf-Life • Looking at time frames bring up two critical questions about models and their predictions: • What is the shelf-life of a model? • The things being modeled change over time • A model created five years ago, or last year, or last month, may no longer be valid • You need to train a new model on more recent data • What is the shelf-life of a prediction? • Predictions are valid during a particular time frame Data Mining

  41. B.Key Assumption 1 (1/2) • The Past Is a Good Predictor of the Future • How patients reacted to a drug in the past • However, external factors will always have an influence on the model being built • Retail sales decrease during cold weather and blizzards • Mortgage lending increases when interest rates go down • Seasonal patterns • The Christmas season and back-to-school season derive many retail sales • The model developed during years of relatively stable financial markets were not applicable in the more volatile markets Data Mining

  42. B. Key Assumption 1 (2/2) • The Past Is a Good Predictor of the Future • How do we know when the past is a good predictor of the future ? • We can never know for sure • It is critical to • Include domain experts (have insight about important factors) in the modeling process • Include enough of the right data (seasonal factors) to make good decisions Data Mining

  43. B. Key Assumption 2 • The Data is Available • Data may not be available for any number of different reasons • The data may not be collected by the operational systems • The data base is too busy most of the time to prepare extracts • The data is owned by an outside vendor • And so on • Ensuring that the right data is available is critical to building successful predictive models Data Mining

  44. B. Key Assumption 3 • The Data Contains What We Want to Predict • To apply the lessons of the past to the future, we need to be comparing apples to apples and oranges to oranges • Often, the business people phrase their needs very ambiguously • We are interested in people who do not pay their bills • Sometimes business users have unreasonable expectations from their data • When building a response model, it must know who responded to the campaign and who received the campaign • For advertising campaigns, the second group is not known • However, we can compare the responders to a random sample of the general population Data Mining

  45. Selecting Data Mining Products (1/3) • There are three main types of data mining products • Tools that are analysis aids for OLAP • Help OLAP users identify the most important dimensions and segments on which they should focus attention • Business Objects Business Miner, Cognos Scenario • The “pure” data mining products • Horizontal tools aimed at data mining analysts concerned with solving a broad range of problems • IBM Intelligent Miner, Oracle Darwin, SAS Enterprise Miner, SGI MineSet, and SPSS Clementine • Analytic applications which implement specific business processes for which data mining is an integral part • Customized packages with the data mining imbedded Data Mining

  46. Selecting Data Mining Products (2/3) • Basic capabilities • Nothing substitutes for actual hands-on experience • Depending on your particular circumstances – system architecture, staff resources, database size, problem complexity – some data mining products will be better suited than others to meet your needs • System architecture • Work on a stand-alone desktop machine or a client-server architecture • Data preparation • Data access • No single product can support the large variety of database servers • Algorithms Data Mining

  47. Selecting Data Mining Products (3/3) • Basic capabilities (continued) • Interfaces to other products • Many tools can help you understand your data before you build your model, and help you interpret the results of your model • These include traditional query and reporting tools, graphics and visualization tools, and OLAP tools • Model evaluation and interpretation • Model deployment • When you need to apply the model to new cases as they come, it is usually necessary to incorporate the model into a program using an API or code generated by the data mining tool • Scalability • User interface • The people who build, deploy, and use the results of the models may be different groups with varying skills Data Mining

  48. The Virtuous Cycle of DM (1/2) • Data mining can be applied to many problems in many industries • Most common applications are in marketing, specifically for CRM • Applied to prospecting for new customers, retaining existing ones, and increasing customer value • Applied to understanding customer behavior and optimizing manufacturing processes • Although they may have much in common, every application has its own unique characteristics • Within a single industry, different companies have different strategic plans and different approaches Data Mining

  49. The Virtuous Cycle of DM (2/2) • The virtuous cycle is a high-level process, consisting of four major business processes: • Identifying the business problem • Transforming data into actionable results • Acting on the results • Measuring the results • There are no shortcuts – success in DM requires all four processes • Expertise growsas organizations focus on the right business problems, learn about data and modeling techniques, and improve Data Mining processes based on the results of previous efforts Data Mining

  50. Data Description and Data MiningModel Building (1/2) • Data mining is a process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions • The first and simplest analytical step in data mining is to describe the data • Summarize its statistical attributes (such as means and standard deviations) • Visually review it using charts and graphics (visualization) • Look for potentially meaningful links among variables (such as values that often occur together) • clustering • collecting, exploring, and selecting the right data are critically important Data Mining

More Related