1 / 26

Mining Changes of Classification by Correspondence Tracing

Mining Changes of Classification by Correspondence Tracing. Ke Wang Senqiang Zhou Simon Fraser University {wangk, szhoua}@cs.sfu.ca. Wai Chee Ada Fu Jeffrey Xu Yu The Chinese University of Hong Kong adafu@cs.cuhk.edu.hk yu@se.cuhk.edu.hk. The Problem.

ziv
Download Presentation

Mining Changes of Classification by Correspondence Tracing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Changes of Classification by Correspondence Tracing Ke Wang Senqiang ZhouSimon Fraser University{wangk, szhoua}@cs.sfu.ca Wai Chee Ada Fu Jeffrey Xu YuThe Chinese University of Hong Kongadafu@cs.cuhk.edu.hkyu@se.cuhk.edu.hk

  2. The Problem • Mining changes of classification patterns as the data changes • What we have: old classifier and new data • What we want: the changes of classification characteristics in the new data members with a large family  shop frequently. members with a large family shop infrequently. Example

  3. Targets • State the changes explicitly • Simply comparing old and new classifiers does not work • Distinguish the important changes, otherwise users could be over-whelmed • Interested in the changes causing the change of classification accuracy.

  4. Related Work • [GGR99]: transfer two classifiers into a common specialization and the changes are measured by the amount of work required for such transfer. • Human users hardly measure the changes in this way • Have not addressed the primary goal: accuracy

  5. Related Work (Cont.) • [LHHX00]: extract changes by requiring the new classifier to be similar to the old one. • Using the same splitting attributes or the same splitting in the decision tree construction. • Put a severe restriction on mining important changes.

  6. Our Approach: Basic Idea • For each old rule o, trace the corresponding new rules in the new classifier through the examples they both classify Example: n1 and n2 are corresponding rules of o. New rule n1 New rule n2 Old rule o New data set

  7. The Algorithm • Input: old classifier and new data • Output: the changes of classification patterns Identify corresponding new rules for each old rule Present changes Build new classifier Step 3 Step 2 Step 1

  8. Identify The Corresponding Rules • Given an old rule o: • Collect the examples classified by o • For each such example, identify the new rule n that classifies it • characteristic change: O <n1,n2, …, nk> New rule n1 Old rule o New rule n2

  9. Any Important Changes? • Given O <n1,n2, …, nk>, which changes are more important? • Hint: users usually are interested in the changes causing the change of classification accuracy. • Basic idea: measure the importance of changes based on the estimated accuracy of rules on future data

  10. Pessimistic Error Estimation • Consider an old rule o that classifies N examples ( in new data set) with E wrongly • Observed error rate: E/N • How about the error rate in the whole population? • Given an confidence level CF, the upper bound of error rate is UCF(N,E) • Details in [Q93]

  11. Estimating Quantitative Change • Consider a rule pair <o, ni>, while o(No, Eo), ni(Nn, En) Estimate the error rates for both rules Calculate the decrease of error Quantitative change Calculate the increase of accuracy

  12. An Example Consider <o, n1, n2> o: A4=1  C3, (N=7, E=4) n1: A3=1A4=1  C3, (N=5, E=0) 3 classified by o n2: A3=2A4=1  C2 , (N=6, E=0) 4 classified by o Assume the new data set has 18 examples and CF=25%. Consider the quantitative changes of <o, n1> The estimated error rate of o is: UCF(7, 4) = 75% The estimated error rate of n1 is: UCF(5, 0) = 24% The decrease of error rate: 75% - 24% = 51% The increase of accuracy (o,n1) = (3/18)*51%=8.5% The increase of accuracy (o,<n1,n2>) =8.5%+12%=20.5%

  13. Types of Changes • Global changes: both characteristics change and quantitative change are large. • Alternative changes: characteristics change is large but its quantitative change is small.

  14. Types of Changes (Cont.) • Target changes: similar characteristics but different classes (targets) • Interval changes: shift of boundary points due to the emerging of new cutting points.

  15. Experiments • Two data sets: • German Credit Data from UCI repository [MM96] • IPUMS Census Data [IPUMS] • Goal: to verify if our algorithm can find the important changes “supposed” to be found

  16. Methodologies • For German Credit data: • Plant some changes to original data and check if the proposed method finds them. • For IPUMS census data: • The proposed method is applied to find the changes across years or different sub-populations. • Classifiers are built using C4.5 algorithm

  17. Summaries of German Data • Data description: • 2 classes: bad and good • 20 attributes, 13 categorical • 1000 examples: 700 are “good” • Changes planted: • Target change • Interval change • Etc.

  18. Planting Target Change Personal-status = single-male, Foreign = no  Credit = good (23, 0) Consider the examples classified by the old rule. Changes planted: if ( Liable-people=1 ) then change the class from good to bad 12 examples changed

  19. The Changes Found Personal-status = single-male, Foreign = no  Credit = good Personal-status = single-male, Liable-people <=1, Foreign = no  Credit = bad ( = 0.48%) Liable-people >1, Foreign = no Credit = good( = 0.54%)

  20. Planting Interval Change Status = 0DM, Duration > 11, Foreign = yes  Credit = bad (164, 47) Consider the examples classified by the old rule. Changes planted: Increase the Duration value by 6 (months) for each example classified by the old rule. 164 examples changed

  21. The Changes Found Status = 0DM, Duration > 11, Foreign = yes  Credit = bad Status = 0DM,Duration > 16, Foreign = yes  Credit = bad( = 1.20%)

  22. Summaries of IPUMS Data • Take “vetstat” as class • 3 data sets: 1970, 1980 and 1990. • Each data set contains the examples for several races. • The proposed method is applied to find the changes across years or different sub-populations.

  23. Interesting Changes Found A. 1970-black vs 1990-black 35<age ≤54 Vetstat=yes 40<age ≤72, sex=male Vetstat=yes( = 1.20%) B. 1990-black vs 1990-chinese 40<age ≤72, sex=male Vetstat=yes bplg =china, incss ≤5748 Verstat=no( = 4.56%)

  24. Conclusion • Extracting changes from potentially very dissimilar old and new classifiers by correspondence tracing • Ranking the importance of changes • Presenting the changes • Experiments on real-life data sets

  25. References • [GGR99] V. Ganti, J. Gehrke, and R. Ramakrishnan. A framework for measuring changes in data characteristics. In PODS, 1999 • [IPUMS] http://www.ipums.umn.edu/. • [LHHX00] B. Liu,W. Hsu, H.S. Han, and Y. Xia. Mining changes for real-life applications. In DaWak, 2000

  26. References (Cont.) • [MM96] C.J. Merz and P. Murphy. UCI repository of machine learning databases • [Q93] J.R. Quinlan. C4.5: programs for maching learning. 1993

More Related