1 / 16

Mining for Context Recognition in Document Filtering and Classification

Mining for Context Recognition in Document Filtering and Classification. Rey-Long Liu Dept. of Information Management Chung Hua University HsinChu, Taiwan, R.O.C. Problem Definition. Hierarchical document filtering & document classification (DF & DC) Goal

gusty
Download Presentation

Mining for Context Recognition in Document Filtering and Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining for Context Recognition in Document Filtering and Classification Rey-Long Liu Dept. of Information Management Chung Hua University HsinChu, Taiwan, R.O.C.

  2. Problem Definition • Hierarchical document filtering & document classification (DF & DC) • Goal • Putting suitable information into suitable categories, which are organized hierarchically • Motivation • Information management, dissemination, & sharing

  3. Problem Definition (Cont.) • Main challenge • Recognition of context of discussion (COD) • For example, a document mentioning “911” may be from many categories in the text hierarchy of Google, including recreation! • Deriving COD thresholds for making DF & DC decisions

  4. Problem Definition (Cont.) • Main contributions • ICenter, which integrates DF and DC by COD recognition • Mining the profile of each category • Tuning a COD threshold for each category • DF & DC • Through COD recognition, higher-quality DF & DC may be achieved

  5. Outline • Overview of ICenter • The profile miner • The COD threshold tuner • The filtering classifier • Empirical evaluation • Conclusion

  6. Training Profile Mining Training Documents Threshold Tuning Category Profiles Category Thresholds Incoming Documents Filtering & Classification Testing Classified Documents Filtered Documents ICenter • ICenter: an information center for a user community

  7. The Profile Miner • Procedure:ProfileMining(c), where c is a category in the text hierarchy. • Effect: Build the profile Pxof each descendant category x of c. • Begin • (1) For each child category x of c, do • (1.1) Px = ; • (1.2) W = {w | w is a word in documents under x, and w is not a stop word}; • (1.3) For each word w in W, do • (1.3.1)sw,x = P(w|x); • (1.3.2)gw,x = P(w|x)  (Bx/iP(w|xi)), where Bx = 1 + number of siblings of x, and xi is the ith child of c, 1i Bx; • (1.3.3) Px = Px {<w, sw,x, gw,x>}; • (1.4) If x is not a leaf category, recursively invoke ProfileMining(x) to build the profile of each descendant category of x; • End.

  8. Manufacturing Systems Development Product, factory, …(O) System, Computer, Analysis, …(O) Transaction Processing Systems … Accounting, Sales … (O) System, Computer, … (X) … … … … Decision Support Systems Decision Support Systems Decision, simulation, … (O) … Decision, simulation,… (O) System, Computer, … (X) • Measuring how representative and discriminative a term w is in a category x: • sw,x= Support(w,x)(=P(w|x)) • gw,x = Support(w,x) / Avg Support(w,xi),where xi is in {x} U {siblings of x}

  9. The COD Threshold Tuner • Procedure: CODThresholdTuning(x), where x is a leaf category. • Effect: (1) For each ancestor a of x, tune a COD threshold ha,x, and • (2) Tune a COD threshold hx,x for x. • Begin • (1) P = {p | p is a document belonging to x}; • (2) For each ancestor category a of x, do • (2.1)UB = Min{DOAp,a}, where p P, and DOAp,a is the DOA value of p with respect to a (DOAp,a =  sw,angw,arw,atsw,p); • (2.2) ha,x = Max{DOAn,a}, where n is a document not belonging to a, and DOAn,aUB; • (3) Q = {q | q is a document not belonging to x}; • (4) For each q in Q, do • (4.1) For each ancestor a of x • (4.1.1) If DOAq,aha,x, Q = Q – {q}; • (5)hx,x = DOAp,x, which maximizes the system’s performance on P and Q (p P); • End.

  10. Manufacturing Systems Development (SA) The threshold allows all relevant documents to pass (but may filter out many non-relevant documents) … Transaction Processing Systems … … … … Decision Support Systems Decision Support Systems (DSS) … Only those non-relevant documents that pass the test of SA are considered to tune an optimum threshold

  11. The Filtering Classifier • Procedure:DF&DC(d), where d is a document. • Return: A set S of categories to which d is classified (d may be classified into c only when it passes all tests of c and ancestors of c) • Begin • (1) Invoke DOAEstimation(d) to estimate DOAd,c, for each category c; • (2) S = ; • (3) For each leaf category x, do • (3.1) IsAccepted = true; • (3.2) For each ancestor a of x, do • (3.2.1) If DOAd,aha,x, • (3.2.1.1) IsAccepted = false; • (3.2.1.2) Exit the for-loop; • (3.3) If IsAccepted = true, • (3.3.1) If DOAd,x < hx,x, • (3.3.1.1) IsAccepted = false; • (3.4) If IsAccepted = true, • (3.4.1) S = S  {x}; • (5) Return S; • End.

  12. Empirical Evaluation • Data • Source: the text hierarchy of Yahoo! • There were 507 categories (under 5 first-level categories) among which 211 were leaves (maximum height = 8) • There were 3612 documents in the leaves • Data splitting • 90% of the leaves served as “in-space” data (for DC) • 10% of the leaves served as “out-space” data (for DF)

  13. Empirical Evaluation (Cont.) • Validation • 5-fold cross validation (i.e. 80% for training, and 20% for testing • Evaluation criteria • For DC • Precision • Recall • F1= 2PR / (P+R) • For DF • Percentage of out-space documents successfully filtered (FR) • Average # of misclassifications for misclassified out-space documents (AM)

  14. Empirical Evaluation (Cont.) • System evaluated • ICenter • Baseline: The Rocchio’s classifier with thresholding (RO+T) • 2 (chi-square) technique for feature selection

  15. Empirical Evaluation (Cont.) • Result • When compared with the baseline using a feature set of size 40000, ICenter contributed 6.2% improvement on FR and 18% reduction of AM

  16. Conclusion • Main contribution • Exploring how and to what extent COD recognition may contribute to integrated DF and DC • The developed technique ICenter is both • More manageable (no need to tune feature sets), and • More competent (able to achieve better performances in both DC and DF)

More Related