160 likes | 276 Views
This paper presents an innovative approach for hierarchical document filtering and classification, focusing on the recognition of context in discussions (COD). The core goal is to categorize documents accurately by deriving COD thresholds that enhance the filtering and classification processes. Central to this approach is the ICenter framework, integrating profile mining and threshold tuning to improve document management and sharing. We detail the procedures involved, evaluate our methodology using empirical data, and highlight our contributions to higher-quality document categorization.
E N D
Mining for Context Recognition in Document Filtering and Classification Rey-Long Liu Dept. of Information Management Chung Hua University HsinChu, Taiwan, R.O.C.
Problem Definition • Hierarchical document filtering & document classification (DF & DC) • Goal • Putting suitable information into suitable categories, which are organized hierarchically • Motivation • Information management, dissemination, & sharing
Problem Definition (Cont.) • Main challenge • Recognition of context of discussion (COD) • For example, a document mentioning “911” may be from many categories in the text hierarchy of Google, including recreation! • Deriving COD thresholds for making DF & DC decisions
Problem Definition (Cont.) • Main contributions • ICenter, which integrates DF and DC by COD recognition • Mining the profile of each category • Tuning a COD threshold for each category • DF & DC • Through COD recognition, higher-quality DF & DC may be achieved
Outline • Overview of ICenter • The profile miner • The COD threshold tuner • The filtering classifier • Empirical evaluation • Conclusion
Training Profile Mining Training Documents Threshold Tuning Category Profiles Category Thresholds Incoming Documents Filtering & Classification Testing Classified Documents Filtered Documents ICenter • ICenter: an information center for a user community
The Profile Miner • Procedure:ProfileMining(c), where c is a category in the text hierarchy. • Effect: Build the profile Pxof each descendant category x of c. • Begin • (1) For each child category x of c, do • (1.1) Px = ; • (1.2) W = {w | w is a word in documents under x, and w is not a stop word}; • (1.3) For each word w in W, do • (1.3.1)sw,x = P(w|x); • (1.3.2)gw,x = P(w|x) (Bx/iP(w|xi)), where Bx = 1 + number of siblings of x, and xi is the ith child of c, 1i Bx; • (1.3.3) Px = Px {<w, sw,x, gw,x>}; • (1.4) If x is not a leaf category, recursively invoke ProfileMining(x) to build the profile of each descendant category of x; • End.
Manufacturing Systems Development Product, factory, …(O) System, Computer, Analysis, …(O) Transaction Processing Systems … Accounting, Sales … (O) System, Computer, … (X) … … … … Decision Support Systems Decision Support Systems Decision, simulation, … (O) … Decision, simulation,… (O) System, Computer, … (X) • Measuring how representative and discriminative a term w is in a category x: • sw,x= Support(w,x)(=P(w|x)) • gw,x = Support(w,x) / Avg Support(w,xi),where xi is in {x} U {siblings of x}
The COD Threshold Tuner • Procedure: CODThresholdTuning(x), where x is a leaf category. • Effect: (1) For each ancestor a of x, tune a COD threshold ha,x, and • (2) Tune a COD threshold hx,x for x. • Begin • (1) P = {p | p is a document belonging to x}; • (2) For each ancestor category a of x, do • (2.1)UB = Min{DOAp,a}, where p P, and DOAp,a is the DOA value of p with respect to a (DOAp,a = sw,angw,arw,atsw,p); • (2.2) ha,x = Max{DOAn,a}, where n is a document not belonging to a, and DOAn,aUB; • (3) Q = {q | q is a document not belonging to x}; • (4) For each q in Q, do • (4.1) For each ancestor a of x • (4.1.1) If DOAq,aha,x, Q = Q – {q}; • (5)hx,x = DOAp,x, which maximizes the system’s performance on P and Q (p P); • End.
Manufacturing Systems Development (SA) The threshold allows all relevant documents to pass (but may filter out many non-relevant documents) … Transaction Processing Systems … … … … Decision Support Systems Decision Support Systems (DSS) … Only those non-relevant documents that pass the test of SA are considered to tune an optimum threshold
The Filtering Classifier • Procedure:DF&DC(d), where d is a document. • Return: A set S of categories to which d is classified (d may be classified into c only when it passes all tests of c and ancestors of c) • Begin • (1) Invoke DOAEstimation(d) to estimate DOAd,c, for each category c; • (2) S = ; • (3) For each leaf category x, do • (3.1) IsAccepted = true; • (3.2) For each ancestor a of x, do • (3.2.1) If DOAd,aha,x, • (3.2.1.1) IsAccepted = false; • (3.2.1.2) Exit the for-loop; • (3.3) If IsAccepted = true, • (3.3.1) If DOAd,x < hx,x, • (3.3.1.1) IsAccepted = false; • (3.4) If IsAccepted = true, • (3.4.1) S = S {x}; • (5) Return S; • End.
Empirical Evaluation • Data • Source: the text hierarchy of Yahoo! • There were 507 categories (under 5 first-level categories) among which 211 were leaves (maximum height = 8) • There were 3612 documents in the leaves • Data splitting • 90% of the leaves served as “in-space” data (for DC) • 10% of the leaves served as “out-space” data (for DF)
Empirical Evaluation (Cont.) • Validation • 5-fold cross validation (i.e. 80% for training, and 20% for testing • Evaluation criteria • For DC • Precision • Recall • F1= 2PR / (P+R) • For DF • Percentage of out-space documents successfully filtered (FR) • Average # of misclassifications for misclassified out-space documents (AM)
Empirical Evaluation (Cont.) • System evaluated • ICenter • Baseline: The Rocchio’s classifier with thresholding (RO+T) • 2 (chi-square) technique for feature selection
Empirical Evaluation (Cont.) • Result • When compared with the baseline using a feature set of size 40000, ICenter contributed 6.2% improvement on FR and 18% reduction of AM
Conclusion • Main contribution • Exploring how and to what extent COD recognition may contribute to integrated DF and DC • The developed technique ICenter is both • More manageable (no need to tune feature sets), and • More competent (able to achieve better performances in both DC and DF)