320 likes | 404 Views
Explore the effectiveness of hierarchical document classification models with error control schemes for improved accuracy and performance. Learn about recovery-oriented and error-masking schemes and their impact. This presentation includes experiments, comparisons, and conclusions on real-life applications.
E N D
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during your presentation • In Slide Show, click on the right mouse button • Select “Meeting Minder” • Select the “Action Items” tab • Type in action items as they come up • Click OK to dismiss this box • This will automatically create an Action Item slide at the end of your presentation with your points entered. Hierarchical Classification of Documents with Error Control Chun-Hung Cheng, Jian Tang, Ada Wai-chee Fu, Irwin King
Overview • Abstract • Problem Description • Document Classification Model • Error Control Schemes • Recovery oriented scheme • Error masking scheme • Experiments • Conclusion
Abstract • Traditional document classification (flat classification) involves only a single classifier • Single classifier takes care of everything • Slow and high overhead
Abstract • Hierarchical document classification • Class hierarchy • Use one classifier at each internal node
Abstract • Advantage • Better performance • Disadvantage • Wrong result if misclassified in any node
Abstract • Introduce error control mechanism • Approach 1 (recovery oriented) • Detect and correct misclassification • Approach 2 (error masking) • Mask errors by using multiple versions of classifiers
Problem Description Class Taxonomy Statistics Training System Training Documents class | doc_id … | … Feature Terms Class-doc Relation
Problem Description Statistics Classification System Target Class Feature Terms Incoming Documents
Problem Description • Objective: Achieve • Higher accuracy • Fast performance • Our proposed algorithms provide a good trade-off between accuracy and performance
c c1c2 … cn Document Classification Model • Formally, we use a model from [Chakrabarti et al. 1997] • Based on naive Bayesian network • For simplicity, we study a single node classifier.
Probability that an incoming document d belongs to c is zi,d—number of occurrence of term i in the incoming document d Pj, c— probability that a word in class c is j (estimated using the training data)
Feature Selection • Previous formula involves all the terms • Feature selection reduces cost by using only the terms with good discriminating power • Use the training sets to identify the feature terms
Fisher’s Index • Fisher’s Index indicates the discriminating power of a term • Good discriminating power: large interclass distance, small intraclass distance Interclass distance c1 c2 w(t) Intraclass distance
c c1c2 … cn Document Classification Model • Consider only feature terms in the classification function p(ci|c,d) • Pick the largest probability among all ci • Use one classifier in each internal node
Recovery Oriented Scheme • Database system • Failure in DBMS • Restart from a consistent state • Document classification • Error detected • Restart from a correct class (High Confidence Ancestor, or HCA)
Recovery Oriented Scheme • In practice, • Rollback is slow • Identify wrong paths and avoid them • To identify wrong paths, • Define closeness indicator (CI) • On wrong path, when CI falls below a threshold
HCA Recovery Oriented Scheme Define distance of HCA and current node = 2 Wrong path
HCA HCA Recovery Oriented Scheme Define distance of HCA and current node = 2 Wrong path
Error Masking Scheme • Software Fault Tolerance • Run multiple versions of software • Majority voting • Document Classification • Run classifiers of different designs • Majority voting
O-Classifier • Traditional classifier
N-classifier • Skip some intermediate levels
Error Masking Scheme • Run three classifiers in parallel • O-classifier • N-classifier • O-classifier using new feature length • This selection minimizes the time wasted on waiting the slowest classifiers
Experiments • Data Sets • US Patents • Preclassified • Rich text content • Highly hierarchical • 3 Sets Collected • 3 levels/large no of docs • 4 levels/large no of docs • 7 levels/small no of docs
Experiments • Algorithm compared • Simple hierarchical • TAPER • Flat • Recovery oriented • Error masking • Generally, • flat is the slowest and the most accurate • simple hierarchical is the fastest and the least accurate
Conclusion • Real-life application • Large taxonomy • Flat classification is too slow • Our algorithm is faster than flat classification at as low as 4 levels • Performance gain widens as the number of levels increases • A good trade-off between accuracy and performance for most applications
Thank You The End