1 / 40

ApMl (All Purpose Machine Learning) Toolkit

ApMl (All Purpose Machine Learning) Toolkit. David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University of Georgia www.cs.uga.edu/~miller/SemWeb www.cs.uga.edu/~helen/SemWeb/SemWeb.html. What Has Been Done.

dalton
Download Presentation

ApMl (All Purpose Machine Learning) Toolkit

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University of Georgia www.cs.uga.edu/~miller/SemWeb www.cs.uga.edu/~helen/SemWeb/SemWeb.html

  2. What Has Been Done • Extensive Research into the effectiveness of machine learning algorithms has been performed • Train System on expert created taxonomy with expert specified documents

  3. What We Did • Train system on a domain specific taxonomy • Eg. CNN’s Sports Pages • Test system’s ability to correctly classify documents from a second, yet similar taxonomy • Eg. Yahoo! Sports Pages

  4. Automatic Text Classification via Statistical Methods • Text Categorization is the problem of assigning predefined categories to free text documents. • Statistical Learning Methods used in ApMl • Bayes Method • Rocchio Method (most popular) • K-Nearest Neighbor Classification • Probabilistic Indexing

  5. “Bag-of-words” 35 a 1 block 12 computer 4 field 1 leg 7 machine 44 of 3 paper 2 perspective 1 rate 5 reinforcement 9 science 2 survey 56 the 11 this 1 underrated … … Reinforcement Learning: a Survey This paper surveys the field of rein- forcement learning from a computer science perspective. A Probabilistic Generative Model • Define a probabilistic generative model for documents with classes.Bayes: Automatic Text Classification through Machine Learning, McCallum, et. al.

  6. Bayes Method Pick the most probable class, given the evidence: - a class (like “Planning”) - a document (like “language intelligence proof...”) Bayes Rule: Probability Category cj should be assigned to document d Automatic Text Classification through Machine Learning, McCallum, et. al.

  7. Bayes Rule - Probability that document d belongs to category cj - Probability that a randomly picked document has the same attributes - Probability that a randomly picked document belongs to this category - Probability that category c contains document d

  8. Bayes Method • Generates conditional probabilities of particular words occurring in a document given it belongs to a particular category. • Larger vocabulary generate better probabilities • Each category is given a threshold p for which it judges the worthiness of a document to fall in that classification. • Documents may fall into one, more than one, or not even one category.

  9. Rocchio Method • Each document is D is represented as a vector within a given vector space V: • Documents with similar content have similar vectors • Each dimension of the vector space represents a word selected via a feature selection process

  10. Rocchio Method • Values of d(i) for a document d are calculated as a combination of the statistics TF(w,d) and DF(w) • TF(w,d) (Term Frequency) is the number of times word w occurs in a document d. • DF(w) (Document Frequency) is the number of documents in which the word w occurs at least once.

  11. Rocchio Method • The inverse document frequency is calculated as • Value of d(i) of feature wifor a document d is calculated as the product • d(i) is called the weight of the word wiin the document d.

  12. Rocchio Method • Based on word weight heuristics, the word wi is an important indexing term for a document d if it occurs frequently in that document • However, words that occurs frequently in many document spanning many categories are rated less importantly

  13. K-Nearest Neighbor • Features • All instances correspond to points in an n-dimensional Euclidean space • Classification is delayed till a new instance arrives • Classification done by comparing feature vectors of the different points • Target function may be discrete or real-valued K-Nearest Neighbor Learning, Dipanjan Chakraborty

  14. 1-Nearest Neighbor K-Nearest Neighbor Learning, Dipanjan Chakraborty

  15. K-Nearest Neighbor • An arbitrary instance is represented by (a1(x), a2(x), a3(x),.., an(x)) • ai(x) denotes features • Euclidean distance between two instances d(xi, xj)=sqrt (sum for r=1 to n (ar(xi) - ar(xj))2) • Find the k-nearest neighbors whose distance from your test cases falls within a threshold p. • If x of those k-nearest neighbors are in category ci, then assign the test case to ci, else it is unmatched. K-Nearest Neighbor Learning, Dipanjan Chakraborty

  16. Probabilistic Indexing • Goal is to estimate P(C|si, dm) • Probability that assignment of term si to the document dm is correct • Once terms have been identified, assign Form Of Occurrence (FOC) • Certainty that term is correctly indentified • Significance of Term

  17. Probabilistic Indexing Cont. • If term t appears in document d and a term descriptor from t to s exists, s an indexing term, then generate a descriptor indictor • Set of generated term descriptors can be evaluated and a probability calculated that document d lies in class c

  18. ApMl Toolkit • Built on top of and extends existing toolkits • rainbow (CMU) – Machine Learning • wget (GNU) – Web Crawler • 4 Machine Learning Algorithms and 2 Classification Committees • Web Crawler and Document Retrieval • Automated Testing

  19. Machine Learning Components • 4 Machine Learning Algorithms (rainbow) • Naïve Bayes, Rocchio, KNN, Probabilistic Indexing • 2 Classification Committees (ApMl) • Weight Assigned For Overall Accuracy • Weights Assigned For Accuracy within each Class of Taxonomy

  20. Document Retrieval • Web Crawler and Document Retrieval • Specify Starting URL • Specify Recursion Depth • Allow Multiple Domain Spanning • Specify Excluded Domains • Store all retrieved pages into a single directory (ApMl)

  21. Automated Testing • Choose Algorithms to Test • Choose Test Directory • Specify Number of Tests • All results are placed into persistent window for evaluation

  22. Effectiveness: Contingency Table Machine Learning for Text Classification, David D. Lewis, AT&T Labs

  23. Effectiveness Measures • precision = a/(a+b) • Documents classified correctly vs. All classified as a particular category • recall = a/(a+c) • Documents classified correctly vs. All that should have been classified in a category • accuracy = (a+d)/(a+b+c+d) • All documents classified as positive or negative in a category correctly vs All classified Machine Learning for Text Classification, David D. Lewis, AT&T Labs

  24. Test Plan • Choose two areas and selected subcategories • Sports • Football • Tennis • Golf • NBA • Health • Children • Men • Women

  25. Test Plan Continued • Sport Web Sites • www.sportsillustrated.com • sports.yahoo.com • www.usatoday.com/sports/sfront.htm • Health Web Sites • www.patient.co.uk • www.cdc.gov/health • www.bbc.co.uk/health

  26. Test Plan Continued • Train the system on pages from one taxonomy from one domain and test on another taxonomy for the same area • Determine contingency tables for each category • Compute effectiveness using precision, recall, and accuracy

  27. Sports Test Results ApMl Test Results

  28. Health Test Results ApMl Test Results

  29. Comparison of Precision ApMl Test Results

  30. Comparison of Recall ApMl Test Results

  31. Comparison of Sports Additional Levels ApMl Test Results

  32. Comparison of Health Additional Levels ApMl Tests Results

  33. Comparison of Accuracy ApMl Test Results

  34. Trends of Results • K Nearest Neighbor effectiveness was significantly lower than other algorithms • continuously categorize the same • The class of Health was much more difficult for the algorithms to correctly categorize • children’s health a non-gender class • No improvement in our results with additional training

  35. Conclusions • Results of automatic text categorization are subjective • Trends can occur because of various factors • Heterogeneous taxonomies can be used for automatic classification with acceptable efficiencies • More research needed

  36. Resources • Dipanjan Chakraborty. “K-Nearest Neighbor Learning.” A PowerPoint Presentation. • Norbert Fuhr and Ulrich Pfeifer. “Combining Model-Oriented and Description-Oriented Approached for Probabilistic Indexing.” Proceedings of the Fourteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 46-56. ACM, New York. 1991. • Thorsten Joachims. “A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization.” Technical Report, CMU, March 1996. • Fabrizio Sebastiani. “Machine Learning in Automated Text Categorization.” ACM Computing Surveys, 34(1):1-47, 2002. • Amit Sheth, et. al. “Semantic Web Content Management for Enterprises and the Web.” In submission to IEEE Internet Computing.

More Related