Download
practical issues for automated categorization n.
Skip this Video
Loading SlideShow in 5 Seconds..
Practical Issues for Automated Categorization of Web Sites PowerPoint Presentation
Download Presentation
Practical Issues for Automated Categorization of Web Sites

Practical Issues for Automated Categorization of Web Sites

329 Views Download Presentation
Download Presentation

Practical Issues for Automated Categorization of Web Sites

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Practical Issues for Automated Categorization of Web Sites John M. Pierre jpierre@metacode.com Metacode Technologies, Inc. 139 Townsend Street San Francisco, CA 94107 (Collaborators: B. Wohler, R. Daniel, M. Butler, R. Avedon)

  2. Outline • Project overview • Web content • Automated Categorization • Feature Selection • Metadata • Experimental Setup • Data • Targeted Spidering • System Architecture • Results • Conclusions

  3. Project Overview • Specific: • Categorize large number of domain names by industry category • NAICS classification scheme • ~30,000 domain names for testing (.com) • Text categorization approach • General: • Domain specific classification • Metadata • Targeted spidering • Feature selection • Classifier training

  4. Web Content: Automated Categorization • Challenges: • Vast (over 1 Billion pages) • Heterogeneous (content, formats, not just HTML) • Dynamic (growing, changing) • Benefits: • Good source of information • Accessible! • Machine readable (vs. machine understandable) • Semi-structured • Tools: • Classification • Automated classification • Text Categorization/Machine Learning • Intelligent agents • Related Work • Manual: • Yahoo! • Open Directory Project • Looksmart • Automatic: • Northern Light • Thunderstone/Texis • Inktomi • Other: • EU Project DESIRE II • Pharos • Attardi, Sebanstiani et al • L. Page et al • McCallum et al

  5. Web Content: Feature Selection • Text Features: (D. Lewis) • Relatively few in number • Moderate in frequency of assignment • Low in redundancy • Low in noise • Related to semantic scope to the classes to be assigned • Relatively unambiguous in meaning • Preliminary Experiment • 1125 web domains • SEC+NAICS training set Use metadata if possible, use body text as last resort!

  6. Web Content: Metadata

  7. Experimental Setup: Targeted Spidering Domain name ‘Query’ Pages HTTP Get live? Yes No Try www. Frames? Yes Use <body> No Metatags? No Yes <a href=? Send Query prod, service, about, info, press, news

  8. Experimental Setup: Data Classification scheme: NAICS 11 Agriculture, Forestry, Fishing and Hunting 21 Mining 23 Construction 31-33 Manufacturing 42 Wholesale Trade 44-45 Retail Trade 48-49 Transportation and Warehousing 51 Information 52 Finance and Insurance 53 Real Estate and Rental and Leasing 54 Professional, Scientific and Technical Services 55 Management of Companies and Enterprise 56 Admin. Support, Waste Mgmt and Remediation Srvcs 61 Educational Services 62 Health Care and Social Assistance 71 Arts, Entertainment & Recreation 72 Accommodation and Food Services 81 Other services (except 92) 92 Public Administration 99 Unclassified Establishments • Test Data • ~30,000 domain names (SIC) • ~13,500 pre-classified/content • Training Data • “SEC-NAICS”: • 1504 SEC 10-K fillings (SIC) • 426 NAICS labels/descriptions • “Web pages”: • 3618 pre-classified domains • Crosswalk • SIC <-> NAICS

  9. Spider Experimental Setup: System Architecture The Web Domain Names Text Query SEC-NAICS IR Engine Web pages Matching documents Decision Foo.com 11, 21, 23

  10. Results P=Precision = # correctly assigned / # assigned R=Recall = # correctly assigned / # total correct F1 = 2 P R / (P+R) micro-averaged = computer over all categories macro-averaged = per category, then averaged

  11. Conclusions • Domain Specific Classification • Knowledge Gathering • Use of specialized knowledge • Targeted Spidering • Efficient use of resources • Extract key features, Metadata • Training • Prior knowledge • Bootstrapping • Classification • Robust, tolerant of noisy data • Benefits of Semantic Web • Better Metadata • Semantic linking & intelligent spidering