1 / 54

Web classification

Web classification. Ontology and Taxonomy. References. Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu

meira
Download Presentation

Web classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web classification Ontology and Taxonomy

  2. References • Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu • Learning to Construct Knowledge Bases from World Wide Web. {M. Craven, D. DiPasquo, A. Mitchell, K. Nigam, S Slattery} Carnegie Mellon University-Pittsburg-USA; {D. Freitag A. McCallum} Just Reserch-Pittsburg-USA

  3. Definitions • Ontology • An explicit formal specification of how to represent the objects, concepts and other entities that are assumed to exist in some area of interest and the relationships that hold among them. • Taxonomy • a classification of organisms into groups based on similarities of structure or origin etc

  4. Goal • Capture and model behavioral patterns and profiles of users interacting with a web site. • Why? • Collaborative filtering • Personalization systems • Improve organization and structural of the site • Provide dynamic recommendations (www.recommend-me.com)

  5. Algorithm 0 (by Rafa’s brother: Gabriel) • Recommend pages viewed by other users with similar page ranks. • Problems • New item problem • Doesn’t consider content similarity nor item-to-item relationships.

  6. User session • User session s: <w(p1,s),w(p2,s),..,w(pn,s)> • W(pi,s) is a weight in session s, associated with page pi • Session clusters {cl1, cl2,…} • cli is a subset of the set of sessions • Usage profile prcl={<p, weight(p,prcl) : weight(p,prcl) ≥ μ} • Weight(p,prcl)=(1/|cl|) *∑w(p,s)

  7. Algorithm 1 • For every session, create a vector containing the viewed pages and a weight for each page. • Each vector represent a point in a N-dimensional space, so we may identify the clusters. • For a new session, check to which cluster this vector/point belongs, and recommend high scores pages of this cluster • Problems • New item problem • Doesn’t consider content similarity nor item-to-item relationships

  8. Algorithm 2: keyword search • Solves new item problem. • Not good enough • A page can contain info for more than 1 object. • A fundamental data can be pointed by the page, not included. • What exactly is a keyword. • Solution • Domain ontologies for objects

  9. Domain Ontologies • Domain-Level Aggregate Profile: Set of pseudo objects each characterizing objects of different types occurring commonly across the user sessions. • Class - C • Attributes –a: < Da, Ta, ≤a, Ψa> • Ta type of attribute • DaDomain of the values for a (red, blue,..) • ≤a ordering relation among Da • Ψa combination function

  10. Example – movie web site • Classes: • movies, actors, directors, etc • Attributes: • Movies: title, genre, starring actors • Actors: name, filmography, gender, nationality • Functions: • Ψactor(<{S,0.7; T, 0.2; U,0.1},1>, <{S,0.5; T,0.5),0.7>) = sumi(wi*wo)/ sumi(wi) • Ψyear({1991},{1994}) = {1991,1994} • Ψis_a({person,student},{person,TA})= {person}

  11. Creating an Aggregated Representation of a usage profile • pr={<o1wo1>, …,<onwon>} • Oi object; woi=significance on the profile pr • Let assume all the object are instances of the same class • Create a new virtual object o’, with attributes ai’= Ψi(o1,…,on)

  12. Item level usage profile

  13. A real (estate property) example

  14. Item Level Usage Profile

  15. Algorithm 2 • Do not just recommend other items viewed by other users, recommend items similar to the class representative. • Advantages: • More accuracy • Need less examples • No new item problem • Consider also content similarity (item-to-item relationship).

  16. Item Level Usage Profile

  17. Final Algorithm • Given a web site • Classify it contents into classes and attributes. • Merge the objects of each user profile and create a pseudo object. • Recommend according to this pseudo-object.

  18. Problems • A per-topic solution • Found patterns can be incomplete • User patterns may change with time (for movies) “I loved ET” problem. • Need cookies and other methods to identify users. • How is weight calculated? Can need many examples: “I loved American Beauty” problem. • How to automatically group the web-pages?

  19. Hafsaka?

  20. Constructing Knowledge Base from WWW • Goal: • Automatically create computer understandable knowledge base from the web. • Why? • To use in the previous described work, and similar • Find all universities that offer Java Programming courses • Make me hotel and flight arrangements for the upcoming Linux conference

  21. …Constructing Knowledge Base from WWW • How? • Use machine learning to create information extraction methods for each of the desired types of knowledge • Apply it, to extract symbolic, probabilistic statements directly from the web: Student-of(Rafa, sdbi)= 99% • Used method • Provide an initial ontology (classes and relations) • Training examples –3 out of 4 university sites (8000 web pages, 1400 web-page pairs)

  22. Example of web pages • Jim’s Home Page • I teach several courses: • Fundamental of CS • Intro to AI • My research includes • Intelligent web agents • Fundamentals of CS Home Page • Instructors: • Jim • Tom Classes: Faculty, Research-project, Student, Staff, (Person), Course, Department, Other Relations: instructor-of, members-of-project, department-of.

  23. Ontology Web KB instances

  24. Problem Assumption • Class instance one-instance/one-webpage • Multiple instances in one web-page • Multiple linked/related web-pages for instance • Elvis problem • Relation R(A,B) is represented by: • Hyperlinks AB or ACD…B • Inclusion in a particular context (I teach Intro2cs) • Statistical model of typical words

  25. To Learn • Recognizing class instances by classifying bodies of hypertext • Recognizing relations instances by classifying chains of hyperlinks • Extract text fields

  26. Recognizing class instances by classifying bodies of hypertext • Statistical bag-of-words approach • Full Text • Hyperlinks • Title/Head • Learning first order rules • Combine the previous 4 methods

  27. Statistical bag-of-words approach • Context-less classification • Given a set of classes C={c1, c2,…cN} • Given a document consisting of n≤2000 words {w1, w2, ..,wn} • c*= argmaxc Pr(c | w1,…,wn)

  28. actual predicted

  29. Statistical bag-of-words approach: Pr(wi|c) log (Pr(wi|c)/Pr(wi|~c))

  30. Accuracy/Coverage tradeoff for full-text classifiers

  31. Accuracy/coverage tradeoff for hyperlinks classifiers

  32. Accuracy/Coverage for title heading classifiers

  33. Learning first order rules • The previous method doesn’t consider relations between pages • A page is a course home-page if it contains the word textbook and TA and point to a page containing the word assignment. • FOIL is a learning system that constructs Horn clause programs from examples

  34. Relations • Has_word(Page). Stemmed words: computer= computing= comput. 200 occurrences but less than 30% in other class pages • Link_to(page,page) • m-estimate accuracy= (nc+(m*p))/(n+m) • nc: # of instances correctly classified by the rule • N: Total # of instance classified by the rule • m=2 • P: proportion of instances in trainning set that belongs to that class • Predict each class with confidence = best_match / total_#_of_matches

  35. New learned rules • student(A) :- not(has_data(A)), not(has_comment(A)), link_to(B,A), has_jame(B), has_paul(B), not(has_mail(B)). • faculty(A) :- has_professor(A), has_ph(A), link_to(B,A), has_faculti(B). • course(A) :- has_instructor(A), not(has_good(A)), link_to(A,B), not(link_to(B, 1)),has_assign(B).

  36. Accuracy/coverage for FOIL page classifiers

  37. Boosting • The best prediction classification depends on the class • Combine the predictions using the measure confidence

  38. Accuracy/coverage tradeoff for combined classifiers (2000 words vocabulary)

  39. Boosting • Disappointing: Somehow it is not uniformly better • Possible solutions • Using reduced size dictionaries (next) • Using other methods for combining predictions (voting instead of best_match / total_#_of_matches)

  40. Accuracy/coverage tradeoff for combined classifiers (200 words vocabulary)

  41. Multi-Page segments • The group is the longest prefix (indicated in parentheses) • (@/{user,faculty,people,home,projects}/*)/*.{html,htm} • (@/{cs???,www/,*})/*.{html,htm} • (@/{cs???,www/,*})/ • … • A primary page is any page which URL matches: • @/index.{html,htm} • @/home.{html,htm} • @/%1/%1.{html,htm} • … • If no page in the group matches one of these patterns, then the page with the highest score for any non-other class is a primary page. • Any non-primary page is tagged as Other

  42. Accuracy/coverage tradeoff for the full text after URL grouping heuristics

  43. Conclusion- Recognizing Classes • Hypertext provides redundant information • We can classify using several methods • Full text • Heading/title • Hyperlinks • Text in neighboring pages • + Grouping pages • No method alone is good enough. • Combine predictions (classify methods) allows a better result.

  44. Learning to Recognize Relation Instances • Assume: Relations are represented by hyper-links • Given the following background relations • Class (Page) • Link-to(Hyperlink,P1,P2) • Has-word (H) – the word is part of the Hyperlink • All-words-capitalized (H) • Has-alphanumeric-word (H) –I Teach CS2765 • Has-neighborhood-word (H) –Neighborhood= paragraph

  45. … Learning to Recognize Relation Instances • Try to learn the following • Members-of-project(P1,P2) • Intsructors_of_course(P1,P2) • Department_of_person(P1,P2)

  46. Learned relations • instructors of(A,B) :- course(A), person(B), link to(C,B,A). • Test Set: 133 Pos, 5 Neg • department of(A,B) :- person(A), department(B), link to(C,D,A), link to(E,F,D), link to(G,B,F), has neighborhood word graduate(E). • Test Set: 371 Pos, 4 Neg • members of project(A,B) :- research project(A), person(B), link to(C,A,D), link to(E,D,B), has neighborhood word people(C). • Test Set: 18 Pos, 0 Neg

  47. Accuracy/Coverage tradeoff for learned relation rules

  48. Learning to Extract Text Fields • Sometimes we want a small fragment of text, not the whole web-page or class (like Jon, Peter, etc) • Make me hotel and flight arrangements for the upcoming Linux conference

  49. Predefined predicates • Let F= w1, w2, … wj be a fragment of text • length({<,>,=…}, N). • some(Var, Path, Feat, Value): some (A,[next_token, next_token], numeric, true) • position(Var, From, Relop, N): • relpos(Var1, Var2, Relop, N):

More Related