Web classification
1 / 54

Web classification - PowerPoint PPT Presentation

  • Uploaded on

Web classification. Ontology and Taxonomy. References. Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Web classification' - percival-levy

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Web classification

Web classification

Ontology and Taxonomy


  • Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu

  • Learning to Construct Knowledge Bases from World Wide Web. {M. Craven, D. DiPasquo, A. Mitchell, K. Nigam, S Slattery} Carnegie Mellon University-Pittsburg-USA; {D. Freitag A. McCallum} Just Reserch-Pittsburg-USA


  • Ontology

    • An explicit formal specification of how to represent the objects, concepts and other entities that are assumed to exist in some area of interest and the relationships that hold among them.

  • Taxonomy

    • a classification of organisms into groups based on similarities of structure or origin etc


  • Capture and model behavioral patterns and profiles of users interacting with a web site.

  • Why?

    • Collaborative filtering

    • Personalization systems

    • Improve organization and structural of the site

    • Provide dynamic recommendations (www.recommend-me.com)

Algorithm 0 by rafa s brother gabriel
Algorithm 0 (by Rafa’s brother: Gabriel)

  • Recommend pages viewed by other users with similar page ranks.

  • Problems

    • New item problem

    • Doesn’t consider content similarity nor item-to-item relationships.

User session
User session

  • User session s: <w(p1,s),w(p2,s),..,w(pn,s)>

    • W(pi,s) is a weight in session s, associated with page pi

  • Session clusters {cl1, cl2,…}

    • cli is a subset of the set of sessions

  • Usage profile prcl={<p, weight(p,prcl) : weight(p,prcl) ≥ μ}

    • Weight(p,prcl)=(1/|cl|) *∑w(p,s)

Algorithm 1
Algorithm 1

  • For every session, create a vector containing the viewed pages and a weight for each page.

  • Each vector represent a point in a N-dimensional space, so we may identify the clusters.

  • For a new session, check to which cluster this vector/point belongs, and recommend high scores pages of this cluster

  • Problems

    • New item problem

    • Doesn’t consider content similarity nor item-to-item relationships

Algorithm 2 keyword search
Algorithm 2: keyword search

  • Solves new item problem.

  • Not good enough

    • A page can contain info for more than 1 object.

    • A fundamental data can be pointed by the page, not included.

    • What exactly is a keyword.

  • Solution

    • Domain ontologies for objects

Domain ontologies
Domain Ontologies

  • Domain-Level Aggregate Profile: Set of pseudo objects each characterizing objects of different types occurring commonly across the user sessions.

  • Class - C

  • Attributes –a: < Da, Ta, ≤a, Ψa>

    • Ta type of attribute

    • DaDomain of the values for a (red, blue,..)

    • ≤a ordering relation among Da

    • Ψa combination function

Example movie web site
Example – movie web site

  • Classes:

    • movies, actors, directors, etc

  • Attributes:

    • Movies: title, genre, starring actors

    • Actors: name, filmography, gender, nationality

  • Functions:

    • Ψactor(<{S,0.7; T, 0.2; U,0.1},1>, <{S,0.5; T,0.5),0.7>) = sumi(wi*wo)/ sumi(wi)

    • Ψyear({1991},{1994}) = {1991,1994}

    • Ψis_a({person,student},{person,TA})= {person}

Creating an aggregated representation of a usage profile
Creating an Aggregated Representation of a usage profile

  • pr={<o1wo1>, …,<onwon>}

    • Oi object; woi=significance on the profile pr

  • Let assume all the object are instances of the same class

  • Create a new virtual object o’, with attributes ai’= Ψi(o1,…,on)

Algorithm 2
Algorithm 2

  • Do not just recommend other items viewed by other users, recommend items similar to the class representative.

  • Advantages:

    • More accuracy

    • Need less examples

    • No new item problem

    • Consider also content similarity (item-to-item relationship).

Final algorithm
Final Algorithm

  • Given a web site

    • Classify it contents into classes and attributes.

    • Merge the objects of each user profile and create a pseudo object.

    • Recommend according to this pseudo-object.


  • A per-topic solution

  • Found patterns can be incomplete

  • User patterns may change with time (for movies) “I loved ET” problem.

  • Need cookies and other methods to identify users.

  • How is weight calculated? Can need many examples: “I loved American Beauty” problem.

  • How to automatically group the web-pages?

Constructing knowledge base from www
Constructing Knowledge Base from WWW

  • Goal:

    • Automatically create computer understandable knowledge base from the web.

  • Why?

    • To use in the previous described work, and similar

    • Find all universities that offer Java Programming courses

    • Make me hotel and flight arrangements for the upcoming Linux conference

Constructing knowledge base from www1
Constructing Knowledge Base from WWW

  • How?

    • Use machine learning to create information extraction methods for each of the desired types of knowledge

    • Apply it, to extract symbolic, probabilistic statements directly from the web: Student-of(Rafa, sdbi)= 99%

  • Used method

    • Provide an initial ontology (classes and relations)

    • Training examples –3 out of 4 university sites (8000 web pages, 1400 web-page pairs)

Example of web pages
Example of web pages

  • Jim’s Home Page

  • I teach several courses:

    • Fundamental of CS

    • Intro to AI

  • My research includes

    • Intelligent web agents

  • Fundamentals of CS Home Page

  • Instructors:

    • Jim

    • Tom

Classes: Faculty, Research-project, Student, Staff, (Person), Course, Department, Other

Relations: instructor-of, members-of-project, department-of.


Web KB instances

Problem assumption
Problem Assumption

  • Class instance one-instance/one-webpage

    • Multiple instances in one web-page

    • Multiple linked/related web-pages for instance

    • Elvis problem

  • Relation R(A,B) is represented by:

    • Hyperlinks AB or ACD…B

    • Inclusion in a particular context (I teach Intro2cs)

    • Statistical model of typical words

To learn
To Learn

  • Recognizing class instances by classifying bodies of hypertext

  • Recognizing relations instances by classifying chains of hyperlinks

  • Extract text fields

Recognizing class instances by classifying bodies of hypertext
Recognizing class instances by classifying bodies of hypertext

  • Statistical bag-of-words approach

    • Full Text

    • Hyperlinks

    • Title/Head

  • Learning first order rules

  • Combine the previous 4 methods

Statistical bag of words approach
Statistical bag-of-words approach hypertext

  • Context-less classification

  • Given a set of classes C={c1, c2,…cN}

  • Given a document consisting of n≤2000 words {w1, w2, ..,wn}

  • c*= argmaxc Pr(c | w1,…,wn)

actual hypertext


Statistical bag of words approach pr w i c log pr w i c pr w i c
Statistical bag-of-words approach: hypertext Pr(wi|c) log (Pr(wi|c)/Pr(wi|~c))

Learning first order rules
Learning first order rules hypertext

  • The previous method doesn’t consider relations between pages

  • A page is a course home-page if it contains the word textbook and TA and point to a page containing the word assignment.

  • FOIL is a learning system that constructs Horn clause programs from examples

Relations hypertext

  • Has_word(Page). Stemmed words: computer= computing= comput. 200 occurrences but less than 30% in other class pages

  • Link_to(page,page)

  • m-estimate accuracy= (nc+(m*p))/(n+m)

    • nc: # of instances correctly classified by the rule

    • N: Total # of instance classified by the rule

    • m=2

    • P: proportion of instances in trainning set that belongs to that class

  • Predict each class with confidence = best_match / total_#_of_matches

New learned rules
New learned rules hypertext

  • student(A) :- not(has_data(A)), not(has_comment(A)), link_to(B,A), has_jame(B), has_paul(B), not(has_mail(B)).

  • faculty(A) :- has_professor(A), has_ph(A), link_to(B,A), has_faculti(B).

  • course(A) :- has_instructor(A), not(has_good(A)), link_to(A,B), not(link_to(B, 1)),has_assign(B).

Boosting hypertext

  • The best prediction classification depends on the class

    • Combine the predictions using the measure confidence

Boosting hypertext

  • Disappointing: Somehow it is not uniformly better

  • Possible solutions

    • Using reduced size dictionaries (next)

    • Using other methods for combining predictions (voting instead of best_match / total_#_of_matches)

Multi page segments
Multi-Page segments hypertext

  • The group is the longest prefix (indicated in parentheses)

    • (@/{user,faculty,people,home,projects}/*)/*.{html,htm}

    • (@/{cs???,www/,*})/*.{html,htm}

    • (@/{cs???,www/,*})/

  • A primary page is any page which URL matches:

    • @/index.{html,htm}

    • @/home.{html,htm}

    • @/%1/%1.{html,htm}

  • If no page in the group matches one of these patterns, then the page with the highest score for any non-other class is a primary page.

  • Any non-primary page is tagged as Other

Accuracy coverage tradeoff for the full text after url grouping heuristics
Accuracy/coverage tradeoff for the hypertextfull text after URL grouping heuristics

Conclusion recognizing classes
Conclusion hypertext- Recognizing Classes

  • Hypertext provides redundant information

    • We can classify using several methods

      • Full text

      • Heading/title

      • Hyperlinks

      • Text in neighboring pages

      • + Grouping pages

    • No method alone is good enough.

    • Combine predictions (classify methods) allows a better result.

Learning to recognize relation instances
Learning to Recognize Relation Instances hypertext

  • Assume: Relations are represented by hyper-links

  • Given the following background relations

    • Class (Page)

    • Link-to(Hyperlink,P1,P2)

    • Has-word (H) – the word is part of the Hyperlink

    • All-words-capitalized (H)

    • Has-alphanumeric-word (H) –I Teach CS2765

    • Has-neighborhood-word (H) –Neighborhood= paragraph

Learning to recognize relation instances1
hypertext Learning to Recognize Relation Instances

  • Try to learn the following

    • Members-of-project(P1,P2)

    • Intsructors_of_course(P1,P2)

    • Department_of_person(P1,P2)

Learned relations
Learned relations hypertext

  • instructors of(A,B) :- course(A), person(B), link to(C,B,A).

    • Test Set: 133 Pos, 5 Neg

  • department of(A,B) :- person(A), department(B), link to(C,D,A), link to(E,F,D), link to(G,B,F), has neighborhood word graduate(E).

    • Test Set: 371 Pos, 4 Neg

  • members of project(A,B) :- research project(A), person(B), link to(C,A,D), link to(E,D,B), has neighborhood word people(C).

    • Test Set: 18 Pos, 0 Neg

Learning to extract text fields
Learning to Extract Text Fields hypertext

  • Sometimes we want a small fragment of text, not the whole web-page or class (like Jon, Peter, etc)

    • Make me hotel and flight arrangements for the upcoming Linux conference

Predefined predicates
Predefined predicates hypertext

  • Let F= w1, w2, … wj be a fragment of text

    • length({<,>,=…}, N).

    • some(Var, Path, Feat, Value): some (A,[next_token, next_token], numeric, true)

    • position(Var, From, Relop, N):

    • relpos(Var1, Var2, Relop, N):

A wrong example
A wrong hypertextExample

Last-Modified: Wednesday, 26-Jun-96 01:37:46 GMT







<img src="ftp://ftp.cs.cornell.edu/pub/brd/images/brd.gif">


Bruce Randall Donald<br>

Associate Professor<br>

  • ownername(Fragment) :-

    • some(A, [prev token], word, “gmt"),

    • some(A, [ ], in title, true),

    • some(A, [ ], word, unknown),

    • some(A, [ ], quadrupletonp, false)

    • length(<, 3)

Conclusions hypertext

  • Used machine learning algorithms to create information extract methods for each desired type of knowledge.

  • WebKB achieves 70% accuracy at 30% coverage.

  • Bag-of-words (Hyperlinks, web-pages and full text) and First order learning can be used to boost the confidence

  • First order learning can be used to look outward from the page and consider its neighbors

Problems hypertext

  • Not as accurate as we want

    • You can get more accuracy at cost of coverage

    • Use linguistic features (verbs)

    • Add new methods to the booster (predict the department of a professor, based on the department of his students advisees)

  • A per topic, per language, per … method. Needs hand made labeling to learn.

    • Learners with high accuracy can be used to teach learners with low accuracy.