Contents
Download
1 / 55

Contents - PowerPoint PPT Presentation


  • 93 Views
  • Uploaded on

Contents. Introduction Knowledge discovery from text & links Knowledge discovery from usage data Important open issues. WWW: the new face of the Net. Once upon a time, the Internet was a forum for exchanging information. Then …. …came the Web. The Web introduced new capabilities ….

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Contents' - ringo


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Contents
Contents

  • Introduction

  • Knowledge discovery from text & links

  • Knowledge discovery from usage data

  • Important open issues


Www the new face of the net
WWW: the new face of the Net

Once upon a time, the Internet was a forum for exchanging information. Then…

…came the Web.

The Web introducednew capabilities …

…and attractedmany more people …

…increasingcommercial interest …

…and turning the Net into areal forum …


Information overload
Information overload

…as more people started using it ...

…increasing the quantity of online information further...

…the quantity of information on the Web increased...

…attracting even more people ...

…and leading to the overload of information for the users ...


Www an expanding forum
WWW: an expanding forum

  • The Web is large and volatile:

    • More than600.000.000users online

    • More than 800.000 sign up every day

    • More than 9.000.000 Web sites

    • More than 300.000.000.000 pages online

    • Less than 50% of Web sites will be there next year

  • … leading to the abundance problem:

    “99% of online information is of no

    interest to 99% of the people”


Information access services
Information access services

A number of services aim to help the user gain access to online information and products ...

… but can they really cope?


New requirements
New requirements

  • Current indexing does not allow for wide coverage: Less than 5% of the Web covered by search engines.

  • What I want is hardly ever ranked high enough.

  • Product information in catalogues is often biased towards specific suppliers and outdated.

  • Product descriptions are incomplete and insufficient for comparison purposes.

  • ‘E’ in ‘E-commerce’ stands for ‘English’: More than 70% of the Web is English.

  • … and many more problems lead to the conclusion ...

    … that more intelligent solutions are needed!


A new generation of services
A new generation of services

Some have already made their way to the market…

… many more are being developed as I speak …


Approaches to web mining
Approaches to Web mining

  • Primary data (Web content):

    • Mainly text,

    • with some multimedia content (increasing)

    • and mark-up commands including hyperlinks.

    • Underlying databases (not directly accessible).

  • Knowledge discovery from text and links

    • Pattern discovery in unstructured textual data.

    • Pattern discovery in the Web graph / hypertext.


Approaches to web mining1
Approaches to Web mining

  • Secondary data (Web usage):

    • Access logs collected by servers,

    • potentially using cookies,

    • and a variety of navigational information collected by Web clients (mainly JavaScript agents).

  • Knowledge Discovery from usage data

    • Discovery of interesting usage patterns,mainly from server logs.

    • Web personalization & Web intelligence.


Contents1
Contents

  • Introduction

  • Knowledge discovery from text & links

    • Introduction

    • Information filtering and retrieval

    • Ontology learning

  • Knowledge discovery from usage data

  • Important open issues


Information access
Information access

  • Goals:

    • Organize documents into categories.

    • Assign new documents to the categories.

    • Retrieve information that matches a user query.

  • Dominating statistical idea:

    TFIDF=term frequency * inverse document frequency

  • Problems on the Web:

    • Huge scale and high volatility demand automation.


Text mining
Text mining

  • Knowledge (pattern) discovery in textual data.

  • Clarifying common misconceptions:

    • Text mining is NOT about assigning documents to thematic categories, but about learning document classifiers.

    • Text mining is NOT about extracting information from text, but about learning information extraction patterns.

  • Difficulty: unstructured format of textual data.


Approaches to text mining
Approaches to text mining

Combination of language engineering (LE), machine learning (ML) and statistical methods:

ML-Stats

LE

ML-Stats

LE


Hyperlink information is useful
Hyperlink information is useful

  • Information access can be improved by identifying: authoritative pages (authorities) and resource index pages (hubs).

  • Linked pages often contain complementary information (e.g. product offers).

  • Thematically related pages are often linked, either directly or indirectly.


Document category modelling
Document category modelling

Training documents (pre-classified)

Stopword removal (and, the, etc.)

Stemming (‘played’  ‘play’)

Bag-of-words coding

Pre-processing

Statistical selection/combination of characteristic terms (MI, PCA)

Dimensionality reduction

Machine Learning

Supervised classifier learning

Category models (classifiers)


Document category modelling1
Document category modelling

  • Example: Filtering spam email.

  • Task: classify incoming email as spam and legitimate (2 document categories).

  • Simple blacklist and keyword-based methods have failed.

  • More intelligent, adaptive approaches are needed (e.g. naive Bayesian category modeling).


Document category modelling2
Document category modelling

  • Step 1 (linguistic pre-processing): Tokenization, removal of stopwords, stemming/lemmatization.

  • Step 2 (vector representation): bag-of-words or n-gram modeling (n=2,3).

  • Step 3 (feature selection): information gain evaluation.

  • Step 4 (machine learning): Bayesian modeling, using word/n-gram frequency.


Link structure analysis
Link structure analysis

  • Improve information retrieval by scoring Web pages according to their importance in the Web or a thematic sub-domain of it.

  • Nodes with large fan-in (authorities) provide high quality information.

  • Nodes with large fan-out (hubs) are good starting points.


Link structure analysis1
Link structure analysis

  • The HITS algorithm[Kleinberg, ACM Journal 1999]:

    • Given a set of Web pages, e.g. as generated by a query,

    • expand the base set by including pages that are linked to by the ones in the initial set or link to them,

    • assign a hub and an authority weight to each page, initialised to 1,

    • update the authority weight of page p according to the hub weights of the pages that link to it:

    • update the hub weight of page p according to the authority weights of the pages that it links to:

    • repeat the weight update for a given number of times,

    • return a list of the pages ranked by their weights.


Link structure analysis2
Link structure analysis

  • Interesting issues:

    • Does the social network hypothesis hold, i.e., “authorities are highly cited”? This may be unrealistic in competitive commercial domains.

    • What happens if link structure adapts to the method, e.g. unrelated pages link to each other to increase their rating?

    • What about interesting new pages? How will people get to them?


Focused crawling spidering
Focused crawling & spidering

  • Crawling/Spidering: Automatic navigation through the Web by robots with the aim of indexing the Web.

  • Crawling v. Spidering (subjective): inter-site v. intra-site navigation.

  • Focused crawling/spidering: Efficient, thematic indexing of relevant Web pages, e.g. maintenance of a thematic portal.

  • Underlying assumption similar to HITS: thematically similar pages are linked.


Focused crawling
Focused crawling

  • Focused crawling[Chakrabarti et al., WWW 1999]:

    • Given an initial set of Web pages about a topic, e.g. as found in a Web directory,

    • use document category modelling to build a topic classifier,

    • extract the hyperlinks within the initial set of pages and add them to a queue of pages to be visited,

    • retrieve pages from the queue,

    • use the classifier to assess the relevance of retrieved pages,

    • use a variant of HITS to assign a hub score to pages and the hyperlinks in the queue,

    • re-sort the links in the queue according to their hub score,

    • continue the retrieval of new pages, periodically updating the score of hyperlinks in the queue.


Focused crawling spidering1
Focused crawling & spidering

  • Domain-specific spidering:

    • Goal: retrieve interesting pages, without traversing the whole site.

    • Differences from crawling:

      • The site is much more restricted in size and thematic diversity than the whole of the Web.

      • Social network analysis is less relevant within a site (no hubs and authorities).

    • Requirement: link scoring using local features, e.g. the anchor text and the textual context.


Information extraction
Information extraction

  • Goals:

    • Identify interesting “events” in unstructured text.

    • Extract information related to the events and store it in structured templates.

  • Typical application:

    Information extraction from newsfeeds.

  • Difficulties:

    • Deals with unstructured or semi-structured text.

    • Identification of entities and relations.

    • Usually requires someunderstanding of the text.


A typical extraction system
A typical extraction system

Unstructured text and database schema (event templates)

Lemmatization (‘said’ ‘say’),

Sentence and word separation.

Part-of-speech tagging, etc.

Morphology

Shallow syntactic parsing.

Syntax

Named-entity recognition.

Co-reference resolution.

Sense disambiguation.

Semantics

Discourse

Pattern matching.

Structured data (filled templates)


Wrappers fact extraction
Wrappers/fact extraction

  • Simplified information extraction:

    • Extract interesting facts from Web documents.

    • Assumes structure in the documents (usually dynamically generated from databases).

    • Reduced demand for pre-processing and LE.

  • Typical application:

    Product comparison services (price, availability, …).

  • Difficulties:

    • Semi-structured data.

    • Different underlying database schemata and presentation formats.



Wrapper induction
Wrapper induction

Training documents (semi-structured)

Abstraction of mark-up structure (often omitted)

Data pre-processing

Database schema (interesting facts)

Machine Learning

Structural/sequence learning

Fact extraction patterns (wrapper)


Ontology learning
Ontology learning

Training documents (unclassified)

Stopword removal (and, the, etc.)

Stemming (‘played’  ‘play’)

Syntactic/Semantic analysis

Bag-of-words coding

Pre-processing

Hand-made thesauri (Wordnet)

Term co-occurrence (LSI)

Dimensionality reduction

Unsupervised learning (clustering and association discovery)

Machine Learning

Ontologies


Ontology learning1
Ontology learning

  • Hierarchical clustering is most suitable:

    • Agglomerative clustering

    • Conceptual clustering (COBWEB)

    • Model-based clustering (EM-type: MCLUST)

  • … but flat clustering can also be adapted:

    • K-means and its variants

    • Bayesian clustering (Autoclass)

    • Neural networks (self-organizing maps)

  • Association discovery (e.g. Apriori) for non-taxonomic relations.


Ontology learning2
Ontology learning

  • Example: Acquisition of an ontology for tourist information. [based on Maedche & Staab, ECAI 2000]


Ontology learning3
Ontology learning

  • Source data: Web pages of tourist sites.

  • Background knowledge: generic and domain-specific ontologies.

  • Target users: Tourist directories, large travel agencies.

  • Goals:

    • Identify types of page (e.g. room descriptions) and terms/entities inside pages (e.g. hotel addresses).

    • Identify taxonomic relations between concepts (e.g. accommodation – hotel).

    • Identify non-taxonomic relations between concepts (e.g. accommodation – area).


Ontology learning4
Ontology learning

  • Heavy linguistic pre-processing:

    • Syntactic analysis,e.g. verb subcategorization frames:verb(arrive) -> prep(at), dir_obj(Torino).

    • Semantic analysis, e.g. named entity recognition:‘Via Lagrange’ -> Street namee.g. special dependency relations:‘Hotel ConcordinTorino’


Contents2
Contents

  • Introduction

  • Knowledge discovery from text & links

  • Knowledge discovery from usage data

    • Personalization on the Web

    • Data collection and preparation issues

    • Personalized assistants

    • Discovering generic user models

    • Sequential pattern discovery

  • Knowledge discovery in action

  • Important open issues


Personalized information access
Personalized information access

sources

personalization server

receivers


Personalization v intelligence
Personalization v. intelligence

  • Better service for the user:

    • Reduction of the information overload.

    • More accurate information retrieval and extraction.

    • Recommendation and guidance.


Personalized assistants
Personalized assistants

  • Personalized crawling[Liebermann et al., ACM Comm., 2000]:

    • The system knows the user (log-in).

    • It uses heuristics to extract “important” terms from the Web pages that the user visits and add them to thematic profiles.

    • Each time the user views a page, the system:

      • searches the Web for related pages,

      • filters them according to the relevant thematic profile,

      • and constructs a list of recommended links for the user.

    • The Letizia version of the system searches the Web locally, following outgoing links from the current page.

    • The Powerscout version uses a search engine to explore the Web.


Personalized assistants1
Personalized assistants

  • Adaptive Web interfaces[Jörding, UM 1999]:

    • The TELLIM system collects user information, (e.g. a selection of a link) using a Java applet .

    • User information is used as training data in order to create generic models reflecting the users’ interest in different products.

    • The system creates short-term personal models using the generic models and the current user’s behavior.

    • Web pages containing more detailed information about these products, together with multimedia content and VRML presentations are created dynamically and presented to the users.


User modelling
User modelling

  • Basic elements:

    • Constructing models that can be used to adapt the system to the user’s requirements.

    • Different types of requirement: interests (sports and finance news), knowledge level (novice - expert), preferences (no-frame GUI), etc.

    • Different types of model: personal – generic.

  • Knowledge discovery facilitates the acquisition of user models from data.


User models
User Models

  • User model (type A): [PERSONAL]

    User x -> sports, stock market

  • User model (type B):[PERSONAL]

    User x, Age 26, Male -> sports, stock market

  • User community:[GENERIC]

    Users {x,y,z} -> sports, stock market

  • User stereotype:[GENERIC]

    Users {x,y,z}, Age [20..30], Male -> sports, stock market


Generic user models
Generic user models

  • Stereotypes: Models that represent a type of user, associating personal characteristics with parameters of the system,

    e.g. Male users of age 20-30 are interested in sports and politics.

  • Communities: Models that represent a group of users with common preferences,

    e.g. Users that are interested in sports and politics.


Learning user models

User 1

User 2

User 3

User 4

User 5

Community 2

User communities

Community 1

User models

Observation of the users interacting with the system.

Learning user models


Knowledge discovery process
Knowledge discovery process

Collection of usage data by the server and the client.

Data collection

Data cleaning, user identification, session identification

Data pre-processing

Construction of user models

Pattern discovery

Report generation, visualization, personalization module.

Knowledge post-processing


Pre processing usage data
Pre-processing usage data

  • Cleaning:

    • Log entries that correspond to error responses.

    • Trails of robots.

    • Pages that have not been requested explicitly by the user (mainly image files, loaded automatically). Should be domain-specific.

  • User identification:

    • Identification by log-in.

    • Cookies and Javascript.

    • Extended Log Format (browser and OS version).

    • Bookmark user-specific URL.

    • Various other heuristics.


Pre processing usage data1
Pre-processing usage data

  • User session/Transaction identification in log files:

    • Time-based methods, e.g. 30 min silence interval. Problems with cache. Partial solutions: special HTTP headers, Java agents.

    • Context-based methods: e.g. separate pages into navigational and content and impose heuristics on the type of page that a user session may consist of.

    • User sessions can be subdivided into smaller transaction sequences, e.g. by identifying a “backward reference” in the sequence of requests.

  • Encoding of training data:

    • Bag-of-pages representation of sessions/transactions.

    • Transition-based representation of sessions/transactions.

    • Manually determined features of interest.


Collaborative filtering
Collaborative filtering

  • Information filtering according to the choices of similar users.

  • Avoids semantic content analysis.

  • Cold-start problem with new users.

  • Approaches:

    • memory-based learning,

    • model-based clustering,

    • item-based recommendation.


Memory based learning
Memory-based learning

  • Nearest-neighbour approach:

    • Construct a model for each user. Often use explicit user ratings for each item.

    • Index the user in the space of system parameters, e.g. item ratings.

    • For each new user,

      • index the user in the same space, and

      • find the k closest neighbours.

      • Simple metrics to measure the similarity between users, e.g. Pearson correlation.

    • Recommend the items that the new user has not seen and are popular among the neighbours.


Model based clustering
Model-based clustering

  • Clustering users into communities.

  • Methods used:

    • Conceptual clustering (COBWEB).

    • Graph-based clustering (Cluster mining).

    • Statistical clustering (Autoclass).

    • Neural Networks (Self-Organising Maps).

    • Model-based clustering (EM-type).

    • BIRCH.

  • Community models: cluster descriptions.


Model based clustering1
Model-based clustering

0,9

0,9

0,9

0,9

0,8

0,8

0,4

0,4

0,1

0,1

0,5

0,5


Item based recommendation
Item-based recommendation

  • Focus on item usage in the profiles, instead of the users themselves.

  • Practically useful in e-commerce, e.g. cross-sell recommendations.

  • Simple modification to the clique-based clustering method: graph of items instead of graph of users.

  • Related to frequent itemset discovery in association rule mining.


Item based recommendation1
Item-based recommendation

0,9

0,9

Politics

Sports

0,9

0,9

0,8

0,8

0,4

0,4

0,1

0,1

World

Finance

0,5

0,5


Contents3
Contents

  • Introduction

  • Knowledge discovery from text & links

  • Knowledge discovery from usage data

    • Personalization on the Web

    • Data collection and preparation issues

    • Personalized assistants

    • Discovering generic user models

    • Sequential pattern discovery

  • Knowledge discovery in action

  • Important open issues


Sequential pattern discovery
Sequential pattern discovery

  • Identifying navigational patterns, rather than “bag-of-page” models.

  • Methods:

    • Clustering transitions between pages.

    • First-order Markov models.

    • Probabilistic grammar induction.

    • Association-rule sequence mining.

    • Path traversal through graphs.

  • Personal and community navigation models.


Sequential pattern discovery1
Sequential pattern discovery

Clique-based transition clustering; small modification of the model-based item clustering approach: an item is a transition between pages.

0,9

0,9

Sports->Politics

Finance->Politics

0,9

0,9

0,8

0,8

0,4

0,4

0,1

0,1

Sports->Finance

Finance->Sports

0,5

0,5


References
References

J. Borges and M. Levene, Data mining of user navigation patterns. Proceedings of Workshop on Web Usage Analysis and User Profiling (WEBKDD), in conjunction with ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Diego, CA., pp. 31-36.

S. Chakrabarti, M. H. van den Berg, B. E. Dom, Focused Crawling: a new approach to topic-specific Web resource discovery, Proceedings of the Eighth International World Wide Web Conference (WWW), Toronto, Canada, May 1999.

T. Jörding, T, A Temporary User Modeling Approach for Adaptive Shopping on the We`, In Proceedings of the 2nd Workshop on Adaptive Systems and User Modeling on the WWW, UM'99, Banff, Canada, 1999.

J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, v. 46, 1999.

H. Lieberman, C. Fry and L. Weitzman. Exploring the Web with Reconnaissance Agents, Communications of the ACM, August 2001, pp. 69-75.

A. Maedche, S. Staab. Discovering Conceptual Relations from Text. In: W.Horn (ed.): ECAI 2000. Proceedings of the 14th European Conference on Artificial Intelligence (ECAI), Berlin, August 21-25, 2000.

A. McCallum, D. Freitag and F. Pereira, Maximum Entropy Markov Models for Information Extraction and Segmentation, Proceedings of the International Conference on Machine Learning (ICML), Stanford, CA, 2000, pp. 591-598.

I. Muslea , S. Minton and C. Knoblock , STALKER: Learning extraction rules for semistructured Web-based information sources. Proceedings of the National Conference on Artificial Intelligence (AAAI), Madison, Wisconsin, 1998.

C. Nédellec, Corpus-based learning of semantic relations by the ILP system, Asium,Learning Language in Logic, Cussens J. and Dzeroski S. (Eds.), Springer Verlag, September 2000.

J. Rennie and A. McCallum. Efficient Web Spidering with Reinforcement Learning. Proceedings of the International Conference on Machine Learning(ICML), 1999.

E. I. Schwartz. Webonomics. New York: Broadway books, 1997.

E. Schwarzkopf, An adaptive Web site for the UM2001 conference. Proceedings of the Workshop on Machine Learning for User Modeling, in conjunction with the International Conference on User modelling (UM), pp 77-86, 2001.


ad