text web mining l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Text & Web Mining PowerPoint Presentation
Download Presentation
Text & Web Mining

Loading in 2 Seconds...

play fullscreen
1 / 36

Text & Web Mining - PowerPoint PPT Presentation


  • 290 Views
  • Uploaded on

Text & Web Mining. Structured Data. So far we have focused on mining from structured data:. Attribute  Value Attribute  Value Attribute  Value  Attribute  Value. Outlook  Sunny Temperature  Hot Windy  Yes Humidity  High Play  Yes. Most data mining involves such data.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Text & Web Mining' - jaden


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
structured data
Structured Data
  • So far we have focused on mining from structured data:

Attribute  Value

Attribute  Value

Attribute  Value

Attribute  Value

Outlook  Sunny

Temperature  Hot

Windy  Yes

Humidity  High

Play  Yes

Most data mining involves such data

complex data types

Focus

Complex Data Types
  • Increased importance of complex data:
    • Spatial data: includes geographic data and medical & satellite images
    • Multimedia data: images, audio, & video
    • Time-series data: for example banking data and stock exchange data
    • Text data: word descriptions for objects
    • World-Wide-Web: highly unstructured text and multimedia data
text databases
Text Databases
  • Many text databases exist in practice
    • News articles
    • Research papers
    • Books
    • Digital libraries
    • E-mail messages
    • Web pages
  • Growing rapidly in size and importance
semi structured data

Structured attribute/value pairs

Unstructured

Semi-Structured Data
  • Text databases are often semi-structured
  • Example:
    • Title
    • Author
    • Publication_Date
    • Length
    • Category
    • Abstract
    • Content
handling text data
Handling Text Data
  • Modeling semi-structured data
  • Information Retrieval (IR) from unstructured documents
  • Text mining
    • Compare documents
    • Rank importance & relevance
    • Find patterns or trends across documents
information retrieval
Information Retrieval
  • IR locates relevant documents
    • Key words
    • Similar documents
  • IR Systems
    • On-line library catalogs
    • On-line document management systems
performance measure
Performance Measure
  • Two basic measures

Retrieved

documents

Relevant

documents

Relevant &

retrieved

All documents

retrieval methods
Retrieval Methods
  • Keyword-based IR
    • E.g., “data and mining”
    • Synonymy problem: a document may talk about “knowledge discovery” instead
    • Polysemy problem: mining can mean different things
  • Similarity-based IR
    • Set of common keywords
    • Return the degree of relevance
    • Problem: what is the similarity of “data mining” and “data analysis”
modeling a document
Modeling a Document
  • Set of n documents and m terms
  • Each document is a vector v in Rm
    • The j-th coordinate of v measures the association of the j-th term
    • Here r is the number of occurrences of the j-th term and R is the number of occurrences of any term.
similarity measures
Similarity Measures

Dot product

  • Cosine measure

Norm of the

vectors

example
Example
  • Google search for “association mining”
  • Two of the documents retrieved:
    • Idaho Mining Association: mining in Idaho (doc 1)
    • Scalable Algorithms for Association mining (doc 2)
  • Using only the two terms
new model
New Model
  • Add the term “data” to the document model
association analysis
Association Analysis
  • Collect set of keywords frequently used together and find association among them
  • Apply any association rule algorithm to a database in the format

{document_id, a_set_of_keywords}

document classification
Document Classification
  • Need already classified documents as training set
  • Induce a classification model
  • Any difference from before?

A set of keywords associated with a document

has no fixed set of attributes or dimensions

association based classification
Association-Based Classification
  • Classify documents based on associated, frequently occurring text patterns
    • Extract keywords and terms with IR and simple association analysis
    • Create a concept hierarchy of terms
    • Classify training documents into class hierarchies
    • Use association mining to discover associated terms to distinguish one class from another
remember generalized association rules
Remember Generalized Association Rules

Taxonomy:

Ancestor of

shoes and

hiking boots

Clothes

Footwear

Outerwear

Shirts

Shoes

Hiking Boots

Jackets

Ski Pants

Generalized association rule

X Y where no item in Y is

an ancestor of an item in X

classifiers
Classifiers
  • Let X be a set of terms
  • Let Anc (X) be those terms and their ancestor terms
  • Consider a rule X C and document d
  • If X  Anc (d) then X Ccoversd
  • A rule that covers d may be used to classifyd (but only one can be used)
procedure
Procedure
  • Step 1: Generate all generalized association rules , where X is a set of terms and C is a class, that satisfy minimum support.
  • Step 2: Rank the rules according to some rule ranking criterion
  • Step 3: Select rules from the list
web mining
Web Mining
  • The World Wide Web may have more opportunities for data mining than any other area
  • However, there are serious challenges:
    • It is too huge
    • Complexity of Web pages is greater than any traditional text document collection
    • It is highly dynamic
    • It has a broad diversity of users
    • Only a tiny portion of the information is truly useful
search engines web mining
Search Engines  Web Mining
  • Current technology: search engines
    • Keyword-based indices
    • Too many relevant pages
    • Synonymy and polysemy problems
  • More challenging: web mining
    • Web content mining
    • Web structure mining
    • Web usage mining
example classification of web documents
Example: Classification of Web Documents
  • Assign a class to each document based on predefined topic categories
  • E.g., use Yahoo!’s taxonomy and associated documents for training
  • Keyword-based document classification
  • Keyword-based association analysis
authoritative web pages
Authoritative Web Pages
  • High quality relevant Web pages are termed authoritative
  • Explore linkages (hyperlinks)
    • Linking a Web page can be considered an endorsement of that page
    • Those pages that are linked frequently are considered authoritative
    • (This has its roots back to IR methods based on journal citations)
structure via hubs
Structure via Hubs
  • A hub is a set of Web pages containing collections of links to authorities
  • There is a wide variety of hubs:
    • Simple list of recommended links on a person’s home page
    • Professional resource lists on commercial sites
slide29
HITS
  • Hyperlink-Induced Topic Search (HITS)
    • Form a root set of pages using the query terms in an index-based search (200 pages)
    • Expand into a base set by including all pages the root set links to (1000-5000 pages)
    • Go into an iterative process to determine hubs and authorities
calculating weights
Calculating Weights
  • Authority weight
  • Hub weight

Page p is pointed

to by page q

adjacency matrix
Adjacency Matrix
  • Lets number the pages {1,2,…,n}
  • The adjacency matrix is defined by
  • By writing the authority and hub weights as vectors we have
recursive calculations
Recursive Calculations
  • We now have
  • By linear algebra theory this converges to the principle eigenvectors of the the two matrices
output
Output
  • The HITS algorithm finally outputs
    • Short list of pages with high hub weights
    • Short list of pages with high authority weights
  • Have not accounted for context
applications
Applications
  • The Clever Project at IBM’s Almaden Labs
    • Developed the HITS algorithm
  • Google
    • Developed at Stanford
    • Uses algorithms similar to HITS (PageRank)
    • On-line version
complex data types summary
Complex Data Types Summary
  • Emerging areas of mining complex data types:
    • Text mining can be done quite effectively, especially if the documents are semi-structured
    • Web mining is more difficult due to lack of such structure
      • Data includes text documents, hypertext documents, link structure, and logs
      • Need to rely on unsupervised learning, sometimes followed up with supervised learning such as classification