Unleashing the Potential of Text and Web Mining for Structured Data Analysis

Text & Web Mining

Structured Data • So far we have focused on mining from structured data: Attribute  Value Attribute  Value Attribute  Value  Attribute  Value Outlook  Sunny Temperature  Hot Windy  Yes Humidity  High Play  Yes Most data mining involves such data

Focus Complex Data Types • Increased importance of complex data: • Spatial data: includes geographic data and medical & satellite images • Multimedia data: images, audio, & video • Time-series data: for example banking data and stock exchange data • Text data: word descriptions for objects • World-Wide-Web: highly unstructured text and multimedia data

Text Databases • Many text databases exist in practice • News articles • Research papers • Books • Digital libraries • E-mail messages • Web pages • Growing rapidly in size and importance

Structured attribute/value pairs Unstructured Semi-Structured Data • Text databases are often semi-structured • Example: • Title • Author • Publication_Date • Length • Category • Abstract • Content

Handling Text Data • Modeling semi-structured data • Information Retrieval (IR) from unstructured documents • Text mining • Compare documents • Rank importance & relevance • Find patterns or trends across documents

Information Retrieval • IR locates relevant documents • Key words • Similar documents • IR Systems • On-line library catalogs • On-line document management systems

Performance Measure • Two basic measures Retrieved documents Relevant documents Relevant & retrieved All documents

Retrieval Methods • Keyword-based IR • E.g., “data and mining” • Synonymy problem: a document may talk about “knowledge discovery” instead • Polysemy problem: mining can mean different things • Similarity-based IR • Set of common keywords • Return the degree of relevance • Problem: what is the similarity of “data mining” and “data analysis”

Modeling a Document • Set of n documents and m terms • Each document is a vector v in Rm • The j-th coordinate of v measures the association of the j-th term • Here r is the number of occurrences of the j-th term and R is the number of occurrences of any term.

Frequency Matrix

Similarity Measures Dot product • Cosine measure Norm of the vectors

Example • Google search for “association mining” • Two of the documents retrieved: • Idaho Mining Association: mining in Idaho (doc 1) • Scalable Algorithms for Association mining (doc 2) • Using only the two terms

New Model • Add the term “data” to the document model

Singular value decomposition can be used to reduce it Frequency Matrix Will quickly become large

Association Analysis • Collect set of keywords frequently used together and find association among them • Apply any association rule algorithm to a database in the format {document_id, a_set_of_keywords}

Document Classification • Need already classified documents as training set • Induce a classification model • Any difference from before? A set of keywords associated with a document has no fixed set of attributes or dimensions

Association-Based Classification • Classify documents based on associated, frequently occurring text patterns • Extract keywords and terms with IR and simple association analysis • Create a concept hierarchy of terms • Classify training documents into class hierarchies • Use association mining to discover associated terms to distinguish one class from another

Remember Generalized Association Rules Taxonomy: Ancestor of shoes and hiking boots Clothes Footwear Outerwear Shirts Shoes Hiking Boots Jackets Ski Pants Generalized association rule X Y where no item in Y is an ancestor of an item in X

Classifiers • Let X be a set of terms • Let Anc (X) be those terms and their ancestor terms • Consider a rule X C and document d • If X  Anc (d) then X Ccoversd • A rule that covers d may be used to classifyd (but only one can be used)

Procedure • Step 1: Generate all generalized association rules , where X is a set of terms and C is a class, that satisfy minimum support. • Step 2: Rank the rules according to some rule ranking criterion • Step 3: Select rules from the list

Web Mining • The World Wide Web may have more opportunities for data mining than any other area • However, there are serious challenges: • It is too huge • Complexity of Web pages is greater than any traditional text document collection • It is highly dynamic • It has a broad diversity of users • Only a tiny portion of the information is truly useful

Search Engines  Web Mining • Current technology: search engines • Keyword-based indices • Too many relevant pages • Synonymy and polysemy problems • More challenging: web mining • Web content mining • Web structure mining • Web usage mining

Web Content Mining

Example: Classification of Web Documents • Assign a class to each document based on predefined topic categories • E.g., use Yahoo!’s taxonomy and associated documents for training • Keyword-based document classification • Keyword-based association analysis

Web Structure Mining

Authoritative Web Pages • High quality relevant Web pages are termed authoritative • Explore linkages (hyperlinks) • Linking a Web page can be considered an endorsement of that page • Those pages that are linked frequently are considered authoritative • (This has its roots back to IR methods based on journal citations)

Structure via Hubs • A hub is a set of Web pages containing collections of links to authorities • There is a wide variety of hubs: • Simple list of recommended links on a person’s home page • Professional resource lists on commercial sites

HITS • Hyperlink-Induced Topic Search (HITS) • Form a root set of pages using the query terms in an index-based search (200 pages) • Expand into a base set by including all pages the root set links to (1000-5000 pages) • Go into an iterative process to determine hubs and authorities

Calculating Weights • Authority weight • Hub weight Page p is pointed to by page q

Adjacency Matrix • Lets number the pages {1,2,…,n} • The adjacency matrix is defined by • By writing the authority and hub weights as vectors we have

Recursive Calculations • We now have • By linear algebra theory this converges to the principle eigenvectors of the the two matrices

Output • The HITS algorithm finally outputs • Short list of pages with high hub weights • Short list of pages with high authority weights • Have not accounted for context

Applications • The Clever Project at IBM’s Almaden Labs • Developed the HITS algorithm • Google • Developed at Stanford • Uses algorithms similar to HITS (PageRank) • On-line version

Web Usage Mining

Complex Data Types Summary • Emerging areas of mining complex data types: • Text mining can be done quite effectively, especially if the documents are semi-structured • Web mining is more difficult due to lack of such structure • Data includes text documents, hypertext documents, link structure, and logs • Need to rely on unsupervised learning, sometimes followed up with supervised learning such as classification

Unleashing the Potential of Text and Web Mining for Structured Data Analysis