applying semantic analyses to content based recommendation and document clustering n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Applying Semantic Analyses to Content-based Recommendation and Document Clustering PowerPoint Presentation
Download Presentation
Applying Semantic Analyses to Content-based Recommendation and Document Clustering

Loading in 2 Seconds...

play fullscreen
1 / 52

Applying Semantic Analyses to Content-based Recommendation and Document Clustering - PowerPoint PPT Presentation


  • 119 Views
  • Uploaded on

Applying Semantic Analyses to Content-based Recommendation and Document Clustering. Eric Rozell, MRC Intern Rensselaer Polytechnic Institute. Bio. Graduate Student @ Rensselaer Polytechnic Institute Research Assistant @ Tetherless World Constellation

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Applying Semantic Analyses to Content-based Recommendation and Document Clustering' - kamal


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
applying semantic analyses to content based recommendation and document clustering

Applying Semantic Analyses to Content-based Recommendation and Document Clustering

Eric Rozell, MRC Intern

Rensselaer Polytechnic Institute

slide2
Bio
  • Graduate Student @ Rensselaer Polytechnic Institute
  • Research Assistant @ Tetherless World Constellation
  • Student Fellow @ Federation of Earth Science Informatics Partners
  • Research Advisor: Peter Fox
  • Research Focus: Semantic eScience
  • Contact: rozele@rpi.edu
outline
Outline

Background

Semantic Analysis

  • Background
  • Semantic Analysis
    • Probase Conceptualization
    • Explicit Semantic Analysis
    • Latent DirichletAllocation
  • Recommendation Experiment
    • Recommendation Systems
    • Experiment Setup
    • Results
  • Clustering Experiment
    • Problem
    • K-Means
    • Results
  • Conclusions

Recommendation

Clustering

Conclusions

background
Background

Background

Semantic Analysis

  • Billions of documents on the Web
  • Semi-structured data from Web 2.0(e.g., tags, microformats)
  • Most knowledge remains in unstructured text
  • Many natural language techniquesfor:
    • Ontology extraction
    • Topic extraction
    • Named entity recognition/disambiguation
  • Some techniques are better than others for various information retrieval tasks…

Recommendation

Clustering

Conclusions

probase
Probase

Background

Semantic Analysis

  • Developed at Microsoft Research Asia
  • Probabilistic knowledge base built from Bing index and query logs (and other sources)
  • Text mining patterns
    • Namely, Hearst patterns: “… artists such as Picaso”
      • Evidence for hypernym(artists, Picaso)

Recommendation

Clustering

Conclusions

probase1
Probase

Background

Semantic Analysis

Recommendation

Clustering

Conclusions

probase2
Probase

Background

Semantic Analysis

  • Very capable at conceptualizing groups of entities:
    • “China; India; United States” yields “country”
    • “China; India; Brazil; Russia” yields “emerging market”
  • Differentiates attributes and entities
    • “birthday” -> “person” as attribute
    • “birthday” -> “occasion” as entity
  • Applications
    • Clustering Tweets from Concepts [Song et al., 2011]
    • Understanding Web Tables
    • Query Expansion (Topic Search)

Recommendation

Clustering

Conclusions

research questions
Research Questions

Background

Semantic Analysis

  • What’s the best way of extracting concepts from text?
    • Compare techniques for semantic analysis
  • How are extracted concepts useful?
    • Generate data about where semantic analysis techniques are applicable
  • Are user ratings affected by the concepts in media items such as movies?
    • Test semantic analysis techniques in recommender systems
  • How useful is Web-scale domain knowledge in narrower domains for information retrieval?
    • Identify need for domain specific knowledge

Recommendation

Clustering

Conclusions

semantic analysis
Semantic Analysis

Background

Semantic Analysis

  • Generating meaning (concepts) from text
  • Specifically, get prevalent hypernyms
    • E.g., “… Apple, IBM, and Microsoft …”
    • “technology companies”
  • Semantic analysis using external knowledge
    • Probase Conceptualization
    • Explicit Semantic Analysis
    • WordNetSynsets
  • Semantic analysis using latent features
    • Latent Dirichlet Allocation
    • Latent Semantic Analysis

Recommendation

Clustering

Conclusions

probase conceptualization
Probase Conceptualization

Background

Semantic Analysis

For each document…

t1

c1

c1

c1

c1

c1

.

.

.

Probase

Naïve Bayes / Summation

Document Concepts

This is some plain text.

Document Corpus

Inverse Document Frequency / Filtering

t2

c2

c2

c2

c2

c2

t3

c3

Recommendation

c3

c3

c3

c3

c4

t4

c4

c4

c4

c4

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Clustering

Conclusions

probase conceptualization1
Probase Conceptualization

Background

Semantic Analysis

  • “CowboydollWoody(Tom Hanks) is co ordinating a reconnaissance missionto find out what presents his ownerAndyis getting for his birthday partydays before theymove to a newhouse. Unfortunately for Woody, Andyreceives a new spacemantoy, Buzz Lightyear(Tim Allen) who impresses the other toys and Andy, who starts to like Buzz more than Woody. Buzz thinks that he is an actual space ranger, not a toy, and thinks that Woodyis interfering with his "mission" to return to his homeplanet…”

Recommendation

Clustering

Conclusions

Text Source: Internet Movie Database (IMDb)

sample features for toy story probase
Sample Features for “Toy Story” (Probase)

Background

Semantic Analysis

  • dvd encryptions0.050  “RC”
  • duty free item0.044  “toys”
  • generic word0.043  “they, travel, it,…”
  • satellite mission0.032  “reconnaissance mission”
  • creator-owned work0.020  “Woody”
  • amazing song0.013  “fury”
  • doubtful word0.013  “overcome”
  • ill-fated tool0.013  “Buzz”
  • lovable ``toy story'' character0.011  “Buzz Lightyear, Woody,…”
  • pleased star0.010  “Woody”
  • trail builder0.010  “Woody”

Recommendation

Clustering

Conclusions

explicit semantic analysis
Explicit Semantic Analysis

Background

Semantic Analysis

Recommendation

Clustering

Conclusions

Image Source: Gabrilovich et al., 2007

sample features for toy story esa
Sample Features for “Toy Story” (ESA)

Background

Semantic Analysis

  • #REDIRECT [[Buzz!]]0.034
  • #REDIRECT [[The Buzz]] 0.028
  • #REDIRECT [[Buzz (comics)]]0.027
  • #REDIRECT [[Buzz cut]]0.027
  • #REDIRECT [[Buzz (DC Thomson)]]0.024
  • #REDIRECT [[Buzz Out Loud]]0.024
  • #REDIRECT [[The Daily Buzz]] 0.023
  • #REDIRECT [[Buzz Aldrin]]0.022
  • #REDIRECT [[Buzz cut]] 0.022
  • #REDIRECT [[Buzzing Tree Frog]]0.022

Recommendation

Clustering

Conclusions

latent dirichlet allocation
Latent Dirichlet Allocation

Background

Semantic Analysis

  • Blei et al., 2003
  • Unsupervised Learning Method
  • “Generates” documents from Dirichlet distributions over words and topics
  • Topic distributions over documents can be inferred from corpus

Recommendation

Clustering

Conclusions

Image Source: Wikipedia

recommendation systems
Recommendation Systems

Background

Semantic Analysis

  • Collaborative Filtering
    • “Customers who purchased X also purchased Y.”
  • Content-based
    • “Because you enjoyed ‘GoldenEye’, you may want to watch ‘Mission: Impossible’.”
  • Hybrid
    • Most modern systems take a hybrid approach.

Recommendation

Clustering

Conclusions

content based recommendation
Content-based Recommendation

Background

Semantic Analysis

  • In GoldenEye/Mission: Impossible example…
    • Structured item content
      • Genre – Action/Adventure/Thriller
      • Tags – Action, Espionage, Adventure
    • Unstructured item content
      • Plot synopses – “helicopter, agent, inflitrate, CIA, …”
      • Concepts? – “aircraft, intelligence agency, …”

Recommendation

Clustering

Conclusions

recommendation systems1
Recommendation Systems

Background

Semantic Analysis

Structured Content-based Approaches

Recommendation

Collaborative Filtering Approaches

Clustering

Unstructured Content-based Approaches

Conclusions

Test semantic analysis approaches here.

experiment
Experiment

Background

Semantic Analysis

Movie Ratings from MovieLens

Matchbox Recommendation Platform

Feature Generation

Recommendation

Movie Synopses

from IMDb

Mean Absolute Error (MAE)

Clustering

Conclusions

matchbox
Matchbox

Semantic Analysis

Recommendation

Clustering

Related Work

Conclusions

Source: Matchbox API Documentation

experimental data
Experimental Data

Background

Semantic Analysis

  • Data: MovieLens Dataset [HetRec ’11]
    • 855,598 ratings
    • 10,197 movies
    • 2,113 users
  • Movie synopses from IMDb (http://www.imdb.com)
    • Collected synopses for 2,633 movies
    • With 435,043 ratings
    • From 2,113 users
  • Ratings data:
    • Scored by half points from 0.5 to 5
  • Choose different numbers of movies (200; 1,000; all)
  • Train on 90% of ratings, test on remaining 10%

Recommendation

Clustering

Conclusions

experimental data1
Experimental Data

Background

Semantic Analysis

  • Controls
    • Baseline 1: Only features are user IDs and movie IDs
    • Baseline 2: User IDs, Movie IDs, Movie Genre
    • Baseline 3: User IDs, Movie IDs, Movie Tags
  • Feature Sets
    • Term Frequency – Inverse Document Frequency
    • Latent Dirichlet Allocation
    • Explicit Semantic Analysis
    • Probase Conceptualization

Recommendation

Clustering

Conclusions

experimental setup
Experimental Setup

Background

Semantic Analysis

  • 4 Scenarios: (training: white, testing: black)

Movies

Movies

Recommendation

Users

Users

Clustering

Movies

Movies

Conclusions

Users

Users

results
Results

Background

Semantic Analysis

Recommendation

Clustering

Conclusions

results1
Results

Background

Semantic Analysis

Recommendation

Clustering

  • testing set contains users and movies not seen in training set
  • recommendations based on item features alone
  • small amounts of structured data (e.g., genre) are the most influential in this scenario

Conclusions

results2
Results

Background

Semantic Analysis

Recommendation

Clustering

  • testing set contains users not seen in training set.
  • lots of collaborative data available (explains comparable performance in all feature sets)
  • given extensive collaborative data, item features are marginally beneficial (in Matchbox)

Conclusions

results3
Results

Background

Semantic Analysis

Recommendation

Clustering

  • testing set contains movies not seen in the training set
  • recommendations based on item features and extensive information on users “rating model”
  • small amounts of structured data (e.g., genre) are the most influential in this scenario (even for long-term users)

Conclusions

results4
Results

Background

Semantic Analysis

Recommendation

Clustering

  • testing set contains users and movies seen in the training set
  • recommendations again are primarily collaborative
  • given a large corpus of rating data for users and items, item features are only marginally beneficial

Conclusions

results5
Results

Background

Semantic Analysis

Recommendation

Clustering

Conclusions

document clustering
Document Clustering

Background

Semantic Analysis

  • Divide a corpus into a specified number of groups
  • Useful for information retrieval
    • Automatically generated topics for search results
    • Recommendations for similar items/pages
    • Visualization of search space

Recommendation

Clustering

Conclusions

k means
K-Means

Background

Semantic Analysis

  • Start with initial clusters
  • Compute means of clusters
  • Compare cosine distance of each item to means
  • Assign to clusters to based on min. distance
  • Repeat from step 2 until convergence

Recommendation

Clustering

Conclusions

experimental setup1
Experimental Setup

Background

Semantic Analysis

  • Generate features for datasets
  • Randomly assign initial clusters
  • Run K-Means
  • Compute purity and ARI
  • Repeat steps 2-4 20 times for mean and standard deviation

Recommendation

Clustering

Conclusions

experimental data2
Experimental Data

Background

Semantic Analysis

From sci.electronics…

“A couple of years ago I put together a Tesla circuit which was published in an electronics magazine and could have been the circuit which is referred to here. This one used a flybacktransformer from a tv onto which you wound your own primary windings...”

  • 20 Newsgroups (mini)
  • 2,000 messages from Usenet newsgroups
  • 100 messages per topic
  • Filter messages for body text
  • Source: http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html

Recommendation

Clustering

Conclusions

results6
Results

Background

Semantic Analysis

Recommendation

Clustering

Conclusions

results comparison
Results Comparison

Background

Semantic Analysis

  • Song et al. Tweets Clustering
    • Experiment #2: Subtle Cluster Distinctions
    • Used Tweets about NA, Asia, Africa and Europe
    • Comparable performance for ESA and Probase Conceptualization
  • Hotho et al. WordNet Clustering
    • Used Reuters dataset and Bisecting K-Means
    • Found best results for combined TF-IDF and feature sets
    • Overall improvement from WordNet features was comparable to Probase features (O[+10%])

Recommendation

Clustering

Conclusions

conclusions
Conclusions

Background

Semantic Analysis

  • Semantic analysis features are marginally beneficial in recommendation
  • Structured data from limited vocabulary work best for recommending “new items”
  • Explicit and latent semantic analysis are comparable in recommendation
  • Knowledge bases generated at Web-scale may be too noisy for narrow domain tasks
  • Confirmed the efficacy of semantic analysis in document clustering tasks

Recommendation

Clustering

Conclusions

future directions
Future Directions

Background

Semantic Analysis

  • Noise Reduction
    • Tune the recommender platform for “concepts”
    • Further explore parameter space for feature generators
    • Hybrid Conceptualization / Named Entity Disambiguation?
  • Domain-specific knowledge sources
    • Comparison of Web-scale and domain-specific resources as external knowledge (e.g., [Aljaber et al., 2010])

Recommendation

Clustering

Conclusions

further reading
Further Reading

Background

Semantic Analysis

  • Short Text Conceptualization Using a Probabilistic Knowledge Base [Song et al., 2011]
  • Exploiting Wikipedia as External Knowledge for Document Clustering [Hu et al., 2009]
  • Hybrid Recommender Using WordNet “Bag of Synsets” [Degemmis et al., 2007]
  • Hybrid Recommender Using LDA [Jin et al., 2005]
  • Feature Generation for Text Categorization Using World Knowledge [Gabrilovich and Markovitch, 2005]
  • WordNet Improves Text Document Clustering [Hotho et al., 2003]

Recommendation

Clustering

Conclusions

acknowledgements
Acknowledgements
  • David Stern, Ulrich Paquet, Jurgen Van Gael
  • Haixun Wang, Yangqiu Song, Zhongyuan Wang
  • Special thanks to Evelyne Viegas!
  • Microsoft Research Connections
references
References
  • [Gabrilovich et al., 2007] EvgeniyGabrilovich and ShaulMarkovitch. 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th international joint conference on Artifical intelligence (IJCAI'07), Rajeev Sangal, Harish Mehta, and R. K. Bagga (Eds.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1606-1611.
  • [Blei et al., 2003] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3 (March 2003), 993-1022.
  • [Song et al., 2011] Yangqiu Song, Haixun Wang, Zhongyuan Wang, Hongsong Li, and Weizhu Chen, Short Text Conceptualization using a Probabilistic Knowledgebase, in IJCAI, 2011.
  • [Stern et al., 2009] David H. Stern, Ralf Herbrich, and ThoreGraepel. 2009. Matchbox: large scale online bayesian recommendations. In Proceedings of the 18th international conference on World wide web (WWW '09). ACM, New York, NY, USA, 111-120.
  • [HetRec ’11] Ivan Cantador, Peter Brusilovsky, and TsviKuflik. 2011. 2nd Workshop on Information Heterogeneity and Fusion in Recommender Systems (HetRec 2011). In Proceedings of the 5th ACM conference on Recommender systems. ACM, New York, NY, USA.
  • [Degemmis et al., 2007] Marco Degemmis, Pasquale Lops, and Giovanni Semeraro. A content-collaborative recommender that exploits WordNet-based user profiles for neighborhood formation. User Modeling and User-Adapted Interaction. Vol. 17, Issue 3, 217-255.
references1
References
  • [Jin et al., 2005] Xin Jin, Yanzan Zhou, and BamshadMobasher. 2005. A maximum entropy web recommendation system: combining collaborative and content features. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining (KDD '05). ACM, New York, NY, USA, 612-617.
  • [Hu et al., 2009] Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou. 2009. Exploiting Wikipedia as external knowledge for document clustering. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '09). ACM, New York, NY, USA, 389-396.
  • [Gabrilovich and Markovitch, 2005] EvgeniyGabrilovich and ShaulMarkovitch. 2005. Feature generation for text categorization using world knowledge. In Proceedings of the 20th international joint conference on Artifical intelligence (IJCAI'05), 1606-1611.
  • [Hotho et al., 2003] Andreas Hotho, Steffen Staab, and GerdStumme. 2003. Wordnet improves text document clustering. In Proceedings of the SIGIR 2003 Semantic Web Workshop, 541-544.
  • [Aljaber et al., 2010] Bader Aljaber, Nicola Stokes, James Bailey, and Jian Pei. 2010. Document clustering of scientific texts using citation contexts. Information Retrieval. Vol. 13, Issue 2, 101-131.
questions
Questions?
  • Thanks for attending
appendix
Appendix
  • Matchbox Details
  • Implementation Details
  • Probase Conceptualization Details
  • Explicit Semantic Analysis Details
  • Learnings from Probase
appendix a matchbox
(Appendix A) Matchbox

Semantic Analysis

  • [Stern et al., 2009]
  • MSR Cambridge recommendation platform
  • Implements a hybrid recommender using Infer.NET
    • Uses combination of expectation propagation (EP) and variational message passing
  • Reduces user, item, and context features to low dimensional trait space

Recommendation

Clustering

Related Work

Conclusions

appendix a matchbox setup
(Appendix A) Matchbox Setup

Semantic Analysis

  • Matchbox settings
    • Use 20 trait dimension (determined experimentally)
    • 10 iterations of EP algorithm
    • Trained on approx. 90% of ratings
    • Updated model with 75% of ratings per user (in remaining 10%)
    • MAE computed for remaining 25% per user

Recommendation

Clustering

Related Work

Conclusions

appendix b implementation
(Appendix B) Implementation

Semantic Analysis

  • ESA: https://github.com/faraday/wikiprep-esa
  • LDA: Infer.NET
  • Probase: Probase Package v. 0.18
  • TF-IDF: http://www.codeproject.com/KB/cs/tfidf.aspx
  • Matchbox: http://codebox/matchbox

Recommendation

Clustering

Related Work

Conclusions

appendix c probase conceptualization
(Appendix C) Probase Conceptualization

Background

Semantic Analysis

  • Identify all Probase terms in text
  • Use Noisy-or Model to combine:
    • Concepts from tlas attribute (zl = 1)
    • Concepts from tl as entity/concept (zl = 0)

Recommendation

Clustering

Conclusions

appendix c probase conceptualization1
(Appendix C) Probase Conceptualization

Background

Semantic Analysis

  • Weight terms based on occurrence
    • Naïve Bayes (similar to Song et al., 2010)
      • Compute P(c|t) for individual terms and use Naïve Bayes model to derive concepts
      • Penalizes false positives, does not reward true positives
      • Generates very small probabilities for large numbers of terms
    • Weighted Sum (similar to Gabrilovich et al., 2007)
      • Compute P(c|t) for individual terms and compute sum over document for each concept
      • Rewards true positives, does not penalize false positives (accurate concepts and inaccurate concepts, resp.)

Recommendation

Clustering

Conclusions

appendix c probase conceptualization2
(Appendix C) Probase Conceptualization

Background

Semantic Analysis

  • Penalize frequent concepts
    • Stop word (concepts) are domain-independent
    • For films, many domain-specific stop concepts
      • E.g., “movie”, “character”, “actor”, etc.
    • Inverse Document Frequency on concepts penalizes those that are too frequent
    • Also rewards those that are too infrequent (in only one document)
    • Solution: Filter for minimum and maximum occurrence

Recommendation

Clustering

Conclusions

appendix c probase conceptualization3
(Appendix C) ProbaseConceptualization

Semantic Analysis

  • Using Summation (similar to Wikipedia ESA)
  • Using Naïve Bayes from Song et al. approach
    • P(|T) P(T|)P()/P(T)

/ P()L - 1

  • Inverse Document Frequency for concepts
    • IDF(ck) = log ( # of documents / document frequency of ck )
    • Minimum occurrence = 2
    • Maximum occurrence = 0.5 * # of documents

Recommendation

Clustering

Related Work

Conclusions

appendix d explicit semantic analysis
(Appendix D) Explicit Semantic Analysis

Semantic Analysis

  • Gabrilovich et al., 2007
  • Builds inverted index of Wikipedia content
  • Input text converted to weight vector of concepts based on TF-IDF

Recommendation

Clustering

Related Work

Conclusions

appendix e learnings from probase
(Appendix E) Learnings from Probase
  • Conceptualization works wonders for small numbers of entities
  • Would be extremely useful in a large-scale QA environment with many semantic analysis and ML algorithms (e.g., Watson)
  • A noisy source of knowledge is best suited to noise-tolerant IR applications
  • Still being developed and improving!
    • Working on recognizing verbs