1 / 19

Link Distribution on W ikipedia

Link Distribution on W ikipedia. [0407] KwangHee Park. Table of contents. Introduction Topic modeling Preliminary Problem Conclusion. Introduction. Why focused on Link

chaeli
Download Presentation

Link Distribution on W ikipedia

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Link Distribution on Wikipedia [0407]KwangHee Park

  2. Table of contents • Introduction • Topic modeling • Preliminary Problem • Conclusion

  3. Introduction • Why focused on Link • When someone make new article in Wikipedia, mostly they simply link to other language source or link to similar and related article. After that, that article to be wrote by others • Assumption • Link terms in the Wikipedia articles is the key terms which can represent specific characteristic of articles

  4. Introduction • Problem what we want to solve is • To analyses latent distribution of set of Target document by topic modeling

  5. Topic modeling • Topic • Topics are latent concepts buried in the textual artifacts of a community described by a collection of many terms that co-occur frequently in context Laura Dietz, Avaré Stewart 2006 ‘Utilize Probabilistic Topic Models to Enrich Knowledge Bases’ • T = {Wi,…,Wn}

  6. Topic modeling • Bag of word assumption • The bag-of-words model is a simplifying assumption used in natural language processing and information retrieval. In this model, a text (such as a sentence or a document) is represented as an unordered collection of words, disregarding grammar and even word order. From Wikipedia • Each document in the corpus is represented by a vector of integer • {f1,f2,…f|w|} • Fi = frequency of ith word • |w| = number of words

  7. Topic modeling • Instead of directly associating documents to words, associate each document with some topics and each topic with some significant words • Document = {Tn, Tk,…,Tm } • {Doc : 1 } {Tn : 0.4, Tk : 0.3 ,… }

  8. Topic modeling • Based upon the idea that documents are mixtures of topics • Modeling • Document  topic  term

  9. Topic modeling • LSA • performs dimensionality reduction using the singular value decomposition. • The transformed word– document co-occurrence matrix, X, is factorized into three smaller matrices, U, D, and V. • U provides an orthonormal basis for a spatial representation of words • D weights those dimensions • V provides an orthonormal basis for a spatial representation of documents.

  10. Observed word distributions Topic distributions per document word distributions per topic Topic modeling • pLSA

  11. LDA pLSA Topic modeling • LDA (Latent Dirichlet Allocation) • Number of parameters to be estimated in pLSA grows with size of training set • In this point LDA method has advantage • Alpha and beta are corpus-level documents that are sampled once in the corpus creating generative model (outside of the plates!)

  12. Topic modeling – our approach • Target • Document = Wikipedia article • Terms = linked term in document • Modeling method • LDA • Modeling tool • Lingpipeapi

  13. Advantage of linked term • Don’t need to extra preprocessing • Boundary detection • Remove stopword • Word stemming • Include more semantics • Co-relation between term and document • Ex) cancer as a term  cancer as a document A Cancer cancer

  14. Preliminary Problem • How well link terms in the document are represent specific characteristic of that document • Link evaluation • Calculate similarity between document

  15. Link evaluation • Similarity based evaluation • Calculate similarity between terms • Sim_t{term1,term2} • Calculate similarity between documents • Sim_d{doc1,doc2} • Compare two similarity

  16. Link evaluation • Sim_t • Similarity between terms • Not affected input term set • Sim_d • Similarity between documents • Significantly affected input term set Lin 1991 p,q = topic distribution of each document

  17. Link evaluation • Compare top 10 most similar each link • Ex )Link A • Term list most similar to A as term • Document list most similar to A as document • Compare two list – number of overlaps • Now under experiment

  18. Conclusion • Topic modeling with link distribution in Wikipedia • Need to measure how well link distribution can represent each article’s characteristic • After that analysis topic distribution in variety way • Expect topic distribution can be apply many application

  19. Thank

More Related