1 / 10

Link Distribution in Wikipedia

Link Distribution in Wikipedia. [0324] KwangHee Park. Table of contents. Introduction Cluster using LDA Experiment Disease, settlement Demo Considering Application. Introduction. Why focused on Link

makani
Download Presentation

Link Distribution in Wikipedia

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Link Distribution in Wikipedia [0324] KwangHee Park

  2. Table of contents • Introduction • Cluster using LDA • Experiment • Disease, settlement • Demo • Considering Application

  3. Introduction • Why focused on Link • When someone make new article in Wikipedia, mostly they simply link to other language source or link to similar and related article. After that, that article to be wrote by others • Assumption • Link terms in the Wikipedia articles is the key terms which can represent specific characteristic of articles

  4. Introduction • Problem what we want to solve is • To analyses latent distribution of set of Target document by Clustering of Link term set • Find the Tendency of latent distribution of specific Domain by limiting input document to specific Domain

  5. Process • Terminology • Term set = all of terms in the input documents • Topic = Set of term  {Wi,…,Wn} • Document = Set of term  {Wk,Wl,…,Wn} • Document = set of part of topic  {Tn, Tk,…,Tm } • {Doc : 1 } {Tn : 0.4, Tk : 0.3 ,… } • Clustering Term set • Find latent distribution of each Document • Group by domain

  6. LDA • The clustering techniques • The LDA model consists of a fixed number of topics • Each topic is modeled as a distribution over words. • A document under LDA is modeled as a distribution over topics.  Topic n Term Set Topic Topic 3 Topic 2 Topic 1 Doc 1 Doc2 Doc 3

  7. Experiment • Domain : • Disease • #Doc : 208 • #Link terms : • English : 46615 , Espanola: 34560, French:, 31747Chinese:, 9286 Korean: 3272 • Settlement • #Doc : 1328 • #Link term : • English : 372483 , Espanola: 227950, French:150921, Chinese:93227, Korean: 38089 • Number of Topic • 10,20,30,40,50,75,100,125,150,175,200,225,250 • Demo site • http://143.248.135.30

  8. Considering Application • Document Classification • Classify domain of target document by calculate similarity between topic distribution of document • Usage : Template recommendation ,… • Domain characteristic Disease Settlement # of appearance / # of total Doc Topic number

  9. Template recommendation • Starvation Trenton,_New_Jersey • Starvation  Disease • Trenton,_New_Jersey  Settlement

  10. Thanks

More Related