1 / 29

指導老師:陳彥良教授 許秉瑜教授 報告人 :楊詠喬 龍晶珠

Web People Search via Connection Analysis Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, Member, IEEE, and Rabia Nuray-Turan IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.20, NO.11, NOVEMBER 2008. 指導老師:陳彥良教授 許秉瑜教授 報告人 :楊詠喬 龍晶珠. Introduction (1/2).

LionelDale
Download Presentation

指導老師:陳彥良教授 許秉瑜教授 報告人 :楊詠喬 龍晶珠

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web People Search via Connection AnalysisDmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, Member, IEEE, and Rabia Nuray-TuranIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.20, NO.11, NOVEMBER 2008 指導老師:陳彥良教授 許秉瑜教授 報告人 :楊詠喬 龍晶珠

  2. Introduction (1/2) • 現今的網路搜尋中,人物搜尋活動佔了5%以上。 • Google 或Yahoo等搜尋引擎,依人名為關鍵字做搜尋,會回傳一連串名字相同的人的網頁資料。 • 下一代的搜尋引擎在尋人時,將利用群集(clustering)的方法,使尋人更為簡易 。

  3. Introduction (2/2) 本論文 1.一個新的、具有高品質分群結果的 網路人物搜尋法 2.本研究方法的完整實証評估 3.本研究方法所帶來的影響

  4. Outlines • Overview of the approach • Generating a graph representation • Disambiguation algorithm • Interpreting clustering results • Related works • Experimental results • Conclusions and future work

  5. Overview of the approach(I/3) • User input • Web page retrieval retrieves a fixed number (top K) of relevant pages • Preprocessing: -- compute TF/IDF -- extraction of Named entities (NEs) and Web-related information

  6. Overview of the approach(2/3) • Graph creation the entity-relationship (ER) graph • Clustering • Cluster processing (1)sketch (2)cluster ranking (3)web page ranking

  7. Overview of the approach(3/3)

  8. Generating a graph representation(1/2)

  9. Generating a graph representation(2/2)

  10. Disambiguation algorithm(1/5) • CC(Correlation Clustering) focus on developing and learning a new accurate s(u,v) • Connection Strength(c(u,v)) c(u,v) can help designing a better similarity function s(u,v)

  11. Disambiguation algorithm(2/5) • Similarity Function(s(u,v)) s(u,v) lebals data with the threshold τ and the δ-band approach

  12. Disambiguation algorithm(3/5) • TF/IDF -用來計算 feature-based similarity f(u,v) • Between two documents u,v

  13. Disambiguation algorithm(4/5) • For each (u,v) edge,we should require that Adding slack

  14. Disambiguation algorithm(5/5) • Choosing negative weight The value of w-( )is chosen to be zero when is less than a certain threshold, and it is chosen to be 1 when it is above this threshold. The value for this threshold itself is learned from the data.

  15. Interpreting clustering results • Cluster rank • Cluster sketch • Web page rank The remainder pages are displayed in the order of the affinity to the selected cluster.

  16. Related work(1/2) • Disambguation

  17. Related work(2/2) Web people serch 1.server-side setting 2.middleware approach (ˇ)

  18. Experimental results • 1. Experimental setup • 2. testing disambiguation quality • 3. impact on search • 4. efficiency

  19. Experimental setup • Data sets( Leave-one-out cross validation) : • 1. www 2005 data set • 2. WEPS data set • 3. Context data set • Quality evaluation measures • B-cubed , Fp • Baseline methods • Agglomerative Vector Space Clustering • Statistical significance test • t-test

  20. Testing disambiguation quality—Experiment 1 (disambiguation quality : overall)

  21. Testing disambiguation quality— Experiment 2 (disambiguation quality :group identification)

  22. Testing disambiguation quality— Experiment 3 (disambiguation quality :queries with context)

  23. Testing disambiguation quality— Experiment 4 ( quality of generating cluster sketches)

  24. Impact on search—measures(experiment 5 ) • First-dominant cluster Regular cluster

  25. Impact on search—measures(experiment 5 ) • average

  26. Impact on search—with context

  27. Efficiency experiment 6 1.由於透過第三者 (NE extractor, GATE) 摘錄NEs, 一開始的下載及前處理,每個網頁需要用3.82秒。 2.假如用 server-side approach, 前處理過程就可以離線事先做好。 3.集群演算法本身執行時,平均每個名字花4.7秒。

  28. Future work • Employ external data sources for disambiguation as well • Use more advances extraction capabilities • a better interpretation of extracted entities by taking into account the roles they play with respect to each other • Develop disambiguation algorithms for other people search problems that have different settings • A algorithms for a generic entity search

  29. Thank you for listening

More Related