1 / 16

Multilingual Synchronization focusing on Wikipedia

Multilingual Synchronization focusing on Wikipedia. 2011-03-17. Goal of M-Sync. Multilingual Synchronization Synchronizing contents of Wikipedia from multiple different languages Linking among multiple language contents Combining them to synthesis

byrd
Download Presentation

Multilingual Synchronization focusing on Wikipedia

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multilingual Synchronization focusing on Wikipedia 2011-03-17

  2. Goal of M-Sync • Multilingual Synchronization • Synchronizing contents of Wikipedia from multiple different languages • Linking among multiple language contents • Combining them to synthesis • The various Wikipedia editions from different languages • can offer more precise and detailed information based on different intentions/backgrounds/cultures • can fill the gap between different languages and to acquire the integrated knowledge

  3. Multilingual Resource Synthesis • Construction Association Network from multiple language hyperlink structures • based on occurrence data • A Web page makes a direct reference to another Web page via a hyperlink • Definitions • Association • A relation between any two words or concepts with a strength • Association Network • Nodes and edges • Node: entity (concept or a named entity) • Edge: link between entities with weight(the strength of association)

  4. Motivating Example Baldness Hair

  5. Motivating Example Baldness Hair 탈모증 털

  6. Motivating Example Baldness Hair 탈모증 털 Increase Strength

  7. Motivating Example Hamilton-Norwood_scale Baldness Hair 탈모증 털

  8. Motivating Example Hamilton-Norwood_scale Baldness Hair 탈모증 털 Link Prediction or Keyword Recommendation Hamilton-Norwood_scale

  9. Multilingual Resource Synthesis: Construction of Association Network • Hypothesis • X is associated with Y in L1 X’ should be associated with Y’ in L2 • Where Y’ is a corresponding term to Y in different language • Assumption • Inter-language links are accurate links to connect two pages about the same entity or concept in different languages • Where X’ is a translating term to X in different language • X is associated with Y according to its strength synthesis based on respective occ-score

  10. Proposed System Workflow English Documents Korean Documents Chinese Documents Link Extraction Link Extraction Link Extraction Association set for each English word Association set for each Korean word Association set for each Chinese word Calculation of Correlations Calculation of Correlations Calculation of Correlations Bilingual Dictionary Bilingual Dictionary Association for each pair of English and Korean words Association for each pair of Korean and Chinese words Calculation of Correlations Selection of highly Associated pairs of words

  11. Example of Association Set Association set of “Baldness” • androgenic alopecia • 머리카락 • 荷爾蒙 • 남성형 탈모증 • 頭髮 • 心理 • adult • alopecia areata • 원형 탈모증 • 營養

  12. Preprocessing: Selecting Target • Source languages(5) • English, Spanish, French, Chinese, Korean • Extracting target pages with a 5-clique by inter-language links • Assumption: • Pages founded in all 5 languages are key pages and the target to sync • Enforcing consistency of a link path • If a path from X(L1) to X’(L2) founded once,its inverse path (X’, X) is automatically added to the output A subset of UN official languages en:Badminton fr:Badminton es:Bádminton ko:배드민턴 zh:羽毛球

  13. Link types of Wikipedia • internal linksto other pages in the wiki • Syntax usage: [[Main Page]] • external links to other websites • interwiki links to other websites registered to the wiki in advance • Unlike internal links, interwiki links do not use page existence detection • Syntaxusage: [[wikipedia:Sunflower]] • Interlanguage links to other websites registered as other language versions of the wiki

  14. Experiment • Link Extraction • Input: • 40,000 pages in each languages • Output • 5,453,959 links en • 3,965,279 links fr • 3,368,949 links es • 2,397,959 links zh • 1,347,811 links ko Association Pairs

  15. Experiment: respective mode • Compute strength of Association: LF-IDF • is motivated by TF-IDF heuristic

  16. Experiment: synthesis mode • M: Simple average • NM: Average with standard deviation • Demo • http://nlplab.kaist.ac.kr/~kekeeo/term

More Related