Named Entity Disambiguation: A Hybrid Statistical and Rule-based Incremental Approach - PowerPoint PPT Presentation

andrew
named entity disambiguation a hybrid statistical and rule based incremental approach l.
Skip this Video
Loading SlideShow in 5 Seconds..
Named Entity Disambiguation: A Hybrid Statistical and Rule-based Incremental Approach PowerPoint Presentation
Download Presentation
Named Entity Disambiguation: A Hybrid Statistical and Rule-based Incremental Approach

play fullscreen
1 / 28
Download Presentation
Named Entity Disambiguation: A Hybrid Statistical and Rule-based Incremental Approach
423 Views
Download Presentation

Named Entity Disambiguation: A Hybrid Statistical and Rule-based Incremental Approach

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. BK TP.HCM Named Entity Disambiguation: A Hybrid Statistical and Rule-based Incremental Approach Hien Nguyen*(Ton Duc Thang University, Vietnam) Tru Cao (Ho Chi Minh City University of Technology, Vietnam) Semantic Web Group (VN-KIM) Faculty of Computer Science & Engineering Ho Chi Minh City University of Technology *Email: hien@tut.edu.vn

  2. Outline • Introduction • Wikipedia • Algorithm • Experimental results • Concluding remarks

  3. Introduction: Named Entities • Named Entities (NE) are considered: people, organizations, locations, date, time, money, measures, percentage, etc. • Example • “Ms. Washington's candidacy is being championed by several powerful lawmakers including her boss, Chairman John Dingell (D., Mich.) of the House Energy and Commerce Committee.”

  4. Introduction: Problem • Different NEs may have the same name. • “John McCarthy has been a staple of the Ultimate Fighting Championship since its second event on March 11, 1994.” John McCarthy John McCarthy(referee) • “John McCarthy, professor of computer science at Stanford University, who developed LISP.” John McCarthy John McCarthy(computer scientist) • “John McCarthy, Britain's longest-held hostage in Lebanon, has been set free after more than five years in captivity.” John McCarthy John McCarthy(journalist)

  5. Introduction: Motivation • Web searches • Queries about Named Entities (NEs) constitute a significant portion of popular web queries (Bunescu et al., EACL 2006). • ~ 30% of search engine queries include person names (R. Guha et al., WWW 2004) • Named entity disambiguation can lead to improve effectiveness of search results on the web for popular named entities. • Web-based Information Extraction • Identifying exactly NEs in web pages can improve accuracy in IE tasks (e.g. extracting relationships between NEs). • Question & Answering • Identifying exactly NEs in questions can improve accuracy of answers

  6. Introduction: NE Disambiguation • Mapping entity names (in a text) to actual entities in a KB of discourse (e.g. Wikipedia). • An ambiguous entity names are out of the KB • An ambiguous entity names occur in the KB, but they refer to named entities out of the KB • An ambiguous entity names refer to two or more than named entities in the KB

  7. Introduction: NE disambiguation But much like the first presidential debate held two weeks ago in Oxford, Mississippi, a draw for Obama would be considered a win.

  8. Introduction: NE disambiguation Gamsakhurdia is seen as a national hero by those who mourn him Zviad Gamsakhurdia, Georgia's first president after independence from the USSR, has been buried in the capital Tbilisi 14 years after his death.

  9. NE disambiguation John McCarthy, 'great man' of computer science, wins major award

  10. Introduction: Approach • Disambiguation based on context • Co-occurring entity names • Co-occurring NE identifiers • Tokens in a window context centered at a name in consideration • Disambiguation based on a KB • We view that instances in the KB have two in formation • Attributes • Relations • We represent those instances by their attributes and relations

  11. Introduction: Approach Text containing ambiguous names Wikipedia article • All keywords in the window text centred around the ambiguous name • The whole text is extended with page titles of the previously identified NEs enclosed • Entity page titles • Redirecting page titles • Category labels • Hyperlink labels Heuristics +TF-IDF vector similarity

  12. Wikipedia • Wikipedia is a free encyclopedia written by a collaborative effort of global community of more than 150,000 volunteers • These volunteers have contributed more than 11 million articles in 265 languages • More than 275 million people visit Wikipedia site every month • 2,697,848 articles in English version (visiting Jan 14th, 2009)

  13. Wikipedia – Pages &Titles Page Titles (ID)

  14. Wikipedia – Pages &Titles Disambiguation text

  15. Wikipedia – Category Category

  16. Wikipedia – Redirect pages Redirect page titles

  17. Wikipedia – Hyperlinks Hyperlinks

  18. Wikipedia – Hyperlinks Hyperlinks

  19. Algorithm • Hybrid statistical and rule-based incremental algorithm: • Rule-based NE disambiguation • Utilizing Wikipedia disambiguation texts E.g. “… Rockville, Maryland …” , disambiguation text Maryland helps identifying Rockvilleis an area in Maryland

  20. On Thursday morning, Sen. Barack Obama warned supporters not to get "cocky," while a few hours later McCain pledged to Pennsylvania voters he would erase Obama's lead by Election Day. Algorithm • Rule-based NE disambiguation (cont.) • Exploiting coreference relationship between referents: Propagation of the identified NE, if any, along its coreference chain E.g. • Extension of the whole text with the Wikipedia entity page titles of the identified NEs

  21. Algorithm • After Rule-based stage, for remaining ambiguous names, matching the whole text vector with Wikipedia candidate entity pages The extracted context surrounding ambiguous names Wikipedia article • All keywords in the window text centred around the ambiguous name • The whole text is extended with page titles of the previously identified NEs enclosed • Entity page titles • Redirecting page titles • Category labels • Hyperlink labels TF-IDF vector similarity

  22. Algorithm

  23. Experimental results • Experiments: 10 news from CNN on Travel, Entertainment, World, World Business, and Americas

  24. Experimental results • D1 obtained after running GATE • D2 obtained after GATE’s errors corrected

  25. Experimental results • We measure accuracy as the total number of right assignments NE (in text)/Wiki NE divided by the total number of assignments

  26. Experimental results • Results:

  27. Concluding remarks • The proposed method is a hybrid and incremental process that utilizes previously identified NEs and related terms co-occurring with ambiguous names in a text for entity disambiguation • Work under investigation: • Disambiguating ambiguous cases when ambiguous names occur in a KB, but they refers to named entities out of the KB.

  28. Thanks for your attention VN-KIM Group http://www.cse.hcmut.edu.vn/vn-kim/ Contact author:hien@tut.edu.vn or nthien97@yahoo.com