To link or not to link a study on end to end tweet entity linking
Download
1 / 26

To Link or Not to Link ? A Study on End-to-End Tweet Entity Linking - PowerPoint PPT Presentation


  • 119 Views
  • Uploaded on
  • Presentation posted in: General

To Link or Not to Link ? A Study on End-to-End Tweet Entity Linking. Stephen Guo, Ming-Wei Chang , Emre Kıcıman. Motivation. Microblogs are data gold mines! Twitter reports that it alone captures over 340M short messages per day Many applications on tweet information extraction

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

To Link or Not to Link ? A Study on End-to-End Tweet Entity Linking

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


To Link or Not to Link? A Study on End-to-End Tweet Entity Linking

Stephen Guo, Ming-Wei Chang, Emre Kıcıman


Motivation

  • Microblogs are data gold mines!

    • Twitter reports that it alone captures over 340M short messages per day

  • Many applications on tweet information extraction

    • Election results (Tumasjan et al., 2010)

    • Disease spreading (Paul and Dredze, 2011)

    • Tracking product feedback and sentiment (Asur and Huberman, 2010)

    • ...

  • Existing tools (for example, NER) are often too limited

    • Stanford NER on tweets set achieves 44% F1 [Ritter et. al, 2011]


Entity Linking (Wikifier) in Tweets

Oh Yes!! giants vs packers game now!! Touchdown!!

  • Q1: Which phrase should be linked? (mention detection)

  • Q2: Which Wikipedia page should be linked for selected phrases? (disambiguation)


Contributions

  • Proposed a new evaluation scheme for entity linking

    • A natural evaluation scheme for microblogs

  • A system that performs significantly better on tweets than other systems

    • Learn to detect mention and perform linking jointly

    • Outperform Tagme[Ferragina & Scaiella 2010] and [Cucerzan 07] by 15% F1

  • What we have learned

    • Mention detection is a difficult problem

    • Entity information can help mention detection


Outline

  • Task Definition (again!)

  • Two stage versus Joint

  • Model + Features

  • Results + Analysis


What should be linked?

Oh Yes!! giants vs packers game now!! Touchdown!!

  • Comparing different Wikifiersis a tough problem [Cornolti, WWW 2013]

  • Really, there is no good definition on what should be linked


Our Scenario

What people are talking about the movie “The Town” on twitter?

  • Assume our customers are only interested in entities of certain types

    • Movies; Video Games; Sports Team;…

    • Type information can be directly inferred by the corresponding Wikipedia page

  • Now, it is fair to compare different systems

    • We assume PER, LOC, ORG, BOOK, TVSHOW, MOVIE


The Desired Results

  • Oh Yes!! giants vs packers game now!! Touchdown!!


Terminology

  • Oh Yes!! giants vs packers game now!! Touchdown!!

Assignment

Mention Candidates

Mentions

Entity


Related Work

  • Wikifier [Cucerzan, 2007; Milne and Witten, 2008…….]

    • Given a document, create Wikipedia-like links

    • Very difficult to evaluate/compare

    • Mention detection and disambiguation are often treated separately

  • NER [Li et al., 2012; Ritter et al., 2011, ...]

    • No Linking

    • Limited Types

  • KBP [Ji et al., 2010; Ji et al., 2011,...]

    • Focus on disambiguation aspect


Outline

  • Task Definition (again!)

  • Two stage versus Joint

  • Model + Features

  • Results + Analysis


What approach should we use?

  • Task: Wikifier to the entities of the certain types (all named entities)

  • Approach 1:

    • Train a general named entity recognizer for those types

    • Link to entities from the output of the first stage

  • Approach 2:

    • Learn to jointly detect mention and disambiguate entities

    • Take advantage of Wikipedia information

    • Take advantage of type information into our model

Limited Types; Adaptation

Advanced model


The Necessity of the Joint Approach

The town is so so good, Don’t worry Ben, we already forgave you for Gigli

  • Q: Is “the town” a mention?

  • Deep analysis with knowledge is required

    • Gigli is Ben Affleck’s movie, which did not receive a good review

    • Ben Affleck is the lead actor in the movie “The Town”


Outline

  • Task Definition (again!)

  • Two stage versus Joint

  • Model + Features

  • Results + Analysis


Features

  • Oh Yes!! giants vs packers game now!! Touchdown!!

Mention, Entity Pair Features

2-nd Order Features

Type Features

Mention Specific Features


Mention Specific Features

  • Mention Specific Features

    • How likely “giants” is being used as an anchor?

    • How likely “giants” is capitalized in Wikipedia?

    • Is the “giants” a stopword? The number of tokens…

  • Entity - Mention Pair Features

    • Given a string "giants". Estimate by Wikipedia link structure

    • Similarity between the context of the and the words in Wikipedia “”

    • View count


View Count

  • The Wikipedia statistics

    • http://dumps.wikimedia.org/other/pagecounts-raw/

    • Log exists for every hour

    • Very valuable data

  • View count is useful

    • Sometimes the most linked entity in Wikipedia is not the most popular one

    • “jersey shore” ==> ?

    • Jersey Shore links: 441 views: 509140

    • Jersey Shore (TV_series) links: 324 views: 5081377


Second Order Features

  • = the set of Wikipages that link to

  • The Jaccard score


Type Features

  • The information content on Wikipedia are different from Twitter

    • Wikipedia is informational; Tweets are actionable

    • Misspelled words: “watchin, watchn, …… “

  • We want to find context for PER, LOC, ORG,… for tweets

    • Step 1: train on a system

    • Step 2: labeled 10 million unlabeled tweets

    • Step 3: Collect popular contextual words for each type

    • Step 4: train a new system with one new feature

      • Check if the context match the type


Mining Contextual Words


Procedure

  • Testing: step 1

    • Given a tweet

    • Tokenize it, remove symbols, segment hashtags

  • Testing: step 2

    • For all k-gram words in the tweet, do table look up

      • To find mention candidates and the entities they can link to

  • Testing: step 3

    • Construct features and output the assignment with the trained model

  • Learning: Structural SVM; Inference: Exact/Beamseach

    • A rule-base system for categorizing Wikipedia


Outline

  • Task Definition (again!)

  • Two stage versus Joint

  • Model + Features

  • Results + Analysis


Data

  • We sample two sets of tweets

    • Train, Test 1 from [Ritter 2011]

    • Test 2 from Twitter with entertainment keywords

      • “director, actress”……

  • P@1 is very high

    • Many, many algorithms focus on disambiguation

    • However, if the mention are correctly extracted, the system is already very good


Main Results

  • TagMe[Ferragina & Scaiella 2010] and Cucerzan [Cucerzan 07]

    • Cucerzan is designed for well-written documents

    • We have a more principle way to handle mention detection than Tagme


Impact of Features

  • Entity information helps mention detections

  • Mining contextual words helps a bit

  • Capturing Entity-Entity relation also improves the model


Conclusion & Discussions

  • We provide an experimental study on tweets

    • Jointly detect mentions and disambiguate

    • A structured learning approach

  • What have we learned

    • Mention detection is a difficult problem

    • Entity information could potentially help mention detection

  • Future work

    • Explore the connections between the joint approaches and the two stage approaches

      • [Illinois—ACL 2011, Aida-- VLDB 2011]

    • A more principled way to handle context


ad
  • Login