to link or not to link a study on end to end tweet entity linking n.
Download
Skip this Video
Download Presentation
To Link or Not to Link ? A Study on End-to-End Tweet Entity Linking

Loading in 2 Seconds...

play fullscreen
1 / 26

To Link or Not to Link ? A Study on End-to-End Tweet Entity Linking - PowerPoint PPT Presentation


  • 140 Views
  • Uploaded on

To Link or Not to Link ? A Study on End-to-End Tweet Entity Linking. Stephen Guo, Ming-Wei Chang , Emre Kıcıman. Motivation. Microblogs are data gold mines! Twitter reports that it alone captures over 340M short messages per day Many applications on tweet information extraction

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'To Link or Not to Link ? A Study on End-to-End Tweet Entity Linking' - jethro


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
to link or not to link a study on end to end tweet entity linking

To Link or Not to Link? A Study on End-to-End Tweet Entity Linking

Stephen Guo, Ming-Wei Chang, Emre Kıcıman

motivation
Motivation
  • Microblogs are data gold mines!
    • Twitter reports that it alone captures over 340M short messages per day
  • Many applications on tweet information extraction
    • Election results (Tumasjan et al., 2010)
    • Disease spreading (Paul and Dredze, 2011)
    • Tracking product feedback and sentiment (Asur and Huberman, 2010)
    • ...
  • Existing tools (for example, NER) are often too limited
    • Stanford NER on tweets set achieves 44% F1 [Ritter et. al, 2011]
entity linking wikifier in tweets
Entity Linking (Wikifier) in Tweets

Oh Yes!! giants vs packers game now!! Touchdown!!

  • Q1: Which phrase should be linked? (mention detection)
  • Q2: Which Wikipedia page should be linked for selected phrases? (disambiguation)
contributions
Contributions
  • Proposed a new evaluation scheme for entity linking
    • A natural evaluation scheme for microblogs
  • A system that performs significantly better on tweets than other systems
    • Learn to detect mention and perform linking jointly
    • Outperform Tagme[Ferragina & Scaiella 2010] and [Cucerzan 07] by 15% F1
  • What we have learned
    • Mention detection is a difficult problem
    • Entity information can help mention detection
outline
Outline
  • Task Definition (again!)
  • Two stage versus Joint
  • Model + Features
  • Results + Analysis
what should be linked
What should be linked?

Oh Yes!! giants vs packers game now!! Touchdown!!

  • Comparing different Wikifiersis a tough problem [Cornolti, WWW 2013]
  • Really, there is no good definition on what should be linked
our scenario
Our Scenario

What people are talking about the movie “The Town” on twitter?

  • Assume our customers are only interested in entities of certain types
    • Movies; Video Games; Sports Team;…
    • Type information can be directly inferred by the corresponding Wikipedia page
  • Now, it is fair to compare different systems
    • We assume PER, LOC, ORG, BOOK, TVSHOW, MOVIE
the desired results
The Desired Results
  • Oh Yes!! giants vs packers game now!! Touchdown!!
terminology
Terminology
  • Oh Yes!! giants vs packers game now!! Touchdown!!

Assignment

Mention Candidates

Mentions

Entity

related work
Related Work
  • Wikifier [Cucerzan, 2007; Milne and Witten, 2008…….]
    • Given a document, create Wikipedia-like links
    • Very difficult to evaluate/compare
    • Mention detection and disambiguation are often treated separately
  • NER [Li et al., 2012; Ritter et al., 2011, ...]
    • No Linking
    • Limited Types
  • KBP [Ji et al., 2010; Ji et al., 2011,...]
    • Focus on disambiguation aspect
outline1
Outline
  • Task Definition (again!)
  • Two stage versus Joint
  • Model + Features
  • Results + Analysis
what approach should we use
What approach should we use?
  • Task: Wikifier to the entities of the certain types (all named entities)
  • Approach 1:
    • Train a general named entity recognizer for those types
    • Link to entities from the output of the first stage
  • Approach 2:
    • Learn to jointly detect mention and disambiguate entities
    • Take advantage of Wikipedia information
    • Take advantage of type information into our model

Limited Types; Adaptation

Advanced model

the necessity of the joint approach
The Necessity of the Joint Approach

The town is so so good, Don’t worry Ben, we already forgave you for Gigli

  • Q: Is “the town” a mention?
  • Deep analysis with knowledge is required
    • Gigli is Ben Affleck’s movie, which did not receive a good review
    • Ben Affleck is the lead actor in the movie “The Town”
outline2
Outline
  • Task Definition (again!)
  • Two stage versus Joint
  • Model + Features
  • Results + Analysis
features
Features
  • Oh Yes!! giants vs packers game now!! Touchdown!!

Mention, Entity Pair Features

2-nd Order Features

Type Features

Mention Specific Features

mention specific features
Mention Specific Features
  • Mention Specific Features
    • How likely “giants” is being used as an anchor?
    • How likely “giants” is capitalized in Wikipedia?
    • Is the “giants” a stopword? The number of tokens…
  • Entity - Mention Pair Features
    • Given a string "giants". Estimate by Wikipedia link structure
    • Similarity between the context of the and the words in Wikipedia “”
    • View count
view count
View Count
  • The Wikipedia statistics
    • http://dumps.wikimedia.org/other/pagecounts-raw/
    • Log exists for every hour
    • Very valuable data
  • View count is useful
    • Sometimes the most linked entity in Wikipedia is not the most popular one
    • “jersey shore” ==> ?
    • Jersey Shore links: 441 views: 509140
    • Jersey Shore (TV_series) links: 324 views: 5081377
second order features
Second Order Features
  • = the set of Wikipages that link to
  • The Jaccard score
type features
Type Features
  • The information content on Wikipedia are different from Twitter
    • Wikipedia is informational; Tweets are actionable
    • Misspelled words: “watchin, watchn, …… “
  • We want to find context for PER, LOC, ORG,… for tweets
    • Step 1: train on a system
    • Step 2: labeled 10 million unlabeled tweets
    • Step 3: Collect popular contextual words for each type
    • Step 4: train a new system with one new feature
      • Check if the context match the type
procedure
Procedure
  • Testing: step 1
    • Given a tweet
    • Tokenize it, remove symbols, segment hashtags
  • Testing: step 2
    • For all k-gram words in the tweet, do table look up
      • To find mention candidates and the entities they can link to
  • Testing: step 3
    • Construct features and output the assignment with the trained model
  • Learning: Structural SVM; Inference: Exact/Beamseach
    • A rule-base system for categorizing Wikipedia
outline3
Outline
  • Task Definition (again!)
  • Two stage versus Joint
  • Model + Features
  • Results + Analysis
slide23
Data
  • We sample two sets of tweets
    • Train, Test 1 from [Ritter 2011]
    • Test 2 from Twitter with entertainment keywords
      • “director, actress”……
  • P@1 is very high
    • Many, many algorithms focus on disambiguation
    • However, if the mention are correctly extracted, the system is already very good
main results
Main Results
  • TagMe[Ferragina & Scaiella 2010] and Cucerzan [Cucerzan 07]
    • Cucerzan is designed for well-written documents
    • We have a more principle way to handle mention detection than Tagme
impact of features
Impact of Features
  • Entity information helps mention detections
  • Mining contextual words helps a bit
  • Capturing Entity-Entity relation also improves the model
conclusion discussions
Conclusion & Discussions
  • We provide an experimental study on tweets
    • Jointly detect mentions and disambiguate
    • A structured learning approach
  • What have we learned
    • Mention detection is a difficult problem
    • Entity information could potentially help mention detection
  • Future work
    • Explore the connections between the joint approaches and the two stage approaches
      • [Illinois—ACL 2011, Aida-- VLDB 2011]
    • A more principled way to handle context