160 likes | 323 Views
Social + Mobile + Commerce. Entity Extraction, Linking, Classification, and Tagging for Social Media: A Wikipedia Based Approach. Abhishek Gattani, Digvijay Lamba , Nikesh Garera, Mitul Tiwari 3 , Xiaoyong Chai,
E N D
Social + Mobile + Commerce Entity Extraction, Linking, Classification, and Tagging for Social Media: A Wikipedia Based Approach Abhishek Gattani, Digvijay Lamba, Nikesh Garera, Mitul Tiwari3, Xiaoyong Chai, Sanjib Das1, Sri Subramaniam, Anand Rajaraman2, Venky Harinarayan2, AnHai Doan1; @WalmartLabs, 1University of Wisconsin-Madison, 2Cambrian Ventures, 3LinkedIn Aug 27th, 2013
The Problem “Obama gave an immigration speech while on vacation in Hawaii” On Social Media Data
Why? – Use cases • Used extensively at Kosmix and later at @WalmartLabs • Twitter event monitoring • In context ads • User query parsing • Product search and recommendations • Social Mining • Use Cases • Central topic detection for a web page or tweet. • Getting a stream of tweets/messages about a topic. • Small team at scale • About 3 engineers at a time • Processing the entire Twitter firehose
Based on a Knowledge Base • Global: Covers a wide range of topics. Includes WordNet, Wikipedia, Chrome, Adam, MusicBrainz, Yahoo Stocks etc. • Taxonomy: Converted Wikipedia graph to a hierarchical taxonomy with IsAedgeswhich are transitive • Large: 6.5 Million hierarchical concepts with 165 Million relationships • Real Time:Constantly updated from sources, analyst curation, event detection • Rich:Synonyms, Homonyms, Relationships, etc Published: Building, maintaining, and using knowledge bases: A report from the trenches. In SIGMOD, 2013.
Annotate with Contexts Every social conversation takes place in a context that changes what it means A Real Time User Context What topics does this user talk about? A Real Time Social Context What topics are usually in context of a Hashtag, Domain, or KB Node A Web Context Topics in a link in a tweet. What are the topics in KB Node’s Wiki Page? Compute the context at scale
Key Differentiators – why it works? The Knowledge Base Interleave several problems Use of Context Scale Rule Based
How: First Find Candidate Mentions “RTStephen lets watch. Politics of Love is about Obama’s election @EricSu” Step 1: Pre-Process – clean up tweet “Stephen lets watch. Politics of Love is about Obama’s election” Step 2: Find Mentions – All in KB + detectors [“Stephen”, “lets”, “watch” “Politics”, “Politics of Love”, “is”, “about”, “Obama”, “Election”] Step 3: Initial Rules – Remove obvious bad cases [“Stephen”, “watch”, “Politics”, “Politics of Love”, “Obama”, “Election”] Step 4: Initial scoring – Quick and dirty [“Obama”: 10, “Politics of Love”: 9, “Stephen”:7, “watch”: “7”., “Politics”: 6, “Election”: 6,]
How: Add mention features Step 5: Tag and Classify– Quick and dirty “Obama”: Presidents, Politicians, People; Politics, Places, Geography “Politics of Love”: Movies, Political Movies, Entertainment, Politics “Stephen”: Names, People “watch”: Verb, English Words, Language, Fashion Accessories, Clothing “Politics”: Politics “Election”: Political Events, Politics, Government Tweet: Politics, People, Movies, entertainment.. Etc. Step 6: Add features Contexts, similarity to the tweet, similarity to user or website, popularity measures, is it interesting?, social signals
How: Finalize mentions Step 7: Apply Rules “Obama”: Boost popular stuff and proper nouns “Politics of Love”: Boost Proper nouns, Boost due to “Watch” “Stephen”: Delete out of context names “watch”: Remove verbs “Politics”: Boost tags which are also mentions “Election”: Boost mentions in the central topic Step 8: Disambiguate KB has many meanings – Pick One Obama: Barrack Obama. Popularity, Context, Social Popularity Watch: verb. Clothing is not in context Context is most important! We use many contexts for most success.
How: Finalize Step 9: Rescore Logistic Regression model on all the features Step 9: Re-tag Use latest scores and only picked meanings Step 9: Editorial Rules A regular expression like language for analysts to pick/book
Does it work? – Evaluation of Entity Extraction • For 500 English Tweets we hand curate a list of mentions. • For 99 of those built a comprehensive list of tags. • Entity extraction: • Works well for people, organizations, locations • Works great for unique names • Works badly for Media: Albums, Songs, • Generic Problem: • Too many movies, books, albums and songs have “Generic” Names • Inception, It’s Friday etc. • Even when popular they are often used “in conversation” • Very hard to disambiguate. • Very hard to find which ones are Generic.
Does it work? – Evaluation of Tagging • Tagging/Classification: • Works well for Travel/Sports • Bad for Products and Social sciences • N Lineages problem: • Note that all mentions have multiple lineages in the KB. • Usually, one IsA lineage goes to “People” or “Product” • A ContainedIn lineage goes to the topic like “SocialScience” • Detecting which is primary is a hard problem. • Is Camera in Photography? Or Electronics? • Is War History? Or Politics? • How far do we go?
Comparison with existing systems • The first such comparison effort that we know of. • OpenCalais • Industrial Entity Extraction system • StanNER-3: (From Stanford) • This is a 3-class (Person, Organization, Location) named entity recognizer. The system uses a CRF-based model which has been trained on a mixture of CoNLL, MUC and ACE named entity corpora. • StanNER-3-cl: (From Stanford) • This is the caseless version of StanNER-3 system which means it ignores capitalization in text. • StanNER-4: (From Stanford) • This is a 4-class (Person, Organization, Location, Misc) named entity recognizer for English text. This system uses a CRF-based model which has been trained on the CoNLL corpora.
For People, Organization, Location • Details in the Paper. • We are far better on almost all respects: • Overall: 85% Precision vs 78% best in other systems. • Overall: 68% Recall vs 40% for StanNER-3 and 28% for OpenCalais • Significantly better on Organizations • Why? - Bigger Knowledge Base • The larger knowledge base allows a more comprehensive disambiguation. • Is “Emilie Sloan” referring to a person or organization? • Why? - Common interjections • LOL, ROFL, Haha interpreted as organizations by other systems. • Acronyms misinterpreted • VsOpenCalais • Recall is a major difference with a significantly smaller set of entities recognized by Open Calais