1 / 33

Social Media, Data Integration, and Human Computation

Social Media, Data Integration, and Human Computation. AnHai Doan University of Wisconsin @ WalmartLabs. @ Walmart Labs. A Journey Starting in 2001 . Worked in data integration combine multiple data sources into one e.g , aggregation/comparison shopping sites, Google Scholar

aliya
Download Presentation

Social Media, Data Integration, and Human Computation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Social Media, Data Integration, and Human Computation AnHai Doan University of Wisconsin @WalmartLabs @WalmartLabs

  2. A Journey Starting in 2001 ... • Worked in data integration • combine multiple data sources into one • e.g, aggregation/comparison shopping sites, Google Scholar • use schema matching, information extraction, entity disambiguation • Ph.D. thesis focused on schema matching homes.com Find houses with 2 bedroomsunder 400K realestate.com fsbo.com

  3. Schema Matching • Developed automatic solution using machine learning • Realized that automatic solutions are not good enough • only 65-85% accuracy • need human intervention • Proposed a crowdsourcing approach address = location price = sold-at

  4. Crowdsourced Schema Matching • Can crowdsource other DI tasks too • Difficult to publish • Building data integration systems via mass collaboration, WebDB-03 • Subsequent reviews: great work, I don’t believe it, neutral address = location Yes, Yes, No Build a large-scale DI system on the Web Show that crowdsourcing is practical

  5. Started DBLifeProject in 2005 Superpages Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary HV Jagadish HV Jagadish Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Web pages * * * * give-talk * * * SIGMOD-07 SIGMOD-07 * * * * * * * * File system RDBMS Hadoop

  6. Example Superpage

  7. Example Crowdsourcing Picture is removed if enough users vote “no”.

  8. Project Status in 2009 • Data integration • overall methodology: VLDB-07a, VLDB-07b, CIDR-09 • DI operators: VLDB-07c • optimization: VLDB-07c, SIGMOD-08, ICDE-08a, SIGMOD-09a • provenance/others: ICDE-07a, ICDE-07b, VLDB-08a • Crowdsourcing / human computation • schema matching: ICDE-08b • best-effort information extraction: SIGMOD-08 • human feedback into the DI pipeline: SIGMOD-09b • how lay users can query the database: SIGMOD-09c • System development • hard to build/maintain systems in academia Wanted to know what’s going on in industry Wanted to take DBLife to the next level Joined Kosmix in 2010 to do “DBLife on steroids”

  9. Kosmix • Founded by AnandRajaraman & VenkyHarinarayan • formerly of Junglee, sold to Amazon for 250M • 55M in funding, 30+ engineers • Integrated Web data sources into a giant taxonomy all places people topic pages IMDBMusicbrainz Tripadvisor Wikipedia … Information extraction Entity disambiguation Entity merging ... actors Angelia Jolie Mel Gibson File system RDBMS Hadoop

  10. Raised many interesting challenges - e.g., incremental updates, recycling human edits Very good in certain topics (e.g., health) But hard to compete with Google and Wikipedia Switched to social media in early 2010

  11. Social Media Exploding • 100 million tweets per day • 1 billion Facebook shares per day • 1.5 million Foursquare checkins per day • 40,000 Flickr photos per second Every two days now we create as much information as we did from the dawn of civilization up until  2003. -- Eric Schmidt 11

  12. Switching Made Much Business Sense • Lot of social media data • Lot of people using it, spending a lot of time on it • lot of links now come from social media, not search engines • Google is worried (hence Buzz, Google+, Google++) • New level playing field • Have a secret weapon: the giant taxonomy • Next hot Internet wave • SoLoMo = social + local + mobile • But can we build interesting applications? What is social media good for?

  13. From Frivolous to Serious • 95% of tweets is still junk • I feel good today • Help teenagers track Justin Bieber • the background noise of Twitter • Charlie Sheen, celebrity fighting, Weiner losing his job • Foster customer relationships • follow your dentist • Spread news • Manage disasters • Promote e-commerce • Help organize events,movements • revolutions

  14. Lot of Companies / Actions in This Space • Build platforms for social media • how to tweet more effectively • Understand social media • social analytics / route relevant information to users • Use social media to make predictions • Use social media to affect real-world changes • Mostly operate at the keyword level • how many times the keyword “Obama” has been mentioned today? • Kosmix: the leader in performing semantic analysis • how many times the entity President Obama has been mentioned today? • “Obama”, “Barack”, “Barry”, “BO”, “the Pres”, “the Messiah”, ...

  15. Kosmix Solution Crowd sourcing internal analysts, users, Mechanical Turks, others IMDBMusicbrainz Wikipedia … Social Genome Applications Information extraction Entity disambiguation Entity merging Schema matching Event detection Event monitoring ... Highly scalable real-time infrastructure File system RDBMS HadoopMuppet Slates Stream servers

  16. Social Genome all places people Twitter users FB users actors @melgibson @dsmith … mel-gibson davesmith … Angelia Jolie Mel Gibson tweet-about the-same-as events sports celebrities politics … @dsmith: Mel crashed. Maseratiis gone. Gibsoncarcrash Egyptianuprising capital-of Egypt Cairo related-to located-in Tahrir @far213: Tahrir is packed!

  17. Building Social Genome: Three Sample Challenges all places people Twitter users FB users actors @melgibson @dsmith … mel-gibson davesmith … Angelia Jolie Mel Gibson tweet-about the-same-as events sports celebrities politics … @dsmith: Mel crashed. Maseratiis gone. Gibsoncarcrash Egyptianuprising capital-of Egypt Cairo related-to located-in Tahrir @far213: Tahrir is packed!

  18. Extraction and Disambiguation:Traditional Methods Ill Suited for Social Media all events places people sports celebrities politics … actors directors Gibsoncarcrash Egyptianuprising Angelia Jolie Mel Gibson Mel Brooks Disambiguation Mel was arrested again. What a dramatic fall sincehis Oscar-winning day. Extraction use rule-based / NLP / machine learning techniques Long-term, Web context: actor, movie, Oscar, Hollywood Disambiguation Extraction use dictionaries use rules @dsmith: melcrashed. maseratiis gone. Short-term, social context: crash, car, Maserati

  19. Must Maintain a Highly Dynamic Social Genome all events places people sports celebrities politics … actors directors Gibsoncarcrash Egyptianuprising Angelia Jolie Mel Gibson Mel Brooks Short-term, social context: crash, car, Maserati Long-term, Web context: actor, movie, Oscar, Hollywood Latency less than 2 seconds

  20. The Giant Traditional Taxonomy is the Secret Weapon • Without it, dictionary-based extraction is not possible • Provide a framework to • “understand” social media, find related concepts, “hang” social contexts • Very hard to develop, takes years • like learning a new foreign language • Partly explains why it was hard for others to catch up  Must integrate traditional data well, then bootstrap all places people capital-of Egypt Cairo actors located-in Angelia Jolie Mel Gibson Tahrir

  21. Event Detection: Current Solutions events Twitter 4square Facebook Myspace Flickr … • Focus on Twitter + Foursquare • Lot of current work in academia / industry • Limitations of most of the current solutions • exploit just one kind of heuristics • e.g., find popular, strongly correlated words (Egypt, revolt) • does not exploit crowdsourcing • does not scale • not designed explicitly for parallelism sports celebrities politics … Event detection Gibsoncarcrash Egyptianuprising

  22. Detector 1 Detector 2 Detector n Event Dection: Kosmix Solution Candidate events Population 1 Eventevaluatorand ranker Ranked events Candidate events Twitter Foursquare Population 2 … Population 3 Candidate events ... HadoopMuppet Slates Stream servers

  23. Event Monitoring: Current Solutions Baltimore shooting Egyptianuprising @dsmith: Baltimore shooting on TV5! @far213: Tahrir is packed! • Manually write rules to match tweets to events • e.g., tweet contains certain keywords / userids positive • conceptually simple, relatively easy to implement • often achieve high initial precision • Limitations • expensive, don’t scale • manually writing good rules can be hard • rules often become invalid/inadequate over time • e.g., Baltimore shooting  John Hopkins shooting 24

  24. Event Monitoring: Kosmix Solution Twitter firehose Event Baltimore shooting Tweets Initial profile “Baltimore shooting on TV5!” “Baltimore shooting. John Hopkins shut down.” ... {Baltimore, shoot} Learning algorithm New profile {Baltimore, shoot, John Hopkins} 25

  25. Social Analytics with The NYTimes e.g. Location, Sentiment, Entity extraction, etc. Stats Tweets Annotators Tweets & Dimensions SocialCubes Location How many are tweeting about Barack Obama in New York, by the minute for last 60 mins, by hour for last 24 hours, and by day for last 10 days? New York California Arizona How many people in Arizona feel positive of the new Medicare plan? Barack Obama Hillary Clinton Topics Medicare How many feel negative of Barack Obama across the US? Negative Positive Neutral Sentiment Barack Obama, President Obama, the Pres, Barry, BO, ...

  26. Social Monitoring with an Unknown Agency 146 in past 5 mins 3267 in past 12 hours Count tweets related to Wael Ghonim Twitter firehose Egyptian uprising Justin Bieber Charlie Sheen Jordan unrest North China unrest Tibet West Southeast Bought by Walmart in May 2011

  27. The Walmart Acquisition • Deal reported to be 250-300M • Kosmix became @WalmartLabs • based in San Bruno • local office in India • plan new offices in China and Brazil • 100 persons today, actively hiring

  28. Why? • 400+ B in revenue, only 5-10B online vs. 34B of Amazon • Major problems if won’t catch up within 5-10 years • see Borders • @WalmartLabs can help in many ways • Provides a core of technical people, attract more • Improve traditional e-commerce • SEO, SEM, search on walmart.com • build a vast product taxonomy • Helps build the e-commerce of the future • social, local, and mobile • a good way to catch up and leapfrog Amazon

  29. Improve Traditional E-Commerce all products Product data from thousands of vendors In-house data Web data books cars search ads Information extraction Entity disambiguation Entity merging ... US cars Ford Chevrolet File system RDBMS Hadoop

  30. Help Build the E-Commerce of Future: Social, Local, and Mobile • O2O (Online 2 Offline) emerging as a major trend • increasingly tighter integration of online and offline parts • e.g., Groupon, Living Social • Social, local, and mobile commerce examples • gift recommendation: • “I love salt!” • “Your friend has just tweeted about the movie SALT. Would you like to buy something related for her birthday?” • personalized “Groupon” with vendors: • “You seem to be interested in gourmet coffee. If 50 persons sign up to buy the new DeLonghi coffee maker, you can get that for a 50% discount.” • stocking a local store • a Siri-like shopping assistant

  31. Wrapping Up • Social media has become a major frontier on Web • Integrating social data is fundamentally much harder than integrating “traditional” data • lack of context • dynamic environment, new concepts appear quickly • quality issues, lots of spam • quick spread of information, user activities • fast data • solution will change over time, need human in the loop to monitor • Must integrate “traditional” data well, then bootstrap • giant taxonomy critical • Crowdsourcing becomes indispensible • but raises interesting challenges

More Related