1 / 58

Big Text: f rom Language ( Names and Phrases ) t o Knowledge ( Entities and Relations )

Big Text: f rom Language ( Names and Phrases ) t o Knowledge ( Entities and Relations ). Gerhard Weikum Max Planck Institute for Informatics Saarbrücken, Germany http://www.mpi-inf.mpg.de/~weikum/. From Natural-Language Text to Knowledge. m ore knowledge , analytics , insight.

Download Presentation

Big Text: f rom Language ( Names and Phrases ) t o Knowledge ( Entities and Relations )

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Big Text: from Language (NamesandPhrases) to Knowledge (EntitiesandRelations) Gerhard Weikum Max Planck Institute forInformatics Saarbrücken, Germany http://www.mpi-inf.mpg.de/~weikum/

  2. From Natural-Language Text to Knowledge moreknowledge, analytics, insight knowledge acquisition Web Contents Knowledge intelligent interpretation

  3. Web of Data & Knowledge (Linked Open Data) > 50 Bio. subject-predicate-objecttriplesfrom > 1000 sources SUMO BabelNet WikiTaxonomy/ WikiNet ConceptNet5 Cyc ReadTheWeb TextRunner/ ReVerb http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png

  4. Web of Data & Knowledge > 50 Bio. subject-predicate-objecttriplesfrom > 1000 sources • 4M entities in • 250 classes • 500M factsfor • 6000 properties • live updates • 600M entities in • 15000 topics • 20B facts • 10M entities in • 350K classes • 120M factsfor • 100 relations • 100 languages • 95% accuracy • 40M entities in • 15000 topics • 1B factsfor • 4000 properties • coreofGoogle • KnowledgeGraph

  5. Web of Data & Knowledge > 50 Bio. subject-predicate-objecttriplesfrom > 1000 sources Bob_Dylantype songwriter Bob_Dylantype civil_rights_activist songwritersubclassOfartist Bob_DylancomposedHurricane HurricaneisAboutRubin_Carter Steve_JobsmarriedToSara_Lownds validDuring[Sep-1965, June-1977] Bob_DylanknownAs„voiceof a generation“ Steve_Jobs„was bigfanof“Bob_Dylan Bob_Dylan„brieflydated“Joan_Baez taxonomicknowledge factualknowledge temporal knowledge terminologicalknowledge evidence& belief knowledge

  6. Knowledge forIntelligent Applications • Enablingtechnologyfor: • disambiguation • in written & spokennaturallanguage • deepreasoning • (e.g. QA towinquizgame) • machinereading • (e.g. tosummarizebookorcorpus) • semanticsearch • in termsofentities&relations (not keywords&pages) • entity-level linkage • for Big Data & Big Text analytics

  7. Use-Case: Semantic Search Politicians who are also scientists? European composers who have won film music awards? Internet companiesfoundedbyBrazilianprofessors? Enzymes thatinhibit HIV? Influenza drugsforteenswith high bloodpressure? ...

  8. Use-Case: Question Answering This town is known as "Sin City" & its downtown is "Glitter Gulch" Q: Sin City ?  movie, graphicalnovel, nicknameforcity, … A: Vegas ? Vega ? Strip ?  Vega (star), Suzanne Vega, Vincent Vega, Las Vegas, …  comicstrip, striptease, Las Vegas Strip, … This American city has two airports named after a war hero and a WW II battle question classification & decomposition knowledge back-ends D. Ferrucci et al.: Building Watson. AI Magazine, Fall 2010. IBM Journal of R&D 56(3/4), 2012: This is Watson.

  9. Big Text Analytics: Who Covered Whom? 1000‘s of Databases 100 Mio‘sof Web Tables 100 Bio‘sof Web & Social Media Pages in different language, country, key, … withmoresales, awards, mediabuzz, … ..... Musician Original Title Elvis Presley Frank Sinatra My Way Robbie Williams Frank Sinatra My Way Sex Pistols Frank Sinatra My Way Frank Sinatra Claude Francois Commed‘Habitude Claudia Leitte Bruno Mars Famo$a (Billionaire) . . . . . . . . . . . . . . .

  10. Big Text Analytics: Who Covered Whom? 1000‘s of Databases 100 Mio‘sof Web Tables 100 Bio‘sof Web & Social Media Pages in different language, country, key, … withmoresales, awards, mediabuzz, … ..... MusicianPerformedTitle Sex PistolsMy Way Frank Sinatra My Way Claudia LeitteFamo$a Petula Clark Boy fromIpanema Name Show PetulaC. Muppets Claudia L. FIFA 2014 Name Group Sid Vicious Sex Pistols Bono U2 MusicianCreatedTitle Francis Sinatra My Way Paul Anka My Way Bruno Mars Billionaire Astrud Gilberto Garota de Ipanema

  11. Big Text Analytics: Who Covered Whom? 1000‘s of Databases 100 Mio‘sof Web Tables 100 Bio‘sof Web & Social Media Pages in different language, country, key, … withmoresales, awards, mediabuzz, … ..... Big Data & Big Text Volume Velocity Variety Veracity Big Data Volume Velocity Variety Veracity MusicianPerformedTitle Sex PistolsMy Way Frank Sinatra My Way Claudia LeitteFamo$a Petula Clark Boy fromIpanema MusicianCreatedTitle Francis Sinatra My Way Paul Anka My Way Bruno Mars Billionaire Astrud Gilberto Garota de Ipanema

  12. Big Data & Big Text Analytics Entertainment: Who coveredwhichothersinger? Who influencedwhichothermusicians? Health: Drugs (combinations) andtheirsideeffects Politics: Politicians‘ positions on controversialtopics andtheirinvolvementwithindustry Business: Customer opinions on small-company products, gatheredfromsocialmedia Culturomics: Trends in society, culturalfactors, etc. General Design Pattern: • Identify relevant contentssources • Identifyentitiesofinterest & theirrelationships • Position in time & space • Group andaggregate • Find insightfulpatterns & predicttrends

  13. Outline  Introduction Lovely NERD The New Chocolate The Dark Side Conclusion

  14. Lovely NERD

  15. NamedEntity Recognition & Disambiguation (NERD) Hurricane, about Carter, is on Bob‘s Desire. Itisplayed in the film with Washington. contextualsimilarity: mention vs. entity (bag-of-words, languagemodel) priorpopularity of name-entitypairs

  16. NamedEntity Recognition & Disambiguation (NERD) Hurricane, about Carter, is on Bob‘s Desire. Itisplayed in the film with Washington. • Coherenceofentitypairs: • semanticrelationships • sharedtypes (categories) • overlapof Wikipedia links

  17. NamedEntity Recognition & Disambiguation racismprotestsong boxingchampion wrongconviction Hurricane, about Carter, is on Bob‘s Desire. Itisplayed in the film with Washington. racismvictim middleweightboxing nicknameHurricane falselyconvicted Grammy Award winner protestsongwriter film musiccomposer civilrightsadvocate Academy Award winner African-American actor Cry for Freedom film Hurricane film Coherence: (partial) overlap of (statisticallyweighted) entity-specifickeyphrases

  18. NamedEntity Recognition & Disambiguation Hurricane, about Carter, is on Bob‘s Desire. Itisplayed in the film with Washington. • KB providesbuildingblocks: • name-entitydictionary, • relationships, types, • textdescriptions, keyphrases, • statisticsforweights NED algorithmscompute mention-to-entitymapping overweightedgraphofcandidates bypopularity& similarity& coherence

  19. Joint Mapping e1 50 m1 50 30 20 e2 30 10 10 90 m2 e3 100 e4 30 m3 20 80 90 90 e5 100 m4 30 5 e6 • Buildmention-entitygraphorjoint-inferencefactorgraph • fromknowledgeandstatistics in KB • Computehigh-likelihoodmapping(ML or MAP) or • densesubgraph(with high total edgeweight) such that: • each m isconnectedtoexactlyonee (orat mostonee) 19

  20. Coherence Graph Algorithm [J. Hoffart et al.: EMNLP‘11, VLDB‘12] e1 140 50 m1 50 30 180 20 e2 30 10 10 90 m2 50 e3 100 470 e4 30 m3 20 80 90 145 90 e5 100 m4 30 5 230 e6 • Computedensesubgraphto • maximizemin weighteddegreeamongentitynodes • such that: • each m isconnectedtoexactlyonee (orat mostonee) • Approx. algorithms (greedy, randomized, …), hashsketches, … • 82% precision on CoNLL‘03 benchmark • Open-sourcesoftware & online service AIDA http://www.mpi-inf.mpg.de/yago-naga/aida/ D5 Overview May 14, 2013 20

  21. NERD Online Tools • J. Hoffart et al.: EMNLP 2011, VLDB 2011 • https://d5gate.ag5.mpi-sb.mpg.de/webaida/ • P. Ferragina, U. Scaella: CIKM 2010 • http://tagme.di.unipi.it/ • R. Isele, C. Bizer: VLDB 2012 • http://spotlight.dbpedia.org/demo/index.html • D. Milne, I. Witten: CIKM 2008 • http://wikipedia-miner.cms.waikato.ac.nz/demos/annotate/ • L. Ratinov, D. Roth, D. Downey, M. Anderson: ACL 2011 • http://cogcomp.cs.illinois.edu/page/demo_view/Wikifier • Reuters Open Calais: http://viewer.opencalais.com/ • Alchemy API: http://www.alchemyapi.com/api/demo.html

  22. NERD at Work https://gate.d5.mpi-inf.mpg.de/webaida/

  23. NERD at Work https://gate.d5.mpi-inf.mpg.de/webaida/

  24. NERD at Work https://gate.d5.mpi-inf.mpg.de/webaida/

  25. NERD at Work https://gate.d5.mpi-inf.mpg.de/webaida/

  26. NERD on Tables

  27. Entity Matching in Structured Data Variety & Veracity ! Hurricane Dylan Like a Hurricane Young HurricaneEverette. Hurricane Katrina New Orleans 2005 Hurricane Sandy New York 2012 ………. ? Hurricane 1975 Forever Young 1972 Like a Hurricane 1975 ………. Dylan Bob 1941 Thomas Dylan Swansea 1914 Young Brigham 1801 Young Neil Toronto 1945 Denny Sandy London 1947 • entitylinkage: • keytodataintegration • long-standing problem, verydifficult, unsolved H.L. Dunn: Record Linkage. American Journal of Public Health 36 (12), 1946 H.B. Newcombe et al.: Automatic Linkage of Vital Records. Science 130 (3381), 1959

  28. Entity Matching in Structured Data e1 f1 e2 f2 f3 e3 f4 sameAslinking: similarityofcontexts

  29. Entity Matching in Structured Data e1 f1 g1 e2 g2 f2 f3 g3 e3 f4 sameAslinking: similarityofcontexts & coherenceofneighborhoods & constraints (transitivityetc.)  jointinferenceover (probabilistic) graph !

  30. Linking Big Data & Big Text Musician Song Year Listeners Charts . . . Sinatra My Way 1969 435 420 Sex PistolsMy Way 1978 87 729 Pavarotti My Way 1993 4 239 C. LeitteFamo$a 2011 272 468 B. Mars Billionaire 2010 218 116 . . . . . . . . . . 30

  31. Research Challenges & Opportunities Efficientinteractive & high-throughputbatch NERD aday‘snews, a month‘spublications, a decade‘sarchive Entitynamedisambiguation in difficultsituations Short andnoisytextsaboutlong-tailentities in socialmedia Handling long-tailandemergingentities tocomplementandcontinuously update KB keyfor KB life-cyclemanagement Web-scaleentitylinkagewith high quality acrosstextsources, linkeddata, KB‘s, Web tables, …

  32. Outline  Introduction  Lovely NERD The New Chocolate The Dark Side Conclusion

  33. Big Text: the New Chocolate

  34. Semantic Search over News https://stics.mpi-inf.mpg.de

  35. Semantic Search over News https://stics.mpi-inf.mpg.de

  36. Semantic Search over News https://stics.mpi-inf.mpg.de

  37. Semantic Search over News https://stics.mpi-inf.mpg.de

  38. Semantic Search over News https://stics.mpi-inf.mpg.de

  39. Semantic Search over News https://stics.mpi-inf.mpg.de

  40. Entity Analytics over News https://stics.mpi-inf.mpg.de

  41. Entity Analytics over News https://stics.mpi-inf.mpg.de

  42. Machine Reading of Scholarly Papers https://gate.d5.mpi-inf.mpg.de/knowlife/

  43. Machine Reading of Health Forums https://gate.d5.mpi-inf.mpg.de/knowlife/ [P. Ernst et al.: ICDE‘14]

  44. Big Data & Text Analytics:Side Effects of Drug Combinations • Deeperinsightfromboth • expert data & socialmedia: • actualsideeffectsofdrugs • … anddrugcombinations • riskfactorsandcomplications • of (wide-spread) diseases • alternative therapies • aggregation & comparisonby • age, gender, life style, etc. Structured Expert Data Social Media http://www.patient.co.uk http://dailymed.nlm.nih.gov

  45. Credibility of Statements in Health Communities [S. Mukherjee et al.: KDD‘14] I tookthewholemed cocktail at once. Xanaxgaveme wild hallucinations and a demonicfeel. Xanaxmademe dizzyandsleepless. XanaxandProzac areknownto causedrowsiness. p3 u1 p1 p2 s1 s2 u2 u3 Language Objectivity User Trustworthiness Statement Credibility jointreasoningwithprobabilisticgraphicalmodel

  46. Machine Reading: fromNamesandPhrasestoEntities, Classes, and Relations The Maestro fromRomewrotescoresforwesterns. Ma playedhisversionofthe Ecstasy. Maestro Card Rome (Italy) Jack Ma MDMA Leonard Bernstein AS Roma Yo-Yo Ma l‘Estasi dell‘Oro Lazio Roma Ennio Morricone plays sport western movie goal in football coverof born in Western Digital plays music storyabout film music playsfor

  47. Paraphrases of Relations composed: musiciansong covered: musician song Dylan wrote a sad song Knockin‘ on Heaven‘s Door, a cover song by the Dead Morricone‘s masterpiece is the Ecstasy of Gold, covered by Yo-Yo Ma Amy‘ssoulyinterpretationofCupid, a classic pieceofSam Cooke Nina Simone‘ssingingofDon‘tExplainrevivedHoliday‘soldsong CatPower‘svoiceishauntingin her versionofDon‘tExplain CaleperformedHallelujahwrittenbyL. Cohen • SOL patterns over words, wildcards, POS tags, semantic types: <musician> wroteADJ piece <song> Sequence Mining with Type Lifting (N. Nakashole et al.: EMNLP’12, ACL’13, VLDB‘12) • Relational phrases are typed: <singer> covered <song> <book> covered <event> • Relational synsets (and subsumptions): covered:coversong, interpretationof, singingof, voice in version, … composed:wrote,classic pieceof, ‘s oldsong, writtenby, composed, … 350 000 SOL patternsfromWikipedia: http://www.mpi-inf.mpg.de/yago-naga/patty/

  48. DisambiguationforEntities, Classes & Relations • (M. Yahya et al.: EMNLP’12, CIKM‘13) e: MaestroCard Maestro e: Ennio Morricone c: conductor c: musician r: actedIn from r: bornIn e: Rome (Italy) ILP optimizers likeGurobi solvethis in seconds Rome weightededges (coherence, similarity, etc.) e: Lazio Roma r: composed wrotescores r: giveExam c:soundtrack scoresfor r: soundtrackFor r: shootsGoalFor c: western movie westerns e: Western Digital CombinatorialOptimizationby ILP (with type constraints etc.)

  49. Outline  Introduction  Lovely NERD  The New Chocolate The Dark Side Conclusion

  50. The Dark Side of Big Data Nobody interested in yourresearch? Wereadyourpapers!

More Related