1 / 96

Research Overview for Harvard Medical Library

Research Overview for Harvard Medical Library. Andrew McCallum Associate Professor Computer Science Department University of Massachusetts Amherst. Outline. Self & Lab Intro Information Extraction and Data Mining. Research vingette: Conditional Random Fields

channary
Download Presentation

Research Overview for Harvard Medical Library

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Research Overviewfor Harvard Medical Library Andrew McCallum Associate Professor Computer Science Department University of Massachusetts Amherst

  2. Outline • Self & Lab Intro • Information Extraction and Data Mining. • Research vingette: Conditional Random Fields • Research vingette: Social network analysis & Topic models • Demo: Rexa, a new Web portal for research literature • Future work, brainstorming, collaboration, next steps

  3. Personal History • PhD, University of Rochester • Machine Learning, “Reinforcement Learning”, • Eye movements and short-term memory • Postdoc, Carnegie Mellon University • Machine Learning for Text, with Tom Mitchell • WebKB Project (met Hooman there) • Research Scientist, Just Research Labs • Information extraction from text, clustering... • Built Cora, a precursor to CiteSeer, in 1997. • VP Research & Development, WhizBang Labs • Information extraction from the Web, ~$50m VC funding, 170 people • Job search subsidiary, FlipDog.com, sold to Monster.com • Associate Professor, UMass Amherst • #5 CS department in Artificial Intelligence • Strong ML & NSF Center for Intelligent Information Retrieval

  4. Information Extraction & Synthesis Laboratory(IESL) • Assoc. Prof. Andrew McCallum, Director • 9 PhD students • 2 postdocs • 3 undergraduates • 2 full-time staff programmers • 40+ publications in the past 2 years • Grants from NSF, DARPA, DHS, Microsoft, IBM, IMS. • Collaborations with BBN, Aptima, BAE, IBM, SRI, ... MIT, Stanford, CMU, Princeton, UPenn, UWash, ... • ~70 compute servers, ~10 Tb disk storage

  5. Outline • Self & Lab Intro • Information Extraction and Data Mining. • Research vingette: Conditional Random Fields • Research vingette: Social network analysis & Topic models • Demo: Rexa, a new Web portal for research literature • Future work, brainstorming, collaboration, next steps

  6. Goal of my research Mine actionable knowledgefrom unstructured text.

  7. foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.html OtherCompanyJobs: foodscience.com-Job1 Extracting Job Openings from the Web

  8. A Portal for Job Openings

  9. Job Openings: Category = High Tech Keyword = Java Location = U.S.

  10. Data Mining the Extracted Job Information

  11. IE fromChinese Documents regarding Weather Department of Terrestrial System, Chinese Academy of Sciences 200k+ documents several millennia old - Qing Dynasty Archives - memos - newspaper articles - diaries

  12. IE from Research Papers [McCallum et al ‘98]

  13. IE from Research Papers

  14. Mining Research Papers [Rosen-Zvi, Griffiths, Steyvers, Smyth, 2004] [Giles et al]

  15. What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification + clustering + association October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

  16. What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

  17. What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification+ association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

  18. What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification+ association+ clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation * Free Soft.. Microsoft Microsoft TITLE ORGANIZATION * founder * CEO VP * Stallman NAME Veghte Bill Gates Richard Bill

  19. Larger Context Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Documentcollection Actionableknowledge Prediction Outlier detection Decision support

  20. Outline • Self & Lab Intro • Information Extraction and Data Mining. • Research vingette: Conditional Random Fields • Research vingette: Social network analysis & Topic models • Demo: Rexa, a new Web portal for research literature • Future work, brainstorming, collaboration, next steps

  21. Hidden Markov Models HMMs---the standard sequence modeling tool in genomics, music, speech, NLP, … Graphical model Finite state model S S S transitions t - 1 t t+1 ... ... observations ... Generates: State sequence Observation sequence O O O t - t +1 t 1 o1 o2 o3 o4 o5 o6 o7 o8 Parameters: for all states S={s1,s2,…} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 ) Observation (emission) probabilities: P(ot|st ) Training: Maximize probability of training observations (w/ prior) Usually a multinomial over atomic, fixed alphabet

  22. IE with Hidden Markov Models Given a sequence of observations: Yesterday Rich Caruana spoke this example sentence. and a trained HMM: person name location name background Find the most likely state sequence: (Viterbi) YesterdayRich Caruanaspoke this example sentence. Any words said to be generated by the designated “person name” state extract as a person name: Person name: Rich Caruana

  23. Wide-spread interest, positive experimental results in many applications. Noun phrase, Named entity [HLT’03], [CoNLL’03]Protein structure prediction [ICML’04] IE from Bioinformatics text [Bioinformatics ‘04],… Asian word segmentation [COLING’04], [ACL’04]IE from Research papers [HTL’04] Object classification in images [CVPR ‘04] (Linear Chain) Conditional Random Fields [Lafferty, McCallum, Pereira 2001] Undirected graphical model, trained to maximize conditional probability of output sequence given input sequence where Finite state model Graphical model OTHERPERSONOTHERORGTITLE … output seq y y y y y t+2 t+3 t - 1 t t+1 FSM states . . . observations x x x x x t t +2 +3 t - t +1 t 1 input seq said Jones a Microsoft VP … (500 citations)

  24. Table Extraction from Government Reports Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 -------------------------------------------------------------------------------- 1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves.

  25. Table Extraction from Government Reports [Pinto, McCallum, Wei, Croft, 2003 SIGIR] 100+ documents from www.fedstats.gov Labels: CRF • Non-Table • Table Title • Table Header • Table Data Row • Table Section Data Row • Table Footnote • ... (12 in all) Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 -------------------------------------------------------------------------------- 1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves. Features: • Percentage of digit chars • Percentage of alpha chars • Indented • Contains 5+ consecutive spaces • Whitespace in this line aligns with prev. • ... • Conjunctions of all previous features, time offset: {0,0}, {-1,0}, {0,1}, {1,2}.

  26. Table Extraction Experimental Results [Pinto, McCallum, Wei, Croft, 2003 SIGIR] Line labels, percent correct HMM 65 % Stateless MaxEnt 85 % CRF 95 %

  27. IE from Research Papers [McCallum et al ‘99]

  28. IE from Research Papers Field-level F1 Hidden Markov Models (HMMs) 75.6 [Seymore, McCallum, Rosenfeld, 1999] Support Vector Machines (SVMs) 89.7 [Han, Giles, et al, 2003] Conditional Random Fields (CRFs) 93.9 [Peng, McCallum, 2004] error 40% (Word-level accuracy is >99%)

  29. Other Successful Applications of CRFs • Information Extraction of gene & protein names from text • Winning teams from UPenn, UWisc, Stanford • ...using UMass CRF software • Gene finding in DNA sequences • [Culotta, Kulp, McCallum 2005] • New work at UPenn • Computer vision, OCR, music, robotics, reference matching, author resolution, ...protein fold recognition, ...

  30. Automatically Annotating MedLine Abstracts

  31. Want to train from set of string pairs, each labeled one of {match, non-match} match “William W. Cohon” “Willlleam Cohen” non-match “Bruce D’Ambrosio” “Bruce Croft” match “Tommi Jaakkola” “Tommi Jakola” match “Stuart Russell” “Stuart Russel” non-match “Tom Deitterich” “Tom Dean” CRF String Edit Distance string 1 alignment string 2 x1 W i l l i a m _ W . _ C o h o n W i l l l e a m _ C o h e n a.i1 a.e a.i2 1 2 3 4 4 5 6 7 8 9 10 11 12 13 14 15 16 copy copy copy copy copy copy copy copy copy copy copy subst subst insert delete delete delete 1 2 3 4 5 6 7 8 8 8 8 9 10 11 12 13 14 x2 joint complete data likelihood conditional complete data likelihood

  32. Outline • Self & Lab Intro • Information Extraction and Data Mining. • Research vingette: Conditional Random Fields • Research vingette: Social network analysis & Topic models • Demo: Rexa, a new Web portal for research literature • Future work, brainstorming, collaboration, next steps

  33. Managing and Understanding Connections of People in our Email World Workplace effectiveness ~ Ability to leverage network of acquaintances But filling Contacts DB by hand is tedious, and incomplete. Contacts DB Email Inbox Automatically WWW

  34. Contact Info and Person Name Extraction Social Network Analysis Person Name Extraction Homepage Retrieval Keyword Extraction Name Coreference System Overview CRF WWW Email names

  35. An Example To: “Andrew McCallum” mccallum@cs.umass.edu Subject ... Search for new people

  36. Example keywords extracted Summary of Results Contact info and name extraction performance (25 fields) Expert Finding:When solving some task, find friends-of-friends with relevant expertise. Avoid “stove-piping” in large org’s by automatically suggesting collaborators. Given a task, automatically suggest the right team for the job. (Hiring aid!) Social Network Analysis:Understand the social structure of your organization.Suggest structural changes for improved efficiency.

  37. Social Network in an Email Dataset

  38. Clustering words into topics withLatent Dirichlet Allocation [Blei, Ng, Jordan 2003] GenerativeProcess: Example: For each document: 70% Iraq war 30% US election Sample a distributionover topics,  For each word in doc Iraq war Sample a topic, z Sample a wordfrom the topic, w “bombing”

  39. Example topicsinduced from a large collection of text JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER [Tennenbaum et al]

  40. Example topicsinduced from a large collection of text JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER [Tennenbaum et al]

  41. From LDA to Author-Recipient-Topic (ART)

  42. Inference and Estimation • Gibbs Sampling: • Easy to implement • Reasonably fast r

  43. Enron Email Corpus • 250k email messages • 23k people Date: Wed, 11 Apr 2001 06:56:00 -0700 (PDT) From: debra.perlingiere@enron.com To: steve.hooser@enron.com Subject: Enron/TransAltaContract dated Jan 1, 2001 Please see below. Katalin Kiss of TransAlta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions. DP Debra Perlingiere Enron North America Corp. Legal Department 1400 Smith Street, EB 3885 Houston, Texas 77002 dperlin@enron.com

  44. Topics, and prominent senders / receiversdiscovered by ART Topic names, by hand

  45. Topics, and prominent senders / receiversdiscovered by ART Beck = “Chief Operations Officer” Dasovich = “Government Relations Executive” Shapiro = “Vice President of Regulatory Affairs” Steffes = “Vice President of Government Affairs”

  46. Comparing Role Discovery Traditional SNA ART Author-Topic connection strength (A,B) = distribution over recipients distribution over authored topics distribution over authored topics

  47. Comparing Role DiscoveryTracy Geaconne  Dan McCarty Traditional SNA ART Author-Topic Different roles Different roles Similar roles Geaconne = “Secretary” McCarty = “Vice President”

  48. Comparing Role DiscoveryLynn Blair  Kimberly Watson Traditional SNA ART Author-Topic Very similar Very different Different roles Blair = “Gas pipeline logistics” Watson = “Pipeline facilities planning”

  49. ART: Roles but not Groups Traditional SNA ART Author-Topic Not Not Block structured Enron TransWestern Division

  50. Groups and Topics • Input: • Observed relations between people • Attributes on those relations (text, or categorical) • Output: • Attributes clustered into “topics” • Groups of people---varying depending on topic

More Related