UMass and Learning for CALO

UMass andLearning for CALO Andrew McCallum Information Extraction & Synthesis Laboratory Department of Computer Science University of Massachusetts

Outline • CC-Prediction • Learning in the wild from user email usage • DEX • Learning in the wild from user correction...as well as KB records filled by other CALO components • Rexa • Learning in the wild from user corrections to coreference... propagating constraints in a Markov-Logic-like system that scales to ~20 million objects • Several new topic models • Discover interesting useful structure without the need for supervision... learning from newly arrived data on the fly

CC Prediction Using Various Exponential Family Factor Graphs Learning to keep an org. connected & avoid stove-piping. First steps toward ad-hoc team creation. Learning in the wild from user’s CC behavior,and from other parts of the CALO ontology.

Graphical Models for Email • Compute P(y|x) for CC prediction - function - random variable - N replications Recipient of Email y N The graph describes the joint distribution of random variables in term of the product of local functions xb xs xr Nb Ns Nr-1 Email Model:Nb words in the body, Ns words in the subject, Nr recipients Body Subject Other Words Words Recipients Nr • Local functions facilitate system engineering through modularity

Document Models • Models may relational attributes Na Author ofDocument y xb xs xb xr xt Nb Ns Na-1 Nr Nt Title Abstract Body Co-authors References • We can optimize P(y|x) for classification performance and P(x|y) for model interpretability and parameter transfer (to other models)

CC Prediction and Relational Attributes Nr Target Recipient y xb xr’ xs xr xtr Nb Ns Nr-1 Ntr Thread Body Subject Other Relation Relation Words Words Recipients Thread Relations – e.g. Was a given recipient ever included on this email thread? Recipient Relationships – e.g. Does one of the other recipients report to the target recipient?

CC-Prediction Learning in the Wild • As documents are added to Rexa, models of expertise for authors grows • As DEX obtains more contact information and keywords, organizational relations emerge • Model parameters can be adapted on-line • Priors on parameters can be used to transfer learned information between models • New relations can be added on-line • Modular model construction and intelligent model optimization enable these goals

CC Prediction Upcoming work on Multi-Conditional Learning A discriminatively-trained topic model, discovering low-dimensional representations for transfer learning and improved regularization & generalization.

Traditional, joint training (e.g. naive Bayes, most topic models) Traditional mixture model (e.g. LDA) Traditional, conditional training (e.g. MaxEnt classifiers, CRFs) Conditional mixtures (e.g. Jebara’s CEM, McCallum CRF string edit distance, ...) Multi-conditional(mostly conditional, generative regularization) Multi-conditional(for semi-sup) Multi-conditional(for transfer learning, 2 tasks, shared hiddens) Objective Functions for Parameter Estimation Traditional New, multi-conditional

“Multi-Conditional Learning” (Regularization) [McCallum, Pal, Wang, 2006]

Predictive Random Fieldsmixture of Gaussians on synthetic data [McCallum, Wang, Pal, 2005] Data, classify by color Generatively trained Multi-Conditional Conditionally-trained [Jebara 1998]

Multi-Conditional Mixturesvs. Harmoniunon document retrieval task [McCallum, Wang, Pal, 2005] Multi-Conditional,multi-way conditionally trained Conditionally-trained,to predict class labels Harmonium, joint,with class labels and words Harmonium, joint with words, no labels

DEX Beginning with a review of previous work, then new work on record extraction, with the ability to leverage new KBs in the wild, and for transfer

Contact Info and Person Name Extraction Social Network Analysis Person Name Extraction Homepage Retrieval Keyword Extraction Name Coreference System Overview CRF WWW Email names

An Example To: “Andrew McCallum” mccallum@cs.umass.edu Subject ... Search for new people

Example keywords extracted Summary of Results Contact info and name extraction performance (25 fields) Expert Finding:When solving some task, find friends-of-friends with relevant expertise. Avoid “stove-piping” in large org’s by automatically suggesting collaborators. Given a task, automatically suggest the right team for the job. (Hiring aid!) Social Network Analysis:Understand the social structure of your organization.Suggest structural changes for improved efficiency.

Importance of accurate DEX fields in IRIS • Information about • people • contact information • email • affiliation • job title • expertise • ... are key to answering many CALO questions... both directly, and as supporting inputs to higher-level questions.

Extracted Record Name:Jane Smith, John Doe JobTitle: Professor, Administrative Assistant Company: U of California Department: Computer Science Phone: 209-555-5555, 209-444-4444 City: Boston Compatibility Graph .8 Jane Smith University of California 209-555-5555 .4 Computer Science -.5 Professor -.6 Boston Administrative Assistant -.5 -.4 University of California .4 John Doe 209-444-4444 Learning Field Compatibilities in DEX Professor Jane Smith University of California 209-555-5555 Professor Smith chairs the Computer Science Department. She hails from Boston, …her administrative assistant … John Doe Administrative Assistant University of California 209-444-4444

Learning Field Compatibilities in DEX Extracted Record Professor Jane Smith University of California 209-555-5555 Professor Smith chairs the Computer Science Department. She hails from Boston, …her administrative assistant … John Doe Administrative Assistant University of California 209-444-4444 Name:Jane Smith, John Doe JobTitle: Professor, Administrative Assistant Company: U of California Department: Computer Science Phone: 209-555-5555, 209-444-4444 City: Boston Jane Smith University of California 209-555-5555 Professor Computer Science Boston Administrative Assistant University of California John Doe 209-444-4444

Learning Field Compatibilities in DEX • ~35% error reduction over transitive closure • Qualitatively better than heuristic approach • Mine Knowledge Bases from other parts of IRIS for learning compatibility rules among fields • “Professor” job title co-occurs with “University” company • Area code / city compatibility • “Senator” job title co-occurs with “Washington, D.C” location • In the wild • As the user adds new fields & make corrections, DEX learns from this KB data • Transfer learning • between departments/industries

Rexa A knowledge base of publications, grants, people, their expertise, topics, and inter-connections Learning for information extraction and coreference. Incrementally leveraging multiple sources of information for improved coreference Gathering information about people’s expertise and co-author, citation relations First a tour of Rexa, then slides about learning

Previous Systems

Previous Systems Cites Research Paper

More Entities and Relations Expertise Cites Grant Research Paper Person Venue University Groups

Learning in Rexa Extraction, coreferenceIn the wild: Re-adjusting KB after corrections from a user Also, learning research topics/expertise, and their interconnections

Wide-spread interest, positive experimental results in many applications. Noun phrase, Named entity [HLT’03], [CoNLL’03]Protein structure prediction [ICML’04] IE from Bioinformatics text [Bioinformatics ‘04],… Asian word segmentation [COLING’04], [ACL’04]IE from Research papers [HTL’04] Object classification in images [CVPR ‘04] (Linear Chain) Conditional Random Fields [Lafferty, McCallum, Pereira 2001] Undirected graphical model, trained to maximize conditional probability of output sequence given input sequence where Finite state model Graphical model OTHERPERSONOTHERORGTITLE … output seq y y y y y t+2 t+3 t - 1 t t+1 FSM states . . . observations x x x x x t t +2 +3 t - t +1 t 1 input seq said Jones a Microsoft VP … (500 citations)

IE from Research Papers [McCallum et al ‘99]

IE from Research Papers Field-level F1 Hidden Markov Models (HMMs) 75.6 [Seymore, McCallum, Rosenfeld, 1999] Support Vector Machines (SVMs) 89.7 [Han, Giles, et al, 2003] Conditional Random Fields (CRFs) 93.9 [Peng, McCallum, 2004] error 40% (Word-level accuracy is >99%)

Joint segmentation and co-reference Extraction from and matching of research paper citations. o s World Knowledge Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison-Wesley, 1990. c Co-reference decisions y y p Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990. Databasefield values c y c Citation attributes s s Segmentation o o 35% reduction in co-reference error by using segmentation uncertainty. 6-14% reduction in segmentation error by using co-reference. Inference: Variant of Iterated Conditional Modes [Wellner, McCallum, Peng, Hay, UAI 2004] see also [Marthi, Milch, Russell, 2003] [Besag, 1986]

Rexa Learning in the Wildfrom User Feedback • Coreference will never be perfect. • Rexa allows users to enter corrections to coreference decisions • Rexa then uses this feedback to • re-consider other inter-related parts of the KB • automatically make further error correctionsby propagating constraints • (Our coreference system uses underlying ideas very much like Markov Logic, and scales to ~20 million mention objects.)

Finding Topics in 1 million CS papers 200 topics & keywords automatically discovered.

Topical Transfer Citation counts from one topic to another. Map “producers and consumers”

Topical Diversity Find the topics that are cited by many other topics---measuring diversity of impact. Entropy of the topic distribution among papers that cite this paper (this topic). LowDiversity HighDiversity

Some New Work onTopic Models Robustly capturing topic correlations Pachkinko Allocation Model Capturing phrases in topic-specific ways Topical N-Gram Model

Pachinko Machine

Pachinko Allocation Model [Li, McCallum, 2005] 11 Model structure, not the graphical model 21 22 Distributions over distributions over topics... Distributions over topics;mixtures, representing topic correlations 31 32 33 41 42 43 44 45 Distributions over words (like “LDA topics”) word1 word2 word3 word4 word5 word6 word7 word8 Some interior nodes could contain one multinomial, used for all documents. (i.e. a very peaked Dirichlet)

“estimation” PAM 100 estimation bayesian parameters data methods estimate maximum probabilistic distributions noise variable variables noisy inference variance entropy models framework statistical estimating Topic Coherence Comparison “models, estimation, stopwords” “estimation, some junk” LDA 100 estimation likelihood maximum noisy estimates mixture scene surface normalization generated measurements surfaces estimating estimated iterative combined figure divisive sequence ideal LDA 20 models model parameters distribution bayesian probability estimation data gaussian methods likelihood em mixture show approach paper density framework approximation markov Example super-topic 33 input hidden units function number 27 estimation bayesian parameters data methods 24 distribution gaussian markov likelihood mixture 11 exact kalman full conditional deterministic 1 smoothing predictive regularizers intermediate slope

Topic Correlations in PAM 5000 research paper abstracts, from across all CS Numbers on edges are supertopics’ Dirichlet parameters

UMass and Learning for CALO