260 likes | 261 Views
School of Computer Science. School of Computer Science. Information Extraction with HMM Structures Learned by Stochastic Optimization. Dayne Freitag and Andrew McCallum Presented by Tal Blum for the course: Machine Learning Approaches to Information Extraction and Information Integration.
E N D
School of Computer Science School of Computer Science Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented by Tal Blum for the course: Machine Learning Approaches to Information Extraction and Information Integration
Outline • Background on HMM transition structure selection • The algorithm for the sparse IE task • Comparison between their algorithm and Borkar et al. algorithm • Discussion • Results
HMMs for IE • Has been successfully used in many tasks: • Speech Recognition • Information Extraction (Biker et al.,Borkar et al.) • IE in Bioinformatics (Leek) • POS Tagging (Ratnaparkhi)
Sparse Extraction task • Fields are extracted from a long document • Most of the document is irrelevant • Examples: • NE • Conference Time & Location
HMM as a dynamic BN HMM as a BN S S1 S2 S3 Obs Obs1 Obs2 Obs3 t Learning HMM Structure? BN Y X Z W
X1 X2 X3 X4 X1 X2 X3 X4 X3 X4 X2 X1 Constrained Transition
country Zip code country Zip C2 Zip code country Zip C1 St. # street St. # street St. # street HMM Structure Learning • Unlike BN structure learning • Learn the structure of the transition Matrix A • Learn structures with different number of states
Why learn HMM structure? • HMMs are not specifically suited for IE tasks • Including structural bias can reduce the amount of parameters needed to learn and therefore require less data • The parameters will be more accurate • Constrain the number of times a class can appear in a document • Can represent class length more accurately • The emission probability might be multi modal • To model class left and right context of a class for the sparse IE task
Fully Observed vs. Partially Observed • The structure learning is only required when the data is partially observed • Partially Observed – a field is represented by several states, where the label is the field • With fully observed data we can let the probabilities “learn” the structure • Edges that are not observed will get zero probability • Learning the transition structure involves incorporating new states • Naively allowing arbitrary transition will not generalize well
The Problem • How to select the additional states and the state transition structure • Manual Selection doesn’t scale well • Human intuition do not always corresponds to the best structures
The Solution • A system that automatically selects a HMM transition structure • The system starts from an initial simple model and extends it sequentially by a set of operations to search for a better model • The model quality is measured by its discrimination on validation dataset • The best model is returned • The system is comparable with human constructed HMM structures and on average outperforms them
IE with HMMs • Each extracted field has its own HMM • Each HMM contains two kinds of states: • Target states • Non-Target states • All of the fields HMM are concatenated to a whole consistent HMM • The entire document is used to train the models with no need of pre-processing
Parameter Estimation • Transition Probabilities Estimation is done with Maximum Likelihood • Unique path – ratio of counts • Non Unique path – use EM • Emission Probabilities require smoothing with priors • shrinkage with EM
Learning State-Transition Structure • States: • Target • Prefix • Suffix • Background
Model Expansion Choices • States: • Target • Prefix • Suffix • Background • Model Expansion Choices: • Lengthen a prefix • Split a prefix • Lengthen a suffix • Split a suffix • Lengthen a target string • Split a target string • Add a background state
Discussion • Structure Learning is similar to rule learning for word or boundary classification • The search for the best structure is not comprehensive • There is no attempt to generalize better by using the same emission probabilities for different states
Comparison with Bokar et. al. algorithm • Differences • Segmentation vs. • Sparse Extraction • Background and boundaries • modeling • Unique Path - don’t use EM • Backward Search vs. Forward Search • Both assume boundaries and that the position is the more relevant feature that distinguish different states
Experimental Results • Tested on 8 extraction tasks over 4 datasets • Seminar Announcements (485) • Reuter Corporate Acquisition articles (600) • Job Announcements (298) • Call For Paper (363) • Training and Testing were equal size • Average performance over 10 splits
Experimental Results • Compared to 4 other approaches • Grown HMM – the structure learned • SRV – rule learning (Freitag 1998) • Rapier – rule learning (Califf 1998) • Simple HMM • Complex HMM
Conclusions • HMMs has been proved to be state of the art method for IE • Constraining the transition structure has a crucial effect on performance • Automatic Transition Structure learning compares and even outperforms manually crafted HMMs which require hard labor for manual construction