Instance Filtering for Entity Recognition

Instance Filtering for Entity Recognition Advisor ：Dr. Hsu Reporter：Chun Kai Chen Author：Alfio Massimiliano, Claudio Giuliano and Raffaella Rinaldi SIGKDD Explorations. Volume 7, Issue 1

Outline • Motivation • Objective • Background And Related work • Instance Filtering • Experimental Results • Conclusions • Personal Opinion

Motivation_Introduction(1/3) • The objective of Information Extraction (IE) • to identify a set of relevant domain-specific classes of entities • their relations in textual documents • this paper focus on the problem of Entity Recognition (ER) • Recent evaluation campaigns on ER • most of the participating systems approach the task as a supervised classification problem • assigning an appropriate classification labelfor each token in the input documents • two problems are usually related to this approach • the skewed class distribution • the data set size

Objective_Introduction(2/3) • To address these problems, we propose a technique called Instance Filtering (IF) • The goal of IF • reduce both the skewness and the data set size • main peculiarity of this technique • performed on both the training and test sets • reduces the computation time and the memory requirements for learning and classification, • improves the classification performance

Introduction(3/3) • Present a comparative study • on Stop Word Filters • Ex. • “He got a job from this company.” (Considering a, from and this to be stop words.), • To evaluate our filtering techniques • the SIE system a supervised system for ER developed at ITC-irst • designed to achieve the goal of being easily and quickly portable across tasks and languages • based on Support Vector Machines and uses a (standard) general purpose feature set • Performed experiments • on three different ER tasks (i.e. Named Entity, Bio-Entity and Temporal Expression Recognition) • in two languages (i.e. English and Dutch)

2. Background And Related work • Learning with skewed class distributions is a very well-known problem in machine learning • The most common technique for dealing with skewed data sets is sampling • An additional problem is the huge size of the data sets. • Instance Pruning techniques • have been mainly applied to instance-based learning algorithms (e.g. kNN), to speed up the classification process while minimizing the memory requirements • The main drawback of many Instance Pruning techniques is their time complexity

3. Instance Filtering • IF is a preprocessing step • performed to reduce the number of instances given as input to a supervised classifier for ER • In this section • describe a formal framework for IF • introduce two metrics to evaluate an Instance Filter. • In addition • define the class of Stop Word Filters • propose an algorithm for their optimization

3.1 A general framework(1/2) • An Instance Filter is a function Δ(ti; T) • returns 0 if the token tiis not expected to be part of a relevant entity, 1 otherwise. • Instance Filter can be evaluated using the two following functions: • ψ(Δ,T) • is called the Filtering Rate • denotes the total percentage of filtered tokens in the data set T • ψ+(Δ,T) • is named as Positive Filtering Rate • denotes the percentage of positive tokens (wrongly) removed

3.1 A general framework(2/2) • a good filter • if ψ+(Δ,T) is minimized and ψ(Δ,T) is maximized • reduce as much as possible the data set size while preserving most of the positive instances • avoid over-fitting • the Filtering Rates among the training and test set (TL and TT , respectively) have to be preserved: • skewness ratio • to evaluate the ability of an Instance Filter to reduce the data skewness

3.2 Stop Word Filters • They are implemented in two steps: • first, Stop Words are identified from the training corpus T and collected in the set of types U V • then all their tokens are removed from both the training and the test set

3.2 Stop Word Filters • 3.2.1 Information Content • removing tokens has a very low information content • 3.2.2 Correlation Coefficient (CC) • χ2 statistic is used to measure the lack of independence to find less likely to express relevant information • 3.2.3 Odds Ratio (OR) • measures the ratiobetween the probability of a type to occur in the positive or negative class • relevant documentsis different from the distribution on non-relevant documents

3.2.1 Information Content (IC) • The most commonly used feature selection metric in text classification is based on document frequency • Our approach consists in removing all the tokens whose type has a very low information content

3.2.2 Correlation Coefficient (CC) • In text classification the χ2 statistic • is used to measure the lack of independence between a type w and a category [20] . • In our approach • we use the correlation coefficient CC2 = χ2 of a term w with the negative class, • to find those types that are less likely to express relevant information in texts.

3.2.3 Odds Ratio (OR) • Odds ratio • measures the ratiobetween the probability of a type to occur in the positive class, and its probability to occur in the negative class. • the idea is that the distribution of the features on the relevant documentsis different from the distribution on non-relevant documents[21] . • Following this assumption, our approach is • a type is non-informative when its probability of being a negative example is sensibly higher than its probability of being a positive example[8] .

3.3 Optimization Issues • How to find the optimal threshold for a Stop Word Filter • To solve this problem • we observe the behaviors of ψ and ψ+

4. A Simple Information Extraction System(1/4) • In the training phase • SIE learns o-line a set of data models from a corpus prepared in IOBE format (see 4.1). • In the classification phase • these models are applied to tag new documents

4. A Simple Information Extraction System(2/4) • Input Format • The corpus must be prepared in IOBE notation • Instance Filtering Module • implements the 3 different Stop Word Filters • different Stop Word Lists • provided for the beginning and the end boundaries of each entity, as SIE learns two distinct classifiers for them

4. A Simple Information Extraction System(3/4) • Feature Extraction • used to extract a predefined set of features for each unfiltered token in both the training and the test sets. • Classification • SIE approaches the IE task as a classification problem • by assigning an appropriate classification label to unfiltered tokens • We use SVMlight for training the classiers

4. A Simple Information Extraction System(4/4) • Tag Matcher • All the positive predictions produced by the begin and end classifiers are paired by the Tag Matcher module • provides the final output of the system • assigns a score to each candidate entity. • If nested or overlapping entities occur, it selects the entity with the maximal score • The score of each entity is proportional to the entity length probability (i.e. the probability that an entity has a certain length) • and to the confidence provided by the classifiers to the boundary predictions.

5. EVALUATION • In order to assess the portability and the language independence of our filtering techniques • we performed a set of comparative experiments on three different tasks intwo different languages (see Subsection 5.1).

5.1 Task Descriptions • JNLPBA • International Joint Workshop on Natural Language Processing in Biomedicine and its Application • five entity types: DNA, RNA, protein, cell-line, and cell-type • CoNLL-2002 • Recognize named entities from Dutch texts • Four types of named entities are considered: persons, locations, organizations and names of miscellaneous entities • TERN • The TERN (Time Expression Recognition and Normalization)

5.2 Filtering Rates(1/2) • The results indicate • both CC and OR do exhibit good performance and are far better than IC in all the tasks • also highlight that our optimization strategy is robust against over fitting

5.2 Filtering Rates(2/2) • also report a significant reduction of the data skewness • Table 3 shows that all the IF techniques reduce sensibly the data skewness on the JNLPBA data set13. • As expected, both CC and OR consistently outperform IC.

5.3 Time Reduction • Figure 4 displays the impact of IF on the computation time14 required to perform the overall IE process. • It is important to note that the cost of the IF optimization process is negligible • The curves indicate that both CC and OR are far superior to IC, allowing a drastic reduction of the time.

5.4 Prediction Accuracy • Figure 5 plots the values of the micro-averaged F115. • Both OR and CC allows to drastically • reduce the computation time • maintain the prediction accuracywith small values of ε

5.5 Comparison with the state-of-the-art • Tables 4, 5 and 6 summarize the performance of SIE compared to the baselines and to the best systems in all the tasks

Conclusion • The high complexity of these algorithms • a preprocessing technique to alleviate two relevant problems of classification-based learning • An important advantage of Instance Filtering • reduction of the computation time required • by the entity recognition system to perform both training and classification • We presented a class of instance filters based on feature selection metrics • Stop Word Filters • The experiments • the results are close to the state-of-the-art

Instance Filtering for Entity Recognition

Instance Filtering for Entity Recognition

Presentation Transcript

Named Entity Recognition and Transliteration for 50 Languages

Named Entity Recognition

Exploiting Domain Structure for Named Entity Recognition

Named Entity Recognition

Cross-Domain Bootstrapping for Named Entity Recognition

Biomedical Named Entity Recognition

NAMED ENTITY RECOGNITION

Alignment and Object Instance Recognition

Entity Recognition via Querying DBpedia

Instance-level recognition II.

Named Entity Recognition

Entity-oriented filtering of large streams

Instance Filtering for Entity Recognition

ENTITYPRO EXPLOTING SVM FOR ITALIAN NAMED ENTITY RECOGNITION

Named Entity Recognition

Using WordNet Predicates for Multilingual Named Entity Recognition

Named Entity Recognition