Evaluation of Different Algorithms for Metadata Extraction

Evaluation of Different Algorithms for Metadata Extraction Work in Progress Metadata Extraction Project Sponsored by DTIC Department of Computer Science Old Dominion University 6 / 03 / 2004

Contents • Introduction • Metadata Extraction Using SVMs • Support Vector Machines • Multi-Class SVMs • Metadata Extraction as Multi-Class SVMs • Metadata Extraction Using HMM • Hidden Markov Model • Metadata Extraction as HMM • Metadata Extraction Using Templates • Experiments • SVMs • HMMs • Templates • Conclusion and Future Work

Introduction • Machine Learning Approach • Support Vector Machines • Hidden Markov Model • Rule-based approach • Using rules to specify how to extract metadata Motivation: Evaluate different approaches for metadata extraction for the DTIC test bed.

Introduction • Deliverables • Software tool to extract metadata and structure from a set of pdf documents. • Feasibility report on extracting complex objects such as figures, equations, references, and tables from the document and representing them in a DTD-compliant XML format.

Introduction • Schedule (Starting Date: March 2004) • Months 0-2: Working with DTIC in identifying the set of documents and the metadata of interest. • Months 3-8: Developing software for metadata and structure extraction from the selected set of pdf documents • Months 9-12: Feasibility study for extracting complex objects and representing the complete document in XML

Support Vector Machines • Binary classifier(classify data into two classes) • Represent data with pre-defined features • Learning: to find the plane with largest margin to separate the two classes. • Classifying: classify data into two classes based on which side they located. hyperplane feature1 margin feature2 The figure shows a SVM example to classify a person into two classes: overweighed, not overweighed; two features are pre-defined: weight (feature 1) and height (feature 2). Each dot represents a person. Red dot: overweighed; Blue dot: not overweighed

Multi-Class SVMs • Combining into multi-class classifier • One-vs-rest • Classes: in this class or not in this class • Positive training samples: data in this class • Negative training samples: the rest • K binary SVM (k the number of the classes) • One-vs-One • Classes: in class one or in class two • Positive training samples: data in this class • Negative training samples: data in the other class • K(K-1)/2 binary SVM

Metadata Extraction as SVMs • Each element (title, author, etc.) in the metadata set can be looked as a class. • Classify each line(paragraph) into a class • Feature set • Line features ( number of words, etc.) • Word features: use each word as a feature. In practice, word clustering techniques are used to reduce the number of features. Word clustering techniques are to cluster words into groups based similarity.

Hidden Markov Model • A probabilistic finite state automaton • A sequence of observation symbols are produced by the underlying states (Hidden States) based on • Transition probabilities: the probabilities from one state to another • Emission Probabilities: the probabilities of emitting each symbol in each state • Learning: determining the transition and emission probabilities from training data. • Decoding: find the most possible sequence of the hidden states that produce the sequence of observation symbols.

Metadata Extraction as HMM • A document header can be looked as a sequence of symbols (words, etc.) produced by the hidden states (title, author, etc.) • Metadata Extraction • For a sequence of symbols – a document header • find the most possible sequence of states (title, author, etc.) • For example, • Input: Converting Existing Corpus to an OAI Compliant Repository, K. Maly, M. Zubair, J. Tang • Output: title title title title title title title title author author author author author author

Metadata Extraction template2 template1 template2 Doc3 Doc1 Doc2 metadata Metadata Extraction Using Templates • A rule-based approach • But decouples the code and the rules • Share the same code • One template per document type • Template • A XML file to describe the document features • Using rules to define how to extract metadata for this type of documents

Experiments • SVM • Apply SVM to different data sets • Objective: Evaluate the performances of different data sets. • Software used • LibSVM • Multi-class SVMs: Using one-vs-one approach • Features: Textual features only • word-specific features such as :city: • line-specific features such as how many words in a line

SVM Experiments with different data sets • Data Sets • Data Set 1: Seymore935 • Download from http://www-2.cs.cmu.edu/~kseymore/ie.html • 935 manually tagged document headers • 15 Tags: title, author, affiliation, address, note, email, date, abstract, introduction (intro), phone, keywords, web, degree, publication number (pubnum), and page • Ignore tags except: title, author, affiliation, date • Using the first 500 for training and the rest for test • Data Set 2: DTIC100 • Selected 100 PDF files from DTIC website based on Z39.18 standard • OCR the first pages and convert to text format • Manually tagged these 100 document headers • 5 Tags: title, author, affiliation, date and others • Using the first 75 for training and the rest for test • Data Set 3: DTIC33 • A subset of DTIC100 • 33 tagged document headers with identical layout • 5 Tags: title, author, affiliation, date and others • Using the first 24 for training and the rest for test

SVM Experiments with different data sets • Result

Experiments • SVM • Use SVM with different feature sets • Objective: Evaluate the performances of different feature sets. • Software used • LibSVM • Multi-class SVMs: Using One-vs-One approach, I.e, training one SVM classifier for each pair. • Research from LibSVM developers shows that One-vs-One approach has better performance than One-vs-Rest approach • Data set: DTIC100 • Manually tagged the XML files with layout information • Feature Sets • Text: Textual features only • Text+font: textual features and font size feature • Text+font+bold: textual features and bold feature

SVM with different feature sets • Result

SVM with different feature sets • More • Using layout information for the documents with much different layout does not improve the performance significantly. • Another step further is to use to a document set with similar layout. We do the same experiment with DTIC33 and get better result in recall. However, due to the data set is too small, we can not jump to conclusion yet.

Experiments • HMM • Data Set: Seymore935 • One state per field (tag) • Using the first 500 for training and the rest for test • Experimental Result • Overall accuracy=93.0%

Experiments • Template • Data Set • DTIC100: 100 XML files with font size and bold information • It is divided into 7 classes according to layout information • For each class, a template is developed after checking the first one or two documents in this class. This template is applied to the remaining documents to get performance data (recall and precision)

Experiments • Result

Experiments • Templates with more data demo

Discussions • We have done experiments with SVM, HMM and Template approach • Template approach is flexible and produces good results. • SVM looks more promising than HMM • Results are better • It processes the data line by line (or paragraph by paragraph) instead of word by word • It is easy to process layout information

Discussions • SVM • +: Reported to have good performance • -: difficulty in selecting proper features; difficulty in labeling a lot of training data; converting data into features and training is time-consuming. • Template Approach • +: Flexible and straightforward (rules may be understood by human) • -: Rules are fixed; difficulty in adjusting rules when errors occurs.

Overall Approach for Handling Large Collection • Manual Classification • This approach assumes it is possible to humanly classify the large set of documents into similar classes ( based on time period, source organizations, etc. ) • For each class, randomly select, say 100, documents develop a template. Evaluate the template by statistically sampling and refine the template till error is under a tolerance level. Next apply the refined template to the whole set. • Auto-Classification • This approach assumes it is not humanly possible to classify the large set of documents. In this case we develop a higher-set of rules on a smaller sample for classification. Evaluate the classification approach based on statistical sampling. • Next develop the template for each class, apply, and refine as outlined in the manual classification approach.

Future work • Evaluate different approaches for the DTIC test bed including the hybrid Approach that integrates SVM and template based approach.

Future work • Enlarge the data set • Currently, the data set is small • We need enlarge the data for evaluation different approaches

Thanks

SVM • The margin is the width of separation between the two classes. • Optimal hyperplane is the one with maximal margin of separation between the two classes. • The support vectors are the instances closest to the optimal hyperplane.

SVM (cont.) Geometric interpretationSupport vectors uniquely defines the optimal hyperplane

SVM (cnt.) • SVM is to determine the hyperplane between two classes from training set • SVM make the classification based on which side the input data located on.

SVM (cont.) • Mathematics Interpretation • We wantw.xi+ b≥ 1 if yi= 1 (xi in class 1)wTxi+ b ≤ -1 if yi= -1 (xi in class 2)The margin= 2/||w|| • Then the problem turned into constrained optimization problemmaximize 2/||w|| or minimize ||w||2subject to yi(w.xi+ b)-1 ≥0

SVM (cont.) • Unique solution w =Σαiyixi over all support vectors • Decision function f(x)=sign(Σαiyixi.x+b) • All other xi irrelevant tothe solution. • Lagrangian Lp=1/2||w||2-∑αiyi(w.xi+ b)+∑αiw =Σαiyixi Σαiyi=0

SVM (cont.) Advantages • Can manage a very large number of attributes/features. • Linear regression has overfitting problem when the number of attributes is much larger than the size of training set. • The SVM solution is determined by support vectors only. • Various kernel functions can be used to map input space into feature space • For non-linear space, SVM uses kernel functions to map it to a linear separable space. • In the way, SVM use linear separation to solve non-linear problems.

Experiment (SVM) Our experiment (working on 500 tagged headers as the paper described) • Knowledge collection • Collect the authors’ names from Archon (CERN collection) • Download a British word list from internet • Collect country name from web • Collect USA city names • Collect Canada province names and USA state names • Collect month names and their abbreviations • Frequent words for degree, pubnum, notenum, affiliation, address. • Regular expression for email and url

Experiment(SVM) 2. Word Clustering Converting the original data For example, <title> Protocols for Collecting Responses +L+ in Multi-hop Radio Networks +L+ </title> <author> Chungki Lee James E. Burns +L+ Mostafa H. Ammar +L+ </author> <pubnum> GIT-CC-92/28 +L+ </pubnum> <date> June 1992 +L+ </date> Will converted to <title> :Cap1DictWord: :DictWord: :Cap1DictWord: :Cap1DictWord: +L+ :prep: :CapWord1LowerWord4-LowerWord3: :Cap1DictWord: :Cap1DictWord: +L+ </title > <author> :CapWord1LowerWord6: :mayName: :mayName: :singleCap: :mayName: +L+ :CapWord1LowerWord6: :singleCap: :mayName: +L+ </author> <pubnum> :CapWord3-CapWord2-Digs2/Digs2: +L+ </pubnum> <date> :month: :Digs4: +L+ </date>

Experiment(SVM) 3. Get Features Treat each word in converted file as a feature, use occurrence as the weight. 4. 500 headers are divided into 450 training data and 50 test data. 5. Training each of the 15 classifiers using one-versus-all approaches.

Observation Symbols Hidden States Hidden Markov Models Example someone trying to deduce the weather from a piece of seaweed • For some reason, he can not access weather information (sun, cloud, rain) directly • But he can know the dampness of a piece of seaweed (soggy, damp, dryish, dry) • And the state of the seaweed is probabilistically related to the state of the weather

Hidden Markov Models (cont.)

HMM problems (cont.) the most probable sequence of hidden states is the sequence that maximizes : Pr(dry,damp,soggy | sunny,sunny,sunny), Pr(dry,damp,soggy | sunny,sunny,cloudy), Pr(dry,damp,soggy | sunny,sunny,rainy), . . . . Pr(dry,damp,soggy | rainy,rainy,rainy)

Hidden Markov Models (cont.) • A Hidden Markov Model is consist of two sets of states and three sets of probabilities: • hidden states : the (TRUE) states of a system that may be described by a Markov process (e.g. weather states in our example). • observable symbols : the symbols of the process that are `visible‘ (e.g. dampness of the seaweed). • Initial probabilities for hidden states • Transition probabilities for hidden states • Emission probabilities for each observable symbol in each hidden state

Digital Library Research at ODU

Open Archives InitiativeOAI-PMH 2.0http://www.openarchives.org

Connecting Islands of Digital Libraries Islands of digital libraries need to be interconnected for users to access different information resources from anywhere Need for manipulating, organizing, and correlating information from different repository for better discovery Open Archives Protocol for Metadata Harvesting (OAI-PMH) is an international effort to facilitate bridges across islands of digital libraries. OAI does to digital libraries what Internet did for islands of isolated networks.

Background - Open Archives Initiative (OAI) The goal of the Open Archives Initiative Protocol for Metadata Harvesting is to supply and promote an application-independent interoperability framework. The OAI protocol permits metadata harvesting of a data provider by a service provider. Data Provider supports the OAI protocol as a means of exposing metadata about the content in their systems Service Providers issue OAI protocol requests to the systems of data providers and use the returned metadata as a basis for building value-added services. http://www.openarchives.org The word “open” in OAI is from the architectural perspective – defining and promoting machine interfaces. Openness does not mean “free” or “unlimited” access to the information repositories that conform to the OAI technical framework. The OAI is an International effort. Major sponsors are: Council on Library and Information Resources (CLIR), the Digital Library Federation (DLF), the Scholarly Publishing & Academic

What does it mean making an existing digital library OAI enabled ? Digital Library OAI Layer Exposing metadata to OAI service providers – DC and Parallel metadata sets ONLY METADATA Storage

Evaluation of Different Algorithms for Metadata Extraction

Evaluation of Different Algorithms for Metadata Extraction

Presentation Transcript

An Overview of Different Compression Algorithms

DNS Data and Metadata Extraction

Evaluation of Fast Electrostatics Algorithms

Metadata Extraction

Performance Evaluation for Learning Algorithms

Spatial Metadata Extraction from Airborne Instrument Data

Evaluation of Queue Management Algorithms

Evaluation of Relevance Feedback Algorithms for XML Retrieval

An Evaluation of Automata Algorithms for String Analysis

DSpace, ETDs, Automatic Metadata Extraction

Metadata Extraction @ ODU for DTIC

Evaluation of Algorithms for the List Update Problem

Validation and Evaluation of Algorithms

Literal and ProRulext: Algorithms for Rule Extraction of ANNs

Algorithms for emittance evaluation

Different Methods of Impact Evaluation

The Utility of Metadata for Questionnaire Design and Evaluation

Keyword extraction for metadata annotation of Learning Objects

Performance Evaluation of Grouping Algorithms