evaluation of different algorithms for metadata extraction n.
Skip this Video
Loading SlideShow in 5 Seconds..
Evaluation of Different Algorithms for Metadata Extraction PowerPoint Presentation
Download Presentation
Evaluation of Different Algorithms for Metadata Extraction

Loading in 2 Seconds...

play fullscreen
1 / 64

Evaluation of Different Algorithms for Metadata Extraction - PowerPoint PPT Presentation

  • Uploaded on

Evaluation of Different Algorithms for Metadata Extraction. Work in Progress Metadata Extraction Project Sponsored by DTIC . Department of Computer Science Old Dominion University 6 / 03 / 2004. Contents. Introduction Metadata Extraction Using SVMs Support Vector Machines

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Evaluation of Different Algorithms for Metadata Extraction' - dalit

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
evaluation of different algorithms for metadata extraction

Evaluation of Different Algorithms for Metadata Extraction

Work in Progress

Metadata Extraction Project Sponsored by DTIC

Department of Computer Science

Old Dominion University

6 / 03 / 2004

  • Introduction
  • Metadata Extraction Using SVMs
    • Support Vector Machines
    • Multi-Class SVMs
    • Metadata Extraction as Multi-Class SVMs
  • Metadata Extraction Using HMM
    • Hidden Markov Model
    • Metadata Extraction as HMM
  • Metadata Extraction Using Templates
  • Experiments
    • SVMs
    • HMMs
    • Templates
  • Conclusion and Future Work
  • Machine Learning Approach
    • Support Vector Machines
    • Hidden Markov Model
  • Rule-based approach
    • Using rules to specify how to extract metadata

Motivation: Evaluate different approaches for metadata extraction for the DTIC test bed.

  • Deliverables
      • Software tool to extract metadata and structure from a set of pdf documents.
      • Feasibility report on extracting complex objects such as figures, equations, references, and tables from the document and representing them in a DTD-compliant XML format.
  • Schedule (Starting Date: March 2004)
      • Months 0-2: Working with DTIC in identifying the set of documents and the metadata of interest.
      • Months 3-8: Developing software for metadata and structure extraction from the selected set of pdf documents
      • Months 9-12: Feasibility study for extracting complex objects and representing the complete document in XML
support vector machines
Support Vector Machines
  • Binary classifier(classify data into two classes)
    • Represent data with pre-defined features
    • Learning: to find the plane with largest margin to separate the two classes.
    • Classifying: classify data into two classes based on which side they located.





The figure shows a SVM example to classify a person into two classes: overweighed, not overweighed; two features are pre-defined: weight (feature 1) and height (feature 2). Each dot represents a person. Red dot: overweighed;

Blue dot: not overweighed

multi class svms
Multi-Class SVMs
  • Combining into multi-class classifier
    • One-vs-rest
      • Classes: in this class or not in this class
      • Positive training samples: data in this class
      • Negative training samples: the rest
      • K binary SVM (k the number of the classes)
    • One-vs-One
      • Classes: in class one or in class two
      • Positive training samples: data in this class
      • Negative training samples: data in the other class
      • K(K-1)/2 binary SVM
metadata extraction as svms
Metadata Extraction as SVMs
  • Each element (title, author, etc.) in the metadata set can be looked as a class.
  • Classify each line(paragraph) into a class
  • Feature set
    • Line features ( number of words, etc.)
    • Word features: use each word as a feature. In practice, word clustering techniques are used to reduce the number of features. Word clustering techniques are to cluster words into groups based similarity.
hidden markov model
Hidden Markov Model
  • A probabilistic finite state automaton
    • A sequence of observation symbols are produced by the underlying states (Hidden States) based on
      • Transition probabilities: the probabilities from one state to another
      • Emission Probabilities: the probabilities of emitting each symbol in each state
  • Learning: determining the transition and emission probabilities from training data.
  • Decoding: find the most possible sequence of the hidden states that produce the sequence of observation symbols.
metadata extraction as hmm
Metadata Extraction as HMM
  • A document header can be looked as a sequence of symbols (words, etc.) produced by the hidden states (title, author, etc.)
  • Metadata Extraction
    • For a sequence of symbols – a document header
    • find the most possible sequence of states (title, author, etc.)
  • For example,
    • Input: Converting Existing Corpus to an OAI Compliant Repository, K. Maly, M. Zubair, J. Tang
    • Output: title title title title title title title title author author author author author author
metadata extraction using templates










Metadata Extraction Using Templates
  • A rule-based approach
    • But decouples the code and the rules
  • Share the same code
  • One template per document type
  • Template
    • A XML file to describe the document features
    • Using rules to define how to extract metadata for this type of documents
  • SVM
    • Apply SVM to different data sets
      • Objective: Evaluate the performances of different data sets.
      • Software used
        • LibSVM
      • Multi-class SVMs: Using one-vs-one approach
      • Features: Textual features only
        • word-specific features such as :city:
        • line-specific features such as how many words in a line
svm experiments with different data sets
SVM Experiments with different data sets
  • Data Sets
    • Data Set 1: Seymore935
      • Download from http://www-2.cs.cmu.edu/~kseymore/ie.html
      • 935 manually tagged document headers
      • 15 Tags: title, author, affiliation, address, note, email, date, abstract, introduction (intro), phone, keywords, web, degree, publication number (pubnum), and page
      • Ignore tags except: title, author, affiliation, date
      • Using the first 500 for training and the rest for test
    • Data Set 2: DTIC100
      • Selected 100 PDF files from DTIC website based on Z39.18 standard
      • OCR the first pages and convert to text format
      • Manually tagged these 100 document headers
      • 5 Tags: title, author, affiliation, date and others
      • Using the first 75 for training and the rest for test
    • Data Set 3: DTIC33
      • A subset of DTIC100
      • 33 tagged document headers with identical layout
      • 5 Tags: title, author, affiliation, date and others
      • Using the first 24 for training and the rest for test
  • SVM
    • Use SVM with different feature sets
      • Objective: Evaluate the performances of different feature sets.
      • Software used
        • LibSVM
      • Multi-class SVMs: Using One-vs-One approach, I.e, training one SVM classifier for each pair.
        • Research from LibSVM developers shows that One-vs-One approach has better performance than One-vs-Rest approach
      • Data set: DTIC100
        • Manually tagged the XML files with layout information
      • Feature Sets
        • Text: Textual features only
        • Text+font: textual features and font size feature
        • Text+font+bold: textual features and bold feature
svm with different feature sets3
SVM with different feature sets
  • More
    • Using layout information for the documents with much different layout does not improve the performance significantly.
    • Another step further is to use to a document set with similar layout. We do the same experiment with DTIC33 and get better result in recall. However, due to the data set is too small, we can not jump to conclusion yet.
  • HMM
    • Data Set: Seymore935
    • One state per field (tag)
    • Using the first 500 for training and the rest for test
    • Experimental Result
      • Overall accuracy=93.0%
  • Template
    • Data Set
      • DTIC100: 100 XML files with font size and bold information
      • It is divided into 7 classes according to layout information
      • For each class, a template is developed after checking the first one or two documents in this class. This template is applied to the remaining documents to get performance data (recall and precision)
  • Templates with more data


  • We have done experiments with SVM, HMM and Template approach
    • Template approach is flexible and produces good results.
    • SVM looks more promising than HMM
      • Results are better
      • It processes the data line by line (or paragraph by paragraph) instead of word by word
      • It is easy to process layout information
  • SVM
    • +: Reported to have good performance
    • -: difficulty in selecting proper features; difficulty in labeling a lot of training data; converting data into features and training is time-consuming.
  • Template Approach
    • +: Flexible and straightforward (rules may be understood by human)
    • -: Rules are fixed; difficulty in adjusting rules when errors occurs.
overall approach for handling large collection
Overall Approach for Handling Large Collection
  • Manual Classification
      • This approach assumes it is possible to humanly classify the large set of documents into similar classes ( based on time period, source organizations, etc. )
      • For each class, randomly select, say 100, documents develop a template. Evaluate the template by statistically sampling and refine the template till error is under a tolerance level. Next apply the refined template to the whole set.
  • Auto-Classification
      • This approach assumes it is not humanly possible to classify the large set of documents. In this case we develop a higher-set of rules on a smaller sample for classification. Evaluate the classification approach based on statistical sampling.
      • Next develop the template for each class, apply, and refine as outlined in the manual classification approach.
future work
Future work
  • Evaluate different approaches for the DTIC test bed including the hybrid Approach that integrates SVM and template based approach.
future work1
Future work
  • Enlarge the data set
    • Currently, the data set is small
    • We need enlarge the data for evaluation different approaches
  • The margin is the width of separation between the two classes.
  • Optimal hyperplane is the one with maximal margin of separation between the two classes.
  • The support vectors are the instances closest to the optimal hyperplane.
svm cont
SVM (cont.)

Geometric interpretationSupport vectors uniquely defines the optimal hyperplane

svm cnt
SVM (cnt.)
  • SVM is to determine the hyperplane between two classes from training set
  • SVM make the classification based on which side the input data located on.
svm cont1
SVM (cont.)
  • Mathematics Interpretation
    • We wantw.xi+ b≥ 1 if yi= 1 (xi in class 1)wTxi+ b ≤ -1 if yi= -1 (xi in class 2)The margin= 2/||w||
    • Then the problem turned into constrained optimization problemmaximize 2/||w|| or minimize ||w||2subject to yi(w.xi+ b)-1 ≥0
svm cont2
SVM (cont.)
  • Unique solution w =Σαiyixi over all support vectors
  • Decision function f(x)=sign(Σαiyixi.x+b)
  • All other xi irrelevant tothe solution.
    • Lagrangian Lp=1/2||w||2-∑αiyi(w.xi+ b)+∑αiw =Σαiyixi Σαiyi=0
svm cont3
SVM (cont.)


  • Can manage a very large number of attributes/features.
    • Linear regression has overfitting problem when the number of attributes is much larger than the size of training set.
    • The SVM solution is determined by support vectors only.
  • Various kernel functions can be used to map input space into feature space
    • For non-linear space, SVM uses kernel functions to map it to a linear separable space.
    • In the way, SVM use linear separation to solve non-linear problems.
experiment svm
Experiment (SVM)

Our experiment (working on 500 tagged headers as the paper described)

  • Knowledge collection
    • Collect the authors’ names from Archon (CERN collection)
    • Download a British word list from internet
    • Collect country name from web
    • Collect USA city names
    • Collect Canada province names and USA state names
    • Collect month names and their abbreviations
    • Frequent words for degree, pubnum, notenum, affiliation, address.
    • Regular expression for email and url
experiment svm1

2. Word Clustering

Converting the original data

For example,

<title> Protocols for Collecting Responses +L+ in Multi-hop Radio Networks +L+ </title>

<author> Chungki Lee James E. Burns +L+ Mostafa H. Ammar +L+ </author>

<pubnum> GIT-CC-92/28 +L+ </pubnum>

<date> June 1992 +L+ </date>

Will converted to

<title> :Cap1DictWord: :DictWord: :Cap1DictWord: :Cap1DictWord: +L+

:prep: :CapWord1LowerWord4-LowerWord3: :Cap1DictWord: :Cap1DictWord: +L+ </title


<author> :CapWord1LowerWord6: :mayName: :mayName: :singleCap: :mayName: +L+

:CapWord1LowerWord6: :singleCap: :mayName: +L+ </author>

<pubnum> :CapWord3-CapWord2-Digs2/Digs2: +L+ </pubnum>

<date> :month: :Digs4: +L+ </date>

experiment svm2

3. Get Features

Treat each word in converted file as a feature, use occurrence as the weight.

4. 500 headers are divided into 450 training data and 50 test data.

5. Training each of the 15 classifiers using one-versus-all approaches.

hidden markov models example





Hidden Markov Models Example

someone trying to deduce the weather from a piece of seaweed

  • For some reason, he can not access weather information (sun, cloud, rain) directly
  • But he can know the dampness of a piece of seaweed (soggy, damp, dryish, dry)
  • And the state of the seaweed is probabilistically related to the state of the weather
hmm problems cont
HMM problems (cont.)

the most probable sequence of hidden states is the sequence that maximizes :

Pr(dry,damp,soggy | sunny,sunny,sunny), Pr(dry,damp,soggy | sunny,sunny,cloudy), Pr(dry,damp,soggy | sunny,sunny,rainy), . . . . Pr(dry,damp,soggy | rainy,rainy,rainy)

hidden markov models cont1
Hidden Markov Models (cont.)
  • A Hidden Markov Model is consist of two sets of states and three sets of probabilities:
    • hidden states : the (TRUE) states of a system that may be described by a Markov process (e.g. weather states in our example).
    • observable symbols : the symbols of the process that are `visible‘ (e.g. dampness of the seaweed).
    • Initial probabilities for hidden states
    • Transition probabilities for hidden states
    • Emission probabilities for each observable symbol in each hidden state

Open Archives InitiativeOAI-PMH 2.0http://www.openarchives.org

connecting islands of digital libraries
Connecting Islands of Digital Libraries

Islands of digital libraries need to be interconnected for users to access different information resources from anywhere

Need for manipulating, organizing, and correlating information from different repository for better discovery

Open Archives Protocol for Metadata Harvesting (OAI-PMH) is an international effort to facilitate bridges across islands of digital libraries.

OAI does to digital libraries what Internet did for islands of isolated networks.


Background - Open Archives Initiative (OAI)

The goal of the Open Archives Initiative Protocol for Metadata Harvesting is to supply and promote an application-independent interoperability framework. The OAI protocol permits metadata harvesting of a data provider by a service provider.

Data Provider supports the OAI protocol as a means of exposing metadata about the content in their systems

Service Providers issue OAI protocol requests to the systems of data providers and use the returned metadata as a basis for building value-added services.


The word “open” in OAI is from the architectural perspective – defining and promoting machine interfaces. Openness does not mean “free” or “unlimited” access to the information repositories that conform to the OAI technical framework.

The OAI is an International effort. Major sponsors are: Council on Library and Information Resources (CLIR), the Digital Library Federation (DLF), the Scholarly Publishing & Academic

what does it mean making an existing digital library oai enabled
What does it mean making an existing digital library OAI enabled ?





Exposing metadata to OAI service providers – DC and Parallel metadata sets




Minimal Dublin Core Metadata – OAI Requirement


Fifteen Elements (Optional)

Element: Title

A name given to the resource. Typically, a Title will be a name by which the resource is formally known.

Element: Creator

An entity primarily responsible for making the content of the resource. Examples of a Creator include a person, an organisation, or a service.

Element: Subject

The topic of the content of the resource. Typically, a Subject will be expressed as keywords,

Element: Description

An account of the content of the resource. Description may include but is not limited to: an abstract, table of contents, reference to a graphical representation of content or a free-text account of the content.

Element: Publisher

An entity responsible for making the resource available. Examples of a Publisher include a person, an organisation, or a service.


Dublin Core Metadata…

Element: Contributor

An entity responsible for making contributions to the content of the resource.

Element: Date

A date associated with an event in the life cycle of the resource. Typically, Date will be associated with the creation or availability of the resource.

Element: Type

The nature or genre of the content of the resource.

Element: Format

The physical or digital manifestation of the resource. Typically, Format may include the media-type or dimensions of the resource. Format may be used to determine the software, hardware or other equipment needed to display or operate the resource.

Element: Identifier

An unambiguous reference to the resource within a given context. Example formal identification systems include the Uniform Resource Locator (URL).


Dublin Core Metadata…

Element: Source

A Reference to a resource from which the present resource is derived.

Element: Language

A language of the intellectual content of the resource.

Element: Relation

A reference to a related resource.

Element: Coverage

The extent or scope of the content of the resource. Coverage will typically include spatial location (a place name or geographic coordinates), temporal period (a period label, date, or date range) or jurisdiction (such as a named administrative entity).

Element: Rights

Information about rights held in and over the resource.


Beyond Dublin Core Metadata

Need to support parallel metadata sets to enable the OAI service provider to take advantage of the richer metadata fields for resource discovery.

The OAI metadata harvesting protocol supports the notion of parallel metadata sets, allowing collections to expose metadata in formats that are specific to their applications and domains. The OAI technical framework places no limitations on the nature of such parallel sets, other than that the metadata records be structured as XML data that have a corresponding XML schema for validation.


Metadata Harvesting

    • Move away from distributed searching.

- cannot scale well to large number of participants.

    • Extract metadata from various sources.

- Build services on local copies of metadata.

- data remains at remote repositories

RCDL 2003, St. Petersburg


OAI Request and OAI Response.

- OAI Request for Metadata is embedded in HTTP.

- OAI Response to OAI Request is encoded in XML.

- XML Schema specification for OAI Response is provided in OAI-PMH document.

RCDL 2003, St. Petersburg


Repos i tory

Harves ter

Service Provider

Data Provider

Supporting protocol requests:

  • Identify
  • ListMetadataFormats
  • ListSets

Harvesting protocol requests:

  • ListRecords
  • ListIdentifiers
  • GetRecord

RCDL 2003, St. Petersburg


Repos i tory

Harves ter

Service Provider

Data Provider


  • Repository name
  • Base-URL
  • Admin e-mail
  • OAI protocol version
  • Description Container

RCDL 2003, St. Petersburg


Repos i tory

Harves ter

Service Provider

Data Provider



  • Format prefix
  • Format XML schema


RCDL 2003, St. Petersburg


Repos i tory

Harves ter

Service Provider

Data Provider



  • Set Specification
  • Set Name


RCDL 2003, St. Petersburg


Repos i tory

Harves ter

Service Provider

Data Provider

* from=a

* until=b

* set=klm

ListRecords * metadataPrefix=oai_dc


  • Identifier
  • Datestamp
  • Metadata
  • About Container


RCDL 2003, St. Petersburg


Repos i tory

Harves ter

Service Provider

Data Provider

* from=a

* until=b


ListIdentifiers * set=klm


  • Identifier
  • Datestamp


RCDL 2003, St. Petersburg


Repos i tory

Harves ter

Service Provider

Data Provider

* identifier=oai:mlib:123a

GetRecord * metadataPrefix=oai_dc

  • Identifier
  • Datestamp
  • Metadata
  • About

RCDL 2003, St. Petersburg

oai mechanics
OAI Mechanics

Request is encoded in http

Response is encoded in XML

XML Schemas for the

responses are defined

in the OAI-PMH document

Courtesy: Michael Nelson