Learning to extract symbolic knowledge from the world wide web
Download
1 / 12

Learning to Extract Symbolic Knowledge from the World Wide Web - PowerPoint PPT Presentation


  • 278 Views
  • Uploaded on

Learning to Extract Symbolic Knowledge from the World Wide Web. Changho Choi Source: http://www.cs.cmu.edu/~knigam/ Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew McCallum Carnegie Mellon University, J.Stefan Institute AAAI-98. Abstract. Information on the Web. Unstandable to Human.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Learning to Extract Symbolic Knowledge from the World Wide Web' - Pat_Xavi


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Learning to extract symbolic knowledge from the world wide web l.jpg

Learning to Extract Symbolic Knowledge from the World Wide Web

Changho Choi

Source: http://www.cs.cmu.edu/~knigam/

Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew McCallum

Carnegie Mellon University, J.Stefan Institute

AAAI-98


Abstract l.jpg
Abstract Web

Information on

the Web

Unstandable to Human

Knowledgable

????

Extract information

KB

Changho Choi, University at Buffalo


Introduction 1 4 l.jpg
Introduction Web(#1/4)

  • Two types of inputs

    of the information extraction system

    • Ontology

      • Specifying the classes and relations of interest

        • For example, a hierarchy of classes including Person, Student, Research.Project, Course, etc.

    • Training examples

      • Represent instances of the ontology classes and relations

        • For example, a course web page for Course classes, faculty web pages for Faculty classes, this pair of pages for Courses.Taught.By, etc.

Changho Choi, University at Buffalo


Slide4 l.jpg

Classes Web

Relations : value

Changho Choi, University at Buffalo


Introduction 3 4 l.jpg
Introduction Web(#3/4)

  • Assumptions

    • about the mapping between the ontology and the Web

      1. Each instance of an ontology class is

      • a single Web page,

      • a contiguous string of text,

      • or a collection of several Web pages.

        2. Each instance of a relation is

      • a segment of hypertext,

      • a contiguous segment of text,

      • or t he hypertext segment.

Changho Choi, University at Buffalo


Introduction 4 4 l.jpg
Introduction (#4/4) Web

  • Three primary learning tasks

    • Involved in extracting knowledge-base instances for the Web

      1. Recognizing class instances by classifying bodies.

      2. Recognizing relation instances by classifying chains of hyperlinks.

      3. Recognizing class and relation instances by extracting small fields of text form Web pages.

Changho Choi, University at Buffalo


Experimental testbed l.jpg
Experimental Testbed Web

  • Experiments

    • Based on the ontology

    • Classes:Department, faculty, staff, student, research_project, course, other

    • Relations: Instructors.Of.Course(251), Members.Of.Project(392), Department.Of.Person(748)

  • Data sets

    • A set of pages(4127) and hyperlinks(10945) from 4 CS dept.

    • A set of pages(4120) from numerous other CS dept.

  • Evaluation

    • Four-fold cross validation

      • 3 for training, 1 for testing

Changho Choi, University at Buffalo


Statistical text classification l.jpg
Statistical Text Classification Web

  • Process

    • building a probabilistic model of each class using labeled training data

    • Classifying newly seen pages by selecting the class that that is most probable given the evidence of words describing the new page.

  • Train three classifiers

    • Full-text

    • Title/Heading

    • Hyperlink

Changho Choi, University at Buffalo


Statistical text classification9 l.jpg
Statistical Text Classification Web

  • Approach

    • the naïve Bayes, with minor modifications

      • Based on Kullback-Leibler Divergence

      • Given a document d to classify, we calculate a score for each class c as follows:

Changho Choi, University at Buffalo


Statistical text classification10 l.jpg
Statistical Text Classification Web

  • Experimental evaluation

Changho Choi, University at Buffalo


Accuracy coverage l.jpg
Accuracy/coverage Web

  • Coverage

    • The percentage of pages for a given class that are correctly classified as belonging to the class

  • accuracy

    • The percentage of pages classified into a given class that are actually members of that class

Changho Choi, University at Buffalo


Accuracy coverage tradeoff l.jpg
Accuracy/coverage tradeoff Web

1. Full-text classifiers

2. Hyperlink classifiers

3. Title/heading classifiers

“Hyperlink information can provide strong knowledge.”

Changho Choi, University at Buffalo


ad