Automatically Generated DAML Markup for Semistructured Documents

Automatically Generated DAML Markup for Semistructured Documents William Krueger, Jonathan Nilsson, Tim Oates, Tim Finin Supported by DARPA contract F30602-00-2-0591

DAML and the Semantic Web • The most efficient way for machines to understand the semantics of the vast amount of information on the web is to add semantic markup to the information • DAML (DARPA Agent Markup Language) is one existing semantic markup language

The Problem • Semantically marking up large amounts of data by hand is far too time consuming • We use machine learning techniques to automate the task

An Excerpt From a Talk Announcement The International Computer Science Institute is pleased to present a talk: "Automatic Classification of Acoustic Signals Based on Psychoacoustic and Neurophysiological Knowledge" <center>Michael Kleinschmidt</center> Medical Physics Group, University of Oldenburg, Germany Who is the speaker?

An Excerpt From a Talk Announcement (Solution) The International Computer Science Institute is pleased to present a talk: "Automatic Classification of Acoustic Signals Based on Psychoacoustic and Neurophysiological Knowledge" <center><Speaker>Michael Kleinschmidt</Speaker></center> Medical Physics Group, University of Oldenburg, Germany

Outline • Talk Ontology • Hierarchical Wrapper Induction • Contributions • Experimental Results • Future Considerations

Our Talk Ontology • An ontology is the hierarchically organized vocabulary used to semantically mark up information sources • The root of our talk ontology is Talk • The ontological children of Talk include elements such as Talk:Title and Talk:BeginTime • The element Talk:BeginTime has its own ontological children, Talk:BeginTime:Hour and Talk:BeginTime:Minute

Advantages of a Hierarchy • Using a hierarchical data model, we can break up documents into embedded segments • When learning rules for the speaker’s first name, for example, we only have to consider the speaker segment of each document

Wrappers • A wrapper is the set of rules used to extract data along with the code required to perform the extraction

The STALKER Algorithm • Stalker is a hierarchical wrapper induction algorithm developed at ISI • We use a modified Stalker algorithm to do information extraction on a source • The extracted information along with a DAML ontology can then be used to create markup for the source

Defining Rule, Landmark, Token, etc • A token is an elementary piece of text • Lowercase words, HTML tags, Numbers, Alphanumeric words, Symbols, etc. • A landmark is a sequence of one or more consecutive tokens • A rule clause contains one landmark and is one of two types: SkipTo or SkipUntil • A rule is an ordered list of rule clauses • can be applied either forward or backward • used to locate both the beginning and end of an information field

Rule Disjunction • Because our system is based on a sequential covering algorithm, a rule disjunctionis learned for each tag • A rule disjunction is an ordered set of rules that are applied in order when placing a tag • The first of that set to match in the document is used to place the tag • Keep in mind that it is a rule disjunction of one or more rules that is learned for each tag

Example of a rule matching

Refining a Rule • A rule initially contains a single token • The token is taken from the tokens immediately adjacent to the target data item • Examples: SkipTo(SYMBOL) or SkipUntil(John) • Then, either a landmark is added to the rule or a token is added to one of the existing landmarks

Refining a Rule • Example • SkipTo(SYMBOL) can become: • SkipTo(be SYMBOL) • SkipTo(speaker) SkipTo(SYMBOL) • etc.

Refining a Rule • After refining a rule, the best candidate rule is chosen and is determined to be either perfect or imperfect • The best candidate rule has the greatest number of matches on the remaining training documents • Early and failed matches are preferred over late matches • If the best candidate is perfect, it is returned; otherwise it is refined again

Keeping a Rule • We want to keep rules that have perfect accuracy on the training documents • No negative matches where the rule being evaluated misplaces a tag in a • No false positive matches where the rule places a tag for a data item in some training document where that data item does not exist • When a rule continues being refined without becoming “perfect” it reaches a limit and is returned as is • The rule in this case is probably not very useful • This case is infrequent

General overview of our improvements • Minimum Refinements • Rule Score • Refinement Window • Wildcards • In the upcoming examples, we often explain how each of these improvements is useful in finding a begin tag for an ontology element; the usefulness for end tags is similar

Minimum Refinements • Forces rules to be refined some minimum number of times • We typically use a minimum number of 5

Minimum Refinements Example • Consider the rule SkipTo(George) • Suppose this rule is perfect • In general, this rule would be very ineffective at finding the speaker’s first name • We would force this rule to be refined further so that it might ultimately have a greater coverage over all documents and reflect the structure of the domain

Rule Score • Utilizes an evaluation set of documents • Decides whether forward rules or backward rules are better for a particular tag based on their performance on the evaluation set

Rule Score Example • What should we do when forward and backward rules disagree on the location of a tag? • We test the forward and backward rules on a set of evaluation documents that were not used during the training • If the forward rules have a better score on the evaluation set, they are stored as the rules for placing that tag • Requires additional marked-up documents

Refinement Window • Only consider the closest n tokens to a tag when refining a rule • We typically use n = 10

Refinement Window Example • Consider the tag Talk:Title • Its ontological parent is Talk, the entire talk announcement • Without a refinement window, many irrelevant tokens would be considered when learning rules for the title • At worst, some irrelevant tokens would actually be used in a rule • Such a rule would not generalize well

Wildcards • Both domain-dependent and domain-independent; can be used in place of tokens • Allow us to better generalize a document’s structure • Examples are: MONTH, NUMBER, HTML_TAG, etc.

Wildcards Example • Consider the tags Talk:Date:Month and Talk:Date:DayOfWeek • We might start with the rule SkipUntil(INITIAL_CAP_WORD) for finding the month, but this rule would match the day of the week, as well • By virtue of the wildcard MONTH, we can use the rule skipUntil(MONTH) to accurately locate the month

Marked Up by a Human <?xml version="1.0" ?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:daml="http://www.daml.org/2000/10/daml-ont#" xmlns="http://daml.umbc.edu/ontologies/talk-ont#" xmlns:time="http://daml.umbc.edu/ontologies/calendar-ont#"> <Talk> <Title>New Developments in Still Image and Video Compression</Title> <X-URI></X-URI> <DAML-URI>./daml/trainfile1.daml</DAML-URI> <Abstract>While the demand on quality of digital images and videos increases, .... We will also show a real-time IP video conference system based on earlier versions of these wavelet codecs.</Abstract> <BeginTime rdf:parseType="Resource"> <time:Year>2002</time:Year> <time:Month>March</time:Month> <time:Day>27</time:Day> <time:DayOfWeek>Thursday</time:DayOfWeek> <time:Hour>3</time:Hour> <time:Minute>30</time:Minute> <time:Second>00</time:Second> </BeginTime> <EndTime rdf:parseType="Resource"> <time:Year>2002</time:Year> <time:Month>March</time:Month> <time:Day>27</time:Day> <time:DayOfWeek>Thursday</time:DayOfWeek> <time:Hour>4</time:Hour> <time:Minute>30</time:Minute> <time:Second>00</time:Second> </EndTime> <Speaker rdf:parseType="Resource"> <Name>Hans L. Cycon</Name> <Organization>FHTW Berlin, University of Applied Sciences</Organization> <Email>hcycon@fhtw-berlin.de</Email> </Speaker> </Talk> </rdf:RDF>

Marked Up by our Basic System <?xml version="1.0" ?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:daml="http://www.daml.org/2000/10/daml-ont#" xmlns="http://daml.umbc.edu/ontologies/talk-ont#" xmlns:time="http://daml.umbc.edu/ontologies/calendar-ont#"> <Talk> <Title>New Developments in Still Image and Video Compression Prof. Hans L. Cycon FHTW Berlin, University of Applied Sciences</Title> <X-URI></X-URI> <DAML-URI>./daml/trainfile1.daml</DAML-URI> <Abstract>While the demand on quality of digital images and videos increases, .... We will also show a real-time IP video conference system based on earlier versions of these wavelet codecs. <Abstract> <BeginTime rdf:parseType="Resource"> <time:Year></time:Year> <time:Month></time:Month> <time:Day></time:Day> <time:DayOfWeek></time:DayOfWeek> <time:Hour></time:Hour> <time:Minute>00</time:Minute> <time:Second>00</time:Second> </BeginTime> <EndTime rdf:parseType="Resource"> <time:Year></time:Year> <time:Month></time:Month> <time:Day></time:Day> <time:DayOfWeek></time:DayOfWeek> <time:Hour>3:30-4</time:Hour> <time:Minute>30-4:30</time:Minute> <time:Second>00</time:Second> </EndTime> <Speaker rdf:parseType="Resource"> <Name>Hans L </Name> <Organization></Organization> <Email>hcycon@fhtw-berlin.de</Email> </Speaker> </Talk> </rdf:RDF>

Marked Up by our Full System <?xml version="1.0" ?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:daml="http://www.daml.org/2000/10/daml-ont#" xmlns="http://daml.umbc.edu/ontologies/talk-ont#" xmlns:time="http://daml.umbc.edu/ontologies/calendar-ont#"> <Talk> <Title>New Developments in Still Image and Video Compression Prof. Hans L. Cycon FHTW Berlin, University of Applied Sciences</Title> <X-URI></X-URI> <DAML-URI>./daml/trainfile1.daml</DAML-URI> <Abstract>While the demand on quality of digital images and videos increases, .... We will also show a real-time IP video conference system based on earlier versions of these wavelet codecs.</Abstract> <BeginTime rdf:parseType="Resource"> <time:Year>2002</time:Year> <time:Month>March</time:Month> <time:Day>27</time:Day> <time:DayOfWeek>Thursday</time:DayOfWeek> <time:Hour>3</time:Hour> <time:Minute>30</time:Minute> <time:Second>00</time:Second> </BeginTime> <EndTime rdf:parseType="Resource"> <time:Year>2002</time:Year> <time:Month>March</time:Month> <time:Day>27</time:Day> <time:DayOfWeek>Thursday</time:DayOfWeek> <time:Hour>4</time:Hour> <time:Minute>30</time:Minute> <time:Second>00</time:Second> </EndTime> <Speaker rdf:parseType="Resource"> <Name>Hans L. Cycon</Name> <Organization>Prof. Hans L. Cycon FHTW Berlin, University of Applied Sciences</Organization> <Email>hcycon@fhtw-berlin.de</Email> </Speaker> </Talk> </rdf:RDF>

Experimental Setup • 3 Domains • UC Berkeley, UCSB, and ITTALKS • 6 Systems • Basic, Min Refine, Score, Refinement Window, Wildcards, and Full • 10 Partitions • 20/20/20 split • Training/Evaluation/Testing Sets • recall = number of correctly extracted data items divided by the total number of data items in the documents

Average Recall Over All Tags UC Berkeley Domain UCSB Domain

Performance Improvements on Individual Tags UC Berkeley Domain

Performance Improvements on Individual Tags UCSB Domain

Conclusion • Our system extends the state-of-the-art algorithm STALKER • Our system performs DAML markup on talk announcements • It can trivially be extended to different markup languages and different domains • A working implementation of everything described here exists!

Future Considerations • Active Learning: select training documents that yield rules with the greatest possible coverage • Cardinality Issues: ontology elements that appear in lists • Linguistic Information: use a system like Aerotext to preprocess the documents • Google API: check to see if our tag placement “makes sense”

Acknowledgements • This work was supported in part by the Defense Advanced Research Projects Agency under contract F30602-00-2-0 591 as part of the DAML program (http://daml.org/) • It was also supported by a Northrop Grumman Fellowship

References Ciravegna, F. (2001). (LP)2 , an Adaptive Algorithm for Information Extraction from Web-related Texts. In Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining held in conjunction with 17th International Joint Conference on Artificial Intelligence (IJCAI). Cost, R. S., T. Finin, A. Joshi, Y. Peng, C. Nicholas, I. Soboroff, H. Chen, L. Kagal, F. Perich, Y. Zou, and S. Tolia. (2002). ITtalks: A Case Study in the Semantic Web and DAML+OIL. IEEE Intelligent Systems, 17(1):40-47. Hendler, J. (2001). Agents and the Semantic Web. IEEE Intelligent Systems,16(2):30-37. Hendler, J., and D. L. McGuinness. (2000). The Darpa Agent Markup Language. IEEE Intelligent Systems, 15(6):67-73. Knoblock, C. A., K. Lerman, S. Minton, and I. Muslea. Accurately and reliably extracting data from the web: A machine learning approach. Data Engineering Bulletin. Muslea, I., S. Minton, and C. Knoblock. (2001). Hierarchical wrapper induction for semistructured information sources. Journal of Autonomous Agents and Multi-Agent Systems.

Automatically Generated DAML Markup for Semistructured Documents

Automatically Generated DAML Markup for Semistructured Documents

Presentation Transcript

Demonstrating the Viability of Automatically Generated User Interfaces

XML + Semantics = DARPA Agent Markup Language (DAML)

Evaluating Automatically Generated timelines

EXecution generated Executions: Automatically generating inputs of death.

DAML-S: Semantic Markup for Web Services

Author Generated JATS XML Markup

Offline Adaptation Using Automatically Generated Heuristics

Aggregating Cultural Heritage Collections using Automatically Generated Topic Hierarchies

EXecution generated Executions: Automatically generating inputs of death.

EXecution generated Executions: Automatically generating inputs of death.

DAML+OIL

DAML-Space

DAML+OIL

Markup Languages and Complex Documents (MLCD)

Tools for Developing and Using DAML-Based Ontologies and Documents

Motivation for DAML+OIL

DAML-S: Sematic Markup for Web Services

Static Validation of Dynamically Generated XML Documents

DARPA Agent Markup Language (DAML)

Semistructured Data Extensible Markup Language Document Type Definitions

Measuring the Structural Similarity of Semistructured Documents Using Entropy

Searching for Macro-operators with Automatically Generated Heuristics