1 / 37

Automatically Generated DAML Markup for Semistructured Documents

Automatically Generated DAML Markup for Semistructured Documents. William Krueger, Jonathan Nilsson, Tim Oates, Tim Finin. Supported by DARPA contract F30602-00-2-0591. DAML and the Semantic Web.

gordy
Download Presentation

Automatically Generated DAML Markup for Semistructured Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatically Generated DAML Markup for Semistructured Documents William Krueger, Jonathan Nilsson, Tim Oates, Tim Finin Supported by DARPA contract F30602-00-2-0591

  2. DAML and the Semantic Web • The most efficient way for machines to understand the semantics of the vast amount of information on the web is to add semantic markup to the information • DAML (DARPA Agent Markup Language) is one existing semantic markup language

  3. The Problem • Semantically marking up large amounts of data by hand is far too time consuming • We use machine learning techniques to automate the task

  4. An Excerpt From a Talk Announcement The International Computer Science Institute <br> is pleased to present a talk: <p><BR> "Automatic Classification of Acoustic Signals Based on Psychoacoustic and Neurophysiological Knowledge"<BR> <P> <center>Michael Kleinschmidt</center> Medical Physics Group, University of Oldenburg, Germany<BR> Who is the speaker?

  5. An Excerpt From a Talk Announcement (Solution) The International Computer Science Institute <br> is pleased to present a talk: <p><BR> "Automatic Classification of Acoustic Signals Based on Psychoacoustic and Neurophysiological Knowledge"<BR> <P> <center><Speaker>Michael Kleinschmidt</Speaker></center> Medical Physics Group, University of Oldenburg, Germany<BR>

  6. Outline • Talk Ontology • Hierarchical Wrapper Induction • Contributions • Experimental Results • Future Considerations

  7. Our Talk Ontology • An ontology is the hierarchically organized vocabulary used to semantically mark up information sources • The root of our talk ontology is Talk • The ontological children of Talk include elements such as Talk:Title and Talk:BeginTime • The element Talk:BeginTime has its own ontological children, Talk:BeginTime:Hour and Talk:BeginTime:Minute

  8. Advantages of a Hierarchy • Using a hierarchical data model, we can break up documents into embedded segments • When learning rules for the speaker’s first name, for example, we only have to consider the speaker segment of each document

  9. Wrappers • A wrapper is the set of rules used to extract data along with the code required to perform the extraction

  10. The STALKER Algorithm • Stalker is a hierarchical wrapper induction algorithm developed at ISI • We use a modified Stalker algorithm to do information extraction on a source • The extracted information along with a DAML ontology can then be used to create markup for the source

  11. Defining Rule, Landmark, Token, etc • A token is an elementary piece of text • Lowercase words, HTML tags, Numbers, Alphanumeric words, Symbols, etc. • A landmark is a sequence of one or more consecutive tokens • A rule clause contains one landmark and is one of two types: SkipTo or SkipUntil • A rule is an ordered list of rule clauses • can be applied either forward or backward • used to locate both the beginning and end of an information field

  12. Rule Disjunction • Because our system is based on a sequential covering algorithm, a rule disjunctionis learned for each tag • A rule disjunction is an ordered set of rules that are applied in order when placing a tag • The first of that set to match in the document is used to place the tag • Keep in mind that it is a rule disjunction of one or more rules that is learned for each tag

  13. Example of a rule matching

  14. Refining a Rule • A rule initially contains a single token • The token is taken from the tokens immediately adjacent to the target data item • Examples: SkipTo(SYMBOL) or SkipUntil(John) • Then, either a landmark is added to the rule or a token is added to one of the existing landmarks

  15. Refining a Rule • Example • SkipTo(SYMBOL) can become: • SkipTo(be SYMBOL) • SkipTo(speaker) SkipTo(SYMBOL) • etc.

  16. Refining a Rule • After refining a rule, the best candidate rule is chosen and is determined to be either perfect or imperfect • The best candidate rule has the greatest number of matches on the remaining training documents • Early and failed matches are preferred over late matches • If the best candidate is perfect, it is returned; otherwise it is refined again

  17. Keeping a Rule • We want to keep rules that have perfect accuracy on the training documents • No negative matches where the rule being evaluated misplaces a tag in a • No false positive matches where the rule places a tag for a data item in some training document where that data item does not exist • When a rule continues being refined without becoming “perfect” it reaches a limit and is returned as is • The rule in this case is probably not very useful • This case is infrequent

  18. General overview of our improvements • Minimum Refinements • Rule Score • Refinement Window • Wildcards • In the upcoming examples, we often explain how each of these improvements is useful in finding a begin tag for an ontology element; the usefulness for end tags is similar

  19. Minimum Refinements • Forces rules to be refined some minimum number of times • We typically use a minimum number of 5

  20. Minimum Refinements Example • Consider the rule SkipTo(George) • Suppose this rule is perfect • In general, this rule would be very ineffective at finding the speaker’s first name • We would force this rule to be refined further so that it might ultimately have a greater coverage over all documents and reflect the structure of the domain

  21. Rule Score • Utilizes an evaluation set of documents • Decides whether forward rules or backward rules are better for a particular tag based on their performance on the evaluation set

  22. Rule Score Example • What should we do when forward and backward rules disagree on the location of a tag? • We test the forward and backward rules on a set of evaluation documents that were not used during the training • If the forward rules have a better score on the evaluation set, they are stored as the rules for placing that tag • Requires additional marked-up documents

  23. Refinement Window • Only consider the closest n tokens to a tag when refining a rule • We typically use n = 10

  24. Refinement Window Example • Consider the tag Talk:Title • Its ontological parent is Talk, the entire talk announcement • Without a refinement window, many irrelevant tokens would be considered when learning rules for the title • At worst, some irrelevant tokens would actually be used in a rule • Such a rule would not generalize well

  25. Wildcards • Both domain-dependent and domain-independent; can be used in place of tokens • Allow us to better generalize a document’s structure • Examples are: MONTH, NUMBER, HTML_TAG, etc.

  26. Wildcards Example • Consider the tags Talk:Date:Month and Talk:Date:DayOfWeek • We might start with the rule SkipUntil(INITIAL_CAP_WORD) for finding the month, but this rule would match the day of the week, as well • By virtue of the wildcard MONTH, we can use the rule skipUntil(MONTH) to accurately locate the month

  27. Marked Up by a Human <?xml version="1.0" ?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:daml="http://www.daml.org/2000/10/daml-ont#" xmlns="http://daml.umbc.edu/ontologies/talk-ont#" xmlns:time="http://daml.umbc.edu/ontologies/calendar-ont#"> <Talk> <Title>New Developments in Still Image and Video Compression</Title> <X-URI></X-URI> <DAML-URI>./daml/trainfile1.daml</DAML-URI> <Abstract>While the demand on quality of digital images and videos increases, .... We will also show a real-time IP video conference system based on earlier versions of these wavelet codecs.</Abstract> <BeginTime rdf:parseType="Resource"> <time:Year>2002</time:Year> <time:Month>March</time:Month> <time:Day>27</time:Day> <time:DayOfWeek>Thursday</time:DayOfWeek> <time:Hour>3</time:Hour> <time:Minute>30</time:Minute> <time:Second>00</time:Second> </BeginTime> <EndTime rdf:parseType="Resource"> <time:Year>2002</time:Year> <time:Month>March</time:Month> <time:Day>27</time:Day> <time:DayOfWeek>Thursday</time:DayOfWeek> <time:Hour>4</time:Hour> <time:Minute>30</time:Minute> <time:Second>00</time:Second> </EndTime> <Speaker rdf:parseType="Resource"> <Name>Hans L. Cycon</Name> <Organization>FHTW Berlin, University of Applied Sciences</Organization> <Email>hcycon@fhtw-berlin.de</Email> </Speaker> </Talk> </rdf:RDF>

  28. Marked Up by our Basic System <?xml version="1.0" ?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:daml="http://www.daml.org/2000/10/daml-ont#" xmlns="http://daml.umbc.edu/ontologies/talk-ont#" xmlns:time="http://daml.umbc.edu/ontologies/calendar-ont#"> <Talk> <Title>New Developments in Still Image and Video Compression Prof. Hans L. Cycon FHTW Berlin, University of Applied Sciences</Title> <X-URI></X-URI> <DAML-URI>./daml/trainfile1.daml</DAML-URI> <Abstract>While the demand on quality of digital images and videos increases, .... We will also show a real-time IP video conference system based on earlier versions of these wavelet codecs. <Abstract> <BeginTime rdf:parseType="Resource"> <time:Year></time:Year> <time:Month></time:Month> <time:Day></time:Day> <time:DayOfWeek></time:DayOfWeek> <time:Hour></time:Hour> <time:Minute>00</time:Minute> <time:Second>00</time:Second> </BeginTime> <EndTime rdf:parseType="Resource"> <time:Year></time:Year> <time:Month></time:Month> <time:Day></time:Day> <time:DayOfWeek></time:DayOfWeek> <time:Hour>3:30-4</time:Hour> <time:Minute>30-4:30</time:Minute> <time:Second>00</time:Second> </EndTime> <Speaker rdf:parseType="Resource"> <Name>Hans L </Name> <Organization></Organization> <Email>hcycon@fhtw-berlin.de</Email> </Speaker> </Talk> </rdf:RDF>

  29. Marked Up by our Full System <?xml version="1.0" ?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:daml="http://www.daml.org/2000/10/daml-ont#" xmlns="http://daml.umbc.edu/ontologies/talk-ont#" xmlns:time="http://daml.umbc.edu/ontologies/calendar-ont#"> <Talk> <Title>New Developments in Still Image and Video Compression Prof. Hans L. Cycon FHTW Berlin, University of Applied Sciences</Title> <X-URI></X-URI> <DAML-URI>./daml/trainfile1.daml</DAML-URI> <Abstract>While the demand on quality of digital images and videos increases, .... We will also show a real-time IP video conference system based on earlier versions of these wavelet codecs.</Abstract> <BeginTime rdf:parseType="Resource"> <time:Year>2002</time:Year> <time:Month>March</time:Month> <time:Day>27</time:Day> <time:DayOfWeek>Thursday</time:DayOfWeek> <time:Hour>3</time:Hour> <time:Minute>30</time:Minute> <time:Second>00</time:Second> </BeginTime> <EndTime rdf:parseType="Resource"> <time:Year>2002</time:Year> <time:Month>March</time:Month> <time:Day>27</time:Day> <time:DayOfWeek>Thursday</time:DayOfWeek> <time:Hour>4</time:Hour> <time:Minute>30</time:Minute> <time:Second>00</time:Second> </EndTime> <Speaker rdf:parseType="Resource"> <Name>Hans L. Cycon</Name> <Organization>Prof. Hans L. Cycon FHTW Berlin, University of Applied Sciences</Organization> <Email>hcycon@fhtw-berlin.de</Email> </Speaker> </Talk> </rdf:RDF>

  30. Experimental Setup • 3 Domains • UC Berkeley, UCSB, and ITTALKS • 6 Systems • Basic, Min Refine, Score, Refinement Window, Wildcards, and Full • 10 Partitions • 20/20/20 split • Training/Evaluation/Testing Sets • recall = number of correctly extracted data items divided by the total number of data items in the documents

  31. Average Recall Over All Tags UC Berkeley Domain UCSB Domain

  32. Performance Improvements on Individual Tags UC Berkeley Domain

  33. Performance Improvements on Individual Tags UCSB Domain

  34. Conclusion • Our system extends the state-of-the-art algorithm STALKER • Our system performs DAML markup on talk announcements • It can trivially be extended to different markup languages and different domains • A working implementation of everything described here exists!

  35. Future Considerations • Active Learning: select training documents that yield rules with the greatest possible coverage • Cardinality Issues: ontology elements that appear in lists • Linguistic Information: use a system like Aerotext to preprocess the documents • Google API: check to see if our tag placement “makes sense”

  36. Acknowledgements • This work was supported in part by the Defense Advanced Research Projects Agency under contract F30602-00-2-0 591 as part of the DAML program (http://daml.org/) • It was also supported by a Northrop Grumman Fellowship

  37. References Ciravegna, F. (2001). (LP)2 , an Adaptive Algorithm for Information Extraction from Web-related Texts. In Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining held in conjunction with 17th International Joint Conference on Artificial Intelligence (IJCAI). Cost, R. S., T. Finin, A. Joshi, Y. Peng, C. Nicholas, I. Soboroff, H. Chen, L. Kagal, F. Perich, Y. Zou, and S. Tolia. (2002). ITtalks: A Case Study in the Semantic Web and DAML+OIL. IEEE Intelligent Systems, 17(1):40-47. Hendler, J. (2001). Agents and the Semantic Web. IEEE Intelligent Systems,16(2):30-37. Hendler, J., and D. L. McGuinness. (2000). The Darpa Agent Markup Language. IEEE Intelligent Systems, 15(6):67-73. Knoblock, C. A., K. Lerman, S. Minton, and I. Muslea. Accurately and reliably extracting data from the web: A machine learning approach. Data Engineering Bulletin. Muslea, I., S. Minton, and C. Knoblock. (2001). Hierarchical wrapper induction for semistructured information sources. Journal of Autonomous Agents and Multi-Agent Systems.

More Related