1 / 36

Towards the Self-Annotating Web

Towards the Self-Annotating Web. Philipp Cimiano, Siegfried Handschuh, Steffen Staab Presenter: Hieu K Le (most of slides come from Philipp Cimiano) CS598CXZ - Spring 2005 - UIUC. Outline. Introduction The Process of PANKOW Pattern-based categorization Evaluation Integration to CREAM

lobo
Download Presentation

Towards the Self-Annotating Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards the Self-Annotating Web Philipp Cimiano, Siegfried Handschuh, Steffen Staab Presenter: Hieu K Le (most of slides come from Philipp Cimiano) CS598CXZ - Spring 2005 - UIUC

  2. Outline • Introduction • The Process of PANKOW • Pattern-based categorization • Evaluation • Integration to CREAM • Related work • Conclusion

  3. The annotation problemin 4 cartoons

  4. The annotation problem from a scientific point of view

  5. The annotation problem in practice

  6. The viscious cycle

  7. Annotating A Concept ? A Noun

  8. Annotating • To annotate terms in a web page: • Manually defining • Learning of extraction rules • Both require lot of labor

  9. A small Quiz • What is “Laska” ? A. A dish B. A city C. A temple D. A mountain  The answer is:

  10. A small Quiz • What is “Laska” ? A. A dish B. A city C. A temple D. A mountain  The answer is:

  11. A small Quiz • What is “Laska” ? A. A dish B. A city C. A temple D. A mountain  The answer is:

  12. From Google “Laska”

  13. From Google • „cities such as Laksa“ 0 hits • „dishes such as Laksa“ 10 hits • „mountains such as Laksa“ 0 hits • „temples such as Laksa“ 0 hits • Google knows more than all of you together! • Example of using syntactic information + statistics to derive semantic information

  14. Self-annotating • PANKOW(Pattern-based Annotation through Knowledge On the Web) • Unsupervised • Pattern based • Within a fixed ontology • Involve information of the whole web

  15. The Self-Annotating Web • There is a huge amount of implicit knowledge in the Web • Make use of this implicit knowledge together with statistical information to propose formal annotations and overcome the viscious cycle: semantics ≈ syntax + statistics? • Annotation by maximal statistical evidence

  16. Outline • Introduction • The Process of PANKOW • Pattern-based categorization • Evaluation • Integration to CREAM • Related work • Conclusion

  17. PANKOW Process

  18. Outline • Introduction • The Process of PANKOW • Pattern-based categorization • Evaluation • Integration to CREAM • Related work • Conclusion

  19. Patterns • HEARST1: <CONCEPT>s such as <INSTANCE> • HEARST2: such <CONCEPT>s as <INSTANCE> • HEARST3: <CONCEPT>s, (especially/including) <INSTANCE> • HEARST4: <INSTANCE> (and/or) other <CONCEPT>s • Examples: • dishes such as Laksa • such dishes as Laksa • dishes, especially Laksa • dishes, including Laksa • Laksa and other dishes • Laksa or other dishes

  20. Patterns (Cont‘d) • DEFINITE1: the <INSTANCE> <CONCEPT> • DEFINITE2: the <CONCEPT> <INSTANCE> • APPOSITION:<INSTANCE>, a <CONCEPT> • COPULA: <INSTANCE> is a <CONCEPT> • Examples: • the Laksa dish • the dish Laksa • Laksa, a dish • Laksa is a dish

  21. Asking Google (more formally) • Instance iI, concept c C, pattern p  {Hearst1,...,Copula} count(i,c,p) returns the number of Google hits of instantiated pattern • E.g. count(Laksa,dish):=count(Laksa,dish,def1)+... • Restrict to the best ones beyond threshold

  22. Outline • Introduction • The Process of PANKOW • Pattern-based categorization • Evaluation • Integration to CREAM • Related work • Conclusion

  23. Evaluation Scenario • Corpus: 45 texts from http://www.lonelyplanet.com/destinations • Ontology: tourism ontology from GETESS project • #concepts: original – 1043; pruned – 682 • Manual Annotation by two subjects: • A: 436 instance/concept assignments • B: 392 instance/concept assignments • Overlap: 277 instances (Gold Standard) • A and B used 59 different concepts • Categorial (Kappa) agreement on 277 instances: 63.5%

  24. Examples Atlantic city 1520837 Bahamas island 649166 USA country 582275 Connecticut state 302814 Caribbean sea 227279 Mediterranean sea 212284 Canada country 176783 Guatemala city 174439 Africa region 131063 Australia country 128607 France country 125863 Germany country 124421 Easter island 96585 St Lawrence river 65095 Commonwealth state 49692 New Zealand island 40711 Adriatic sea 39726 Netherlands country 37926 St John church 34021 Belgium country 33847 San Juan island 31994 Mayotte island 31540 EU country 28035 UNESCO organization 27739 Austria group 24266 Greece island 23021 Malawi lake 21081 Israel country 19732 Perth street 17880 Luxembourg city 16393 Nigeria state 15650 St Croix river 14952 Nakuru lake 14840 Kenya country 14382 Benin city 14126 Cape Town city 13768

  25. F=28,24% R/Acc=24,90% Results

  26. Comparison

  27. Outline • Introduction • The Process of PANKOW • Pattern-based categorization • Evaluation • Integration to CREAM • Related work • Conclusion

  28. Ontology Guidance & Fact Browser Document Editor /Viewer CREAM/OntoMat Annotation Environment WWW DocumentManagement Web Pages AnnotationTool GUI annotate plugin Annotation Inference Server crawl plugin Annotated Web Pages query load plugin Domain Ontologies Annotation by Markup extract PANKOW

  29. PANKOW & CREAM/OntoMat

  30. F=51,65% R/Acc=49.46% Results (Interactive Mode)

  31. Outline • Introduction • The Process of PANKOW • Pattern-based categorization • Evaluation • Integration to CREAM • Related work • Conclusion

  32. Current State-of-the-art • Large-scale IE [SemTag&Seeker@WWW‘03] • only disambiguation • Standard IE (MUC) • need of handcrafted rules • ML-based IE (e.g.Amilcare@{OntoMat,MnM}) • need of hand-annotated training corpus • does not scale to large numbers of concepts • rule induction takes time • KnowItAll (Etzioni et al. WWW‘04) • shallow (pattern-matching-based) approach

  33. Outline • Introduction • The Process of PANKOW • Pattern-based categorization • Evaluation • Integration to CREAM • Related work • Conclusion

  34. Conclusion Summary • new paradigm to overcome the annotation problem • unsupervised instance categorization • first step towards the self-annotating Web • difficult task: open domain, many categories • decent precision, low recall • very good results for interactive mode • currently inefficient (590 Google queries/instance) Challenges: • contextual disambiguation • annotating relations (currently restricted to instances) • scalability (e.g. only choose reasonable queries to Google) • accurate recognition of Named Entities (currently POS-tagger)

  35. Outline • Introduction • The Process of PANKOW • Pattern-based categorization • Evaluation • Integration to CREAM • Related work • Conclusion

  36. Thanks to… • Philipp Cimiano (cimiano@aifb.uni-karlsruhe.de) for slides • The audience for listening

More Related