1 / 10

tOKo from TOKens to Ontologies

tOKo from TOKens to Ontologies. Anjo Anjewierden Human-Computer Studies laboratory University of Amsterdam http://staff.science.uva.nl/~anjo http://anjo.blogs.com/metis/. Overview. tOKo for end-users (this presentation) Help intelligent users develop ontologies from documents

zorina
Download Presentation

tOKo from TOKens to Ontologies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. tOKofrom TOKens to Ontologies Anjo Anjewierden Human-Computer Studies laboratory University of Amsterdam http://staff.science.uva.nl/~anjo http://anjo.blogs.com/metis/

  2. Overview tOKo for end-users (this presentation) • Help intelligent users develop ontologies from documents • Approach is to offer useful functionality (possibly smart) that applies to all kinds of documents • Demonstration: imagine you are an end-user who is given the task to develop an ontology for the cooking domain tOKo for researchers (second presentation) • Accessing tOKo using HTTP • Infrastructure • Information extraction and ontology-based search

  3. Demonstration

  4. Infrastructure (1) • Dictionaries (English, Dutch, German) • Used for word classes, inflections and spelling • Document representation (=corpus) • Low-level representation, highly indexed, fast access • Prolog primitives to access the corpus • corpus_pattern([word(Word), integer(Int)], Doc,From,To) • Searches for a Word immediately followed by an Int. For example: “room is A306” unifies Word with “A” and Int with “306”. Doc,From,To is unified with the document and document position.

  5. Infrastructure (2) • Lots of higher level primitives (this one is • used in the HTTP demo. Note: little knowledge of Prolog required) word_frequencies_corpus(WFs, [ cases(alpha) , case(plain) , documents(all) , language(Language) , number_chars(2,infinite) , lemmatize(delete) ]).

  6. Information Extraction • Phrases that may be concepts or attributes • 6 tbsp of sugar could be part of a recipe • 1089 WB could be an instance of the concept postal code • Such phrases don’t follow the rules of “language” • See demonstration for examples

  7. Ontology-Based Corpus Searches • Query corpus with a combination of ontology constructs and language elements • Example: • [fruit] and [fruit] • Matches: “I bought some apples and pears” • Because [apple] is-a [fruit] (according to the ontology) and “apples” is the plural of “apples” (according to the dictionary)

  8. Ontology-Based Text QL • Language constructs (provisional): • [concept] matches a concept (and sub-concepts) including inflections, synonyms, etc. in the corpus • (word) matches a word (incl. inflections, etc) • <word class> matches all members of the word class • @20 matches all (compound) terms that appear at least 20 times in the corpus • integer matches any integer • literal matches precisely that literal • Demonstration

  9. Status • Usage • Ontology development (both research and contracted) • Document indexing (by Jan Jacobs and colleagues at Oce) • Finding inconsistencies in documents (has just started) • Research on top of tOKo (mostly using weblogs as a source, see my website for papers) • Caveats • Dictionary used is not “public” (CELEX) • Creating a corpus from an “arbitrary” set of documents may involve some programming (templates exist for HTML and plain text document sets)

  10. Plan • Open Source? • Perhaps it is an idea to create an Open Source version • To do (for Open Source version) • Documentation (although lack of documentation has so far not been a problem for end-users) • Make infrastructure / external interfaces consistent • Some performance issues • Conclusion • Listen to users for good ideas!

More Related