1 / 21

Indexing Knowledge

Indexing Knowledge. Daniel Vasicek 2014 March 27. Introduction. Basic topic is : All Human Knowledge Who Cares? Simple Examples. Basic Ideas. Concepts instead of key words Thesauri instead of key words Recognize Emerging concepts Classification

nowles
Download Presentation

Indexing Knowledge

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Indexing Knowledge Daniel Vasicek 2014 March 27

  2. Introduction • Basic topic is : All Human Knowledge • Who Cares? • Simple Examples

  3. Basic Ideas • Concepts instead of key words • Thesauri instead of key words • Recognize Emerging concepts • Classification • Facilitate communication between environments (Data translation) • Meta data for publications (xml, sql, txt) • Indexing information

  4. Topics to Cover • Programming language constructs needed. What functionality do we need? • What people pay Access Innovations to do? • Typical programming problems that I encounter.

  5. Input Data • Formats • XML tagged meta data for publications • SQL data base • RAW text • Pictures of text • Quantities • AIP • 304,910 authors as xml files • 807,005 xml files containing title, abstract, +meta data • Nicem (National Information Center for Educational Media) • 503,534 xml files describing available educational media • 26,144 xml files describing suppliers of educational media

  6. Programming Languages Used • Visual Basic (1990s) • C++ • Java (currently)

  7. Who Cares? • AIP – American Institute of Physics (17 journals + conference proceedings) • IEEE- Institute of Electronic and Electrical Engineers (journals, standards, patents, …) • SPIE- International Society for Optics and Photonics • ACM – Association of Computing Machinery • Wolters-Klewer • Pub-Med

  8. More Clients • Parliament of Victoria (5000 articles per day) • JSTOR (~10 million documents, some journals back to 1665) • PLOS (quick path to electronic publication) • Dupont • DOW • Council of Europe • Triumph Learning • ASCE, SAGE, SafetyLit, OSA, NICEM, NPR …

  9. Useful Tools • Controlled Vocabulary – an organizational tool for capturing concepts • Proximity – a tool for capturing context • Hash Table (Content Addressable Array) • Convenience • Uniqueness • Fast access • Regular Expressions

  10. What’s a taxonomy? • Knowledge organization system • Words • Controlled vocabulary for a subject area • Descriptive labels • Hierarchy • Simple hierarchical view of a thesaurus • Storage and retrieval aid

  11. Thesaurus Elements • Hierarchy • Broader and Narrower concepts • Multiply connected “treelike” structure • Nodes in the thesaurus structure contain descriptions of concepts and links to broader, narrower, related, and similar concepts • Subject specific?

  12. Structure of Controlled Vocabularies Flat List Synonym Ring Taxonomy Thesaurus Ontology INCREASING MEANING and CONTROL Ambiguity Ambiguity Synonym Ambiguity Synonym Hierarchy Relationships Synonym Hierarchy Additional Types of Relationships Hierarchy After ANSI/NISOZ39.19 -2005, Figure 5

  13. Thesaurus Node (Term) Science Broader Term Biology Narrower Term Science of Life Synonym

  14. Thesaurus Implementation • Terms (Concepts, Preferred Terms) • Broader Terms • Narrower Terms • Related Terms • Other Concepts • Synonyms • History • Responsibility • Backup • Rules to help identify the concept in text • Methods for maintaining the thesaurus

  15. Thesaurus Text Representation <TermInfo> <T>Biology</T> <BT>Science</BT> <UF>Science of Life</UF> </TermInfo> <TermInfo> <T>Science</T> <NT>Biology</NT> </TermInfo> <TermInfo> <T>Science of Life</T> </TermInfo>

  16. Thesaurus Problems • Missing Terms - pointer links to a term that is not present • Broken loops • Narrower term without matching broader term • Broader term without matching narrower term • Related term without a matching return relationship

  17. Proximity of Words • Adjacent • Before • After • Same sentence • Same Paragraph • Within 50 words • Phrases (n-Grams)

  18. Content Addressable Array T[“Science”]=1; T[“Biology”]=1; T[“Science of Life”]=1; BT[“Biology”] = “Science”; NT[“Science”] = “Biology”; UF[“Science of Life”]=“Biology”;

  19. Regular Expressions • /^[_a-zA-Z0-9-]+(\.[_a-zA-Z0-9-]+)*@[a-zA-Z0-9-]+(\.[a-zA-Z0-9-]+)*(\.[a-zA-Z]{2,4})$/ • Email addresses? • / [A-Z][a-z]* / • Capitalized words • /[A-Z][a-zA-Z0-9,\”\- ]*\. / • Sentence ? • Paragraph?

  20. Structure of Controlled Vocabularies Flat List Synonym Ring Taxonomy Thesaurus Ontology INCREASING MEANING and CONTROL Ambiguity Ambiguity Synonym Ambiguity Synonym Hierarchy Relationships Synonym Hierarchy Additional Types of Relationships Hierarchy After ANSI/NISOZ39.19 -2005, Figure 5

More Related