slide1
Download
Skip this Video
Download Presentation
Indexing Knowledge

Loading in 2 Seconds...

play fullscreen
1 / 21

Indexing Knowledge - PowerPoint PPT Presentation


  • 79 Views
  • Uploaded on

Indexing Knowledge. Daniel Vasicek 2014 March 27. Introduction. Basic topic is : All Human Knowledge Who Cares? Simple Examples. Basic Ideas. Concepts instead of key words Thesauri instead of key words Recognize Emerging concepts Classification

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Indexing Knowledge' - nowles


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
indexing knowledge

Indexing Knowledge

Daniel Vasicek

2014 March 27

introduction
Introduction
  • Basic topic is : All Human Knowledge
  • Who Cares?
  • Simple Examples
basic ideas
Basic Ideas
  • Concepts instead of key words
    • Thesauri instead of key words
    • Recognize Emerging concepts
    • Classification
  • Facilitate communication between environments (Data translation)
  • Meta data for publications (xml, sql, txt)
    • Indexing information
topics to cover
Topics to Cover
  • Programming language constructs needed. What functionality do we need?
  • What people pay Access Innovations to do?
  • Typical programming problems that I encounter.
input data
Input Data
  • Formats
    • XML tagged meta data for publications
    • SQL data base
    • RAW text
    • Pictures of text
  • Quantities
    • AIP
      • 304,910 authors as xml files
      • 807,005 xml files containing title, abstract, +meta data
    • Nicem (National Information Center for Educational Media)
      • 503,534 xml files describing available educational media
      • 26,144 xml files describing suppliers of educational media
programming languages used
Programming Languages Used
  • Visual Basic (1990s)
  • C++
  • Java (currently)
who cares
Who Cares?
  • AIP – American Institute of Physics (17 journals + conference proceedings)
  • IEEE- Institute of Electronic and Electrical Engineers (journals, standards, patents, …)
  • SPIE- International Society for Optics and Photonics
  • ACM – Association of Computing Machinery
  • Wolters-Klewer
  • Pub-Med
more clients
More Clients
  • Parliament of Victoria (5000 articles per day)
  • JSTOR (~10 million documents, some journals back to 1665)
  • PLOS (quick path to electronic publication)
  • Dupont
  • DOW
  • Council of Europe
  • Triumph Learning
  • ASCE, SAGE, SafetyLit, OSA, NICEM, NPR …
useful tools
Useful Tools
  • Controlled Vocabulary – an organizational tool for capturing concepts
  • Proximity – a tool for capturing context
  • Hash Table (Content Addressable Array)
    • Convenience
    • Uniqueness
    • Fast access
  • Regular Expressions
what s a taxonomy
What’s a taxonomy?
  • Knowledge organization system
  • Words
    • Controlled vocabulary for a subject area
  • Descriptive labels
  • Hierarchy
    • Simple hierarchical view of a thesaurus
  • Storage and retrieval aid
thesaurus elements
Thesaurus Elements
  • Hierarchy
    • Broader and Narrower concepts
    • Multiply connected “treelike” structure
  • Nodes in the thesaurus structure contain descriptions of concepts and links to broader, narrower, related, and similar concepts
  • Subject specific?
structure of controlled vocabularies

Structure of Controlled Vocabularies

Flat List Synonym Ring Taxonomy Thesaurus Ontology

INCREASING MEANING and CONTROL

Ambiguity

Ambiguity

Synonym

Ambiguity

Synonym

Hierarchy

Relationships

Synonym

Hierarchy

Additional Types of Relationships

Hierarchy

After ANSI/NISOZ39.19 -2005, Figure 5

thesaurus node term
Thesaurus Node (Term)

Science

Broader Term

Biology

Narrower Term

Science of Life

Synonym

thesaurus implementation
Thesaurus Implementation
  • Terms (Concepts, Preferred Terms)
  • Broader Terms
  • Narrower Terms
  • Related Terms
  • Other Concepts
    • Synonyms
    • History
    • Responsibility
    • Backup
  • Rules to help identify the concept in text
  • Methods for maintaining the thesaurus
thesaurus text representation
Thesaurus Text Representation

<TermInfo>

<T>Biology</T>

<BT>Science</BT>

<UF>Science of Life</UF>

</TermInfo>

<TermInfo>

<T>Science</T>

<NT>Biology</NT>

</TermInfo>

<TermInfo>

<T>Science of Life</T>

</TermInfo>

thesaurus problems
Thesaurus Problems
  • Missing Terms - pointer links to a term that is not present
  • Broken loops
    • Narrower term without matching broader term
    • Broader term without matching narrower term
    • Related term without a matching return relationship
proximity of words
Proximity of Words
  • Adjacent
    • Before
    • After
  • Same sentence
  • Same Paragraph
  • Within 50 words
  • Phrases (n-Grams)
content addressable array
Content Addressable Array

T[“Science”]=1;

T[“Biology”]=1;

T[“Science of Life”]=1;

BT[“Biology”] = “Science”;

NT[“Science”] = “Biology”;

UF[“Science of Life”]=“Biology”;

regular expressions
Regular Expressions
  • /^[_a-zA-Z0-9-]+(\.[_a-zA-Z0-9-]+)*@[a-zA-Z0-9-]+(\.[a-zA-Z0-9-]+)*(\.[a-zA-Z]{2,4})$/
    • Email addresses?
  • / [A-Z][a-z]* /
    • Capitalized words
  • /[A-Z][a-zA-Z0-9,\”\- ]*\. /
    • Sentence ?
  • Paragraph?
structure of controlled vocabularies1

Structure of Controlled Vocabularies

Flat List Synonym Ring Taxonomy Thesaurus Ontology

INCREASING MEANING and CONTROL

Ambiguity

Ambiguity

Synonym

Ambiguity

Synonym

Hierarchy

Relationships

Synonym

Hierarchy

Additional Types of Relationships

Hierarchy

After ANSI/NISOZ39.19 -2005, Figure 5

ad