390 likes | 457 Views
Explore the critical role of taxonomies in navigating the evolving landscape of technology and industry standards. Discover misconceptions and essential tools for effective information organization and retrieval.
E N D
Taxonomies:Hidden but Critical Tools Marjorie M.K. Hlava President Access Innovations, Inc.
Industry in change • Technology changes • Evolving standards • Mergers • New buzzwords • Hard to tell what is real
Popular Misconceptions • Computers can do it all • No need to index • No need for thesauri or subject headings • Full text gives all we need • Automatic full text • User friendly search engines • Search engines are indexes • User profiles provide the right context • Data filters give right answers
Some of it is true • What can we use? • Automatic - semi - classification • Depends….. • Size of collection • Cost of the effort
What’s in?? • Taxonomies • thesauri • hierarchies - classification • categorization • browsing • Wellformedness • Bricks and mortar, i.e., profit
Options for Access/Control • Keep track of the input • Thesaurus • Authority file • Maximize the access • Search engine • Browse list • Power of the word • McCain
What do we need? • The basics... • Authority file • People, places, things • Taxonomy • Thesaurus* with authority file or document instance • “Automatic” Classification
Thesaurus Construction • Parts of a whole • Noun and noun phrases • People, places, things • Actions and reactions • Concepts and processes
Term Records -Thesaurus - format • Main Entries • Top Terms - TT • Broader Terms - BT • Narrower Terms - NT • Scope Notes - SN • History - HI • Date Term - added/changed - DA
Thesaurus - Format • Related Terms - RT • See - S • See Also - SA • Use - U • Use For - UF • “Wellformedness” = W3C
What are the parts? • Natural Language Processing • Term forms • Term Relationships • Term Associations
Natural Language Processing • Morphological • Lexical Analysis • Syntactic • Numerical • Phraseological • Semantic Analysis • Pragmatic
Seven Major Parts of NLP 1. Morphological • plural • past tense to present
Seven Major Parts of NLP 2. Lexical Analysis • part of speech tagging 3. Syntactic analysis • non phrase id • proper name boundary
Seven Major Parts of NLP 4. Numeric concept boundary 5. Semantic analysis • Proper name concept categorization • Numeric concept categorization • Semantic relation extraction 6. Phraseological - discourse analysis • Text structure identification
Seven Major Parts of NLP 7. Pragmatic analysis • Cause and effect relationships • Nurse and nursing • Common sense reasoning (buy possess) • Who has x ? • These are the people who brought you.....
Say it another way • Term standardization • Term forms • Term relationships • Term associations • Rule building / domain creation
Word Standardization • Split out chemical & drug terms • Separates chemical & drug terms for special treatment • Split out homonyms, non-English terms, and authority terms • Separates objects, proper names, place names, and dates for special treatment • Run spelling standardization program • Identifies variant spellings
Word Standardization • Run word standardization program • ie, ing, -ed, -s, es, pre-, non-, and “-” • Match preferred terms and synonyms
Term Forms • Noun • Adjective • Verb, adverb • Singular, plural • Initial articles • Spelling variants
Term Forms • Punctuation • Capitalization • Abbreviations
Term Relationships • Generic • Hierarchical • Systematic • Alphabetic • Instance • Poly-hierarchical
Term Associations • Cross references • All and some rule • Associative terms • Related terms
“Rule building”* process • Put terms in context • Group like categories • Consider relationships • Standardize variants • Meld to a single concept rule • How much is really automatic???
Domains • Taxonomy • Term Record - thesaurus • Hierarchical Browse-able list • Handout in Booth 150
What else can we have? • Proximity • Stemming (lemmatization) • Truncation • Statistical clustering • Bayesian and others
Other terms and tools • Neural networks • Word normalization • Lexical (word) networks • Distance mapping • Pattern recognition
Moving toward the search engines • Term weighting • Frequency counts • Relevance • Precision • Recall
Classification of “Automatic Classification Systems” • Evolving model… • Noun Extractors • Rule Based Systems • Semantic Processors • Fuzzy Search Systems • Filtering Systems
(Semi) Automatic Indexing • Basic theories • Thesaurus construction • Natural language processing • Domain specific
Noun extractors • Noun Extractors • Use stop word list and frequency counts • Semio • Word Perfect 5.0 • Recon • Prebuilt domains • Autonomy • Net Owl • Newsindexer
Rules Based Systems • Rule Based • Data Harmony • API • DTIC • Mapit
Semantic Processors • Synth Bank • n-Stein - expected • Quiver - beta
Fuzzy Search Systems • Dr. Link • Sovereign Hill
Filtering Systems • Screaming Media • Data Harmony
New Directions • Topic Maps - TAO • Topic • Associations • Occurrences • Relational Indexing • Index Visualization • Based on term records • Add the search engines….
What’s a user to do? • Enjoy the presentation • What about a database producer? • Look the options, • Build from the basics • Evaluate the new tools • See it work before you buy
Thank You • Marjorie M.K. Hlava • President, Access Innovations, Inc. • www.accessinn.com • Chairman, Data Harmony • mhlava@accessinn.com • 505-998-0800 • Booth 150