Networked Knowledge Organization
1 / 24

Brian A. Carlsen Apelon, Inc. - PowerPoint PPT Presentation

  • Uploaded on

Networked Knowledge Organization Systems/Services Workshop June 28, 2001. Tools For Classification Integration. Brian A. Carlsen Apelon, Inc. Presentation Outline. State of the UMLS Metathesaurus Life-cycle of a Source Tools and Processes Challenges Further Approaches.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Brian A. Carlsen Apelon, Inc.' - bevis-acevedo

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Networked Knowledge Organization

Systems/Services Workshop June 28, 2001

Tools For Classification Integration

Brian A. Carlsen

Apelon, Inc.

Presentation outline
Presentation Outline

  • State of the UMLS Metathesaurus

  • Life-cycle of a Source

  • Tools and Processes

  • Challenges

  • Further Approaches

State of the umls metathesaurus
State of the UMLS Metathesaurus

  • Concept orientation, concept persistance

  • Growth to over 800,000 concepts and over 60 vocabulary families

  • Over 1000 users worldwide

  • Uses of the Metathesaurus

    • Natural Language Processing

    • Knowledge Representation

    • Patient Record Systems

    • Linking Patient Data to Knowledge Sources

    • Automated Indexing/ Retrieval

English word string counts by release year
English Word, String Counts by Release Year


  • State of the UMLS Metathesaurus

  • Life-cycle of a Source

  • Tools and Processes

  • Challenges

  • Further Approaches

Life cycle of a source inversion
Life-cycle of a Source: Inversion

  • Source arrives in “machine readable” format*

    • Many formats are used, including PDF, Clipper dump files, WordPerfect files, unit-record formats, and relational flat files.

  • Source undergoes “inversion”

    • Requires a human

    • Input is this machine readable file

    • Process is source-specific

    • Output is a common relational flat-file format used internally.

Life cycle of a source insertion
Life-cycle of a Source: Insertion

  • A “Recipe” is created

  • Test insertion to validate recipe

  • Insertion and matching.

    • Load common format into database

    • Match to existing content algorithmically

      • Use string normalization

      • Determine SAFE vs. UNSAFE matches

    • Prepare data for editing

    • Process is fully undoable

Life cycle of a source editing
Life-cycle of a Source: Editing

  • Predicate-based partitioning

  • Workflow management

    • Review ALL content for new sources

    • Review UNSAFE content for updates

  • Human Review

  • QA Driven Editing

    • Source-specific QA

    • Feedback QA

    • Conservation of Mass QA

Life cycle of a source release
Life-cycle of a Source: Release

  • Synchronize editing changes

    • State-based model

  • Release data in desired format

    • Full release/partial release

  • Transform base release

    • “MetamorphoSys”

    • Remove unlicensed data

    • Create “Content Views”


  • State of the UMLS Metathesaurus

  • Life-cycle of a Source

  • Tools and Processes

  • Challenges

  • Further Approaches

Tools and processes overview
Tools and Processes: Overview

  • Humans vs. Computers

    • Humans are good at making content decisions

    • Computers are good at automating tasks

  • Tools vs. Processes

    • Tools enable computers to automate tasks

    • Processes keep humans productive.

Tools and processes pre editing
Tools and Processes: Pre-Editing

  • No common data representation

  • Source-by-source conversion to common format

    • Perl, Unix tools

  • What would a common format need?

    • Represent terms and attributes

    • Represent within-source relationships

    • Represent hierarchies

    • Represent external-source relationships

    • Represent classifications (e.g. Concept)

Tools and processes editing
Tools and Processes: Editing

  • Workflow Management

  • Report Generation

  • State Model vs. Action Model

    • Actions represented as new states vs.

    • Single state + actions as data

  • Human Editing

    • Interface enabling “high level cognitive editing”

  • LVG: String Normalization

  • Automated Editing

    • Save vs. Unsafe, Integrities

Tools and processes release
Tools and Processes: Release

  • License Agreements

  • Content Views

    • e.g. Indexing View

    • Filter by Semantic Type

    • Filter by Language

  • Alternative Release Formats

  • Updates

  • MetamorphoSys


  • State of the UMLS Metathesaurus

  • Life-cycle of a Source

  • Tools and Processes

  • Challenges

  • Further Approaches

Challenges ambiguity
Challenges: Ambiguity

  • Ambiguous Strings

    • e.g. “Cold”

    • Solution: Disambiguating strings, Preferred Names with “face validity”, Integrity checks when merging.

  • Not fully specified Strings

    • e.g. “Head of Pancreas” within “Malignant Neoplasm of Pancreas”

    • Solution: Fully specified preferred name.

Challenges what is a classification
Challenges: What is a Classification?

  • A classification is any grouping of terms with a consistent semantics.

  • Thesauri typically group terms by meaning into concepts (synonymy).

  • Alternatives

    • Neighborhoods (e.g. Descriptors in MeSH).

    • Near-synonymy

    • No classification (identity or term classification).

    • Lexical

  • Connecting relationships/attributes to classifiers

Challenges precedence
Challenges: Precedence

  • Concepts (or other classifications) generally have a preferred name

  • A thesaurus will have terms from different sources competing for precedence

  • Source precedence should be a user-level choice

  • Preferred name should not be used as a proxy for concept-ness

  • Every level of classification should have a preferred term

  • Preferred name exists primarily for “face validity”

Challenges update model
Challenges: Update Model

  • Constituent sources of a thesaurus will be updated

  • Editing cycle

    • Updated sources will require editing

    • Typically overlap is > 90%

    • Overlap can safely replace the old version’s content

    • Safe replacements should not be edited

    • Ideally, source providers would indicate replacement otherwise it must be computed

  • Release

    • Release changes


  • State of the UMLS Metathesaurus

  • Life-cycle of a Source

  • Tools and Processes

  • Challenges

  • Further Approaches

Further approaches description logic
Further Approaches: Description Logic

  • What is it?

    • Concepts (or other classifications) are axioms

    • Relationships (roles) are theorems

    • The transitive closure of the roles across the concepts is computed to ensure no violations.

    • e.g. A isa B, B isa C, C isa A (!violation)

  • When is it useful?

    • In formalized, static domains like Anatomy

  • When is it not useful?

    • Performance > formalism

    • In dynamic, loosely coupled domains like Genomics

Further approaches standards xml
Further Approaches: Standards XML

  • Standardized Terminology/Ontology Representation

    • XML is the most likely candidate

    • Ideally would support

      • Links to external sources

      • Relationships between different levels of classification

      • Update model

      • Description Logic Metadata

  • Standardized Thesaurus Representation

  • XML Repository

  • Standard Object Representations

Conclusion lessons learned
Conclusion: Lessons Learned

  • Use the Web

  • Use current technology

  • Use Description Logic where appropriate

  • Make editing intuitive

  • Automate tasks

    • “A well-understood, reproducible, automated process that succeeds 95% of the time is a vast improvement over a poorly-understood, labor-intensive process that is believed to succeed 100% of the time. “

    • Review UNSAFE automated tasks.

    • Stop automating when marginal utility falls below a threshold.