developing i2b2 ontologies for the long haul
Skip this Video
Download Presentation
Developing i2b2 Ontologies for the Long Haul

Loading in 2 Seconds...

play fullscreen
1 / 59

Developing i2b2 Ontologies for the Long Haul - PowerPoint PPT Presentation

  • Uploaded on

Developing i2b2 Ontologies for the Long Haul. Lori Phillips, MS Partners HealthCare Systems, Inc April 25, 2012. National Centers for Biomedical Computing. What is i2b2?.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Developing i2b2 Ontologies for the Long Haul' - gin

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
developing i2b2 ontologies for the long haul

Developing i2b2 Ontologies for the Long Haul

Lori Phillips, MS

Partners HealthCare Systems, Inc

April 25, 2012

what is i2b2
What is i2b2?
  • Software for explicitly organizing and transforming person-oriented clinical data in a way that is optimized for research
    • Allows integration of clinical data, trials data, and genotypic data
  • A portable and extensible application framework
    • Modular software architecture allows additions without disturbing core parts
    • Available as open source at
where is it used
Academic Health Centers (does not include AHCs that are part of a CTSA):

Arizona State University

City of Hope, Los Angeles

Georgia Health Sciences University, Augusta

Hartford Hospital, CN 

HealthShare Montana

Massachusetts Veterans Epidemiology Research and Information Center (MAVERICK), Boston


Phoenix Children\'s Hospital

Regenstrief Institute

Thomas Jefferson University

University of Connecticut Health Center

University of Missouri School of Medicine

University of Tennessee Health Sciences Center

Wake Forest University Baptist Medical Center


Group Health Cooperative

Kaiser Permanente 


Georges Pompidou Hospital, Paris, France

Hospital of the Free University of Brussels, Belgium

Inserm U936, Rennes, France

Institute for Data Technology and Informatics (IDI), NTNU, Norway

Institute for Molecular Medicine Finland (FIMM)

Karolinska Institute, Sweden

Landspitali University Hospital, Reykjavik, Iceland

Tokyo Medical and Dental University, Japan

University of Bordeau Segalen, France

University of Erlangen-Nuremberg, Germany

University of Goettingen, Goettingen, Germany

University of Leicester and Hospitals, England (Biomed. Res. Informatics Ctr. for Clin. Sci)

University of Pavia, Pavia, Italy

University of Seoul, Seoul, Korea


Johnson and Johnson (TransMART)

GE Healthcare Clinical Data Services

Where is it used?


  • Boston University
  • Case Western Reserve University (including Cleveland Clinic)
  • Children\'s National Medical Center (GWU), Washington D.C.
  • Duke University
  • Emory University (including Morehouse School of Medicine and Georgia Tech )
  • Harvard University (includingBeth Israel Deaconness Medical Center, Brigham and Women\'s Hospital, Children\'s Hospital Boston, Dana Farber Cancer Center, Joslin Diabetes Center, Massachusetts General Hospital)
  • Medical University of South Carolina
  • Medical College of Wisconsin
  • Oregon Health & Science University
  • Penn State MIlton S. Hershey Medical Center
  • Tufts University
  • University of Alabama at Birmingham
  • University of Arkansas for Medical Sciences
  • University of California Davis
  • University of California, Irvine
  • University of California, Los Angeles*
  • University of California, San Diego*
  • University of California San Francisco
  • University of Chicago
  • University of Cincinnati (including Cinncinati Children\'s Hospital Medical Center)
  • University of Colorado Denver (including Children\'s Hospital Colorado)
  • University of Florida
  • University of Kansas Medical Center
  • University of Kentucky Research Foundation
  • University of Massachusetts Medical School, Worcester
  • University of Michigan
  • University of Pennsylvania (including Children\'s Hospital of Philadelphia)
  • University of Pittsburgh (including their Cancer Institute)
  • University of Rochester School of Medicine and Dentistry
  • University of Texas Health Sciences Center  at Houston
  • University of Texas Health Sciences Center at San Antonio
  • University of Texas Medical Branch (Galveston)
  • University of Texas Southwestern Medical Center at Dallas
  • University of Utah
  • University of Washington
  • University of Wisconsin - Madison (including Marshfield Clinic)
  • Virginia Commonwealth University
  • Weill Cornell Medical College
why use i2b2
Why use i2b2?
  • Cohort discovery
    • Enables and simplifies research cohort discovery across an institution’s large, heterogeneous clinical datasets
  • Hypothesis generation
    • Enables and simplifies analysis of data to support a hypothesis
  • Retrospective data analysis
    • Enables the retrospective analysis of data to support/refute claims.
data model
Data Model
    • The quantitative or factual data being queried
    • Groups of hierarchies and descriptors that define the facts.
    • A single fact table surrounded by numerous dimension tables.
observation fact table primary keys
Observation (fact table) Primary Keys

Patient_num Distinct number for every patient

Encounter_num Distinct number for every visit

Concept_cd Distinct code for every concept

Observer_cd Distinct code for every observer

Start_date Date-time observation began

Modifier_cd Code to modify concept_cd

Instance_num Mechanism to group concept modifers

i2b2 fact table
i2b2 Fact Table
  • In i2b2, an atomic fact is an observation on a patient.
  • Examples of facts
    • Diagnoses
    • Procedures
    • Lab data
    • Medications
    • Genetic data
i2b2 dimension tables
i2b2 Dimension Tables
  • Dimension tables contain descriptive information about the facts.
  • Examples
    • Concept dimension describes the concepts stored in the concept_cd field.
    • Provider dimension contains information about the observer_cd field
    • Patient dimension contains information about the patient_num field
    • Visit dimension contains information about the encounter_num field
    • Modifier dimension contains information about the modifier_cd field
how does i2b2 use ontologies
How does i2b2 use Ontologies?
  • By and large, the concepts stored in the fact table come from clinical coding systems or ontologies.
  • Largely dependent on data available to institution
    • Diagnoses ICD9/ICD10/SNOMED
    • Procedures CPT/ICD9
    • Medications NDC/RXNORM
    • Lab results LOINC
    • Molecular/genomic data
    • Custom or project specific data
  • Ontologies are used to organize query terms (and concepts) hierarchically.
metadata table
Metadata table
  • Query terms are stored in a separate metadata table.
  • There is a one-to-one mapping of terms in the metadata to concepts in the dimension table.
  • The structure of the metadata table is integral to both the visualization of the query terms (tree) and the query mechanism itself.
i2b2 metadata root level categories
i2b2 Metadata Root Level Categories
  • Terms with c_hlevel = 1
  • Display name is c_name
  • Icon (folder or container) is determined by c_visualattributes
  • Example c_fullname:
    • \Diagnoses\
query terms are visualized hierarchically in tree
Query terms are visualized hierarchically in tree

\Diagnoses\ 1

Respiratory system\ 2

Chronic obstructive diseases\ 3

Emphysema\ 4

why are hierarchies so important for i2b2
Hierarchies form the basis of both the visualization of the terms and the query mechanism itself.Why are hierarchies so important for i2b2?

select * from metadata where c_fullname like ‘\Diagnoses\Respiratory system\Chronic obstructive diseases\Emphysema\%’ and c_hlevel = 5

hierarchies in queries
Hierarchies in queries

select patient_num from observation_fact where concept_cd IN (select concept_cd from concept_dimension where concept_path LIKE \'\Diagnoses\Respiratory system\Chronic obstructive diseases\ Emphysema\%\')

i2b2 ontologies for the long haul
i2b2 Ontologies for the Long Haul
  • How do I create i2b2 metadata for a known ontology?
    • ICD-10
  • What happens to my legacy clinical data when I have to move to ICD-10?
    • Merging ICD-9 with ICD-10
  • How do I handle genomic metadata?
  • …. Custom metadata?
building an icd 10 ontology with ncbo services
Building an ICD-10 Ontology with NCBO services
  • Pull data from NCBO via REST services.
  • Reorganize information into i2b2 Metadata format








<contents class="org.ncbo.stanford.bean.concept. ClassBeanResultListBean">












</entry> ……

primary challenges
Primary challenges
  • i2b2 Metadata depends upon hierarchical information
    • c_fullname, c_tooltip maintain the hierarchy from root to leaves

Diseases of the respiratory system \

Chronic lower respiratory diseases \


  • NCBO REST service that enables pull of concepts includes immediate parent/child info only
    • Hierarchy must be computed
  • <data>
  • <classBean>
  • <id>J43</id>
  • <label>Emphysema</label>
  • <relations>
  • <entry>
  • <string>SuperClass</string>
  • <list>
  • <classBean>
    • <id>J40-J47</id>
    • <label>Chronic lower respiratory diseases</label>
    • </classBean>
    • </list>
    • </entry>
    • </relations>
    • </classBean>
  • </data>
ncbo extraction workflow
NCBO Extraction workflow



Request to extract ontology









released deliverables
Released deliverables

what about my legacy icd 9 data
What about my legacy ICD-9 data?
  • Ideally we would like an i2b2 ontology that integrates ICD-9 into ICD-10.
mapping tool
Mapping Tool
  • Tool to verify/(re)assign ontology mappings.
navigating the mapping tool tree
Navigating the Mapping Tool Tree
  • Displays terms mapped from one ontology within hierarchy of another
  • Mapped terms are displayed adjacent to terms they are mapped to and appear in bold
adding a new mapping
Adding a new mapping
  • ICD9:269.3, Mineral deficiency should appear for ICD10:E63 Other nutritional deficiencies
  • Copy term ICD9:269.3
adding a new mapping1
Adding a new mapping
  • Paste onto ICD10:E63 Other nutritional deficiencies
move a mapping
Move a mapping
  • Ascorbic acid deficiency (ICD9:267) can be moved down one level to Ascorbic acid deficiency (ICD10:E54)
  • Drag and drop down the term one level.
unmap a mapping
Unmap a mapping

ICD9:416.8 Other chronic pulmonary heart diseases appears in two places: the one attached to ICD10:I27.2 appears incorrect and can be unmapped.

the unmapped terms list
The Unmapped Terms List
  • Free form list of terms to be mapped
  • Locate term you wish to map to in the hierarchy tree. Drag from table to term in the tree.
  • If you make a mistake you can either reassign the mapped term within the tree or unmap it from tree.
  • Unmap will cause it to reappear in the unmapped terms list if the term has no other mappings.
assigning an unmapped term
Assigning an unmapped term
  • Drag from unmapped terms list
  • Drop onto term we are mapping to
unmapping a term
Unmapping a term
  • Drag term from tree
  • Drop onto unmapped terms list
merging ontologies
Merging Ontologies
  • Mapping tool provides a visualization of what the merged ontologies would look like
  • What if we could extract a single metadata table from this?
integration tool

Request to integrate




ICD9 into ICD-10

For each mapped ICD-9

terms, compute ICD-10


ICD-10 merged

with ICD9 terms

Mapped ICD-9 terms

Integration tool
how to handle genomic data
How to handle genomic data
  • Ability to organize the variants for ease of navigation
    • Needs may differ between geneticist, physician, research scientist
  • Ability to query for the variant in the workbench
    • Genomic labs may report data differently
    • Define the variant so it may be reliably identified over time
    • Implication is that the identifier for the variant does not change over time or is maintainable.
how to reliably identify a genomic variant
How to (reliably) identify a genomic variant?

Chr location,

Nucleotide subst ?


Name ?

Gene name +


sequences ?

All of them??


# ?

rs number
RS number

Uniquely identifies a variant over time ….but….

Novel variants may not have rs number

User may not want to submit to dbSNP

gene name flanking sequences
Gene name + flanking sequences

Not guaranteed if gene has several isoforms


hgvs name

Uniquely identifies variant within a referenced and versioned accession and details the nucleotide substitution.




RefSeq accession Position

Coding DNA

is there a common denominator in all of this
Is there a common denominator in all of this?

Yes … all ultimately describe variant location on a chromosome.

Nucleotide substitution defines the physical manifestation of the variant.


HGVS name (n/t subst, positional info)

Flanking sequences (a way to verify positional info)




genomic metadataxml record
Genomic MetadataXML record


Version 1.0

ReferenceGenomeVersion hg18


HGVSName NM_0005228.3:c.2155G>T

SystematicName c.2155G>T

SystematicNameProtein p.Glu719Cys

AaChange missense

DnaChange substitution


GeneName EGFR



RegionType exon

RegionName Exon 18



Name NM_005228

Type mrna (NCBI)


Name NP_005219

Type protein (NCBI)


Name NT_004487

Type contig (NCBI)


Chromosome chr7

Region 7p12

Orientation +

organizational challenges
Organizational challenges

By Disease?

By Gene?

how to handle custom local metadata
How to handle custom (local) metadata
  • Edit Tool ideal for creating small, non-standard ontology for a local project.
  • Consider the case for classifying patients as smokers, non-smokers or smoking status unknown
  • The Custom Metadata folder is designed for use with the creation of local terms.