Developing i2b2 Ontologies for the Long Haul - PowerPoint PPT Presentation

Developing i2b2 ontologies for the long haul
1 / 59

  • Uploaded on
  • Presentation posted in: General

Developing i2b2 Ontologies for the Long Haul. Lori Phillips, MS Partners HealthCare Systems, Inc April 25, 2012. National Centers for Biomedical Computing. What is i2b2?.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Developing i2b2 Ontologies for the Long Haul

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Developing i2b2 ontologies for the long haul

Developing i2b2 Ontologies for the Long Haul

Lori Phillips, MS

Partners HealthCare Systems, Inc

April 25, 2012

National centers for biomedical computing

National Centers for Biomedical Computing

What is i2b2

What is i2b2?

  • Software for explicitly organizing and transforming person-oriented clinical data in a way that is optimized for research

    • Allows integration of clinical data, trials data, and genotypic data

  • A portable and extensible application framework

    • Modular software architecture allows additions without disturbing core parts

    • Available as open source at

Where is it used

Academic Health Centers (does not include AHCs that are part of a CTSA):

Arizona State University

City of Hope, Los Angeles

Georgia Health Sciences University, Augusta

Hartford Hospital, CN 

HealthShare Montana

Massachusetts Veterans Epidemiology Research and Information Center (MAVERICK), Boston


Phoenix Children's Hospital

Regenstrief Institute

Thomas Jefferson University

University of Connecticut Health Center

University of Missouri School of Medicine

University of Tennessee Health Sciences Center

Wake Forest University Baptist Medical Center


Group Health Cooperative

Kaiser Permanente 


Georges Pompidou Hospital, Paris, France

Hospital of the Free University of Brussels, Belgium

Inserm U936, Rennes, France

Institute for Data Technology and Informatics (IDI), NTNU, Norway

Institute for Molecular Medicine Finland (FIMM)

Karolinska Institute, Sweden

Landspitali University Hospital, Reykjavik, Iceland

Tokyo Medical and Dental University, Japan

University of Bordeau Segalen, France

University of Erlangen-Nuremberg, Germany

University of Goettingen, Goettingen, Germany

University of Leicester and Hospitals, England (Biomed. Res. Informatics Ctr. for Clin. Sci)

University of Pavia, Pavia, Italy

University of Seoul, Seoul, Korea


Johnson and Johnson (TransMART)

GE Healthcare Clinical Data Services

Where is it used?


  • Boston University

  • Case Western Reserve University (including Cleveland Clinic)

  • Children's National Medical Center (GWU), Washington D.C.

  • Duke University

  • Emory University (including Morehouse School of Medicine and Georgia Tech )

  • Harvard University (includingBeth Israel Deaconness Medical Center, Brigham and Women's Hospital, Children's Hospital Boston, Dana Farber Cancer Center, Joslin Diabetes Center, Massachusetts General Hospital)

  • Medical University of South Carolina

  • Medical College of Wisconsin

  • Oregon Health & Science University

  • Penn State MIlton S. Hershey Medical Center

  • Tufts University

  • University of Alabama at Birmingham

  • University of Arkansas for Medical Sciences

  • University of California Davis

  • University of California, Irvine

  • University of California, Los Angeles*

  • University of California, San Diego*

  • University of California San Francisco

  • University of Chicago

  • University of Cincinnati (including Cinncinati Children's Hospital Medical Center)

  • University of Colorado Denver (including Children's Hospital Colorado)

  • University of Florida

  • University of Kansas Medical Center

  • University of Kentucky Research Foundation

  • University of Massachusetts Medical School, Worcester

  • University of Michigan

  • University of Pennsylvania (including Children's Hospital of Philadelphia)

  • University of Pittsburgh (including their Cancer Institute)

  • University of Rochester School of Medicine and Dentistry

  • University of Texas Health Sciences Center  at Houston

  • University of Texas Health Sciences Center at San Antonio

  • University of Texas Medical Branch (Galveston)

  • University of Texas Southwestern Medical Center at Dallas

  • University of Utah

  • University of Washington

  • University of Wisconsin - Madison (including Marshfield Clinic)

  • Virginia Commonwealth University

  • Weill Cornell Medical College

Why use i2b2

Why use i2b2?

  • Cohort discovery

    • Enables and simplifies research cohort discovery across an institution’s large, heterogeneous clinical datasets

  • Hypothesis generation

    • Enables and simplifies analysis of data to support a hypothesis

  • Retrospective data analysis

    • Enables the retrospective analysis of data to support/refute claims.

I2b2 workbench

i2b2 Workbench

Data model

Data Model


    • The quantitative or factual data being queried


    • Groups of hierarchies and descriptors that define the facts.


    • A single fact table surrounded by numerous dimension tables.

I2b2 star schema

i2b2 Star Schema

Observation fact table primary keys

Observation (fact table) Primary Keys

Patient_num Distinct number for every patient

Encounter_num Distinct number for every visit

Concept_cd Distinct code for every concept

Observer_cd Distinct code for every observer

Start_date Date-time observation began

Modifier_cd Code to modify concept_cd

Instance_num Mechanism to group concept modifers

I2b2 fact table

i2b2 Fact Table

  • In i2b2, an atomic fact is an observation on a patient.

  • Examples of facts

    • Diagnoses

    • Procedures

    • Lab data

    • Medications

    • Genetic data

I2b2 dimension tables

i2b2 Dimension Tables

  • Dimension tables contain descriptive information about the facts.

  • Examples

    • Concept dimension describes the concepts stored in the concept_cd field.

    • Provider dimension contains information about the observer_cd field

    • Patient dimension contains information about the patient_num field

    • Visit dimension contains information about the encounter_num field

    • Modifier dimension contains information about the modifier_cd field

How does i2b2 use ontologies

How does i2b2 use Ontologies?

  • By and large, the concepts stored in the fact table come from clinical coding systems or ontologies.

  • Largely dependent on data available to institution

    • Diagnoses ICD9/ICD10/SNOMED

    • ProceduresCPT/ICD9

    • MedicationsNDC/RXNORM

    • Lab resultsLOINC

    • Molecular/genomic data

    • Custom or project specific data

  • Ontologies are used to organize query terms (and concepts) hierarchically.

Metadata table

Metadata table

  • Query terms are stored in a separate metadata table.

  • There is a one-to-one mapping of terms in the metadata to concepts in the dimension table.

  • The structure of the metadata table is integral to both the visualization of the query terms (tree) and the query mechanism itself.

Structure of metadata table

Structure of Metadata Table

I2b2 metadata root level categories

i2b2 Metadata Root Level Categories

  • Terms with c_hlevel = 1

  • Display name is c_name

  • Icon (folder or container) is determined by c_visualattributes

  • Example c_fullname:

    • \Diagnoses\

Query terms are visualized hierarchically in tree

Query terms are visualized hierarchically in tree


Respiratory system\ 2

Chronic obstructive diseases\ 3


Why are hierarchies so important for i2b2

Hierarchies form the basis of both the visualization of the terms and the query mechanism itself.

Why are hierarchies so important for i2b2?

select * from metadata where c_fullname like ‘\Diagnoses\Respiratory system\Chronic obstructive diseases\Emphysema\%’ and c_hlevel = 5

Structure of metadata table1

Structure of Metadata Table

Hierarchies in queries

Hierarchies in queries

select patient_num from observation_fact where concept_cd IN (select concept_cd from concept_dimension where concept_path LIKE '\Diagnoses\Respiratory system\Chronic obstructive diseases\ Emphysema\%')

I2b2 ontologies for the long haul

i2b2 Ontologies for the Long Haul

  • How do I create i2b2 metadata for a known ontology?

    • ICD-10

  • What happens to my legacy clinical data when I have to move to ICD-10?

    • Merging ICD-9 with ICD-10

  • How do I handle genomic metadata?

  • …. Custom metadata?

Ncbo bioportal icd 10

NCBO BioPortal ICD-10

Building an icd 10 ontology with ncbo services

Building an ICD-10 Ontology with NCBO services

  • Pull data from NCBO via REST services.

  • Reorganize information into i2b2 Metadata format








<contents class="org.ncbo.stanford.bean.concept. ClassBeanResultListBean">












</entry> ……

Primary challenges

Primary challenges

  • i2b2 Metadata depends upon hierarchical information

    • c_fullname, c_tooltip maintain the hierarchy from root to leaves

Diseases of the respiratory system \

Chronic lower respiratory diseases \




  • NCBO REST service that enables pull of concepts includes immediate parent/child info only

    • Hierarchy must be computed

  • <data>

  • <classBean>

  • <id>J43</id>

  • <label>Emphysema</label>

  • <relations>

  • <entry>

  • <string>SuperClass</string>

  • <list>

  • <classBean>

    • <id>J40-J47</id>

    • <label>Chronic lower respiratory diseases</label>

    • </classBean>

    • </list>

    • </entry>

    • </relations>

    • </classBean>

  • </data>

Ncbo extraction workflow

NCBO Extraction workflow



Request to extract ontology









Extracted icd 10 terms

Extracted ICD-10 terms

Released deliverables

Released deliverables

What about my legacy icd 9 data

What about my legacy ICD-9 data?

  • Ideally we would like an i2b2 ontology that integrates ICD-9 into ICD-10.

Mapping tool

Mapping Tool

  • Tool to verify/(re)assign ontology mappings.

Navigating the mapping tool tree

Navigating the Mapping Tool Tree

  • Displays terms mapped from one ontology within hierarchy of another

  • Mapped terms are displayed adjacent to terms they are mapped to and appear in bold

Adding a new mapping

Adding a new mapping

  • ICD9:269.3, Mineral deficiency should appear for ICD10:E63 Other nutritional deficiencies

  • Copy term ICD9:269.3

Adding a new mapping1

Adding a new mapping

  • Paste onto ICD10:E63 Other nutritional deficiencies

Move a mapping

Move a mapping

  • Ascorbic acid deficiency (ICD9:267) can be moved down one level to Ascorbic acid deficiency (ICD10:E54)

  • Drag and drop down the term one level.

Unmap a mapping

Unmap a mapping

ICD9:416.8 Other chronic pulmonary heart diseases appears in two places: the one attached to ICD10:I27.2 appears incorrect and can be unmapped.

The unmapped terms list

The Unmapped Terms List

  • Free form list of terms to be mapped

  • Locate term you wish to map to in the hierarchy tree. Drag from table to term in the tree.

  • If you make a mistake you can either reassign the mapped term within the tree or unmap it from tree.

  • Unmap will cause it to reappear in the unmapped terms list if the term has no other mappings.

Assigning an unmapped term

Assigning an unmapped term

  • Drag from unmapped terms list

  • Drop onto term we are mapping to

Unmapping a term

Unmapping a term

  • Drag term from tree

  • Drop onto unmapped terms list

Search unmapped terms by name

Search Unmapped Terms By Name

Search unmapped terms by code

Search Unmapped Terms by Code

Mapped terms viewer

Mapped Terms Viewer

Search mapped terms by code

Search Mapped Terms By Code

Search mapped terms by name

Search Mapped Terms By Name

Merging ontologies

Merging Ontologies

  • Mapping tool provides a visualization of what the merged ontologies would look like

  • What if we could extract a single metadata table from this?

Integration tool

Request to integrate




ICD9 into ICD-10

For each mapped ICD-9

terms, compute ICD-10


ICD-10 merged

with ICD9 terms

Mapped ICD-9 terms

Integration tool

How to handle genomic data

How to handle genomic data

  • Ability to organize the variants for ease of navigation

    • Needs may differ between geneticist, physician, research scientist

  • Ability to query for the variant in the workbench

    • Genomic labs may report data differently

    • Define the variant so it may be reliably identified over time

    • Implication is that the identifier for the variant does not change over time or is maintainable.

How to reliably identify a genomic variant

How to (reliably) identify a genomic variant?

Chr location,

Nucleotide subst ?


Name ?

Gene name +


sequences ?

All of them??


# ?

Rs number

RS number

Uniquely identifies a variant over time ….but….

Novel variants may not have rs number

User may not want to submit to dbSNP

Gene name flanking sequences

Gene name + flanking sequences

Not guaranteed if gene has several isoforms


Hgvs name


Uniquely identifies variant within a referenced and versioned accession and details the nucleotide substitution.




RefSeq accession Position

Coding DNA

Is there a common denominator in all of this

Is there a common denominator in all of this?

Yes … all ultimately describe variant location on a chromosome.

Nucleotide substitution defines the physical manifestation of the variant.


HGVS name (n/t subst, positional info)

Flanking sequences (a way to verify positional info)




Structure of metadata table2

Structure of Metadata Table

Genomic metadataxml record

Genomic MetadataXML record


Version 1.0

ReferenceGenomeVersion hg18


HGVSName NM_0005228.3:c.2155G>T

SystematicName c.2155G>T

SystematicNameProtein p.Glu719Cys

AaChange missense

DnaChange substitution


GeneName EGFR



RegionType exon

RegionName Exon 18



Name NM_005228

Type mrna (NCBI)


Name NP_005219

Type protein (NCBI)


Name NT_004487

Type contig (NCBI)


Chromosome chr7

Region 7p12

Orientation +

Organizational challenges

Organizational challenges

By Disease?

By Gene?

Combining equivalent terms

Combining equivalent terms

How to handle custom local metadata

How to handle custom (local) metadata

  • Edit Tool ideal for creating small, non-standard ontology for a local project.

  • Consider the case for classifying patients as smokers, non-smokers or smoking status unknown

  • The Custom Metadata folder is designed for use with the creation of local terms.

Create a smoking status folder

Create a “Smoking status” folder

Populate folder with smoker non smoker etc

Populate folder with “Smoker”, “Non-smoker”, etc

Smoking status custom metadata

Smoking status custom metadata

Developing i2b2 ontologies for the long haul

  • Login