Making Sense of Non-structured Data in the Medical World

Making Sense of Non-structured Data in the Medical World Dirk Van Hyfte, Misha Bouzinier

Non-structured Data in the Medical World • The most valuable clinical data is often hidden inside doctor notes (written in a free form) • We are developing tools that are using this data to answer real world practical questions • First of all – Clinical Questions • Two main topics of this presentation: • What kind of questions can be addressed with our tools • How to do it

Agenda • Examples of clinical questions • And customers that are trying to answer them with the help of our tools • What is unstructured data? • What technology do we use • Practical receipts and solutions • A broader view: from healthcare to the whole Life Science domain

Three types of clinical questions: • Identify patients with a certain condition • Maastro Clinic • Janssen Pharmaceutica • Weill Cornell • University Hospital Brussels • Identify indicators for a certain condition • Parnassia • Extract indicators to feed predictive models • UZA Intensive Care INZO

Identify patients with a certain condition Clinical Protocol unstructured structured Cloud of Patients

Identify patients with a certain condition Acta Oncol. 2012 Sep 5. Metformin use and improved response to therapy in esophageal adenocarcinoma. Skinner HD, McCurdy MR, Echeverria AE, Lin SH, Welsh JW, O'Reilly MS, Hofstetter WL, Ajani JA, Komaki R, Cox JD, Sandulache VC, Myers JN, Guerrero TM. Department of Radiation Oncology, The University of Texas MD Anderson Cancer Center , Houston, Texas , USA .

Identify patients with a certain condition Metformin Diabetic Head & Neck Cancer

Identify patients with a certain condition Metformin Diabetic Head & Neck Cancer 0 in metadata Forms In notes Partially in metadata Forms In notes All in Metadata

Identify patients with a certain condition

Background • Competing on Unstructured Data • Hidden Structure Revealed… • …And Presented • Different types of (lack of) Structure

Competing on Unstructured Data • You already heard it: • Gartner: • 80% of business is conducted on unstructured information • Unstructured data doubles every three months • Ovum (aka Butler group): • 85% of all data stored is held in an unstructured format • Is there really such a thing as unstructured data? • The only really unstructured data is white noise.

Hidden Structure Revealed… • Data that is used to transmit or store information must have some structure • However – Analytical and Reporting Tools might not understand it out of the box • He studies linguistics at the university is an example of “unstructured data” • Do you think the picture to the right is unstructured? http://en.wikipedia.org/wiki/X-bar_theory#A_full_sentence

And Presented • Grammatical rules of a Natural Language is an example of a hidden structure • From the DocBook: The iKnow semantic analysis engine is used to analyze unstructured data, data that is written as text in a human language such as English or French. • In other words iKnow is about revealing the hidden or maybe disguised structure • And presenting it in the form accessible by tools • Relational Tables is probably the most convenient form of the output

Are all Texts Similar? • However, different domains use different structures • Grammatical rules alone are insufficient • Are all texts written by humans born equal? • Is the article in a newspaper and a doctors note – the same kind of text?

Spot 10 Differences

Key Features of Medical Texts • There are three special features of healthcare related texts • Language grammatical rules are partially replaced with geometrical arrangements • Much of the information can be represented as simple Key-Value Pairs rather than a Parse Tree • There is a UMLS – a unique toolkit • It both require and allow for a special technique

1. Grammatical Rules Replaced • Reading a medical note requires as much image recognition as free text analysis • Relations between concepts are not recoginized linguistically • Weight: 69.2kg “:” means “is” or “has value of” • [X] Xerostomia means “Xerostomia is present” • [ ] Odynophagia means “Odynophagia is absent”

2. Key-Value Pairs • Natural language phrase produces a complex parse tree • Medical record often contains a lot of simple key-value pairs. This part: of record: • Can be represented as a table:

2… Missing Values – Ambiguities • Medical forms contain a lot of questions • Not all of them are answered • Those questions pollute the bag of concepts associated with a source • On a previous slide we interpreted empty checkboxes [] as negative responses but are they? • Here is another line: • [ ] Odynophagia [ ] Dysphagia [ ] Diarrhoea [ ] Vomiting • Does it mean that patient does not have symptoms above or that the questions have not been actually asked?

3. Data Aggregation by Mapping to UMLS • UMLS stands for Unified Medical Language System. • It consists of several parts including ICD-10, MeSH, SNOMED CT, DSM-IV, LOINC. • We can map either to any UMLS concept or only to a concept present in a specified vocabulary (or several vocabularies). • It is a truly unique resource containing a lot of semantic information • Mapping to UMLS is like a machine translation into a medical language • See: http://www.nlm.nih.gov/research/umls/

Domain Specific Technique • Separate “forms” from “free” text (when feasible). • Resort to Advanced Regular Expression Search when forms and text are embedded too tightly • Using UMLS: • Generalization of Terms (UMLS Unique Concepts) • Analyzing relations (between UMLS High Level Concepts)

What is a Form? • Separation of “forms” and “free” text. • But what is “free text” and what is a “form”? Usually none of this is clear and obvious in a medical note • For our purposes we define a form as a part of text where visual arrangement and special characters are used to convey relations between concepts • Let us call a part of a document a free text if it is governed by linguistics and grammatical rules

How to Deal with Forms • Identify Key-Value pairs before language processing and remove them • NB: Suppress all keys with missing values! • Store Key Values pairs in a separate table • This table can be joined with a parse table from free text processing • It can be used in complex regular expressions search

Text Transformations DEMO

Generic SetBuilder Schema SetBuilder iKnow Key-Value pairs Sections Text Text Transformation Structured Data Text

Simple Search Example • Assume we want to search for: • patients with completed chemotherapy • We can look for complete statements that contain both entities: similar to “chemo” and “completed” • This is like regular expression search

Example: Mapping Clinical Text to UMLS

Advanced Regular Expressions(Including UMLS) • This really complex query searches Therapeutic or Preventive Procedures performed on a patient:

Applying AnalyticsDomain Entities and Domain Coverage • In the end we want to draw some conclusions… • Free text is a very high-dimensional object • For all practical purposes it is infinite-dimensional (Hilbert Space) • To analyze it (or do something useful) we need to reduce its dimensionality • There are mathematical tricks to do it but we need some starting point • We need to select a group of the most important entities David Hilbert

Using UMLS to Improve Domain Coverage • Mapping to UMLS generalize terms and thus vastly improve domain coverage • For EPR:

Texts in Life Science Domain(Beyond Patient Records)

Analyzing Research Papers • PubMed comprises more than 23 million citations for biomedical literature from MEDLINE, life science journals, and online books. • Indexing PubMed abstracts with iKnow provides new insight • Our Partner – TM-7 is using iKnow indexed PubMed to identify relations between concepts

UMLS and Social Media • When we tried to analyze a social media blog posts on a medical-related forum with iKnow entities we could not do it • Mapping to UMLS makes it practical:

What is coming next… • Disambiguation • based on similarity of relations: The Pair-Pattern Matrix • Knowledge Discovery • Text Categorization • Similarity and Search by Example

Making Sense of Non-structured Data in the Medical World

Making Sense of Non-structured Data in the Medical World

Presentation Transcript

Making sense of the Bible

China’s Automotive Aftermarket: Making Sense of the Data

Tricky Calorimetry: Making Sense of the Real World

Making Sense of Census Data

Making More Sense of School Data

of making sense...

Making Sense of Qualitative Data

Making sense of data

Making “Sense” of the World

Chapter 4 Making Sense of Our World

Propensity Scores Making Sense of Non-Randomized Observational Data

Making Sense of Qualitative Data

Making Sense of the Social World 4th Edition

Making Sense of Environmental Data

InfoMagnets : Making Sense of Corpus Data

Data, Data Everywhere: Making Sense of the Sea of User Data

Making Sense of the Medical Career Pathway

Making Sense of Life Sciences Data

Making Sense of Unstructured Data

of making sense...

Lecture 1. Making Sense of Data: Data Variation

Making Sense of Environmental Data