1 / 35

Making Sense of Non-structured Data in the Medical World

Making Sense of Non-structured Data in the Medical World. Dirk Van Hyfte , Misha Bouzinier. Non-structured Data in the Medical World. The most valuable clinical data is often hidden inside doctor notes (written in a free form)

ramya
Download Presentation

Making Sense of Non-structured Data in the Medical World

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Making Sense of Non-structured Data in the Medical World Dirk Van Hyfte, Misha Bouzinier

  2. Non-structured Data in the Medical World • The most valuable clinical data is often hidden inside doctor notes (written in a free form) • We are developing tools that are using this data to answer real world practical questions • First of all – Clinical Questions • Two main topics of this presentation: • What kind of questions can be addressed with our tools • How to do it

  3. Agenda • Examples of clinical questions • And customers that are trying to answer them with the help of our tools • What is unstructured data? • What technology do we use • Practical receipts and solutions • A broader view: from healthcare to the whole Life Science domain

  4. Three types of clinical questions: • Identify patients with a certain condition • Maastro Clinic • Janssen Pharmaceutica • Weill Cornell • University Hospital Brussels • Identify indicators for a certain condition • Parnassia • Extract indicators to feed predictive models • UZA Intensive Care INZO

  5. Identify patients with a certain condition Clinical Protocol unstructured structured Cloud of Patients

  6. Identify patients with a certain condition Acta Oncol. 2012 Sep 5. Metformin use and improved response to therapy in esophageal adenocarcinoma. Skinner HD, McCurdy MR, Echeverria AE, Lin SH, Welsh JW, O'Reilly MS, Hofstetter WL, Ajani JA, Komaki R, Cox JD, Sandulache VC, Myers JN, Guerrero TM. Department of Radiation Oncology, The University of Texas MD Anderson Cancer Center , Houston, Texas , USA .

  7. Identify patients with a certain condition Metformin Diabetic Head & Neck Cancer

  8. Identify patients with a certain condition Metformin Diabetic Head & Neck Cancer 0 in metadata Forms In notes Partially in metadata Forms In notes All in Metadata

  9. Identify patients with a certain condition

  10. Identify patients with a certain condition

  11. Background • Competing on Unstructured Data • Hidden Structure Revealed… • …And Presented • Different types of (lack of) Structure

  12. Competing on Unstructured Data • You already heard it: • Gartner: • 80% of business is conducted on unstructured information • Unstructured data doubles every three months • Ovum (aka Butler group): • 85% of all data stored is held in an unstructured format • Is there really such a thing as unstructured data? • The only really unstructured data is white noise.

  13. Hidden Structure Revealed… • Data that is used to transmit or store information must have some structure • However – Analytical and Reporting Tools might not understand it out of the box • He studies linguistics at the university is an example of “unstructured data” • Do you think the picture to the right is unstructured? http://en.wikipedia.org/wiki/X-bar_theory#A_full_sentence

  14. And Presented • Grammatical rules of a Natural Language is an example of a hidden structure • From the DocBook: The iKnow semantic analysis engine is used to analyze unstructured data, data that is written as text in a human language such as English or French. • In other words iKnow is about revealing the hidden or maybe disguised structure • And presenting it in the form accessible by tools • Relational Tables is probably the most convenient form of the output

  15. Are all Texts Similar? • However, different domains use different structures • Grammatical rules alone are insufficient • Are all texts written by humans born equal? • Is the article in a newspaper and a doctors note – the same kind of text?

  16. Spot 10 Differences

  17. Key Features of Medical Texts • There are three special features of healthcare related texts • Language grammatical rules are partially replaced with geometrical arrangements • Much of the information can be represented as simple Key-Value Pairs rather than a Parse Tree • There is a UMLS – a unique toolkit • It both require and allow for a special technique

  18. 1. Grammatical Rules Replaced • Reading a medical note requires as much image recognition as free text analysis • Relations between concepts are not recoginized linguistically • Weight: 69.2kg “:” means “is” or “has value of” • [X] Xerostomia means “Xerostomia is present” • [ ] Odynophagia means “Odynophagia is absent”

  19. 2. Key-Value Pairs • Natural language phrase produces a complex parse tree • Medical record often contains a lot of simple key-value pairs. This part: of record: • Can be represented as a table:

  20. 2… Missing Values – Ambiguities • Medical forms contain a lot of questions • Not all of them are answered • Those questions pollute the bag of concepts associated with a source • On a previous slide we interpreted empty checkboxes [] as negative responses but are they? • Here is another line: • [ ] Odynophagia [ ] Dysphagia [ ] Diarrhoea [ ] Vomiting • Does it mean that patient does not have symptoms above or that the questions have not been actually asked?

  21. 3. Data Aggregation by Mapping to UMLS • UMLS stands for Unified Medical Language System. • It consists of several parts including ICD-10, MeSH, SNOMED CT, DSM-IV, LOINC. • We can map either to any UMLS concept or only to a concept present in a specified vocabulary (or several vocabularies). • It is a truly unique resource containing a lot of semantic information • Mapping to UMLS is like a machine translation into a medical language • See: http://www.nlm.nih.gov/research/umls/

  22. Domain Specific Technique • Separate “forms” from “free” text (when feasible). • Resort to Advanced Regular Expression Search when forms and text are embedded too tightly • Using UMLS: • Generalization of Terms (UMLS Unique Concepts) • Analyzing relations (between UMLS High Level Concepts)

  23. What is a Form? • Separation of “forms” and “free” text. • But what is “free text” and what is a “form”? Usually none of this is clear and obvious in a medical note • For our purposes we define a form as a part of text where visual arrangement and special characters are used to convey relations between concepts • Let us call a part of a document a free text if it is governed by linguistics and grammatical rules

  24. How to Deal with Forms • Identify Key-Value pairs before language processing and remove them • NB: Suppress all keys with missing values! • Store Key Values pairs in a separate table • This table can be joined with a parse table from free text processing • It can be used in complex regular expressions search

  25. Text Transformations DEMO

  26. Generic SetBuilder Schema SetBuilder iKnow Key-Value pairs Sections Text Text Transformation Structured Data Text

  27. Simple Search Example • Assume we want to search for: • patients with completed chemotherapy • We can look for complete statements that contain both entities: similar to “chemo” and “completed” • This is like regular expression search

  28. Example: Mapping Clinical Text to UMLS

  29. Advanced Regular Expressions(Including UMLS) • This really complex query searches Therapeutic or Preventive Procedures performed on a patient:

  30. Applying AnalyticsDomain Entities and Domain Coverage • In the end we want to draw some conclusions… • Free text is a very high-dimensional object • For all practical purposes it is infinite-dimensional (Hilbert Space) • To analyze it (or do something useful) we need to reduce its dimensionality • There are mathematical tricks to do it but we need some starting point • We need to select a group of the most important entities David Hilbert

  31. Using UMLS to Improve Domain Coverage • Mapping to UMLS generalize terms and thus vastly improve domain coverage • For EPR:

  32. Texts in Life Science Domain(Beyond Patient Records)

  33. Analyzing Research Papers • PubMed comprises more than 23 million citations for biomedical literature from MEDLINE, life science journals, and online books. • Indexing PubMed abstracts with iKnow provides new insight • Our Partner – TM-7 is using iKnow indexed PubMed to identify relations between concepts

  34. UMLS and Social Media • When we tried to analyze a social media blog posts on a medical-related forum with iKnow entities we could not do it • Mapping to UMLS makes it practical:

  35. What is coming next… • Disambiguation • based on similarity of relations: The Pair-Pattern Matrix • Knowledge Discovery • Text Categorization • Similarity and Search by Example

More Related