Introduction to medieq
1 / 21

Introduction to MedIEQ - PowerPoint PPT Presentation

  • Uploaded on

Introduction to MedIEQ. Quality Labelling of Medical Web content using Multilingual Information Extraction Martin Labský Knowledge Engineering Group (KEG) University of Economics Prague (UEP). Purpose of MedIEQ.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Introduction to MedIEQ' - langer

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Introduction to medieq

Introduction to MedIEQ

Quality Labelling of Medical Web content using Multilingual Information Extraction

Martin Labský

Knowledge Engineering Group (KEG)

University of Economics Prague (UEP)

WP6 – Information Extraction

Purpose of medieq
Purpose of MedIEQ

  • Medical web sites are increasingly popular

  • Content strongly affects users’ decisions

  • Therefore, quality labeling is very important

  • Agencies invest large effort into labeling websites manually

  • We develop tools to minimize their effort

  • Tools will be multi-lingual, will support different and evolving labeling criteria

WP6 – Information Extraction


  • Partners

  • Description of relevant work packages [3]

    • Web content collection, Information Extraction, Lexical and semantic resources

    • Goals, tasks, partners

    • Existing tools (to be extended)

    • New tools (to be developed)

    • Existing resources (to be made accessible)

  • Milestones & deliverables

  • References

  • Questions

WP6 – Information Extraction


  • Agencies

    • WMA: Web Médica Acreditata (Es)

      • assigns a quality label that is shown on medical websites

      • websites ask for the label, are suggested changes, then get it

    • AQUMED: Agency for Quality Labeling in medicine (De)

      • maintains a web directory organized by topics

      • only good-quality websites are present

  • Developers

    • NCSR Demokritos and I-Sieve (spin-off) (Gr)

    • UEP: University of Economics Prague (Cz)

    • UNED: National University of Distance Education (Es)

    • HUT: Helsinki University of Technology (Fi)

WP6 – Information Extraction

Web content collection wp5
Web Content Collection (WP5)

WP6 – Information Extraction

Website monitoring
Website monitoring

  • Regular visits to labeled website

  • Checking pages

    • for relevant changes

    • which changes are relevant?

      • manual rules, machine learning...

    • alert agency when significant changes occur

    • or, increase the website’s (web page’s) priority in a list of to-be-checked resources

    • show what has changed, suggest solution

  • Needed by WMA, AQuMed

WP6 – Information Extraction

Web focused crawling
Web focused crawling

  • Find new medical websites

  • Use multiple existing search engines

    • specify lists of keywords / keyphrases

    • give sample “similar” documents

    • use Google/Yahoo API and filter their results

  • NCSR already has a focused crawler

    • we should contribute to its development

  • Needed by WMA

WP6 – Information Extraction

Website spidering
Website spidering

  • Walk pages of a single website

  • Classify each page

    • in order to choose relevant docs for quality labeling

    • e.g. contact page, page containing treatment description, page with sponsors

    • use machine learning, e.g. based on a bag-of-words (unigram, bigram) document representation

  • Spidering strategy

    • which documents belong together (e.g. page 1/7)

    • which links to follow next

  • NCSR has a spider

    • uses classifiers from Weka for doc classification

    • we should contribute

WP6 – Information Extraction

Information extraction wp6
Information Extraction (WP6)

WP6 – Information Extraction

Ie introduction
IE introduction

  • Documents to extract from

    • pages retrieved & classified by spider

      • from known websites

      • from crawler

    • monitored labeled pages that have changed

  • Information to be extracted

    • derived from agencies’ labeling criteria

    • e.g. contact information of responsible persons, sponsor names, privacy warning texts...

  • Questions

    • how much human intervention needed?

    • complexity of label sets to be supported?

    • methodology of porting to a new language?

WP6 – Information Extraction

Example extracted information i
Example extracted information I.

  • Transparency and honesty

    • site provider (company name, contact)

    • site purpose, type of target audience

    • funding (grants, sponsors)

  • Authority

    • source citation for information provided, its type and date

    • names and credentials of all information providers

  • Privacy and data protection

    • privacy policy description

  • Timeliness of information

    • dates of publication/modification

  • Accountability

    • names (and roles) of people responsible for presented information

    • editorial policy description

WP6 – Information Extraction

Example extracted information ii
Example extracted information II.

  • Content

    • medical terms, e.g. disease and drug names

    • statements recommending a certain product/method

    • advertisements

    • disallowed combinations (e.g. advertisement for X adjacent to an article related to X)

  • Formal

    • mandatory statements (e.g. importance of physical examination, privacy warnings when posting data into chats)

WP6 – Information Extraction

Sources of extraction knowledge
Sources of extraction knowledge

  • Training data

    • scarcity will be a problem for most extracted attributes

    • different types: labeled documents, sample extracted data, data previously extracted from the same website, domain dictionaries

  • Extraction patterns

    • induced (semi)automatically from scarce training data

    • or even authored manually

  • Background domain knowledge

    • relations between extracted attributes, cardinalities ...

    • e.g. typically just one company is the web site’s provider, but there are often multiple sponsors

  • Web site structure

    • exploit common formatting of a group of documents within a website

    • exploit common formatting used for a particular type of extracted data across different websites

WP6 – Information Extraction

Ie tools
IE tools

  • Ex (UEP)

    • IE system under development using “extraction ontologies”

    • extracts instances from semi-structured documents

    • utilizes training data + manually defined patterns, includes spider

    • old version based on HMMs –

  • Named entity recognizer (UNED)

    • extracts dates, person/institution names

  • 3rd party IE tools

    • wrapper management systems

    • e.g. LP2-based IE tool or annotation editor from Sheffield

WP6 – Information Extraction

Website assessment
Website assessment

  • Check website’s technical correctness

    • SEO (findability in search engines with respect to some keyphrases)

    • accessibility (possibility of font enlargement, blind access, pages hidden deep in website structure, color schemes perceivable by anybody)

    • formal correctness (dead links, violations of HTML standards, failure to display well under at least the 3 most popular browsers)

  • Check non-technical correctness

    • e.g. typos, “clear, easy-to-understand language”

    • more: check for black-listed phrases, claims, etc.

WP6 – Information Extraction

Website assessment tools
Website assessment tools

  • Relaxed (UEP)

    • HTML validator based on Relax NG and Schematron patterns

    • can perform formal checks of website content beyond DTDs


  • SEO tool (UEP)

    • could Honza’s SEO tool be extended?

WP6 – Information Extraction

Ie deliverables
IE Deliverables

  • Duration: M1-M28

  • Deliverables

    • D8: Methodology & architecture of IE (M9)

    • D9.1: First version of IE toolkit (M15)

    • D9.2: Final version of IE toolkit (M24)

WP6 – Information Extraction

Lexical and semantic resources wp7
Lexical and semantic resources(WP7)

WP6 – Information Extraction

Lexical and semantic resources
Lexical and semantic resources

  • Sp, De, En, Cz, Gr, Fi, Catalan (7!)

  • We are in charge of Cz, De(!)

  • Semantic

    • thesauri, ontologies (MESH)

    • lists of cures, vaccine names, lists of medical companies, illnesses, diagnoses

    • generic ontologies and translation dictionaries (e.g. Eurowordnet)

  • Lexical

    • lemmatizers/morphology analyzers, part-of-speech taggers, chunkers, syntactic parsers

    • medical document collections (for classification)

WP6 – Information Extraction


  • MedIEQ:



  • Related projects:

    • WRAPIN

    • Quatro


  • Relaxed:


  • Ex:


  • Ellogon:


WP6 – Information Extraction


  • ?

WP6 – Information Extraction