introduction to medieq
Download
Skip this Video
Download Presentation
Introduction to MedIEQ

Loading in 2 Seconds...

play fullscreen
1 / 21

Introduction to MedIEQ - PowerPoint PPT Presentation


  • 89 Views
  • Uploaded on

Introduction to MedIEQ. Quality Labelling of Medical Web content using Multilingual Information Extraction http://zeus.iit.demokritos.gr/medieq. Martin Labský [email protected] Knowledge Engineering Group (KEG) University of Economics Prague (UEP). Purpose of MedIEQ.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Introduction to MedIEQ' - langer


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
introduction to medieq

Introduction to MedIEQ

Quality Labelling of Medical Web content using Multilingual Information Extraction

http://zeus.iit.demokritos.gr/medieq

Martin Labský[email protected]

Knowledge Engineering Group (KEG)

University of Economics Prague (UEP)

WP6 – Information Extraction

purpose of medieq
Purpose of MedIEQ
  • Medical web sites are increasingly popular
  • Content strongly affects users’ decisions
  • Therefore, quality labeling is very important
  • Agencies invest large effort into labeling websites manually
  • We develop tools to minimize their effort
  • Tools will be multi-lingual, will support different and evolving labeling criteria

WP6 – Information Extraction

agenda
Agenda
  • Partners
  • Description of relevant work packages [3]
    • Web content collection, Information Extraction, Lexical and semantic resources
    • Goals, tasks, partners
    • Existing tools (to be extended)
    • New tools (to be developed)
    • Existing resources (to be made accessible)
  • Milestones & deliverables
  • References
  • Questions

WP6 – Information Extraction

partners
Partners
  • Agencies
    • WMA: Web Médica Acreditata (Es)
      • assigns a quality label that is shown on medical websites
      • websites ask for the label, are suggested changes, then get it
    • AQUMED: Agency for Quality Labeling in medicine (De)
      • maintains a web directory organized by topics
      • only good-quality websites are present
  • Developers
    • NCSR Demokritos and I-Sieve (spin-off) (Gr)
    • UEP: University of Economics Prague (Cz)
    • UNED: National University of Distance Education (Es)
    • HUT: Helsinki University of Technology (Fi)

WP6 – Information Extraction

web content collection wp5
Web Content Collection (WP5)

WP6 – Information Extraction

website monitoring
Website monitoring
  • Regular visits to labeled website
  • Checking pages
    • for relevant changes
    • which changes are relevant?
      • manual rules, machine learning...
    • alert agency when significant changes occur
    • or, increase the website’s (web page’s) priority in a list of to-be-checked resources
    • show what has changed, suggest solution
  • Needed by WMA, AQuMed

WP6 – Information Extraction

web focused crawling
Web focused crawling
  • Find new medical websites
  • Use multiple existing search engines
    • specify lists of keywords / keyphrases
    • give sample “similar” documents
    • use Google/Yahoo API and filter their results
  • NCSR already has a focused crawler
    • we should contribute to its development
  • Needed by WMA

WP6 – Information Extraction

website spidering
Website spidering
  • Walk pages of a single website
  • Classify each page
    • in order to choose relevant docs for quality labeling
    • e.g. contact page, page containing treatment description, page with sponsors
    • use machine learning, e.g. based on a bag-of-words (unigram, bigram) document representation
  • Spidering strategy
    • which documents belong together (e.g. page 1/7)
    • which links to follow next
  • NCSR has a spider
    • uses classifiers from Weka for doc classification
    • we should contribute

WP6 – Information Extraction

information extraction wp6
Information Extraction (WP6)

WP6 – Information Extraction

ie introduction
IE introduction
  • Documents to extract from
    • pages retrieved & classified by spider
      • from known websites
      • from crawler
    • monitored labeled pages that have changed
  • Information to be extracted
    • derived from agencies’ labeling criteria
    • e.g. contact information of responsible persons, sponsor names, privacy warning texts...
  • Questions
    • how much human intervention needed?
    • complexity of label sets to be supported?
    • methodology of porting to a new language?

WP6 – Information Extraction

example extracted information i
Example extracted information I.
  • Transparency and honesty
    • site provider (company name, contact)
    • site purpose, type of target audience
    • funding (grants, sponsors)
  • Authority
    • source citation for information provided, its type and date
    • names and credentials of all information providers
  • Privacy and data protection
    • privacy policy description
  • Timeliness of information
    • dates of publication/modification
  • Accountability
    • names (and roles) of people responsible for presented information
    • editorial policy description

WP6 – Information Extraction

example extracted information ii
Example extracted information II.
  • Content
    • medical terms, e.g. disease and drug names
    • statements recommending a certain product/method
    • advertisements
    • disallowed combinations (e.g. advertisement for X adjacent to an article related to X)
  • Formal
    • mandatory statements (e.g. importance of physical examination, privacy warnings when posting data into chats)

WP6 – Information Extraction

sources of extraction knowledge
Sources of extraction knowledge
  • Training data
    • scarcity will be a problem for most extracted attributes
    • different types: labeled documents, sample extracted data, data previously extracted from the same website, domain dictionaries
  • Extraction patterns
    • induced (semi)automatically from scarce training data
    • or even authored manually
  • Background domain knowledge
    • relations between extracted attributes, cardinalities ...
    • e.g. typically just one company is the web site’s provider, but there are often multiple sponsors
  • Web site structure
    • exploit common formatting of a group of documents within a website
    • exploit common formatting used for a particular type of extracted data across different websites

WP6 – Information Extraction

ie tools
IE tools
  • Ex (UEP)
    • IE system under development using “extraction ontologies”
    • extracts instances from semi-structured documents
    • utilizes training data + manually defined patterns, includes spider
    • old version based on HMMs – http://eso.vse.cz/~labsky/client/
  • Named entity recognizer (UNED)
    • extracts dates, person/institution names
  • 3rd party IE tools
    • wrapper management systems
    • e.g. LP2-based IE tool or annotation editor from Sheffield

WP6 – Information Extraction

website assessment
Website assessment
  • Check website’s technical correctness
    • SEO (findability in search engines with respect to some keyphrases)
    • accessibility (possibility of font enlargement, blind access, pages hidden deep in website structure, color schemes perceivable by anybody)
    • formal correctness (dead links, violations of HTML standards, failure to display well under at least the 3 most popular browsers)
  • Check non-technical correctness
    • e.g. typos, “clear, easy-to-understand language”
    • more: check for black-listed phrases, claims, etc.

WP6 – Information Extraction

website assessment tools
Website assessment tools
  • Relaxed (UEP)
    • HTML validator based on Relax NG and Schematron patterns
    • can perform formal checks of website content beyond DTDs
    • http://relaxed.sourceforge.net/
  • SEO tool (UEP)
    • could Honza’s SEO tool be extended?

WP6 – Information Extraction

ie deliverables
IE Deliverables
  • Duration: M1-M28
  • Deliverables
    • D8: Methodology & architecture of IE (M9)
    • D9.1: First version of IE toolkit (M15)
    • D9.2: Final version of IE toolkit (M24)

WP6 – Information Extraction

lexical and semantic resources wp7
Lexical and semantic resources(WP7)

WP6 – Information Extraction

lexical and semantic resources
Lexical and semantic resources
  • Sp, De, En, Cz, Gr, Fi, Catalan (7!)
  • We are in charge of Cz, De(!)
  • Semantic
    • thesauri, ontologies (MESH)
    • lists of cures, vaccine names, lists of medical companies, illnesses, diagnoses
    • generic ontologies and translation dictionaries (e.g. Eurowordnet)
  • Lexical
    • lemmatizers/morphology analyzers, part-of-speech taggers, chunkers, syntactic parsers
    • medical document collections (for classification)

WP6 – Information Extraction

references
References
  • MedIEQ:
    • http://www.iit.demokritos.gr/~vangelis/MedIEQ/
    • http://zeus.iit.demokritos.gr/medieq
  • Related projects:
    • WRAPIN http://debussy.hon.ch/cgi-bin/Wrapin/ClientWrapin.pl
    • Quatro http://www.quatro-project.org/DC2005.htm
    • CROSSMARC http://www.iit.demokritos.gr/skel/crossmarc/
  • Relaxed:
    • http://badame.vse.cz/validator/
  • Ex:
    • http://eso.vse.cz/~labsky/doc/ex.pdf
  • Ellogon:
    • http://www.ellogon.org/

WP6 – Information Extraction

questions
Questions
  • ?

WP6 – Information Extraction

ad