Focused Crawling for both Topical Relevance and Quality of Medical Information

Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November, 2005

Outlines • Problems and Motivation • The experiment • Focused crawling • Relevance and quality prediction • The three crawlers • Measures for relevance and quality • Results, findings • Future work

Why Health Information on the Web? • The Internet is a free medium • Health information of various quality • Incorrect health advice may be dangerous • High user demand for Web health information

Problems • Relevance (in IR): • Topical relevance based on text • Navigational and distillation relevance based on links • None of these techniques guarantee quality • Our previous study (Tang et al., JIR ‘05) showed Google returns a lot of low-quality health results -> PageRank does not guarantee quality

Problems: Quality of Health Info • Quality of health information is often measured by evidence-based medicine which are Interventions supported by a systematic review of the evidence as effective. • Low quality health information originate from untrusted sites: personal home pages, commercial sites, chat sites, web forums, and even some published materials,…

Wrong Advice from an Article

Dangerous Information from Personal Web Pages

Commercial Promotion

Why Domain-specific Search? • Impose domain restriction • Results from previous work (Tang et. al, JIR ‘05) • Quality: Domain-specific engines performed much better than Google • Relevance: GoogleD was best • Coverage analysis: BPS & 4sites are poor

The Problems of Domain-specific Engines • The current method to build domain-specific engines is very expensive: manual, rule-based. • Example: BluePages Search – A depression portal at the ANU (http://bluepages.anu.edu.au) • Manual judgments of health sites by domain experts for two weeks to decide what to include in the index. • Low coverage: only 207 Web sites in the index • Tedious maintenance process: Web pages change, cease to exist, new pages come out, etc. -> A quality focused crawler may be a cheaper approach, maintaining high quality while improving coverage

The FC Process • Designed to selectively fetch content relevant to a specified topic of interest using the Web’s hyperlink structure. {URLs, scores} enqueue URL Frontier Classifier dequeue {URLs, link info} Download Link extractor Link info = anchor text, URL, source page’s content, so on.

Relevance Prediction • anchor text: text appearing in a hyperlink • text around the link: 50 bytes before and after the link • URL words: Words formed by parsing the URL address

Relevance Indicators • URL: http://www.depression.com/psychotherapy.html => URL words: depression, com, psychotherapy • Anchor text: psychotherapy • Text around the link: • 50 bytes before: section, learn • 50 bytes after: talk, therapy, standard, treatment

Methods • Machine learning approach: Train and test relevant and irrelevant Web pages using the discussed indicators. • Evaluated different learning algorithms: k-nearest neighbor, Naïve Bayes, C4.5, Perceptron. • Result: The C4.5 decision tree was the best to predict relevance. • A Laplace correction formula (Margineantu et al., LNS, ‘02) was used to produce a confidence score (confidence_level) at each leaf node of the tree. • The same method applied to predict quality but not successful!!! -> Link anchor context cannot predict quality

Quality Prediction • Using evidence-based medicine, and • Using Relevance Feedback (RF) technique

Evidence-based Medicine • Evidence-based treatments were divided into single and 2-word terms. • Example: • Cognitive behavioral therapy -> cognitive, behavioral, therapy, cognitive behavioral, behavioral therapy

Relevance Feedback • Well-known IR approach of query by examples. • Basic idea: Do an initial query, get feedback from users about what documents are relevant, then add words from relevant document to the query. • Goal: Add terms to the query in order to get more relevant results. • Usually, 20 terms are added into the query in total

Our RF Approach • Not for relevance, but Quality • Not only single terms, but also phrases • Generate a list of single terms and 2-word phrases and their associated weights • Select the top weighted terms and phrases • Cut-off points at the lowest-ranked term that appears in the evidence-based treatment list • 20 phrases and 29 single words form a ‘quality query’

Predicting Quality • For downloaded pages, quality score (QScore) is computed using a modification of the BM25 formula, taking into account term weights. • Quality of a new page is then predicted based on the quality of all the downloaded pages linking to it. (Assumption: There is quality locality, pages with similar content are inter-connected (Davison, SIGIR ‘00)) • Predicted quality score of a page with n downloaded source pages: PScore = ΣQScore/n Downloaded sources P1 P2 Pn Target …

Combining Relevance and Quality • We need to balance between relevance and quality • Quality and relevance score combination is new • Our method uses a product of the two scores: URLScore = confidence_level * PScore • Other ways to combine these scores will be explored in future work • A quality focused crawler will rely on this combined score to order its crawl queue

The Three Crawlers • The Breadth-first crawler: Traverses the link graph in a FIFO fashion (serves as baseline for comparison) • The Relevance crawler: For topical relevance, ordering the crawl queue using the C4.5 decision tree • The Quality crawler: For both relevance and quality, ordering the crawl queue using the combination of the C4.5 decision tree and RF techniques.

Measures • Relevance: The relevance performance of the three crawlers were evaluated using a relevance classifier. • Quality: were judged by domain experts using the evidence-based guidelines from the Centre for Evidence Based Mental Health (CEBMH). • Overall quality: taking into account all pages • High and low quality categories: the top 25%, and the bottom 25% results in each crawl were compared.

Results

Relevance

Quality

High Quality Pages AAQ = Above Average Quality: top 25%

Low Quality Pages BAQ = Below Average Quality: bottom 25%

Findings • Topical-relevance can be predicted using link anchor context. • Relevance feedback technique proved its usefulness in quality prediction. • Domain-specific search portals can be successfully built using focused crawling techniques.

Future Work • We only experimented in one health topic. Our plan is to repeat the same experiments with another topic, and generalise the technique to another domain. • Other ways of combining relevance and quality should be explored. • Experiments to compare our quality crawl with other health portals is necessary.

Focused Crawling for both Topical Relevance and Quality of Medical Information

Focused Crawling for both Topical Relevance and Quality of Medical Information

Presentation Transcript

Information Systems Reseach at RuG: rigour, relevance, or both?

Learning-focused Instructional Model Rigor and Relevance Framework

Use of Information Technology for Precision Performance Measurement and Focused Quality Improvement

FOCUSED CRAWLING

Enhancing Relevance, Quality, and Impact of Scientific Research

Regulatory Requirements with Relevance for Quality of API

Relevance of CAHPS® for Workers’ Compensation Medical Care

Information on Topical Committees

INFORMATION RELEVANCE Making Information Meaningful

Topical information for CIO

A Quality Focused Crawler for Health Information Tim Tang

Exploiting Inter-Class Rules for Focused Crawling

Crawling

Policy Search for Focused Web Crawling

Geant4 highlights of relevance for medical physics applications

Adaptive Focused Crawling

Adaptive Focused Crawling

Accelerated Focused Crawling Through Online Relevance Feedback

Geographically Focused Collaborative Crawling

Focused Crawling and Collection Synthesis

Customer-Focused Quality

The Relevance of Information Protection For Small Businesses