UIC at TREC 2006: Genomics Track

UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16, 2006

3 stages • Stage 1: Conversion - Greek letters  English words • Stage 2: Paragraph retrieval - retrieve 2,000 most relevant paragraphs • Stage 3: Passage extraction and ranking -extract and retrieve 1,000 most relevant passages

Stage 1: conversion • Convert the Greek letters into English words, for example, TGF β1  TGF beta1 (β, in the HTML documents, may be represented by “&#223” or “beta.gif”)

Stage 2: paragraph retrieval • The goal of this stage is to retrieve 2,000 most relevant paragraphs. • Several techniques are utilized: • 1. conditional porter stemming • 2. gene symbol lexical variants handling • 3. concept retrieval IR model • 4. query expansion • 5. abbreviation correction.

Stage 2: paragraph retrieval - conditional Porter stemming • Potential errors of the Porter stemmer • Type 1: gene symbol  non-gene word e.g., “Pes”  “Pe”, “IDE”  “ID” • Type 2: non-gene word  gene symbol e.g., “IDEE”  “IDE” solution: a table (Entrez gene database) containing all the gene symbols is maintained.

Stage 2: paragraph retrieval- handling lexical variants of gene symbols • 2 strategies: • Strategy 1: automatically generate lexical variants (Buttcher, 2004; Huang, 2005). e.g., PLA2 PLA 2, PLAII, and PLA II • Strategy 2: retrieve additional lexical variants from a term database of MEDLINE (Zhou, 2006). e.g., PLA2  PL-A2 Note: PLA2: Phospholipase A2

Stage 2: paragraph retrieval- concept retrieval (IR model) • Definition: A concept is a biomedical meaning or sense. 1) a gene and its synonym set refer to the same concept; 2) a MeSH and its synonym set refer to the same concept.

Stage 2: paragraph retrieval- concept retrieval (IR model) Assumption: Okapi does not work well if the query contains multiple concepts. For example: q: “role of gene PRNP in mad cow disease.” concept 1concept 2 d1: has many occurrences of concept 2 d2: has small number of occurrences of both concepts Okapi: sim(q,d1)>sim(q,d2), but intuitively d2 is more relevant than d1.

Stage 2: paragraph retrieval- concept retrieval (IR model) According to our model (Liu, 2004; UIC Robust track, 2005) , we have: because: although, includes both concept 1 & concept 2

Stage 2: paragraph retrieval- query expansion • Synonyms • Hyponyms (more specific terms) • Pseudo-feedback • Related terms

Hepatocytes Hepatoblastoma Gluconeogenesis Hepatitis B virus HNF4 and COUP-tf I Liver Stage 2: paragraph retrieval- query expansion using biomedical knowledge • Related terms (Co-occur frequently & related semantically) q:How do interactions between HNF4 and COUP-TF1 suppress liver function" related terms There exists relationships between the semantic type of a related term and the semantic type of each query concept in UMLS semantic network.

Stage 2: paragraph retrieval- avoid incorrect match of abbreviations • Given a query with both an abbreviation of a gene symbol and its full form, a document will match the term only if both its abbreviation and its full form are matched. For example, q: role of APC (adenomatous polyposis coli) in colon cancer? d: “…Much work has been undertaken in recent decades with the aim of producing projections of future cancer incidence and mortality rates from observed rates by using age-period-cohort (APC) models…” Notice that gene symbols are usually abbreviations, which are very ambiguous in the biomedical literature.

Stage 3: passage extraction and ranking • The goal of this stage is to take the output of stage 2 (i.e., 2,000 most relevant paragraphs) and identify the 1,000 most relevant passages (i.e., one or more consecutive sentences within paragraphs).

Stage 3: passage extraction and ranking - extraction • The criterion for the optimal passage in a paragraph is given by: “Given various windows of different sizes, choose the one which has the maximum number of query concepts and the smallest size.”

Stage 3: passage extraction and ranking- ranking • The ranking of passages is similar to the ranking of paragraphs. For each passage, we computed its concept similarity and word similarity with the query. Then the concept retrieval model is applied for the ranking.

Experiment results • 3 runs: • UICgen1: the top 1,000 most relevant paragraphs were returned as the passages. • UICgen2: the top 1,000 optimal passages according to the criterion were returned (some bugs). • UICgen3: same as UICgen2, except the bugs were removed.

Experiment results

Reference • Buttcher S, Clarke CLA, Cormack GV: Domain-specific synonym expansion and validation for bio-medical information retrieval (MultiText experiments for TREC 2004). The Thirteenth Text REtrieval Con-ference (TREC 2004) Proceedings, 2004, Gaithers-burg, MD. • Huang X, Zhong M, Si L. York University at TREC 2005: Genomics Track. The Fourteenth Text RE-trieval Conference (TREC 2005) Proceedings, 2005, Gaithersburg, MD. • Zhou W, Torvik VI, Smalheiser NR. ADAM: Another Database of Abbreviations in MEDLINE. Bioinformatics 2006; 22(22): 2813-2818. • Liu S, Liu F, Yu C, and Meng WY. An Effective Approach to Document Retrieval via Utilizing WordNet and Recognizing Phrases. Proceedings of the 27th Annual International ACM SIGIR Confer-ence, pp.266-272, Sheffield, UK, July 2004. • Liu S, Yu C. UIC at TREC2005: Robust Track. The Fourteenth Text RE-trieval Conference (TREC 2005) Proceedings, 2005, Gaithersburg, MD.

Questions Thanks!

UIC at TREC 2006: Genomics Track