1 / 20

Beespace Component: Filtering and Normalization for Biology Literature

Beespace Component: Filtering and Normalization for Biology Literature. Qiaozhu Mei 03.16.2005. Concept Processing Component for Beespace: A Big Picture. A list of Representative Terms Or phrases. Filtering Module. Relevant documemts. Query terms. Retrieval.

oriel
Download Presentation

Beespace Component: Filtering and Normalization for Biology Literature

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Beespace Component:Filtering and Normalizationfor Biology Literature Qiaozhu Mei 03.16.2005

  2. Concept Processing Component for Beespace: A Big Picture A list of Representative Terms Or phrases Filtering Module Relevant documemts Query terms Retrieval entities & phrases of interest Similarity Groups Of Terms and Phrases (Concepts) Normalization And Clustering Module Pre-processed Text Collection

  3. Concept Processing Component for Beespace: Input and Output • Input: texts (indices) with entities and phrases tagged. • Filtering: a group of relevant documents for a query • Normalization: a list of terms, entities or phrases of interest to be normalized • Output: • Filtering: list of highly representative terms & phrases • Normalization: • hierarchical structure of concepts (compacted, loose) • Concept dictionary • texts tagged with concepts

  4. Filtering

  5. Term Filtering: Heuristics • We want to find a list of representative terms & phrases short enough to enable interactive selection and navigation. • We want terms with higher frequency in the given documents, (high Term Frequency), however… • Terms too frequent in the whole collection are considered harmful: the, is, cell, bee, …(low Document Frequency)

  6. Term Filtering: TF*IDF • Adding IDF to frequency count: • Weight = tf * log ((N – 1)/df) • TF-IDF formula in Okapi method: • Weight = IDF TF part

  7. Term Filtering (cont.) • Results 1: • Collection: honeybee.biosis 1980 • Query: “pollen-foraging” • Select top 2 documents • Results 2: • Collection: GENIA (on “human & blood cell & transcription factor”), with noun phrases of entities tagged • Query: “il-2”

  8. Normalization

  9. From Term to Concept: Normalization and Theme Clustering • Normalization: Tight concepts • Group terms/entities/phrases with similarity so that one can represent others • Forage: forager, forage-bee, foraging, foragers, pollen-foraging… • Theme clustering: Looser concepts • Group terms/entities/phrases representing the same subtopic (semantically related) • forage, pollen, food, detect, feeding, dance, … • In a hierarchical manner.

  10. Normalization • Morphological approach? (stemming) • Normalize English words of morphological variations, e.g. • forag: forage/foraging/forager/foragers • Concerns: • Too cruel? one->on; day->dai; apis-> api; useful -> us • Handling biological entities? (some do nothing when detect “-”) • Not sufficient to normalize phrases

  11. Normalization: Stemmers • Porter Stemmer: • does not stem words beginning with an uppercase letter • Krovetz' Stemmer: • Less aggressive than porter • Sample results: • Honeybee: • Genia:

  12. Normalization (cont.) • Semantic and Contextual Approach: • Group the terms which are considered “Replaceable” with each other in a context. E.g. • …the pollen-foraging activity of a mellifera… • …the nectar-foraging activity of a cerana… • Generally handled with clustering approaches based on statistical information in a large corpus • Usually in the form of hierarchical clusters

  13. Normalization: A clustering approach • A N-gram clustering method: • Ideally, if we consider the terms in its N-Gram context, the replaceable relation would be global and reliable. • Concerns: efficiency • Computing complexity is high! • For 2-gram, NV2 even after optimization! (initially V5) • Space complexity is high!! • V3 • Compromising: use 2-gram (equivalent to computing the average mutual information of 2-grams and group two terms which will bring the smallest loss to this avg. MI)

  14. Normalization: A clustering approach (cont.) • Toy Example on honeybee: • Vocabulary size: 9100 words; • Collection size: 5505 abstracts; (honeybee.biosis1980) • Terms to be Clustered: 18 • Genia collection, 2000 abstracts • 200 noun phrases (entities) to be clustered

  15. nursing nurseries nursery nectar-foraging pollen-foraging foraging-related preforaging non-foraging forager forage foraging foragers queen worker queens workers bee honeybee

  16. Sample clusters on Genia: human_and_mouse_gene mouse_il-2r_alpha_gene i_kappa_b_alpha nf_kappa_b transcription_factors transcription_factor saos_2_cells saos-2 human_osteosarcoma_ b_cells jurkat_t_cells hela_cells thp-1 hl60_cells k562_cells thp-1_cells epstein-barr_virus_ interleukin-2 interleukin-2_ epstein-barr_virus phorbol_myristate_acetate phorbol_12-myristate_13-acetate 2_gene_expression 2_gene u937_cells monocytic_cells jurkat_cells human_t_cells ipr_cd4-8-_t_cells j_delta_k_cells lymphoid_cells activated_t_cells hematopoietic_cells

  17. Normalization: Clustering Methods • Other Possible Clustering Approaches • Cluster terms based on features such as: • Co-occurring terms • Tends to ignore position information • Correlation of Nouns and Verbs • Dependency-based Word Similarity • Proximity-based Word Similarity • Depend on highly accurate parsing result, which may be not easy to get for biology literature.

  18. Theme Clustering • Looser Clusters • Usually in the form of partitioning clusters • K-Means, Latent Semantic Indexing, Probabilistic LSI • Compute loose clusters of terms, or clusters represented by term distributions • Example: # cluster = 10 • Sometimes helpful to find normalizations (e.g., when #clusters are large; when no stemming was done) • Comparative Text Mining for concept switching

  19. Future Plan: • Customize the stemmers • Try more morphological approaches. • e.g. pollen-foraging, nectar-foraging • Exam more clustering methods: • How to use theme clustering to help normalization • Find a way to divide the hierarchical clustering structure into concepts

  20. Thanks!

More Related