220 likes | 347 Views
The project aims to automate the summarization of knowledge about specific genes from scientific literature, enabling biologists to rapidly understand target genes. It addresses the challenge of manual curation by compressing information on gene products, expression patterns, mutations, and interactions into concise summaries. Utilizing a two-stage approach, the system retrieves articles by gene name and extracts relevant sentences, improving both recall and precision through advanced techniques, including machine learning. The generated summaries provide valuable insights and facilitate navigation of related literature.
E N D
Automated Gene Summary:Let the Computer Summarize the Knowledge Xu Ling Department of Computer Science University of Illinois at Urbana-Champaign
The Reality of Scientific Literature Hard to keep up manual curation!
Automated Gene Summarization Gene summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gene product Expression Sequence Interactions Mutations General Functions
Goal • To retrieve and summarize all the knowledge about a particular gene from the literature • Compressing knowledge: enables biologists to quickly understand the target gene. • Automated curation: explicitly covers multiple aspects of a gene, such as the sequence information, mutant phenotypes etc.
Semi-structured summary on multiple aspects Gene products Expression pattern Sequence information Phenotypic information Genetical/physical interactions … 2-stage summarization Retrieve relevant articles by gene name search Extract most informative and relevant sentences for each aspects. Our Solution
System Overview: 2-stage Gene name recognition Sentence Categorization
Gene Name Recognition • v1: Dictionary-based string match • High recall, low precision • v2: Machine learning methods of gene name recognition • High precision, low recall • v3: v2 + dictionary based synonym expansion • Improved in both recall and precision
Categorization of Retrieved Sentences • Collect “example sentences” from FlyBase • v1: applying vector space model to construct aspect “profile”. • v2: applying probabilistic models to factor out context-specific language. • v3: v2 + biologist labeled training examples. Real sentence! Many thanks for the help by Susan Brown’s “Beetle group” !
Gene Summary in BeeSpace v4 • To add
General Entity Summarization • General and applicable to summarize other entities: pathways, protein family, … • General settings: • Space: A set of documents to be summarized. • Aspects: A set of aspects to define the structure of the summary. • Examples: Training sentences for each aspect.
Further Generalization … • Limitations of the categorization problem with training examples • Predefined aspects, may not fit the need of a particular user • Only works for a predefined domain and topics • Training examples for each aspect are often unavailable • More Realistic New Setup • Allow a user to flexibly describe each facet with keywords (1-2): let the user determine what they want • Generate the summary in a semi-supervised way: no need of training examples
Example (1): Consumer vs. Editor Honda accord 2006
Example (2): Different Aspects 17 • What if the users want an overview with different facets?
Conclusion • The generated summaries are • directly useful to biologists, • and also serve as entry points to enable them to quickly navigate relevant literatures, • via the BeeSpace analysis environment available at www.beespace.uiuc.edu
Start from Here … • The reverse of automated entity summarization: automated entity retrieval • Profiling of entities using entity summary Eg.,what genes are associated with … ? • Build a powerful knowledge base … • Enriched entities under certain context Eg.,what are the significantly enriched genes in …? • Entities involved in certain biomedical relations Eg.,what genes are interacting with gene X ? BeeSpace v5 !
Acknowledgement Bruce Schatz Gene Robinson Chengxiang Zhai Xin He Jing Jiang Qiaozhu Mei Moushumi Sarma
Vector Space Model (VSM) • Construct a corresponding term vector Vc using the training sentences for the aspect • The weight of a term ti in the aspect term vector for aspect j: wij=TFijIDFi, where TFij= term frequency, IDFi= 1 + log(N/ni) is the inverse document frequency (N=total number of documents, ni=number of documents containing term ti). • Construct a sentence term vector Vs for each sentence • with the same IDF and TF=number of times a term occurs in the sentence • Aspect relevance score S=cos(Vc, Vs).