Gene clustering by latent semantic indexing of medline abstracts
Download
1 / 21

Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts - PowerPoint PPT Presentation


  • 95 Views
  • Uploaded on
  • Presentation posted in: General

Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts. Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee presented by J. Jiang. Outline. Brief Overview of Biomedical Literature Mining The Gene Clustering Problem Latent Semantic Indexing

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Gene clustering by latent semantic indexing of medline abstracts
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts

Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry

University of Tennessee

presented by J. Jiang


Outline
Outline Abstracts

  • Brief Overview of Biomedical Literature Mining

  • The Gene Clustering Problem

  • Latent Semantic Indexing

  • Experiments

  • Conclusions and Discussions


Biomedical literature mining brief overview
Biomedical Literature Mining AbstractsBrief Overview

  • Goal: to find useful information from the large amount of biomedical literature

  • Tasks include:

    • Identifying relevant literature for a given gene/protein

    • Connecting genes with diseases

    • Grouping genes/proteins by functions

    • Reconstructing and predicting gene networks

      (ISMB 05’ Tutorial Proposal, H. Shatkay)


Biomedical literature mining brief overview cont
Biomedical Literature Mining AbstractsBrief Overview (cont.)

  • Approaches:

    • IE & NLP: entities, relations, facts, etc. Many methods rely on co-occurrences of genes/proteins.

    • IR: text categorization and summarization, etc.

    • Hybrid: combining multiple techniques

  • Challenges include:

    • No fixed nomenclature or sentence structure

    • Indirect links

    • Etc.


The gene clustering problem
The Gene Clustering Problem Abstracts

  • To group genes based on their functions

  • Previous work:

    • Co-occurrence of gene symbols to extract gene relationships

    • Implicit textual relationships

    • Gene clustering using functional information in annotated indices or MEDLINE abstracts


Vector space model for gene clustering
Vector Space Model Abstractsfor Gene Clustering

  • Glenisson et al., 2003

  • Bag-of-words, vector space model

  • Cosine similarity

  • K-medoids algorithm

    This paper tries to improve the vector representation of documents using LSA.


Background lsa
Background: LSA Abstracts

  • First studied by Deerwester et al., Indexing by Latent Semantic Analysis, J Info Sci, 1990

  • Motivation: inaccuracy of term matching due to polysemy and synonomy

  • Assumption: existence of latent semantic structure (“artificial concepts”)

  • Dimension reduction. Keep the most important dimensions. Similar to PCA.


Singular value decomposition
Singular Value Decomposition Abstracts

  • d documents, t terms (in general, t >> d)

  • d t matrix X = [xij], where xij denotes the frequency of term j in document i

  • X can be decomposed as:

    X = T0S0D0,

    where columns of T0 are the eigenvectors of XX, and columns of D0 are the eigenvectors of X X. S0 is diagonal. S02 is the matrix of eigenvalues of XX (or X X).


Svd cont
SVD (cont.) Abstracts

  • The diagonal elements of S0 are constructed to be positive and ordered in decreasing magnitude.


Svd cont1
SVD (cont.) Abstracts

  • The eigenvector with the largest eigenvalue represents the dimension along which the variance of the data is maximized.

  • Keep the k largest elements in S0, remove other elements, and remove corresponding columns (eigenvectors) in T0 and D0, X can be approximated by:

    XXhat = TSD.


Svd cont2
SVD (cont.) Abstracts

  • Xhat is the best least-square-fit to X with rank k.


Illustration
Illustration Abstracts

The first eigenvector

The second eigenvector

(taken from “A Tutorial on PCA” by Lindsay Smith)


Lsa with svd
LSA with SVD Abstracts

  • Terms are represented by rows of Xhat and documents are represented by columns of Xhat in the reduced space.

  • Doc-to-doc similarity:

    Xhat Xhat = DS2D = DS(DS) .

  • Query is represented as pseudo-document:

    Dq = Xq TS-1,

    where Xq is the query vector in the original space. Dq is like a row of D.

  • Query-to-doc similarity:

    DqS(DS) .


Experiments
Experiments Abstracts

  • 50 genes in (1) development, (2) Alzheimer Disease, and (3) Cancer Biology are selected

  • Gene-document: concatenation of abstracts known to be related the gene

  • Gene-document represented as vectors:


Experiments cont
Experiments (cont.) Abstracts

  • Keyword query and accession number query

  • Reelin signaling pathway

  • GO classification terms and human disease

  • Direct genes and indirect genes

  • Hierarchical Clustering


Results
Results Abstracts


Results cont
Results (cont.) Abstracts


Results cont1
Results (cont.) Abstracts

  • Tried 5, 25, and 50 dimensions. 50 is shown to perform the best.

  • Tried reducing the numbers of abstracts of Reelin genes. Claimed that AP was not significantly reduced when 50% abstracts were removed.

  • Claimed that hierarchical clustering agrees with biological relationships.


Discussions
Discussions Abstracts

  • Pros

    • Gene clustering by textual information.

    • Applied LSA to biomedical literature. Indirect linkage can be found through latent concepts.

  • Cons

    • Requires human annotation to construct gene-documents. Not applicable to new domain.

    • Genes in the experiments are carefully chosen in 3 categories. How does the method perform in general?

  • Other gene clustering methods?


References
References Abstracts

  • S. Deerwester et al. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41-6, 391-407.

  • M.A. Gerolami (2004). Latent Semantic Analysis A General Tutorial Introduction. http://ir.dcs.gla.ac.uk/oldseminars/Girolami.ppt

  • H. Shatkay (2005). ISMB 05’ Tutorial Proposal. http://www.iscb.org/ismb2005/tutorials/pm10.pdf

  • H. Shatkay & R. Feldman (2004). Mining the Biomedical Literature in the Genomic Era: An Overview. Journal of Computational Biology, 10-6, 821-855.


The end
The End Abstracts

  • Questions?

  • Thank you!


ad
  • Login