1 / 26

Clustering More than Two Million Biomedical Publications

Clustering More than Two Million Biomedical Publications. Comparing the Accuracies of Nine Text-Based Similarity Approaches. Boyack et al. (2011). PLoS ONE 6(3): e18029. Motivation . Compare different similarity measurements Make use of biomedical data set Process large corpus.

adler
Download Presentation

Clustering More than Two Million Biomedical Publications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering More than Two Million Biomedical Publications Comparing the Accuracies of Nine Text-Based Similarity Approaches Boyack et al. (2011). PLoS ONE 6(3): e18029

  2. Motivation • Compare different similarity measurements • Make use of biomedical data set • Process large corpus

  3. Procedures • define a corpus of documents • extract and pre-process the relevant textual information from the corpus • calculate pairwise document-document similarities using nine different similarity approaches • create similarity matrices keeping only the top-n similarities per document • cluster the documents based on this similarity matrix • assess each cluster solution using coherence and concentration metrics

  4. Data • To build a corpus with titles, abstracts, MeSH terms, and reference lists • Matched and combined data from the MEDLINE and Scopus (Elsevier) databases • The resulting set was then limited to those documents published from 2004-2008 that contained abstracts, at least five MeSH terms, and at least five references in their bibliographies • resulting in a corpus comprised of 2,153,769 unique scientific documents • Base matrix: word-document co-occurrence matrix

  5. Methods

  6. tf-idf • The tf–idf weight (term frequency–inverse document frequency) • Astatistical measure used to evaluate how important a word is to a document in a collection or corpus • The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

  7. tf-idf

  8. LSA • Latent semantic analysis

  9. LSA

  10. BM25 • Okapi BM25 • A ranking function that is widely used by search engines to rank matching documents according to their relevance to a query

  11. BM25

  12. SOM • Self-organizing map • A form of artificial neural network that generates a low-dimensional geometric model from high-dimensional data • SOM may be considered a nonlinear generalization of Principal components analysis (PCA).

  13. SOM • Randomize the map's nodes' weight vectors • Grab an input vector • Traverse each node in the map • Use Euclidean distance formula to find similarity between the input vector and the map's node's weight vector • Track the node that produces the smallest distance (this node is the best matching unit, BMU) • Update the nodes in the neighbourhood of BMU by pulling them closer to the input vector • Wv(t + 1) = Wv(t) + Θ(t)α(t)(D(t) - Wv(t)) • Increase t and repeat from 2 while t < λ

  14. Topic modeling • Three separate Gibbs-sampled topic models were learned at the following topic resolutions: T= 500, T= 1000 and T=2000 topics. • Dirichlet prior hyperparameter settings of b= 0.01 and a = 0.05N/(D.T) were used, where N is the total number of word tokens, D is the number of documents and T is the number of topics.

  15. Topic modeling

  16. PMRA • The PMRA ranking measure is used to calculate ‘Related Articles’ in the PubMedinterface • The de facto standard • Proxy

  17. Similarity filtering • Reduce matrix size • Generate a top-n similarity file from each of the larger similarity matrices • n=15, each document thus contributes between 5 and 15 edges to the similarity file

  18. Clustering • DrL (now called OpenOrd) • A graph layout algorithm that calculates an (x,y) position for each document in a collection using an input set of weighted edges • http://gephi.org/

  19. Evaluation • Textual coherence (Jensen-Shannon divergence)

  20. Evaluation • Concentration: a metric based on grant acknowledgements from MEDLINE, using a grant-to-article linkage dataset from a previous study

  21. Results

  22. Results (cont.)

More Related