1 / 13

Document Classification via Term Distribution Similarity

Document Classification via Term Distribution Similarity. Xin Lu, Angela Zoss CSCI B651 Final Project Presentation December 16, 2010. Classification of Scientific Disciplines. Goals Classification using Map of Science clusters Program Matrix population Matrix transformation

don
Download Presentation

Document Classification via Term Distribution Similarity

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Document Classification via Term Distribution Similarity Xin Lu, Angela Zoss CSCI B651 Final Project Presentation December 16, 2010

  2. Classification of Scientific Disciplines • Goals • Classification using Map of Science clusters • Program • Matrix population • Matrix transformation • Vector comparison • Evaluation • By subdiscipline • By discipline

  3. Classification Using Map of Science • Map of Science can be generated from documents automatically (citation analysis, LSA) • Resulting clusters are up-to-date, interlinked • The 554 clusters identified offer a more fine-grained classification system than many others (e.g., ISI uses 173 Subject Categories)

  4. UCSD Map of Science • 7.2 million papers • 16,000 (serial) publication sources • 554 clusters of sources (subdisciplines) • 13 top-level disciplines Classification to clusters can be done using the 16k journals or 72k keywords that were extracted from article titles during the clustering process.

  5. Article Mapping by Journal Journals Papers

  6. Article Mapping by Keyword Articles Cosine Similarity Keyword Vectors

  7. Program: Matrix Population • Obtained 200k articles from PubMed Central • Selected a subset of 8k articles that covered 214 map clusters, by journal name association • Used the ~30k keywords matched to those 214 clusters to generate a word frequency matrix for the 8k papers • A matrix with identical structure was created for the 214 clusters, but the values used were the match percentages assigned by the MoS

  8. Program: Matrix Population

  9. Program: Matrix Transformation Selecting terms based on inverse document frequency

  10. Program: Vector Comparison • Cosine similarity using Matlab(http://en.wikipedia.org/wiki/Cosine_similarity)

  11. Evaluation by Subdiscipline • Overall recall: 20%

  12. Evaluation by Discipline • Overall recall:49%

  13. Possible Areas for Improvement • Reduce sparseness with more flexible reg-ex, synonymy/hypernymy • SVD to smooth vectors • Different TF-IDF ranges

More Related