1 / 22

Machine learning techniques for detecting topics in research papers

Machine learning techniques for detecting topics in research papers. Amy Dai. The Goal. Build a web application that allows users to easily browse and search papers. Project Overview. Part I – Data Processing Convert PDF to text Extract information from documents

Download Presentation

Machine learning techniques for detecting topics in research papers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine learning techniques for detecting topics in research papers Amy Dai

  2. The Goal Build a web application that allows users to easily browse and search papers

  3. Project Overview • Part I – Data Processing • Convert PDF to text • Extract information from documents • Part II – Discovering topics • Index documents • Group documents by similarity • Learn underlying topics

  4. Part I - Data Processing How do we extract information from PDF documents?

  5. Pdf to Text • Research papers are in PDF • PDFs are images • Computer sees colored lines and dots • Conversion process loses some of the formatting

  6. Getting what we need • Construct heuristic rules to extract info • First line • Between title and abstract • Preceded by “Abstract” • Preceded by “Keywords”

  7. Finding Names

  8. Can we predict names? • Named Entity Tagger • by the Cognitive Computation Group at Uni. Illinois Urbana-Champaign. Spam, Damn Spam, and StatisticsUsing statistical analysis to locate spam web pagesDennis Fetterly Mark Manasse Marc NajorkMicrosoft Research Microsoft Research Microsoft Research 1065 La Avenida 1065 La Avenida 1065 La Avenida Mountain View, CA 94043, USA Mountain View, CA 94043, USA Mountain View, CA 94043, USAfetterly@microsoft.commanasse@microsoft.comnajork@microsoft.com

  9. Accuracy • To determine how well my script to extract info worked • (# right + # needing minor changes)/ Total # of documents • Example • 30 were correctly extracted • 10 needed minor changes • 60 total documents • (30+10)/60 = 66.7%

  10. Accuracy and Error

  11. Part II – Learning Topics Can we use machine learning to discover underlying topics?

  12. Indexing Documents • Index documents • Remove common words leaving better descriptors for clustering • Compare to corpus • Brown Corpus: A Standard Corpus of Present-Day Edited American English • From the Natural Language Toolkit • Reduce from 19,100 to 12,400 words • Documents contain between 100 – 1,700 words after common word removal

  13. Effect on Index Size • Changes in document index size for “Defining quality in web search results”

  14. Keeping What’s Important • Words in abstract of “Defining quality in web search results”

  15. Documents as Vectors • Represent documents as numerical vectors by transforming words to numbers using tf-idf • Length is normalized • Vector length is the length of index for corpus • Mostly sparse

  16. Clustering using Machine Learning • Use machine learning algorithms to cluster by: • K-means • Group Average Agglomerative (GAA) • Unsupervised learning • Cosine similarity

  17. Clustering Results Documents K-Means A: SpamRank – Fully Automatic Link Spam Detection B: An Approach to Confidence Based Page Ranking for User Oriented Web Search C: Spam, Damn Spam, and Statistics D:Web Spam, Propaganda and Trust E: Detecting Spam Web Pages through Content Analysis F: A Survey of Trust and Reputation Systems for Online Service Provision Group 1 A Group 2 B,C,D,E Group 3 F GAA Group 1 B Group 2 A,C,D,E Group 3 F

  18. Challenges • K-Means • Finding K • Group Average Agglomerative • The depth to cut the dendogram

  19. Labeling Clusters • Compare term frequency in a cluster with the collection • A frequent word within the cluster and in the collection isn’t a good discriminative label • A good label is one that is infrequent in the collection

  20. Summary • Part I – Data Processing • PDF to text conversion isn’t perfect and imperfections make it difficult to extract text • Documents don’t follow one formatting standard, need heuristic rules to extract info • Part II – Discovering topics • Indexes are large, to keep the important we need a good corpus to compare it to. • There are many clustering algorithms and each has limitations • How do I choose the best label?

  21. Ongoing work • Use Bigrams • Keywords: Web search, adversarial information retrieval, web spam • Limit the number of topic labels by ranking • Use algorithm that clusters based on probabilistic distributions • Logistic normal distribution

  22. Useful Tools • Pdftotext – Unix command for converting PDF to text • Python libraries • Unicode • Re –regular expressions • NLTK – Natural language processing tool • Software and datasets for natural language processing • Used for clustering algorithms and reference corpus

More Related