1 / 41

Method Seminar

Method Seminar. Tutorial : Using Stanford Topic Modeling Toolbox Lili Lin. Contents. Introduction Getting Started Prerequisites Installation Toolbox Running Latent Dirichlet Allocation Model (LDA Model) Labeled LDA Model. Contents. Introduction Getting Started

abie
Download Presentation

Method Seminar

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Method Seminar Tutorial : Using Stanford Topic Modeling Toolbox Lili Lin

  2. Contents • Introduction • Getting Started • Prerequisites • Installation • Toolbox Running • Latent Dirichlet Allocation Model (LDA Model) • Labeled LDA Model

  3. Contents • Introduction • Getting Started • Prerequisites • Installation • Toolbox Running • Latent Dirichlet Allocation Model (LDA Model) • Labeled LDA Model

  4. Introduction • http://nlp.stanford.edu/software/tmt/tmt-0.4/ • The Stanford Topic Modeling Toolbox was written at the Stanford NLP group by: Daniel Ramage and Evan Rosen, first released in September 2009 • Topic models (e.g. LDA, Labeled LDA) training and inference to create summaries of the text

  5. Introduction - LDA Model • LDA model is a unsupervised topic model • User need to define some important parameters, such as number of topics • It is hard to choose the number of topics • Even with some top terms for each topic, it is still difficult to interpret the content of the extracted topics

  6. Introduction – Labeled LDA Model • Labeled LDA is a supervised topic model for credit attribution in multi-labeled corpora. • If one of the columns in your input text file contains labels or tags that apply to the document, you can use Labeled LDA to discover which parts of each document go with each label, and to learn accurate models of the words best associated with each label globally

  7. Contents • Introduction • Getting Started • Prerequisites • Installation • Simple Testing • Toolbox Running • LDA Model • Labeled LDA Model

  8. Prerequisites • A text editor (e.g. TextWrangler) for creating TMT processing scripts. • TMT scripts are written in Scala, but no knowledge of Scala is required to get started. • An installation of Java 6SE or greater: http://java.com/en/download/index.jsp. • Windows, Mac, and Linux are supported.

  9. Installation • Download the TMT executable (tmt-0.4.0.jar) from http://nlp.stanford.edu/software/tmt/tmt-0.4/ • Double-click the jar file to open toolbox or run the toolbox with the command line : java -jar tmt-0.4.0.jar • You should see a simple GUI

  10. SimpleTesting • Example data and scripts for simple testing • Download the example data file: pubmed-oa-subset.csv • Download the first testing script: example-0-test.scala • Note: the data file and the script should be put into the same folder

  11. Simple Testing - GUI • Load script: File  Open script

  12. Simple Testing - GUI • Edit script: valpubmed = CSVFile("pubmed-oa-subset.csv”)

  13. Simple Testing - GUI • Run the script: click the button ‘Run’

  14. Simple Testing - Command Line

  15. Contents • Introduction • Getting Started • Prerequisites • Installation • Toolbox Running • Latent Dirichlet Allocation Model (LDA Model) • Labeled LDA Model

  16. LDA Model – Data Preparation • 173, 777 Astronomy papers were collected from the Web of Science (WOS) covering the period from 1992 to 2012 • In the file ‘astro_wos_lda.csv’, every record includes paper ID (the first column), title (the second column) and published year (the third column)

  17. LDA Training – Script Loading • File  Open script  Navigate to example-2-lda-learn.scala  Open

  18. LDA Training – Data Loading • Edit Script : ‘valsource = CSVFile("astro_wos_lda.csv”)’ ‘Column(2) ~>’ • Note: if your text cover 2 columns or more than 2 columns, such as the third and forth columns, you can use ‘Columns(3,4) ~> Join(" ") ~>’ to replace ’Column(2) ~>’

  19. LDA Training – Parameter Selection • Edit Script : valparams = LDAModelParams(numTopics = 30, dataset = dataset, topicSmoothing = 0.01, termSmoothing = 0.01)

  20. LDA Training – Model Training • Run : Out of Memory due to the big data

  21. LDA Training – Model Training • Change the size of Memory  Run

  22. LDA Training – Output Generation • lda-b2aa1797-30-751edefe • description.txt : A description of the model saved in this folder • document-topic-distributions.csv: A csv file containing the per-document topic distribution for each document in the training dataset • 00000-01000 : Snapshots of the model during training

  23. LDA Training – Output Generation • /params.txt: Model parameters used during training • /tokenizer.txt: Tokenizer used to tokenize text for use with this model • /summary.txt: Human readable summary of the topic model, with top-20 terms per topic and how many words instances of each have occurred • /log-probability estimate.txt: Estimate of the log probability of the dataset at this iteration • /term-index.txt: Mapping from terms in the corpus to ID numbers • /description.txt: A description of the model saved in this iteration • /topic-termdistributions.csv.gz: For each topic, the probability of each term in that topic

  24. LDA Training – Command Line • Java –Xmx4G –jar tmt-0.4.0.jar example-2-lda-learn.scala

  25. LDA Inference – Script Loading • File  Open script  Navigate to example-3-lda-infer Open

  26. LDA Inference – Trained Model Loading • Edit Script: valmodelPath = file("lda-b2aa1797-30-751edefe”)

  27. LDA Inference – Data Loading • Edit Script: ‘val source = CSVFile("astro_wos_lda.csv”)’ ‘Column(2) ~>’ • Note: Here we just use the same dataset as the inference data, but actually it should be some new dataset

  28. LDA Inference – Model Inference • Change the size of Memory  Run

  29. LDA Inference – Output Generation • Navigate to the folder ’lda-b2aa1797-30-751edefe’ • astro_wos_lda-document-topic-distributuions.csv : A csv file containing the per-document topic distribution for each document in the inference dataset • astro_wos_lda-top-terms.csv: A csv file containing the top terms in the inference dataset for each topic • astro_wos_lda-usage.csv

  30. LDA Inference – Command Line • Java –Xmx4G –jar tmt-0.4.0.jar example-3-lda-infer.scala

  31. LLDA Model – Data Preparation • 4,770 metformin papers were collected from pubMed covering the period from 1997 to 2011 • Training data : metformin_train_data_llda.csv(2798 papers), every record includes paper ID (the first column), bio-term list (the second column), title (the third column) and abstract (the forth column), the number of bio-terms in very record is at least 3 • Inference data:metformin_infer_data_llda.csv (4770 papers), every record includes paper ID (the first column), title (the second column) and abstract (the third column)

  32. LLDA Training – Script Loading • File  Open script  Navigate to example-6-llda-learn.scala Open

  33. LLDA Training – Data Loading • Edit Script : ‘valsource = CSVFile("metformin_train_data_llda.csv")’ ‘Columns(3,4) ~> Join(" ") ~>’ ’Column(2) ~>’

  34. LLDA Training – Model Training • Run

  35. LLDA Training – Output Generation • llda-cvb0-bd54e9b6-176-1213c7f4-222a08a4 • description.txt: A description of the model saved in this folder • document-topic-distributions.csv: A csv file containing the per-document topic distribution for each document in the training dataset • 00000-01000 : Snapshots of the model during training

  36. LLDA Training – Output Generation • /params.txt: Model parameters used during training • /tokenizer.txt: Tokenizer used to tokenize text for use with this model • /summary.txt: Human readable summary of the topic model, with top-20 terms per topic and how many words instances of each have occurred • /term-index.txt: Mapping from terms in the corpus to ID numbers • /description.txt: A description of the model saved in this iteration • /label-index.txt : Topics extracted after LLDA training • /topic-termdistributions.csv.gz: For each topic, the probability of each term in that topic

  37. LLDA Training – Command Line • Java –Xmx4G –jar tmt-0.4.0.jar example-6-llda-learn.scala

  38. LLDA Inference – Jar Script • The TMT toolbox doesn’t provide script for LLDA inference • A java script, packaged into ‘llda-infer.jar’, was generated in order to conduct LLDA inference

  39. LLDA Inference – Command Line • java -jar llda-infer.jarmetformin_infer_data_llda.csv llda-cvb0-bd54e9b6-176-1213c7f4-222a08a4 metformin_infer_result.csv

  40. LLDA Inference – Output Generation • A file named metformin_infer_result.csv will be generated after LLDA Inference

  41. Thanks….. Any Question?

More Related