1 / 14

How to Tag a Corpus Using Stanford Tagger

How to Tag a Corpus Using Stanford Tagger. Accuracy. All tokens: 97.32 % Unknown words: 90.79 %. What You Need. JRE: http://www.java.com/en/download/ie_manual.jsp?locale=en. To make sure that Windows can find the Java compiler and interpreter:.

rhona
Download Presentation

How to Tag a Corpus Using Stanford Tagger

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How to Tag a Corpus Using Stanford Tagger

  2. Accuracy • All tokens: 97.32% • Unknown words: 90.79%

  3. What You Need JRE: http://www.java.com/en/download/ie_manual.jsp?locale=en

  4. To make sure that Windows can find the Java compiler and interpreter: • Select Start -> Computer -> System Properties -> Advanced system settings -> Environment Variables -> System variables -> PATH. • [ In Vista, select Start -> My Computer -> Properties -> Advanced -> Environment Variables -> System variables -> PATH. ] • [ In Windows XP, Select Start -> Control Panel -> System -> Advanced -> Environment Variables -> System variables -> PATH. ] • Prepend C:\Program Files\Java\jdk1.6.0_27\bin; to the beginning of the PATH variable. • Click OK three times.

  5. Installing Java (JRE) on your computer • Click Start • type cmd and press enter • this will open the command prompt window • type java –version and press enter • you will get a message: java version “1.7.0” (or may be an older version) If you do not get this message it means you could not install Java correctly. Ask for help.

  6. Install the Stanford POS Tagger Basic English Stanford Tagger Version 3.1.3: http://nlp.stanford.edu/software/stanford-postagger-2012-07-09.tgz

  7. Installing Basic English Stanford Tagger Version 3.1.3 • Click on the link that I provided above download the zip file. • Unzip the file to Documents using an archive manager software, such as WinRAR, 7-Zip, or WinZip • You might want to change the name of this unzipped folder to stanTagger. I do this because the original name is too long:stanford-postagger-2012-07-09

  8. Create a Corpus Folder • In stanTagger folder create two folders to hold your files. • I name them myCorpus and myTaggedCorpus • Now put some text files (or your corpus) in myCorpus • Make sure there are no spaces in your file names. For example, writtenArgument.txt instead of written Argument.txt • Carry your folder named stanTagger under C: so that you can find it easily.

  9. Tagging Files •  Start your command window as described above •  Go to C: by typing the command cd.. twice •  Go in stanTagger by typing cd stanTagger

  10. Tagging files • To be able to use the Stanford-Tagger on every file automatically, we need to do some programming. • We can do this with Perl or other programming languages, such as Java, PHP, Python, and so on. • However, I found programming the Command Prompt to be the simplest and will share the code I prepared.

  11. Tagging files • Code to be used in Command Prompt: • FOR%aIN (C:\stanTagger\myCorpus\*.txt) DOstanford-postagger models\left3words-wsj-0-18.tagger myCorpus\%~nxa>myTaggedCorpus\%~nxa • You can simply copy the above code and paste it in the Command Prompt

  12. New Code! • FOR %a IN (C:\stanTagger\myCorpus\*.txt) DO stanford-postagger models\wsj-0-18-left3words.tagger myCorpus\%~nxa >myTaggedCorpus\%~nxa

  13. Newest Code! • FOR %a IN (C:\stanTagger\myCorpus\*.txt) DO stanford-postagger models\english-left3words-distsim.tagger myCorpus\%~nxa >myTaggedCorpus\%~nxa

  14. Each file may take about 2-3 seconds and at the end, you will see that myTaggedChineseFolder contains the tagged files.

More Related