mallet l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Mallet PowerPoint Presentation
Download Presentation
Mallet

Loading in 2 Seconds...

play fullscreen
1 / 33

Mallet - PowerPoint PPT Presentation


  • 803 Views
  • Uploaded on

Mallet. MA chine L earning for L anguag E T oolkit. Outline. About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion. Outline. About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion. About MALLET.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Mallet' - satchel


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
mallet

Mallet

MAchineLearning for LanguagEToolkit

outline
Outline
  • About MALLET
  • Representing Data
  • Command Line Processing
  • Simple Evaluation
  • Conclusion
outline3
Outline
  • About MALLET
  • Representing Data
  • Command Line Processing
  • Simple Evaluation
  • Conclusion
about mallet
About MALLET
  • "MALLET: A Machine Learning for Language Toolkit.“
    • written by Andrew McCallum
    • http://mallet.cs.umass.edu. 2002.
    • Implemented in Java, currently version 2.0.6
  • Motivation:
    • Text classification and information extraction
    • Commercial machine learning
    • Analysis and indexing of academic publications
about mallet5
About MALLET
  • Main idea
    • Text focus: data is discrete rather than continuous, even when values could be continuous
  • How to
    • Command line scripts:
      • bin/mallet [command] --[option] [value] …
      • Text User Interface (“tui”) classes
    • Direct Java API
      • http://mallet.cs.umass.edu/api
outline6
Outline
  • About MALLET
  • Representing Data
  • Command Line Processing
  • Simple Evaluation
  • Conclusion
representations
Representations
  • Transform text documents to vectors x1 , x2 …
  • Elements of vector are called feature values
    • Example: “Feature at row 345 is number of times “dog” appears in document”
  • Retain meaning of vector indices
outline16
Outline
  • About MALLET
  • Representing Data
  • Command Line Processing
  • Developing with MALLET
  • Conclusion
command line
Command Line
  • Importing Data
  • Classification
  • Sequence Tagging
  • Topic Modeling
importing data
Importing Data
  • One Instance per file
    • files in the folder:

sample-data/web/enor sample-data/web/de

    • command line:

bin/mallet import-dir --input sample-data/web/* --output web.mallet

  • One file, one instance per line
    • file format:

[URL] [language] [text of the page...]

    • command line:

bin/mallet import-file --input /data/web/data.txt --output web.mallet

classification
Classification
  • Training a classifier

bin/mallet train-classifier --input training.mallet --output-classifier my.classifier

  • Choosing an algorithm
    • MaxEnt, NaiveBayes, C45, DecisionTree and many others.

bin/mallet train-classifier --input training.mallet --output-classifier my.classifier--trainer MaxEnt

  • Evaluation
    • Random split the data into 90% training instances, which will be used to train the classifier, and 10% testing instances. 

bin/mallet train-classifier --input labeled.mallet --training-portion 0.9

sequence tagging
Sequence Tagging
  • Sequence algorithms
    • hidden Markov models (HMMs)
    • linear chain conditional random fields (CRFs).
  • SimpleTagger
    • a command line interface to the MALLET Conditional Random Field (CRF) class
simpletagger
SimpleTagger
  • Input file: [feature1 feature2 ... featurenlabel]

Bill CAPITALIZED noun

slept non-noun

here LOWERCASE STOPWORD non-noun

  • Train a CRF
    • An input file “sample”
    • A trained CRF in the file "nouncrf"

java -cp“~/mallet/class:~/mallet/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --train true --model-file nouncrf sample

simpletagger22
SimpleTagger
  • A file “stest” needed to be labeled

CAPITAL Al

slept

here

  • Label the input

java -cp“~/mallet/class:~/mallet/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --model-file nouncrfstest

  • Output

Number of predicates: 5

noun CAPITAL Al

non-noun slept

non-noun here

topic modeling
Topic Modeling
  • Building Topic Models

bin/mallet train-topics --input topic-input.mallet--num-topics 100 --output-state topic-state.gz

--input [FILE] 

--num-topics [NUMBER] The number of topics to use. The best number depends on what you are looking for in the model.

--num-iterations [NUMBER] The number of sampling iterations should be a trade off between the time taken to complete sampling and the quality of the topic model.

--output-state [FILENAME] This option outputs a compressed text file containing the words in the corpus with their topic assignments. 

outline25
Outline
  • About MALLET
  • Representing Data
  • Command Line Processing
  • Simple Evaluation
  • Conclusion
methodology
Methodology
  • Focus on sequence tagging module in MALLET
    • CRF-based implementation
      • Some scripts written for importing data and evaluating results
    • Small corpora collected from web
      • Divided into two parts, 80% for training, 20% for test
    • Evaluate both POS Tagging and Named Entity Recognition
      • The performance of training
      • Accuracy (POS Tagging) and Precision, Recall and FB1 (NER)
  • All scripts, corpora and results can be found here
    • http://mallet-eval.googlecode.com
a s urvey of named entity corpora
A Survey of Named Entity Corpora
  • Well known named entity corpora
    • Language-Independent Named Entity Recognition at CoNLL-2003
      • A manual annotation of a subset of RCV1 (Reuters Corpus Volume 1)
      • free and public, but need RCV1 raw texts as the input
    • Message Understanding Conference (MUC) 6/ 7
      • not for free
    • Affective Computational Entities (ACE) Training Corpus
      • not for free
  • Other special purpose corpora
    • Enron Email Dataset
      • email messages in this corpus are tagged with person names, dates and times.
    • A variety of biomedical corpora
      • some corpora in this collection are tagged with entities in the biomedical domain, such as gene name
small corpora
Small Corpora
  • Two small corpora collected from web
    • Penn Treebank Sample
      • English POS tagging corpora, ~5% fragment of Penn Treebank, (C) LDC 1995.
      • raw, tagged, parsed and combined data from Wall Street Journal
      • 148120 tokens, 36 Standard treebank POS tagger
      • http://web.mit.edu/course/6/6.863/OldFiles/share/data/corpora/treebank/
    • HIT CIR LTP Corpora Sample
      • Chinese NER corpora integrated
      • 10% of the whole corpora (open to public)
      • 23751 tokens, 7 kinds of named entities
      • http://ir.hit.edu.cn/demo/ltp/Sharing_Plan.htm
environment
Environment
  • Hardware
    • CPU: Q8300 Quad Core 2.50 GHz
    • Memory: 3GB
  • Software
    • Fedora 13 x86_64
    • Java 1.6.0_18
    • MALLET 2.0.6
data format and labels
Data Format and Labels
  • Data Format
    • Each token one row, each feature one column

Bill noun

slept non-noun

Here non-noun

  • Labels
    • Standard treebank POS Tagger
      • CCCoordinating conjunction | CD Cardinal number | DT Determiner | EXExistential there | FW Foreign word | INPreposition or subordinating conjunction | JJ Adjective | JJRAdjective, comparative | JJSAdjective, superlative | LS List item marker | MD Modal | NNNoun, singular or mass | NNSNoun, plural …… (36 taggers in all)
    • HIT Named Entity
      • O 不是NE | S- 单独构成 NE | B- 一个NE 的开始 | I- 一个NE 的中间 | E- 一个 NE 的结尾
      • Nm 数词| Ni 机构名 | Ns 地名 | Nh人名 | Nt时间 | Nr 日期 | Nz专有名词
      • Example: 美国 B-Ni 洛杉矶 I-Ni 警察局 E-Ni
evaluation
Evaluation

Tasks

Stages