Resources for multilingual processing
Download
1 / 81

Resources for multilingual processing - PowerPoint PPT Presentation


  • 170 Views
  • Uploaded on

Resources for multilingual processing. Georgiana Pu şcaşu University of Wolverhampton, UK. Outline. Motivation and goals NLP Methods, Resources and Applications Text Segmentation Part of Speech Tagging Stemming Lemmatization Syntactic Parsing Named Entity Recognition

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Resources for multilingual processing' - phila


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Resources for multilingual processing l.jpg

Resources for multilingual processing

Georgiana Puşcaşu

University of Wolverhampton, UK


Outline l.jpg
Outline

  • Motivation and goals

  • NLP Methods, Resources and Applications

    • Text Segmentation

    • Part of Speech Tagging

    • Stemming

    • Lemmatization

    • Syntactic Parsing

    • Named Entity Recognition

    • Term Extraction and Terminology Data Management Tools

    • Text Summarization

    • Language Identification

    • Statistical Language Modeling Toolkits

    • Corpora

  • Conclusions


Motivation and goals l.jpg
Motivation and goals

Motivation

  • Most NLP research and resources deal with English

  • The Web is multilingual and ideally for all languages the current NLP state-of-the-art should be attained

    Goals

  • To present already available textual methods that can support multilingual NLP

  • To offer an inventory of existent tools and resources that can be exploited in order to avoid reinventing the wheel


Text segmentation l.jpg
Text Segmentation

  • Electronic text is essentially just a sequence of characters

  • Before any real processing, text needs to be segmented

  • Text segmentation involves

    • Low-level text segmentation (performed at the initial stages of text processing)

      • Tokenization

      • Sentence splitting

    • High-level text segmentation

      • Intra-sentential: segmentation of linguistic groups such as Named Entities, Noun Phrases, splitting sentences into clauses

      • Inter-Sentential: grouping sentences and paragraphs into discourse topics


Tokenization l.jpg
Tokenization

  • Tokenization is the process of segmenting text into linguistic units such as words, punctuation, numbers, alphanumerics, etc.

  • It is normally the first step in the majority of text processing applications

  • Tokenization in languages that are:

    • segmented: is considered a relatively easy and uninteresting part of text processing (words delimited by blank spaces and punctuation)

    • non-segmented: is more challenging (no explicit boundaries between words)


Tokenization in segmented languages l.jpg
Tokenization in segmented languages

  • Segmented languages: all modern languages that use a Latin-, Cyrillic- or Greek-based writing system

  • Traditionally, tokenization rules are written using regular expressions

  • Problems:

    • Abbreviations: solved by lists of abbreviations (pre-compiled or automatically extracted from a corpus), guessing rules

    • Hyphenated words:“One word or two?”

    • Numerical and special expressions (Email addresses, URLs, telephone numbers, etc.) are handled by specialized tokenizers (preprocessors)

    • Apostrophe: (they’re => they + ‘re; don’t => do + n’t) solved by language-specific rules


Tokenization in non segmented languages l.jpg
Tokenization innon-segmented languages

  • Non-segmented languages: Oriental languages

  • Problems:

    • tokens are written directly adjacent to each other

    • almost all characters can be one-character word by themselves but can also form multi-character words

  • Solutions:

    • Pre-existing lexico-grammatical knowledge

    • Machine learning employed to extract segmentation regularities from pre-segmented data

    • Statistical methods: character n-grams


Tokenizers 1 l.jpg
Tokenizers (1)

ALEMBIC

Author(s): M. Vilain, J. Aberdeen, D. Day, J. Burger, The MITRE Corporation

Purpose: Alembic is a multi-lingual text processing system. Among other tools, it incorporates tokenizers for: English, Spanish, Japanese, Chinese, French, Thai.

Access: Free by contacting [email protected]

ELLOGON

Author(s): George Petasis, Vangelis Karkaletsis, National Center for Scientific Research, Greece

Purpose: Ellogon is a multi-lingual, cross-platform, general-purpose language engineering environment. One of the provided components that can be adapted to various languages can perform tokenization. Supported languages: Unicode.

Access: Free at http://www.ellogon.org/

GATE (General Architecture for Text Engineering)

Author(s): NLP Group, University of Sheffield, UK

Access: Free but requires registration at http://gate.ac.uk/

HEART Of GOLD

Author(s): Ulrich Schäfer, DFKI Language Technology Lab, Germany

Purpose: Supported language: Unicode, Spanish, Polish, Norwegian, Japanese, Italian, Greek, German, French, English, Chinese.

Access: Free at http://heartofgold.dfki.de/


Tokenizers 2 l.jpg
Tokenizers (2)

LT TTT

Author(s): Language Technology Group, University of Edinburgh, UK

Purpose: LT TTT is a text tokenization system and toolset which enables users to produce a swift and individually-tailored tokenisation of text.

Access: Free athttp://www.ltg.ed.ac.uk/software/ttt/

MXTERMINATOR

Author(s): Adwait Ratnaparkhi

Platforms: Platform independent

Access: Free athttp://www.cis.upenn.edu/~adwait/statnlp.html

QTOKEN

Author(s): Oliver Mason, Birmingham University, UK

Platforms: Platform independent

Access: Free athttp://www.english.bham.ac.uk/staff/omason/software/qtoken.html

SProUT

Author(s): Feiyu Xu, Tim vor der Brück, LT-Lab, DFKI GmbH, Germany

Purpose: SProUTprovides tokenization for Unicode, Spanish, Japanese, German, French, English, Chinese.

Access: Not free. More information at http://sprout.dfki.de/


Tokenizers 3 l.jpg
Tokenizers (3)

THE QUIPU GROK LIBRARY

Author(s): Gann Bierner and Jason Baldridge, University of Edinburgh, UK

Access: Free athttps://sourceforge.net/project/showfiles.php?group_id=4083

TWOL

Author(s): Lingsoft

Purpose: Supported languages: Swedish, Norwegian, German, Finnish, English, Danish.

Access: Not free. More information at http://www.lingsoft.fi/


Sentence splitting l.jpg
Sentence splitting

  • Sentence splitting is the task of segmenting text into sentences

  • In the majority of cases it is a simple task: . ? ! usually signal a sentence boundary

  • However, in cases when a period denotes a decimal point or is a part of an abbreviation, it does not always signal a sentence break.

  • The simplest algorithm is known as ‘period-space-capital letter’ (not very good performance). Can be improved with lists of abbreviations, a lexicon of frequent sentence initial words and/or machine learning techniques


Part of speech pos tagging l.jpg

WORDS

TAGS

The

couple

spent

the

honeymoon

on

a

yacht

N

V

P

DET

Part of Speech (POS) Tagging

  • POS Tagging is the process of assigning a part-of-speech or lexical class marker to each word in a corpus (Jurafsky and Martin)


Pos tagger prerequisites l.jpg
POS Tagger Prerequisites

  • Lexicon of words

  • For each word in the lexicon information about all its possible tags according to a chosen tagset

  • Different methods for choosing the correct tag for a word:

    • Rule-based methods

    • Statistical methods

    • Transformation Based Learning (TBL) methods


Pos tagger prerequisites lexicon of words l.jpg
POS Tagger Prerequisites: Lexicon of words

  • Classes of words

    • Closed classes: a fixed set

      • Prepositions: in, by, at, of, …

      • Pronouns: I, you, he, her, them, …

      • Particles: on, off, …

      • Determiners: the, a, an, …

      • Conjunctions: or, and, but, …

      • Auxiliary verbs: can, may, should, …

      • Numerals: one, two, three, …

    • Open classes: new ones can be created all the time, therefore it is not possible that all words from these classes appear in the lexicon

      • Nouns

      • Verbs

      • Adjectives

      • Adverbs


Pos tagger prerequisites tagsets l.jpg
POS Tagger Prerequisites Tagsets

  • To do POS tagging, need to choose a standard set of tags to work with

  • A tagset is normally sophisticated and linguistically well grounded

  • Could pick very coarse tagets

    • N, V, Adj, Adv.

  • More commonly used set is finer grained, the “UPenn TreeBank tagset”, 48 tags

  • Even more fine-grained tagsets exist


Pos tagger prerequisites tagset example upenn tagset l.jpg
POS Tagger PrerequisitesTagset example – UPenn tagset


Pos tagging rule based methods l.jpg
POS Tagging Rule based methods

  • Start with a dictionary

  • Assign all possible tags to words from the dictionary

  • Write rules by hand to selectively remove tags

  • Leaving the correct tag for each word


Pos tagging statistical methods 1 l.jpg
POS Tagging Statistical methods (1)

The Most Frequent Tag Algorithm

  • Training

    • Take a tagged corpus

    • Create a dictionary containing every word in the corpus together with all its possible tags

    • Count the number of times each tag occurs for a word and compute the probability P(tag|word); then save all probabilities

  • Tagging

    • Given a new sentence, for each word, pick the most frequent tag for that word from the corpus


Pos tagging statistical methods 2 l.jpg
POS Tagging Statistical methods (2)

Bigram HMM Tagger

  • Training

    • Create a dictionary containing every word in the corpus together with all its possible tags

    • Compute the probability of each tag generating a certain word, compute the probability each tag is preceded by a specific tag (Bigram HMM Tagger => probability is dependent only on the previous tag)

  • Tagging

    • Given a new sentence, for each word, pick the most likely tag for that word using the parameters obtained after training

    • HMM Taggers choose the tag sequence that maximizes this formula: P(word|tag) * P(tag|previous tag)


Bigram hmm tagging example l.jpg
Bigram HMM Tagging: Example

People/NNS are/VBZ expected/VBN to/TO queue/VB at/IN the/DT registry/NNS

The/DT police/NN is/VBZ to/TO blame/VB for/IN the/DT queue/NN

  • to/TO queue/???the/DT queue/???

  • tk = argmaxk P(tk|tk-1)*P(wi|tk)

    • i = number of word in sequence, k = number among possible tags for the word “queue”

  • How do we compute P(tk|tk-1)?

    • count(tk-1tk)/count(tk-1)

  • How do we compute P(wi|tk)?

    • count(wi tk)/count(tk)

  • max[P(VB|TO)*P(queue|VB) , P(NN|TO)*P(queue|NN)]

  • Corpus:

    • P(NN|TO) = 0.021 * P(queue|NN) = 0.00041 => 0.000007

    • P(VB|TO) = 0.34 * P(queue|VB) = 0.00003 => 0.00001


Pos tagging transformation based tagging 1 l.jpg
POS Tagging Transformation Based Tagging (1)

  • Combination of rule-based and stochastic tagging methodologies

    • Like rule-based because rule templates are used to learn transformations

    • Like stochastic approach because machine learning is used — with tagged corpus as input

  • Input:

    • tagged corpus

    • lexicon (with all possible tags for each word)


Pos tagging transformation based tagging 2 l.jpg
POS Tagging Transformation Based Tagging (2)

  • Basic Idea:

    • Set the most probable tag for each word as a start value

    • Change tags according to rules of type “if word-1 is a determiner and word is a verb then change the tag to noun” in a specific order

  • Training is done on tagged corpus:

    1. Write a set of rule templates

    2. Among the set of rules, find one with highest score

    3. Continue from 2 until lowest score threshold is passed

    4. Keep the ordered set of rules

  • Rules make errors that are corrected by later rules


Transformation based tagging example l.jpg
Transformation Based Tagging Example

  • Tagger labels every word with its most-likely tag

    • For example: race has the following probabilities in the Brown corpus:

      • P(NN|race) = 0.98

      • P(VB|race)= 0.02

  • Transformation rules make changes to tags

    • “Change NN to VB when previous tag is TO”… is/VBZ expected/VBN to/TO race/NN tomorrow/NNbecomes… is/VBZ expected/VBN to/TO race/VB tomorrow/NN


Pos taggers 1 l.jpg
POS Taggers (1)

ACOPOST

Author(s): Jochen Hagenstroem, Kilian Foth, Ingo Schröder, Parantu Shah

Purpose: ACOPOST is a collection of POS taggers. It implements and extends well-known machine learning techniques and provides a uniform environment for testing.

Platforms: All POSIX (Linux/BSD/UNIX-like OSes)

Access: Free athttp://sourceforge.net/projects/acopost/

BRILL’S TAGGER

Author(s): Eric Brill

Purpose: Transformation Based Learning POS Tagger

Access: Free athttp://www.cs.jhu.edu/~brill

fnTBL

Author(s): Radu Florian and Grace Ngai, John Hopkins University, USA

Purpose: fnTBL is a customizable, portable and free source machine-learning toolkit primarily oriented towards Natural Language-related tasks (POS tagging, base NP chunking, text chunking, end-of-sentence detection). It is currently trained for English and Swedish.

Platforms: Linux, Solaris, Windows

Access: Free athttp://nlp.cs.jhu.edu/~rflorian/fntbl/


Pos taggers 2 l.jpg
POS Taggers (2)

LINGSOFT

Author(s): LINGSOFT, Finland

Purpose: Among the services offered by Lingsoft one can find POS taggers for Danish, English, German, Norwegian, Swedish.

Access: Not free. Demos at http://www.lingsoft.fi/demos.html

LT POS (LT TTT)

Author(s): Language Technology Group, University of Edinburgh, UK

Purpose: The LT POS part of speech tagger uses a Hidden Markov Model disambiguation strategy. It is currently trained only for English.

Access: Free but requires registration at http://www.ltg.ed.ac.uk/software/pos/index.html

MACHINESE PHRASE TAGGER

Author(s): Connexor

Purpose: Machinese Phrase Tagger is a set of program components that perform basic linguistic analysis tasks at very high speed and provide relevant information about words and concepts to volume-intensive applications. Available for: English, French, Spanish, German, Dutch, Italian, Finnish.

Access: Not free. Free access to online demo at http://www.connexor.com/demo/tagger/


Pos taggers 3 l.jpg
POS Taggers (3)

MXPOST

Author(s): Adwait Ratnaparkhi

Purpose: MXPOST is a maximum entropy POS tagger. The downloadable version includes a Wall St. Journal tagging model for English, but can also be trained for different languages.

Platforms: Platform independent

Access: Free athttp://www.cis.upenn.edu/~adwait/statnlp.html

MEMORY BASED TAGGER

Author(s): ILK - Tilburg University, CNTS - University of Antwerp

Purpose: Memory-based tagging is based on the idea that words occurring in similar contexts will have the same POS tag. The idea is implemented using the memory-based learning software package TiMBL.

Access: Usable by email or on the Web athttp://ilk.uvt.nl/software.html#mbt

µ-TBL

Author(s): Torbjörn Lager

Purpose: The µ-TBL system is a powerful environment in which to experiment with transformation-based learning.

Platforms: Windows

Access: Free athttp://www.ling.gu.se/~lager/mutbl.html


Pos taggers 4 l.jpg
POS Taggers (4)

QTAG

Author(s): Oliver Mason, Birmingham University, UK

Purpose: QTag is a probabilistic parts-of-speech tagger. Resource files for English and German can be downloaded together with the tool.

Platforms: Platform independent

Access: Free athttp://www.english.bham.ac.uk/staff/omason/software/qtag.html

STANFORD POS TAGGER

Author(s): Kristina Toutanova, Stanford University, USA

Purpose: The Stanford POS tagger is a log-linear tagger written in Java. The downloadable package includes components for command-line invocation and a Java API both for training and for running a trained tagger.

Platforms: Platform independent

Access: Free athttp://nlp.stanford.edu/software/tagger.shtml

SVM TOOL

Author(s): TALP Research Center, University of Catalunya, Spain

Purpose: The SVMTool is a simple and effective part-of-speech tagger based on Support Vector Machines. The SVMLight software implementation of Vapnik's Support Vector Machine by Thosten Joachims has been used to train the models for Catalan, English and Spanish.

Access: Free. SVMTool athttp://www.lsi.upc.es/~nlp/SVMTool/ and SVMLight athttp://svmlight.joachims.org/


Pos taggers 5 l.jpg
POS Taggers (5)

TnT

Author(s): Thorsten Brants, Saarland University, Germany

Purpose: TnT, the short form of Trigrams'n'Tags, is a very efficient statistical part-of-speech tagger that is trainable on different languages and virtually any tagset. The tagger is an implementation of the Viterbi algorithm for second order Markov models. TnT comes with two language models, one for German, and one for English.

Platforms: Platform independent.

Access: Free but requires registrationathttp://www.coli.uni-saarland.de/~thorsten/tnt/

TREETAGGER

Author(s): Helmut Schmid, Institute for Computational Linguistics, University of Stuttgart, Germany

Purpose: The TreeTagger has been successfully used to tag German, English, French, Italian, Spanish, Greek and old French texts and is easily adaptable to other languages if a lexicon and a manually tagged training corpus are available.

Access: Free at

http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html


Pos taggers 6 l.jpg
POS Taggers (6)

Xerox XRCE MLTT Part Of Speech Taggers

Author(s): Xerox Research Centre Europe

Purpose:Xerox has developed morphological analysers and part-of-speech disambiguators for various languages including Dutch, English, French, German, Italian, Portuguese, Spanish. More recent developments include Czech, Hungarian, Polish and Russian.

Access: Not free. Demos at

http://www.xrce.xerox.com/competencies/content-analysis/fsnlp/tagger.en.html

YAMCHA

Author(s): Taku Kudo

Purpose: YamCha is a generic, customizable, and open source text chunker oriented toward a lot of NLP tasks, such as POS tagging, Named Entity Recognition, base NP chunking, and Text Chunking. YamCha is using Support Vector Machines (SVMs), first introduced by Vapnik in 1995. YamCha is exactly the same system which performed the best in the CoNLL2000 Shared Task, Chunking and BaseNP Chunking task.

Platforms: Linux, Windows

Access: Free athttp://www2.chasen.org/~taku/software/yamcha/


Stemming l.jpg
Stemming

  • Stemmers are used in IR to reduce as many related words and word forms as possible to a common canonical form – not necessarily the base form – which can then be used in the retrieval process.

  • Frequently, the performance of an IR system will be improved if term groups such as: CONNECT, CONNECTED, CONNECTING, CONNECTION, CONNECTIONS are conflated into a single term (by removal of the various suffixes -ED, -ING, -ION, -IONS to leave the single term CONNECT). The suffix stripping process will reduce the total number of terms in the IR system, and hence reduce the size and complexity of the data in the system, which is always advantageous.


The porter stemmer l.jpg
The Porter Stemmer

  • A conflation stemmer developed by Martin Porter at the University of Cambridge in 1980

  • Idea: the English suffixes (approximately 1200) are mostly made up of a combination of smaller and simpler suffixes

  • Can be adapted to other languages (needs a list of suffixes and context sensitive rules)


Stemmers 1 l.jpg
Stemmers (1)

ELLOGON

Author(s): George Petasis, Vangelis Karkaletsis, National Center for Scientific Research, Greece

Access: Free at http://www.ellogon.org/

FSA

Author(s): Jan Daciuk, Rijksuniversiteit Groningen and Technical University of Gdansk, Poland

Purpose: Supported languages: German, English, French, Polish.

Access: Free at http://juggernaut.eti.pg.gda.pl/~jandac/fsa.html

HEART Of GOLD

Author(s): Ulrich Schäfer, DFKI Language Technology Lab, Germany

Purpose: Supported language: Unicode, Spanish, Polish, Norwegian, Japanese, Italian, Greek, German, French, English, Chinese.

Access: Free at http://heartofgold.dfki.de/


Stemmers 2 l.jpg
Stemmers (2)

LANGSUITE

Author(s): PetaMem

Purpose: Supported languages: Unicode, Spanish, Polish, Italian, Hungarian, German, French, English, Dutch, Danish, Czech.

Access: Not free. More information at http://www.petamem.com/

SNOWBALL

Purpose: Presentation of stemming algorithms, and Snowball stemmers, for English, Russian, Romance languages (French, Spanish, Portuguese and Italian), German, Dutch, Swedish, Norwegian, Danish and Finnish.

Access: Free at http://www.snowball.tartarus.org/

SProUT

Author(s): Feiyu Xu, Tim vor der Brück, LT-Lab, DFKI GmbH, Germany

Purpose: Available for: Unicode, Spanish, Japanese, German, French, English, Chinese

Access: Not free. More information at http://sprout.dfki.de/

TWOL

Author(s): Lingsoft

Purpose: Supported languages: Swedish, Norwegian, German, Finnish, English, Danish

Access: Not free. More information at http://www.lingsoft.fi/


Lemmatization l.jpg
Lemmatization

  • The process of grouping the inflected forms of a word together under a base form, or of recovering the base form from an inflected form, e.g. grouping the inflected forms COME, COMES, COMING, CAME under the base form COME

  • Dictionary based

    • Input: token + pos

    • Output: lemma

  • Note: needs POS information

  • Example:

    • left+v -> leave, left+a->left

  • It is the same as looking for a transformation to apply on a word to get its normalized form (word endings: what word suffix should be removed and/or added to get the normalized form) => lemmatization can be modeled as a machine learning problem


Lemmatizers 1 l.jpg
Lemmatizers (1)

CONNEXOR LANGUAGE ANALYSIS TOOLS

Author(s): Connexor, Finland

Purpose: Supported languages: English, French, Spanish, German, Dutch, Italian, Finnish.

Access: Not free. Demos at http://www.conexor.fi/

ELLOGON

Author(s): George Petasis, Vangelis Karkaletsis, National Center for Scientific Research, Greece

Access: Free at http://www.ellogon.org/

FSA

Author(s): Jan Daciuk, Rijksuniversiteit Groningen and Technical University of Gdansk, Poland

Purpose: Supported languages: German, English, French, Polish.

Access: Free at http://juggernaut.eti.pg.gda.pl/~jandac/fsa.html

MBLEM

Author(s): ILK Research Group, Tilburg University

Purpose: MBLEM is a lemmatizer for English, German, and Dutch.

Access: Demo at http://ilk.uvt.nl/mblem/


Lemmatizers 2 l.jpg
Lemmatizers (2)

SWESUM

Author(s): Hercules Dalianis, Martin Hassel, KTH, Euroling AB

Purpose: Supported languages: Swedish, Spanish, German, French, English

Access: Free at http://www.euroling.se/produkter/swesum.html

TREETAGGER

Author(s): Helmut Schmid, Institute for Computational Linguistics, University of Stuttgart, Germany

Purpose: The TreeTagger has been successfully used for German, English, French, Italian, Spanish, Greek and old French texts and is easily adaptable to other languages if a lexicon is available.

Access: Free at

http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html

TWOL

Author(s): Lingsoft

Purpose: Supported languages: Swedish, Norwegian, German, Finnish, English, Danish

Access: Not free. More information at http://www.lingsoft.fi/


Syntactic parsing l.jpg
Syntactic Parsing

  • Syntax refers to the way words are arranged together and the relationship between them

  • Parsing is the process of using a grammar to assign a syntactic analysis to a string of words

  • Approaches:

    • Shallow Parsing

    • Dependency Parsing

    • Context-Free Parsing


Shallow parsing l.jpg
Shallow Parsing

  • Partition the input into a sequence of non-overlapping units, or chunks, each a sequence of words labelled with a syntactic category and possibly a marking to indicate which word is the head of the chunk

  • How?

    • Set of regular expressions over POS labels

    • Training the chunker on manually marked up text


Dependency parsing l.jpg
Dependency Parsing

  • Based on dependency grammars, where a syntactic analysis takes the form of a set of head-modifier dependency links between words, each link labelled with the grammatical function of the modifying word with respect to the head

  • Parser first labels each word with all possible function types and then applies handwritten rules to introduce links between specific types and remove other function-type readings


Context free cf parsing l.jpg
Context-Free (CF) Parsing

  • CF parsing algorithms form the basis for almost all approaches to parsing that build hierarchical phrase structure

  • CFG Example:

    • S -> NP VP

    • NP -> Det NOMINAL

    • NOMINAL -> Noun

    • VP -> Verb

    • Det -> a

    • Noun -> flight

    • Verb -> left

  • A derivation is a sequence of rules applied to a string that accounts for that string (derivation tree)

  • Parsing is the process of taking a string and a grammar and returning one (more?) parse tree(s) for that string

  • Treebanks = Parsed corpora in the form of trees


  • Probabilistic cfgs l.jpg
    Probabilistic CFGs

    • Assigning probabilities to parse trees

      • Attach probabilities to grammar rules

      • The expansions for a given non-terminal sum to 1

      • A derivation (tree) consists of the set of grammar rules that are in the tree

      • The probability of a tree is just the product of the probabilities of the rules in the derivation.

    • Needed: grammar, dictionary with POS, parser

    • Task is to find the max probability tree for an input


    Noun phrase np chunkers l.jpg
    Noun Phrase (NP) Chunkers

    fnTBL

    Author(s): Radu Florian and Grace Ngai, John Hopkins University, USA

    Purpose: fnTBL is a customizable, portable and free source machine-learning toolkit primarily oriented towards Natural Language-related tasks (POS tagging, base NP chunking, text chunking, end-of-sentence detection, word sense disambiguation). It is currently trained for English and Swedish.

    Platforms: Linux, Solaris, Windows

    Access: Free athttp://nlp.cs.jhu.edu/~rflorian/fntbl/

    YAMCHA

    Author(s): Taku Kudo

    Purpose: YamCha is a generic, customizable, and open source text chunker oriented toward a lot of NLP tasks, such as POS tagging, Named Entity Recognition, base NP chunking, and Text Chunking. YamCha is using Support Vector Machines (SVMs), first introduced by Vapnik in 1995. YamCha is exactly the same system which performed the best in the CoNLL2000 Shared Task, Chunking and BaseNP Chunking task.

    Platforms: Linux, Windows

    Access: Free athttp://www2.chasen.org/~taku/software/yamcha/


    S yntactic parsers l.jpg
    Syntactic parsers

    MACHINESE PHRASE TAGGER

    Author(s): Connexor

    Purpose: Machinese Phrase Tagger is a set of program components that perform basic linguistic analysis tasks at very high speed and provide relevant information about words and concepts to volume-intensive applications. Available for: English, French, Spanish, German, Dutch, Italian, Finnish.

    Access: Not free. Free access to online demo at http://www.connexor.com/demo/tagger/


    Named entity recognition l.jpg
    Named Entity Recognition

    • Identification of proper names in texts, and their classification into a set of predefined categories of interest:

      • entities: organizations, persons, locations

      • temporal expressions: time, date

      • quantities: monetary values, percentages, numbers

    • Two kinds of approaches

    Knowledge Engineering

    • rule based

    • developed by experienced language engineers

    • make use of human intuition

    • small amount of training data

    • very time consuming

    • some changes may be hard to accommodate

    Learning Systems

    • use statistics or other machine learning

    • developers do not need LE expertise

    • require large amounts of annotated training data

    • some changes may require re-annotation of the entire training corpus


    Named entity recognition knowledge engineering approach l.jpg
    Named Entity RecognitionKnowledge engineering approach

    • identification of named entities in two steps:

      • recognition patterns expressed as WFSA (Weighted Finite-State Automaton) are used to identify phrases containing potential candidates for named entities (longest match strategy)

      • additional constraints (depending on the type of candidate) are used for validating the candidates

    • usage of on-line base lexicon for geographical names, first names


    Named entity recognition problems l.jpg
    Named Entity RecognitionProblems

    • Variation of NEs, e.g. John Smith, Mr. Smith, John

    • Since named entities may appear without designators (companies, persons) a dynamic lexicon for storing such named entities is used

      Example:

      “Mars Ltd is a wholly-owned subsidiary of Food Manufacturing Ltd, a non-trading company registered in England. Mars is controlled by members of the Mars family.”

    • Resolution of type ambiguity using the dynamic lexicon:

      If an expression can be a person name or company name (Martin Marietta Corp.) then use type of last entry inserted into dynamic lexicon for making decision.

    • Issues of style, structure, domain, genre etc.

    • Punctuation, spelling, spacing, formatting


    Named entity recognizers 1 l.jpg
    Named Entity Recognizers (1)

    ELLOGON

    Author(s): George Petasis, Vangelis Karkaletsis, National Center for Scientific Research, Greece

    Purpose: Available for Unicode.

    Access: Free at http://www.ellogon.org/

    HEART Of GOLD

    Author(s): Ulrich Schäfer, DFKI Language Technology Lab, Germany

    Purpose: Supported language: Unicode, Spanish, Polish, Norwegian, Japanese, Italian, Greek, German, French, English, Chinese.

    Access: Free at http://heartofgold.dfki.de/

    INSIGHT DISCOVERER EXTRACTOR

    Author(s): TEMIS

    Purpose: Supported language: Spanish, Russian, Portuguese, Polish, Italian, Hungarian, Greek, German, French, English, Dutch, Czech.

    Access: Not free. More information at http://www.temis-group.com/


    Named entity recognizers 2 l.jpg
    Named Entity Recognizers (2)

    LINGPIPE

    Author(s): Bob Carpenter, Breck Baldwin, Alias-I

    Purpose: Supported languages: Unicode, Spanish, German, French, English, Dutch.

    Access: Free at http://www.alias-i.com/lingpipe/

    YAMCHA

    Author(s): Taku Kudo

    Purpose: YamCha is a generic, customizable, and open source text chunker oriented toward a lot of NLP tasks, such as POS tagging, Named Entity Recognition, base NP chunking, and Text Chunking. YamCha is using Support Vector Machines (SVMs), first introduced by Vapnik in 1995. YamCha is exactly the same system which performed the best in the CoNLL2000 Shared Task, Chunking and BaseNP Chunking task.

    Platforms: Linux, Windows

    Access: Free athttp://www2.chasen.org/~taku/software/yamcha/


    Automatic term extraction l.jpg
    Automatic term extraction

    • Terms = linguistic labels of concepts

    • Concepts = units of thought (vague definition): if a term represents a unit of thought, its appearance in textual data has to be statistically significant, otherwise, the “unit” nature of the concept the term represents is in question.

    • Label: different labels can be used for the same concept, and the same label can be used for different concepts.


    Automatic term extraction50 l.jpg
    Automatic term extraction

    • Lexico-syntactic approaches use lexical and syntactic patterns:

      • domain-specific prefixes and suffixes (i.e. formaldehyde)

      • part-of-speech sequences (AN; NN; AAN; ANN; NAN; NNN; NPN) (How about ((A|N)+|((A|N)*(N|P)?)(A|N)*)N)

      • cue word or phrase (immediate left/right contexts)

    • Statistical approaches: different statistical measures:

      • Frequency, relative frequency, tf.idf etc. (for the whole unit)

      • Mutual information; t-score; z-score; etc. (collocation measurement)

      • C-value: combine both internal and external statistical measures.


    Terminology extractors 1 l.jpg
    Terminology extractors (1)

    CONNEXOR LANGUAGE ANALYSIS TOOLS

    Author(s): Connexor, Finland

    Purpose: Supported languages: English, French, Spanish, German, Dutch, Italian, Finnish.

    Access: Not free. Demos at http://www.conexor.fi/

    ELLOGON

    Author(s): George Petasis, Vangelis Karkaletsis, National Center for Scientific Research, Greece

    Purpose: Available for Unicode.

    Access: Free at http://www.ellogon.org/

    FASTR

    Author(s): Christian Jacquemin, Groupe Langage et Cognition, CNRS-LIMSI

    Purpose: Available for French and English.

    Access: Free at http://www.limsi.fr/Individu/jacquemi/FASTR/

    INTEX

    Author(s): Max Silberztein, New York University

    Purpose: Supported languages: Spanish, Portuguese, Italian, French, English.

    Access: Free at http://www.nyu.edu/pages/linguistics/intex/


    Terminology extractors 2 l.jpg
    Terminology extractors (2)

    NOMINO

    Author(s): Université de Québec à Montréal

    Purpose: French and English term extractors.

    Access: Free at http://www.ling.uqam.ca/nomino/

    PROMEMORIA

    Author(s): BridgeTerm

    Purpose: Translation memory system with terminology extraction component.

    Access: Not free. More information at http://www.bridgeterm.com/en/promem.html

    PWA

    Author(s): Jörg Tiedemann, Mikael Andersson, Magnus Merkel, Lars Ahrenberg, Anna Sågvall Hein, Department of Linguistics, Uppsala University; Department of Computer and Information Science, Linköping University, Sweden

    Purpose: Language independent terminology extractor.

    Access: Free at http://stp.ling.uu.se/~corpora/plug/pwa/index.html

    TerminologyExtractor

    Author(s): Etienne Cornu, Chamblon Systems Inc., Cambridge, Ontario, Canada

    Purpose: Available for French and English.

    Access: Not free. More information at http://www.chamblon.com/terminologyextractor.htm


    Terminology extractors 3 l.jpg
    Terminology extractors (3)

    Xerox TermFinder

    Author(s): Xerox Multilingual Knowledge Management Solutions

    Purpose: Supported languages: Swedish, Spanish, Russian, Portuguese, Norwegian, Hungarian, German, French, Finnish, English, Dutch, Danish.

    Access: Not free. More information athttp://www.mkms.xerox.com/


    Terminology data management tools 1 l.jpg
    Terminology data management tools (1)

    DÉJÀ VU

    Author(s): Atril Software

    Purpose: Translation memory system with integrated terminology tool.

    Access: Not free. Trial version at: http://www.atril.com

    DICOMAKER

    Author(s): Dalix Software

    Access:http://www.dicomaker.com/

    EDITERM

    Author(s): EDIT INC.

    Access: Not free. More information at http://www.editerm.com/indexN.html

    LEXSYN

    Author(s): Babeling

    Access: Not free. Evaluation version athttp://www.babeling.com/accueil.html

    LOGITERM

    Author(s): Terminotix Inc.

    Access: Not free. More information at http://www.terminotix.com/eng/index.htm


    Terminology data management tools 2 l.jpg
    Terminology data management tools (2)

    MULTITERM

    Author(s): TRADOS

    Purpose: Available as a stand-alone version or as part of the TRADOS TM Workbench translation memory system.

    Access: Not free. More information at http://www.trados.com/products.asp?page=22

    MULTITRANS

    Author(s): MultiCorpora R&D Inc.

    Purpose: Translation memory system with integrated terminology tool.

    Access: Not free. More information at http://www.multicorpora.ca

    SYSTEM QUIRK

    Author(s):School of ECM, University of Surrey, UK

    Access: Free at http://www.computing.surrey.ac.uk/SystemQ/

    TERMBASE

    Author(s): University of Mainz

    Access: Free at http://www.fask.uni-MAINZ.de/user/srini/srini.html

    TERMSTAR

    Author(s): STAR-USA, LLC

    Access: Not free. http://www.star-group.net/eng/software/sprachtech/termstar.html


    Text summarization l.jpg
    Text Summarization

    • Text summarization = automatic creation of summaries of one or more texts

    • Summary = a text that is produced from one or more texts, that contains a significant portion of the information in the original text(s) and that is no longer than half of the original text(s)

    • Types of summary:

      • Extracts: summaries created by reusing portions (words, sentences) of the input text(s)

      • Abstracts: summaries created by re-generating the extracted content


    Text summarization methodology l.jpg
    Text Summarization Methodology

    • There are three stages of automated text summarization:

      • Stage 1:Topic Identification

        • Using different criteria of importance, the system should identify the most important units (words, sentences, passages). If it lists them => extract. If not => Stage 2 and Stage 3

        • Criteria of importance:

          • Cue phrase indicator criteria

          • Word and phrase frequency criteria

          • Query and title overlap criteria

          • Combination of various criteria and scores

      • Stage 2: Interpretation or topic fusion: template representation of important topics identified at stage 1

      • Stage 3:Summary generation: the information captured in the templates is processed by NLG modules to obtain the summary (abstract)


    Text summarizers 1 l.jpg
    Text summarizers (1)

    BREVITY

    Author(s): Art Pollard, Lextek International

    Access: Not free, demo available at http://www.lextek.com/brevity/

    CAST

    Author(s): Constantin Orasan, Laura Hasler, Ruslan Mitkov, University of Wolverhampton, UK

    Access: Free at http://clg.wlv.ac.uk/projects/CAST

    COPERNIC SUMMARIZER

    Author(s):Copernic Technologies

    Purpose: Supported languages: Spanish, German, French, English.

    Access: Not free, trial available at

    http://www.copernic.com/en/products/summarizer/index.html

    GATE

    Author(s): NLP Group, University of Sheffield, UK

    Access: Free but requires registration at http://gate.ac.uk/


    Text summarizers 2 l.jpg
    Text summarizers (2)

    LANGSUITE

    Author(s): PetaMem

    Purpose: Supported languages: Unicode, Spanish, Polish, Italian, Hungarian, German, French, English, Dutch, Danish, Czech.

    Access: Not free. More information at http://www.petamem.com/

    MEAD

    Author(s): The Center for Language and Speech Processing, Johns Hopkins University, USA

    Access: Free at http://www.summarization.com/mead/

    MUST

    Author(s): Chin-Yew Lin, Eduard Hovy, ISI, USA

    Purpose: MuST performs web access, text summarization and translation into English from Japanese, Arabic, Spanish, and Indonesian.

    Access: Demo at http://www.isi.edu/natural-language/projects/MuST.html

    PERTINENCE

    Author(s): A. Lehmam, P. Bouvet, Pertinence

    Purpose: Available for English, French and Spanish.

    Access: Free at http://www.pertinence.net/index.html


    Text summarizers 3 l.jpg
    Text summarizers (3)

    SUMMARIST

    Author(s): Eduard Hovy, Chin-Yew Lin, Daniel Marcu, ISI, USA

    Purpose:SUMMARIST produces extract summaries in five languages (English, Japanese, Arabic, Spanish and Indonesian)

    Access: Demo at http://www.isi.edu/natural-language/projects/SUMMARIST.html

    SWESUM

    Author(s): Hercules Dalianis, Martin Hassel, KTH, Euroling AB

    Purpose: Supported languages: Swedish, Spanish, German, French, English

    Access: Free at http://www.euroling.se/produkter/swesum.html

    SYSTEM QUIRK

    Author(s):School of ECM, University of Surrey, UK

    Access: Free at http://www.computing.surrey.ac.uk/SystemQ/


    Language identification l.jpg
    Language Identification

    • The task of detecting the language a text is written in.

    • Identifying the language of a text from some of the text’s attributes is a typical classification problem.

    • Two approaches to language identification:

      • Short words (articles, prepositions, etc.)

      • N-grams (sequences of n letters). Best results are obtained for trigrams (3 letters).


    Language identification trigram method l.jpg

    Trigram Data Files

    (Language specific)

    Source languages texts

    Training Module

    Input text

    Combined Data File

    (All languages)

    Language Detection Module

    Language of the input text

    Language Identification Trigram method


    Trigram method training module l.jpg
    Trigram method Training module

    • Given a specific language and a text file written in this language, the training module will execute the following steps:

      • Remove characters that may reduce the probability of correct language identification (! " ( ) [ ] { } : ; ? , . & £ $ % * 0 1 2 3 4 5 6 7 8 9 - ` +)

      • Replace all white spaces with _ to mark word boundaries, then replace any sequence of __ with _ so that double spaces are treated as one

      • Store all three-character sequences within an array, with each having a counter indicating number of occurrences

      • Remove from the list of trigrams all trigrams with underscores in the middle (‘e_a’ for example) as they are considered to be invalid trigrams

      • Retain for further processing only those trigrams appearing more than x times

      • Approximate the probability of each trigram occurring in a particular language by summing the frequencies of all the retained trigrams for that language, and dividing each frequency by the total sum

    • This process is repeated for all languages the system should be trained on.

    • All language specific trigram data files are merged into one combined training file.


    Trigram method language detection module l.jpg
    Trigram method Language detection module

    • Input: text written in an unknown language

    • The unknown text sample is processed in a similar way to the training data (i.e. removing unwanted characters, replacing spaces with underscores and then dividing it into trigrams), and for each trained language the probability of the resulting sequence of trigrams is computed. This assumes that a zero probability is assigned to each unknown trigram.

    • The language will be identified by the language trigram data set with the highest combined probability of occurrence.

    • The fewer characters in the source text, the less accurate the language detection is likely to be.

    • This method is successful in more than 90% of the cases when the input text contains at least 40 characters.


    Language guessers 1 l.jpg
    Language Guessers (1)

    SWESUM

    Author(s): Hercules Dalianis, Martin Hassel, KTH, Euroling AB

    Purpose: Supported languages: Swedish, Spanish, German, French, English

    Access: Free at http://www.euroling.se/produkter/swesum.html

    LANGSUITE

    Author(s): PetaMem

    Purpose: Supported languages: Unicode, Spanish, Polish, Italian, Hungarian, German, French, English, Dutch, Danish, Czech.

    Access: Not free. More information at http://www.petamem.com/

    TED DUNNING'S LANGUAGE IDENTIFIER

    Author(s): Ted Dunning

    Access: Free at ftp://crl.nmsu.edu/pub/misc/lingdet_suite.tar.gz

    TEXTCAT

    Author(s): Gertjan van Noord

    Purpose: TextCat is an implementation of the N-Gram-Based Text Categorization algorithm and at the moment, the system knows about 69 natural languages.

    Access: Free at http://grid.let.rug.nl/~vannoord/TextCat/


    Language guessers 2 l.jpg
    Language Guessers (2)

    XEROX LANGUAGE IDENTIFIER

    Author(s): Xerox Research Centre Europe

    Purpose: Supported languages: Albanian, Arabic, Basque, Breton, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Finnish, French. Georgian, German, Greek, Hebrew, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Korean, Latin, Latvian, Lithuanian, Malay, Maltese, Norwegian, Polish, Poruguese, Romanian, Russian, Slovakian, Slovenian, Spanish, Swahili, Swedish, Thai, Turkish, Ukrainian, Vietnamese, Welsh

    Access: Not free. More information at http://www.xrce.xerox.com/competencies/content-analysis/tools/guesser-ISO-8859-1.en.html


    Statistical language modeling toolkits l.jpg
    Statistical language modeling toolkits

    CMU - Cambridge Statistical Language Modeling Toolkit

    Author(s):Philip Clarkson and Roni Rosenfeld, Carnegie Mellon University, USA

    Purpose: The toolkit is a suite of UNIX software tools to facilitate the construction and testing of statistical language models.

    Access: Free at http://svr-www.eng.cam.ac.uk/~prc14/toolkit.html

    BOW-A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering

    Author(s): Andrew McCallu, Carnegie Mellon University, USA

    Purpose: Bow (or LIBBOW) is a library of C code useful for writing statistical text analysis, language modeling and information retrieval programs. The current distribution includes the library, as well as front-ends for document classification (RAINBOW), document retrieval (ARROW) and document clustering (CROSSBOW).

    Access: Free at http://www-2.cs.cmu.edu/~mccallum/bow/


    Cmu cambridge statistical language modeling toolkit l.jpg
    CMU - Cambridge Statistical Language Modeling Toolkit

    • The Carnegie Mellon Statistical Language Modeling (CMU SLM) Toolkit is a set of Unix software tools designed to facilitate language modeling work in the research community.

    • Some of the tools are used to process general textual data into:

      • word frequency lists and vocabularies

      • word bigram and trigram counts

      • vocabulary-specific word bigram and trigram counts

      • bigram- and trigram-related statistics

      • various Backoff bigram and trigram language models


    Cmu cambridge statistical language modeling toolkit the tools 1 l.jpg
    CMU - Cambridge Statistical Language Modeling Toolkit – The Tools (1)

    • text2wfreq

      • Input: Text file

      • Output: List of every word which occurred in the text, along with its number of occurrences.

    • wfreq2vocab

      • Input: A word-frequency file as produced by text2wfreq.

      • Output: A file containing a list of vocabulary words

    • text2wngram

      • Input: Text file

      • Output: List of every word n-gram (n - parameter) which occurred in the text, along with its number of occurrences

    • text2idngram

      • Input: Text file plus a vocabulary file

      • Output: List of every id n-gram (n-tuples of numbers corresponding to the mapping of the word n-grams relative to the vocabulary) which occurred in the text, along with its number of occurrences


    Cmu cambridge statistical language modeling toolkit the tools 2 l.jpg
    CMU - Cambridge Statistical Language Modeling Toolkit – The Tools (2)

    • wngram2idngram

      • Input: Word n-gram file, plus a vocabulary file

      • Output: List of every id n-gram which occurred in the text, along with its number of occurrences

    • idngram2stats

      • Input: An id n-gram file

      • Output: A list of the frequency-of-frequencies for each of the 2-grams, …, n-grams

    • mergeidngram

      • Input: A set of id n-gram files

      • Output: One id n-gram file containing the merged id n-grams from the input files

    • idngram2lm

      • Input: An id n-gram file and a vocabulary file

      • Output: A language model in either binary format or in ARPA format


    Cmu cambridge statistical language modeling toolkit the tools 271 l.jpg
    CMU - Cambridge Statistical Language Modeling Toolkit – The Tools (2)

    • binlm2arpa

      • Input: A binary format language model, as generated by idngram2lm

      • Output: An ARPA format language model

    • evallm

      • Input: A binary or ARPA format language model, as generated by idngram2lm.

      • Output: Output is confirmation or denial that the sum of the probabilities of each of the words in the context supplied by the user sums to one.


    Corpora large collections aimed at the nlp community l.jpg
    Corpora - Large collections The Tools (2)aimed at the NLP community

    LDC (Linguistic Data Consortium)

    Access: http://www.ldc.upenn.edu/

    ELDA (European Language Resources Association)

    Access:http://www.elra.info/

    TRACTOR (TELRI Research Archive of Computational Tools and Resources)

    Access: http://www.tractor.de/

    CLR (Consortium for Lexical Research)

    Access:http://crl.nmsu.edu/Tools/CLR/

    European Corpus Initiative Multilingual Corpus I (ECI/MCI)

    Access:http://www.elsnet.org/resources/eciCorpus.html

    MULTEXT: Multilingual Text Tools and Corpora

    Access:http://www.lpl.univ-aix.fr/projects/multext/

    Electronic Text Collections in Western European Literature

    Purpose: Pointers to internet sources for literary texts in the western European languages other than English: Catalan, Danish, Dutch, Finnish, French, German, Italian, Norwegian, Old Norse, Portuguese, Provençal, Spanish, Swedish.

    Access: Free at http://www.lib.virginia.edu/wess/etexts.html


    Other multilingual corpora l.jpg
    Other multilingual corpora The Tools (2)

    CRATER Multilingual Aligned Annotated Corpus

    Purpose: Aligned corpus in English, French and Spanish.

    Access:http://www.comp.lancs.ac.uk/linguistics/crater/corpus.html

    EMILLE/CIIL

    Purpose: Monolingual written corpus data for 14 South Asian languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu). Orthographically transcribed spoken data and parallel corpus data for five South Asian languages (Bengali, Gujarati, Hindi, Punjabi and Urdu). In addition, the parallel corpus contains the English originals from which the translations stored in the corpus were derived. All data in the corpus is CES and Unicode compliant. The EMILLE corpus totals some 94 million words.

    Access: Free at http://bowland-files.lancs.ac.uk/corplang/emille/

    OPUS

    Purpose: An open source parallel corpus, aligned, in many languages, based on free Linux etc. manuals.

    Access: http://logos.uio.no/opus/

    Searchable Canadian Hansard French-English parallel texts (1986-1993)

    Access:http://rali.iro.umontreal.ca/

    European Union Web Server

    Access: http://europa.eu.int/


    Online multilingual dictionaries l.jpg
    Online multilingual dictionaries The Tools (2)

    ECTACO

    Access:www.ectaco.com

    YOURDICTIONARY

    Purpose: It is the most comprehensive index of dictionaries available on the web.

    Access: http://www.yourdictionary.com/


    Lexical resources wordnets l.jpg
    Lexical resources (wordnets) The Tools (2)

    Balkanet

    Purpose: The Balkanet project aimed at the development of a multilingual lexical database comprising of individual WordNets for the Balkan languages (Bulgarian, Czech, Greek, Romanian, Serbian and Turkish).

    Access: http://www.ceid.upatras.gr/Balkanet/

    EuroWordnet

    Purpose: EuroWordNet is a multilingual database with wordnets for several European languages (Dutch, Italian, Spanish, German, French, Czech and Estonian).

    Access:http://www.illc.uva.nl/EuroWordNet/

    WordNet

    Purpose: WordNet is an online lexical reference system. The wordnets developed as a result of the Balkanet and EuroWordnet projects are linked to the original Princeton WordNet to ensure conceptual equivalence.

    Access: http://wordnet.princeton.edu/


    Treebanks 1 l.jpg
    Treebanks (1) The Tools (2)

    Penn Treebank

    Language: US-English

    Size: 2 million + words

    Access:

    BLLIP WSJ corpus

    Language: US-English

    Size: 30 million words

    Access:

    ICE-GB

    Language: UK-English

    Size: 1 million words

    Access:

    NEGRA Corpus

    Language: German

    Size: 20000 sentences

    Access:


    Treebanks 2 l.jpg
    Treebanks (2) The Tools (2)

    TIGER Corpus

    Language: German

    Size: 700000 words

    Access:

    Alpino Dependency Treebank

    Language: Dutch

    Size: 150000 words

    Access:

    The Prague Dependency Treebank 1.0

    Language: Czech

    Size: 500000 words

    Access:

    Bulgarian Treebank

    Language: Bulgarian

    Size: n/a

    Access:


    Treebanks 3 l.jpg
    Treebanks (3) The Tools (2)

    Penn Chinese Treebank

    Language: Chinese

    Size: 100000 words

    Access:

    Danish Dependency Treebank 1.0

    Language: Danish

    Size: 100000 words

    Access:

    Syntactic Spanish Database

    Language: Spanish

    Size: 1.5 million words

    Access:

    LDC Korean Treebank

    Language: Korean

    Size: n/a

    Access:


    Methods and applications that did not make it into this presentation l.jpg
    Methods and applications that did not make it into this presentation

    • Word Sense Disambiguation

      • Nancy Ide and Dan Tufis

    • Anaphora Resolution

      • Dan Cristea, Constantin Orasan and Oana Postolache

    • Machine Translation

      • Daniel Marcu and Dragos Stefan Munteanu

    • Question Answering

      • Bernardo Magnini and Marius Pasca


    Conclusions l.jpg
    Conclusions presentation

    • Many resources for textual NLP already exist on the Web and can be exploited and adapted to new languages

    • All methods presented today can be adapted to a new language

    • Hopefully the present inventory will be of help in your future NLP activity


    Slide81 l.jpg

    Thank you! presentation


    ad