probabilistic tagging of a corpus of mennonite low german a case study using qtag l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Probabilistic Tagging of a Corpus of Mennonite Low German: A Case Study Using Qtag PowerPoint Presentation
Download Presentation
Probabilistic Tagging of a Corpus of Mennonite Low German: A Case Study Using Qtag

Loading in 2 Seconds...

play fullscreen
1 / 51

Probabilistic Tagging of a Corpus of Mennonite Low German: A Case Study Using Qtag - PowerPoint PPT Presentation


  • 139 Views
  • Uploaded on

Probabilistic Tagging of a Corpus of Mennonite Low German: A Case Study Using Qtag. Christopher Cox, University of Alberta christopher.cox@ualberta.ca AACL - March 15, 2008. Introduction.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Probabilistic Tagging of a Corpus of Mennonite Low German: A Case Study Using Qtag' - naasir


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
probabilistic tagging of a corpus of mennonite low german a case study using qtag

Probabilistic Tagging of a Corpus of Mennonite Low German:A Case Study Using Qtag

Christopher Cox, University of Alberta

christopher.cox@ualberta.ca

AACL - March 15, 2008

introduction
Introduction
  • Presentation considers the application of probabilistic part-of-speech (POS) tagging methods to minority language data
    • Methods appliedprofitably in large-scale corpus construction
      • Time, linguistic data, technical expertise, and financial resources often comparatively abundant
    • Perhaps less documented: challenges of applying similar techniques when resources are limited
introduction3
Introduction
  • Both linguistic and technical-financial challenges in minority language corpus development:
    • Existing computational techniques not amenable to linguistic structure of given language
    • Lack of standardization (e.g. in POS tags, spelling)
    • Limited resources (cf. McEnery & Ostler 2000)
  • Important to understand which factors produce acceptable results and minimize investment of effort, given set goals and resources for tagging
introduction4
Introduction
  • Presentation offers case study in probabilistic tagging of minority language data:
    • Applies Qtag to a small (~120,000-token) corpus of written Mennonite Low German (Plautdietsch)
    • Opportunity to consider problems faced in tagging minority language data in concrete detail
    • Chance to evaluate tagging procedure adopted, consider alternatives which may have produced results of comparable quality
slide5
“Qtag?”
  • Qtag:language-independent, “pure” probabilistic tagger designed by Oliver Mason
    • Freely available for non-commercial use
    • Well-documented Java API
    • Unicode support
  • Not alone among pure probabilistic taggers; arguably presents a reasonable point of departure into probabilistic tagging
plautdietsch
“Plautdietsch?”
  • Plautdietsch:Mennonite Low German
    • Variety of Eastern Low German, once spoken near Gdansk, Poland
    • Approx. 400,000 speakers, predominantly descendants of Dutch-Russian Mennonites (Anabaptist Christians)
    • Sizeable Plautdietschspeech communities on four continents and in no fewer than a dozen countries (cf. Epp 1993: 103-4)
a corpus of plautdietsch
A Corpus of Plautdietsch
  • Corpus intended primarily for research into syntax of verbal complementation in Plautdietsch:
    • Adequate tagging for verbal-inflectional features important (e.g. tense, person, number, etc.)
    • Dialectal variation potentially relevant in analysis
    • Technical resources furnished largely by the Text Analysis Portal for Research (TAPoR) at University of Alberta; time expenditure should be minimized
challenges
Challenges
  • Plautdietsch poses challenges common in minority language corpus construction:
    • No single orthographic standard. Systems vary between authors and individual published works
    • No corpora published to date. No tagsets proposed; little consensus on POS classes
    • Dialectal variation. Substantial variation between and within national varieties.
corpus construction
Corpus Construction
  • Three-stage corpus construction procedure:
    • Spelling normalization. Created versions of all corpus source texts normalized according to a published orthographic standard (Epp 1996)
    • Tagset selection. Adapted a tagset proposed for Standard German (Münster Tagset for German, MT/D; Steiner 2003) to Plautdietsch
example corpus preparation
Example: Corpus Preparation

<?xmlversion=“1.0”encoding=“utf-8”?>

<document doc_id=“1”>

<wordwd_id=“31”>Goon</word>

<wordwd_id=“32”>dach</word>

<wordwd_id=“32”>,</word>

<wordwd_id=“33”>kompt</word>

<wordwd_id=“34”>ennen</word>

<wordwd_id=“34”>,</word>

<wordwd_id=“35”>sat</word>

<wordwd_id=“36”>junt</word>

<wordwd_id=“37”>dol</word>

<wordwd_id=“37”>.</word>

. . .

</document>

example corpus preparation11
Example: Corpus Preparation

<?xmlversion=“1.0”encoding=“utf-8”?>

<document doc_id=“1”>

<wordwd_id=“31”>Go’n</word>

<wordwd_id=“32”>Dag</word>

<wordwd_id=“32”>,</word>

<wordwd_id=“33”>komt</word>

<wordwd_id=“34”>’enenn</word>

<wordwd_id=“34”>,</word>

<wordwd_id=“35”>sat</word>

<wordwd_id=“36”>Junt</word>

<wordwd_id=“37”>dol</word>

<wordwd_id=“37”>.</word>

. . .

</document>

corpus construction12
Corpus Construction
  • Three-stage corpus construction procedure:
    • Tagging. Normalized texts then tagged gradually with the adopted tagset, in an iterative, interactive process:
iterative interactive tagging
Iterative Interactive Tagging

n n+1 n+2 n+3 n+4 n+5 c

  • Segment the document into c “chunks” of n tokens.

. . .

iterative interactive tagging14
Iterative Interactive Tagging

n n+1 n+2 n+3 n+4 n+5 c

  • Segment the document into c “chunks” of n tokens.
  • Manually assign tags to the first chunk.

. . .

iterative interactive tagging15
Iterative Interactive Tagging

n n+1 n+2 n+3 n+4 n+5 c

  • Segment the document into c “chunks” of n tokens.
  • Manually assign tags to the first chunk.
  • Train Qtag on all correct tags and have it tag the next chunk.

. . .

iterative interactive tagging16
Iterative Interactive Tagging

n n+1 n+2 n+3 n+4 n+5 c

  • Segment the document into c “chunks” of n tokens.
  • Manually assign tags to the first chunk.
  • Train Qtag on all correct tags and have it tag the next chunk.
  • Manually correct the tags assigned to the last chunk, adding them to the training data.

. . .

iterative interactive tagging17
Iterative Interactive Tagging

n n+1 n+2 n+3 n+4 n+5 c

  • Segment the document into c “chunks” of n tokens.
  • Manually assign tags to the first chunk.
  • Train Qtag on all correct tags and have it tag the next chunk.
  • Manually correct the tags assigned to the last chunk, adding them to the training data.

. . .

iterative interactive tagging18
Iterative Interactive Tagging

n n+1 n+2 n+3 n+4 n+5 c

  • Segment the document into c “chunks” of n tokens.
  • Manually assign tags to the first chunk.
  • Train Qtag on all correct tags and have it tag the next chunk.
  • Manually correct the tags assigned to the last chunk, adding them to the training data.

. . .

the road s not taken
The Road(s) Not Taken
  • Iterative, interactive process successful, albeit time consuming, labour intensive -

What could have been done to reduce the burden of corpus construction without lessening the quality of resulting data?

    • Necessary to normalize spelling in advance?
    • Should greater numbers of tokens have been tagged at each stage?
    • Should the tag set have been less elaborate?
simulating iterative tagging
Simulating Iterative Tagging
  • Simulations of different models of iterative, interactive tagging conducted using the corrected data
  • Parameters of the tagging models simulated:
    • Normalization. Normalized, unnormalized data
    • Chunk size. 100, 200, 300, 400, 500, 750, 1000, 1500, 2000, 3000, 4000, 5000, 7500, 10000 tokens tagged per round
    • Tagset selection. 99 tags, 50 tags, 13 tags
evaluating tagging models
Evaluating Tagging Models
  • Evaluation of each model by rate of accuracy developmentandestimated time requirement
    • Estimated time requirement as a function of time requirements for initial manual tagging and subsequent tag correction (at various error rates) of c chunks of n tokens using tagset t:
evaluating normalization
Evaluating Normalization
  • Does orthographic normalization matter, either for the rate of accuracy development or estimated overall time expenditure?
    • Holding tag set and chunk size constant, compare simulations of tagging normalized and unnormalized data:
evaluating normalization25
Evaluating Normalization
  • Does orthographic normalization matter, either for the rate of accuracy development or estimated overall time requirement?
    • Rate of accuracy development: on average 20% lower for unnormalized data over all tagsets
    • Estimated time requirement: on average 26 hours long for POS-99, 15 hours for POS-50, 11 hours for POS-13 for unnormalized data
evaluating training data
Evaluating Training Data
  • Does chunk size matter, either for the rate of development of accuracy or estimated overall time expenditure?
    • Holding tagsets and normalization constant, compare simulations of tagging for different chunk sizes:
evaluating training data29
Evaluating Training Data
  • Does chunk size matter, either for the rate of development of accuracy or estimated overall time expenditure?
    • Rate of accuracy development: no substantial difference in accuracy development for chunk sizes <= 5000
    • Estimated time requirement: considerable differences, with smaller chunk sizes (< 2000) taking less time
      • Minimize time required to tag first chunk manually without the aid of automatically-assigned tags
evaluating tagsets
Evaluating Tagsets
  • Does tagset detail matter, either to the rate of accuracy development or estimated overall time requirement?
    • Holding chunk size and normalization constant, compare all three tagsets:
evaluating tagsets33
Evaluating Tagsets
  • Does tagset detail matter, either for the rate of development of accuracy or estimated overall time requirement?
    • Rate of accuracy development: average 15% increase of mean accuracy for minimal tagset over full tagset, regardless of normalization
    • Estimated time requirement: time requirement for full tagset (80.5 hours) more than double that of minimal tagset (36.5 hours)
evaluation summary
Evaluation: Summary
  • In the present case, the following guidelines would appear relevant to ‘successful’ tagging:
    • Normalization:Accuracy gains (here, 20%) may be substantial; however, gains must be weighed against cost of normalization itself
    • Chunk size:Favour smaller chunk sizes; choose tag correction over manual tag assignment
    • Tagset:Minimize tagset complexity (wherever corpus goals permit)
evaluation and planning
Evaluation and Planning
  • Determining interaction of all such factors in their relation to accuracy likely impossible during corpus planning
  • Nevertheless, planning and evaluation might profitably enter into corpus construction:
    • Consideration of general guidelines, such as those proposed in this case study, during corpus design
    • Periodic evaluation as additional part of iterative tagging process
tagging and minority language data
Tagging and Minority Language Data
  • Such suggestionsmust bemeasured against requirements, resources, and stated goals of the corpus project:
    • In present case, detailed verbal coding needed; cost of tagset mitigated through normalization
    • Sociolinguistic situation may require preservation (in some form) of original orthographies or other “distinctive” features of source data
tagging and minority language data37
Tagging and Minority Language Data
  • Selection of pure probabilistic methods over others in part determined by typological features and available sources of data:
    • Highly fusional or polysynthetic languages may benefit from morphological parsing, rather than probabilistic POS assignment alone;
    • Integration of tagged documents with other linguistic data (e.g. dictionaries, word lists) may encourage use of hybrid tools permitting concurrent lemmatization
conclusion
Conclusion
  • Computer-assisted part-of-speech assignment a complex problem, one profitably viewed in the larger context of minority language corpus construction:
    • Computational methods, probabilistic or otherwise, of clear importance, but not sole object of inquiry
    • Rather, consideration required of resources, requirements, and (socio-)linguistic conditions which bear upon minority language corpus construction as a whole
conclusion39
Conclusion
  • Case studies of minority language corpus design might contribute to an understanding of such problems in context:
    • Present direction for further quantitative study of corpus and tagset design
    • Offer assessment of the challenges facing corpus-based language documentation, providing guidelines from which similar projects might benefit
acknowledgements
Acknowledgements
  • Text Analysis Portal for Research (TAPoR), University of Alberta
  • Social Sciences and Humanities Research Council of Canada (SSHRC)
  • Members of the Department of Linguistics, University of Alberta
  • Oliver Mason (for Qtag)
references
References
  • Epp, Reuben. 1993. The History of Low German and Plautdietsch: Tracing a language across the globe. Hillsboro, Kansas: The Reader’s Press.
  • Epp, Reuben. 1996. The Spelling of Low German and Plautdietsch. Hillsboro, Kansas: The Reader’s Press.
  • McEnery, Tony and Nick Ostler. 2000. A New Agenda for Corpus Linguistics - Working with all of the World’s Languages. Literary and Linguistic Computing 15.403-49.
references43
References
  • Tufis, Dan and Oliver Mason. 1998. “Tagging Romanian Texts: a Case Study for QTAG, a Language Independent Probabilistic Tagger.” Proceedings of the First International Conference on Language Resources & Evaluation (LREC), Granada (Spain), 28-30 May 1998, 589-596.
  • Steiner, Petra. 2003. Das revidierte Münsteraner Tagset Deutsch (MT/D). Beschreibung, Anwendung, Beispiele und Problemfälle [The revised Münster Tagset for German (MT/D). Description, Application, Examples and Problematic Cases]. Online: http://xlex.uni-muenster.de/Portal/MTPD/tagsetDescriptionDE.ps
qtag algorithm
Qtag Algorithm
  • Read in the next token.
  • Retrieve all tags observed for this token (if none available, guess possible tags)
  • For each possible tag:
    • Calculate Pw = P(tag|token)= P(token has tag)
    • Calculate Pc = P(tag|t1,t2)= P(tag follows t1, t2)
    • Calculate Pw,c = Pw * Pc
  • Repeat this calculation for the other two tags in the window (except with Pc = P(t1 precedes t2, tag), Pc = P(t2 between t1, tag))
slide46
“Qtag”

http://www.english.bham.ac.uk/staff/omason/software/qtag.html

example corpus preparation47
Example: Corpus Preparation

<?xmlversion=“1.0”encoding=“utf-8”?>

<document doc_id=“1”>

<wordwd_id=“31”>Goon</word>

<wordwd_id=“32”>dach</word>

<wordwd_id=“32”>,</word>

<wordwd_id=“33”>kompt</word>

<wordwd_id=“34”>ennen</word>

<wordwd_id=“34”>,</word>

<wordwd_id=“35”>sat</word>

<wordwd_id=“36”>junt</word>

<wordwd_id=“37”>dol</word>

<wordwd_id=“37”>.</word>

. . .

</document>

example corpus preparation48
Example: Corpus Preparation

<?xmlversion=“1.0”encoding=“utf-8”?>

<document doc_id=“1”>

<wordwd_id=“31”>Go’n</word>

<wordwd_id=“32”>Dag</word>

<wordwd_id=“32”>,</word>

<wordwd_id=“33”>komt</word>

<wordwd_id=“34”>’enenn</word>

<wordwd_id=“34”>,</word>

<wordwd_id=“35”>sat</word>

<wordwd_id=“36”>Junt</word>

<wordwd_id=“37”>dol</word>

<wordwd_id=“37”>.</word>

. . .

</document>

example corpus preparation49
Example: Corpus Preparation

<?xmlversion=“1.0”encoding=“utf-8”?>

<document doc_id=“1”>

<wordwd_id=“31” pos99a=“Aa”>Go’n</word>

<wordwd_id=“32” pos99a=“Ngns”>Dag</word>

<wordwd_id=“32” pos99a=“Fi”>,</word>

<wordwd_id=“33” pos99a=“Vfvca2p”>komt</word>

<wordwd_id=“34” pos99a=“Qv”>’enenn</word>

<wordwd_id=“34” pos99a=“Fi”>,</word>

<wordwd_id=“35” pos99a=“Vfvca2p”>sat</word>

<wordwd_id=“36” pos99a=“Rs”>Junt</word>

<wordwd_id=“37” pos99a=“Bg”>dol</word>

<wordwd_id=“37” pos99a=“Bg”>.</word>

. . .

</document>

example corpus preparation50
Example: Corpus Preparation

<?xmlversion=“1.0”encoding=“utf-8”?>

<document doc_id=“1”>

<wordwd_id=“31” pos99a=“Aa” pos99c=“Aa”>Go’n</word>

<wordwd_id=“32” pos99a=“Ngns” pos99c=“Ngas”>Dag</word>

<wordwd_id=“32” pos99a=“Fi” pos99c=“Fi”>,</word>

<wordwd_id=“33” pos99a=“Vfvca2p” pos99c=“Vfvca2p”>komt</word>

<wordwd_id=“34” pos99a=“Qv” pos99c=“Qv”>’enenn</word>

<wordwd_id=“34” pos99a=“Fi” pos99c=“Fi”>,</word>

<wordwd_id=“35” pos99a=“Vfvca2p” pos99c=“Vfvca2p”>sat</word>

<wordwd_id=“36” pos99a=“Rs” pos99c=“Rs”>Junt</word>

<wordwd_id=“37” pos99a=“Bg” pos99c=“Qv”>dol</word>

<wordwd_id=“37” pos99a=“Fs” pos99c=“Fs”>.</word>

. . .

</document>