1 / 19

Using the WWW to resolve PP attachment ambiguities in Dutch

Using the WWW to resolve PP attachment ambiguities in Dutch. Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U.Leuven, Belgium. Introduction. Finding the correct attachment site for PP’s is one of the problems when parsing natural languages

sol
Download Presentation

Using the WWW to resolve PP attachment ambiguities in Dutch

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using the WWW to resolve PP attachment ambiguities in Dutch Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U.Leuven, Belgium

  2. Introduction • Finding the correct attachment site for PP’s is one of the problems when parsing natural languages • Volk (2000;2001) has presented an approach for German by using cooccurrence frequencies on the WWW

  3. Introduction (2) • We present a replication of the approach used by Volk, but applied on Dutch • We present a number of changes that have been made on the initial formula and their effect on the results

  4. cooccurrence values • On the one hand, the cooccurrence strength between nouns and prepositions is measured • On the other hand, the cooccurrence strength between verbs and prepositions is measured • The competing values of N+P vs. V+P are used to decide whether to attach the PP to the noun or to the verb

  5. Experiment 1 • Method • Altavista search engine • noun NEAR preposition vs. verb NEAR preposition • restricted to Dutch documents • lemmata are used for lookup • minimal cooccurrence threshold

  6. Experiment 1 • Evaluation • 500 PP’s were selected which were immediately following a noun or a pronoun which functions as a noun. • It was manually decided if the PP was attached either to the verb or to the noun.

  7. Experiment 1 • Algorithm • if cooc(N+P) and cooc(V+P) are available, the higher value decides • if one is not available (2% of test cases), the other value is compared to a threshold • if both are unavailable, no decision can be made

  8. Experiment 1 • Results • 100% coverage: 58.4% correct attachment • max. accuracy 59%, coverage 98% • Conclusion • better than pure guessing (50%) • much lower than Volk for German • defaulting to Noun-attachment: 68%

  9. Experiment 2 • Method • Full forms, not lemmata • Results • we want to compare at a rate of 75% correct attachments • if we set threshold so we have 75% correct attachment: coverage =21.6% • Conclusion :Results are much better than with lemmata, but still low

  10. Experiment 3 • Method • Full forms • Minimal distance threshold • Results • 75% correct attachment: coverage=27% • Conclusion: Still a lot lower than Volk (58%), but improving

  11. Experiment 4 • Method • We include the head noun of the PP into the queries • cooc(X,P,N2)=freq(X,P,N2)/freq(X) • without thresholds • defaulting to N-attachment if cooc’s don’t exist • Results • General accuracy = 68% with coverage=100% • Conclusions: Results are as accurate as defaulting to N-attachment

  12. Experiment 5 • Method • minimal cooc-threshold when triple cooc not available for one • when both unavailable: no decision • Results • setting the threshold to reach an accuracy of 75% is impossible

  13. Experiment 6 • Method • full forms + lemmata • Results: • maximum accuracy is 68.77% • Conclusions: • Volk gets nice results in the just described conditions: coverage of 63% with an accuracy of 75% • We get only 27% coverage with same accuracy

  14. Experiment 7 • Method • combining doubles and triples into one algorithm • minimal distance and 2 different thresholds • when min-distance < threshold for triples then use minimal distance of doubles • Results: • coverage of 48.8% with an accuracy of 75% • coverage of 50% with an accuracy of 74.4%

  15. Experiment 8 • Method • accuracy with preprocessed triples • test cases where N1 is not a real noun are removed from testset (492 cases remaining) • unlexicalized compounds are reduced to the heads of the compounds krijtstreepjeskostuum => kostuum • Results • coverage of 60.4% with an accuracy of 75% • coverage of 50% with an accuracy of 76.8%

  16. Experiment 8 • Results: • combining the two minimal distances algorithms (for doubles and triples) gives a big rise in coverage for the same accuracy • preprocessing of nouns and leaving out pronouns gives a second big rise in coverage for the same accuracy • after defaulting the remaining cases to N-attachment we end up with an accuracy of 70.33%

  17. General Conclusions • using the WWW helps to get a more accurate estimate of PP-attachment • difference between our results and German results: Number of decidable cases is higher for German since the number of WWW documents is higher for German • Querying cooccurrence freqs with WWW search engines using the NEAR operator allows only very rough queries

  18. Future improvements • Using cooccurrence freqs on a controlled corpus might improve results: • more exact queries are possible than with AltaVista • less noise in the corpus

  19. References • Volk, M. (2000). Scaling up using the WWW to resolve PP-attachment ambiguities. In Proceedings of Konvens, Ilmenau. • Volk, M. (2001). Exploiting the WWW qs q corpus to resolve PP-attachment ambiguities. In Proceedings of Corpus Linguistics, Lancaster.

More Related