1 / 21

A Suite to Compile and Analyze an LSP Corpus

6th International Conference on Language Resources and Evaluation LREC 2008. A Suite to Compile and Analyze an LSP Corpus. Rogelio Nazar – Jorge Vivaldi – M. Teresa Cabré { rogelio.nazar; jorge.vivaldi; teresa.cabre}@upf.edu. Introduction.

lave
Download Presentation

A Suite to Compile and Analyze an LSP Corpus

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 6th International Conference on Language Resources and Evaluation LREC 2008 A Suite to Compile and Analyze an LSP Corpus Rogelio Nazar – Jorge Vivaldi – M. Teresa Cabré {rogelio.nazar; jorge.vivaldi; teresa.cabre}@upf.edu

  2. Introduction This system (JAGUAR) is a set of tools for compiling and exploring an LSP corpus from the web http://jaguar.iula.upf.edu Usage Examples: • Terminology extraction • Bilingual lexicon extraction • Neologisms extraction Architecture: a system divided in two main modules: • Compilation of an LSP corpus from the web • Analysis of the corpus with statistical techniques

  3. Module 1: Compilation of an LSP corpus from the web • Document retrieval by querying search engines • Classification of the collection on the basis of two axis: • Degree of relevance to the topic • Possibility of corpus tuning with user feedback • Degree of specialization of the document • Structure of the document (abstract, introduction, etc.) • System for bibliographical references, etc. Final classification is the result of the combination of these factors.

  4. Module 1: Compilation of an LSP corpus from the web Classification by degree of relevance to the topic:

  5. Module 1: Compilation of an LSP corpus from the web Classification by degree of relevance to the topic:coocurrence graphs

  6. Module 1: Compilation of an LSP corpus from the web Evaluation of the documents classification: Cumulative precision in the ranking of documents with the term spastic diplegia.

  7. Module 1: Compilation of an LSP corpus from the web Evaluation of the documents classification: Precision and Recall for the experiments.

  8. Module 1: Compilation of an LSP corpus from the web Evaluation of the documents classification: Probability distribution of precision as a random variable (performance of 10.000 random classifiers).

  9. Module 2: Analysis of the corpus with statistical techniques • 1. Input: from module 1 or from user compiled corpus • 2. Main functions: • Measures of vocabulary richness • Analysis of sample representativeness • Automatic language recognition • Kwic search • N-grams extraction and sorting • Collocations extraction • Measures of association • Models of term distribution • Coefficients for vector comparison

  10. http://rc16.upf.es/jaguar

  11. Conclusions • We have presented the system JAGUAR, set of tools for compiling and exploring an LSP corpus from the web • The main characteristics of this suit are the following: • It is able to collect an LSP corpus from the web, ensuring the thematic adequacy and degree of specialization to a given domain • It offers tools to statistically explore such collection in a friendly interface • It has also been conceived as a library • The original algorithms have been successfully evaluated • It usage save time and effort in the analysis of a corpus offering also new insights, a perspective of the data invisible to the naked eye.

  12. Future Work • Project is now growing in different directions: • Progressive enhancement with new functions and algorithms • Turning into a desktop application

  13. Thanks!

More Related