1 / 36

Towards a Quantitative Characterization of Corpora at the Morphological Level: Morphological Profiles to Measure Diachro

Towards a Quantitative Characterization of Corpora at the Morphological Level: Morphological Profiles to Measure Diachronic Change. Alfonso Medina IINGEN-UNAM AACL 2008, BYU, March 15th, 2008 . Outline. Introduction Characterization of Corpora at Morphological Level ( Affixality Measurement)

prema
Download Presentation

Towards a Quantitative Characterization of Corpora at the Morphological Level: Morphological Profiles to Measure Diachro

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards a Quantitative Characterization of Corpora at the Morphological Level: Morphological Profiles to Measure Diachronic Change Alfonso Medina IINGEN-UNAMAACL 2008, BYU, March 15th, 2008

  2. Outline • Introduction • Characterization of Corpora at Morphological Level (Affixality Measurement) • Distances between Morphological Profiles (sets of affixes) • Experiment with Spanish XVI, XVIII & XX; Comparison to data from the Castilian dialect and the Portuguese Language • Concluding Remarks

  3. Some Guiding Questions • When did Mexican Spanish appear as a distinct dialectal system of the Spanish Language? • How much has it changed since then? • How different is it of other major dialects of Spanish? • How can variation, either diachronic or synchronic, be measured? • What linguistic level(s) should be minimally taken into account?

  4. Glottochronology (Swadesh) • Inspired in the dating of carbon in organic material • Linguistic change index in millennia • Lexical base of minimum 100 items: body parts, heavenly phenomena, numerals (1,2), personal pronouns, basic action verbs, etc. • Given a couple of genetically related languages and their lexical bases, measuring similarity among the items in those lexical bases permits, among other things, estimation of how far back in time two or more languages were in fact the same language: • The greater the number of shared cognates, the more recent their separation as independent languages. • Great controversy

  5. Cognates • cognates – words of a common origen • between languages: ‘posible’ - ‘possible’; ‘starve’ - ‘sterben’; etc. • within one language: ‘delicado’ - ‘delgado’; ‘shirt’ - ‘skirt’ • Morphological cognates • Verbal clitics (Spanish – Portuguese) me, te, se, nos, lo, los, la, las, etc.

  6. Unsupervised Word Segmentation Techniques • Z Harris (50s-70s): variety of phonemes before and after a segmentation (importance of uncertainty) • N Andreev (60s): string frequencies • J De Kock (70s): economy principle • Entropy based approaches (70s - 90s): information retrieval • Goldsmith (2000): best fit of a model • Creutz and Lagus: Bayesian statistics

  7. Extracting Morphemes thru Affixality Measurement • Affixality can be seen as a sort of glutinosity between affixes and bases; it can be conceived as some force that glues affixes to bases • Affixality/glutinosity can be measured in terms of bits carried by some kind of structure • Thus, for the following experiments, it will estimated by means of the average of normalized entropy, economy and square indexes estimated for each affix candidate

  8. A Catalog of Czech Prefixes • Affixality = average of normalized economy and entropy indexes • Items are ranked according to affixality • More affixality is not clearly related to the frequency of the string

  9. Sets/Tables of Affixes: Morphological Profiles • This kind of table (an affix catalog) appears to be a sort of fingerprint of the language portrayed in the corpus from which they are extracted • In fact, since they give morphosyntactical and lexical structure, they are more intimately related to the patterns of the language that any set of items in a lexical base (devised for glottochronology) • It is worth the while to compare these morphological fingerprints obtained from different diachronic states 5

  10. Experiment • Objective: compare three language states: • XVI and XVIII Centuries (from our CHEM) and • XX Century from the Corpus del Español Mexicano Contemporáneo, CEMC (El Colegio de México) • Thus, most prominent suffixes were extracted from each corpus and stored in catalogs (morphological profiles) for later comparison • Again: these profiles are tables of morphological items ordered from most affixal (exhibiting more glutinosity) to less affixal (exhibiting less); smaller ranks imply greater affixality/glutinosity

  11. Some Words of Caution • The set of Spanish phonemes varies across diachronic and geographic dialects. • There are great orthographic irregularities in old documents (especially XVI), which make automatic phonological transcriptions difficult: so, this is basically an exercise in written language • Short and frequent affixes tend to be very polysemic or to represent several homographs/homophones (-a, -o, -e, -as, -es, -os, -an, -en, etc.). • Results of the experiment will depend on the representativity of the samples

  12. Most prominent suffixes and suffixal groups from the XVIthcentury (CHEM) 151,966 word tokens 17,608 word types

  13. Most prominent suffixes and suffixal groups from the XVIIIthcentury (CHEM) 151,966 word tokens 15,916 word types

  14. Most prominent suffixes and suffixal groups from the XXthcentury (CEMC) Around 2,000,000 word tokens; 69,000 word types

  15. Measuring Distances among Morphological Profiles • Some techniques applicable • Word Count (plain number of items shared) • Word Count and Bonus (weight added) • Cosine Similarity • Euclidean Distance • Etc.

  16. Euclidean Distance D(xvi,xviii) = (i(affxvi(i)-affxviii(i))2/n)½ • To compare XVI and XVIII: square root of average of differences to the square: • Affixality difference of -ito from XVI and XVIII: (0.5108 – 0.5496)2 = 0.03882, • Affixality difference of -ería from XVI and XVIII: (0.3662 – 0.4776)2 = 0.11142, etc.

  17. Euclidean Distance D(xvi,xviii) = (i(affxvi(i)-affxviii(i))2/n)½ • To compare XVI and XVIII: square root of average of differences to the square: • Affixality difference of -ito from XVI and XVIII: (0.5108 – 0.5496)2 = 0.03882, • Affixality difference of -ería from XVI and XVIII: (0.3662 – 0.4776)2 = 0.11142, etc.

  18. Euclidean Distance D(xvi,xviii) = (i(affxvi(i)-affxviii(i))2/n)½ • To compare XVI and XVIII: square root of average of differences to the square: • Affixality difference of -ito from XVI and XVIII: (0.5108 – 0.5496)2 = 0.03882, • Affixality difference of -ería from XVI and XVIII: (0.3662 – 0.4776)2 = 0.11142, etc. 10

  19. Morphological Distance among Centuries • Since we are dealing with a “conservative” language and normalized values of affixality, it is not surprising that the largest distance is very small. • XX and XVIII most similar (smallest distance of 0.0715), and XVI and XVIII more than XX and XVIII • How significant is this tendency?

  20. Distances to other dialects • One way to put these distances in context is to include other dialects • Thus, samples were improvised for Peninsular Spanish (Castilian) from the CORDE (XVIII) and the CREA (XX) • Pick content words randomly (“elefante”) or select peninsularismos (“bañador”), and use concordances as sample • Searches in all documents from Spain, one or two concordances per document

  21. Most prominent suffixes and suffix groups of Castilian (XVIIIth) extracted from a small selection of the CORDE 96,877 word tokens 13,882 word types

  22. Most prominent suffixes and suffix groups of Castilian (XXth) extracted from a small selection of the CREA 125,969 word tokens 17,509 word types

  23. Similarity (1 - distance) among the Centuries • XVIII - XX Mexican Spanish and XVIII-XX Castilian appear to be most similar, particularly in the XVIII Century (in XVI there is only one) • Also, XX Castilian appears more similar to XVIII Mexican Spanish than to XX Mexican Spanish • XVIII Mexican Spanish is closer to XX Mexican Spanish, than to XVI Spanish • Obviously these tendencies cannot be definitive, but they do corroborate some philologists’ intuitions (at lexical level).

  24. Distances to genetically related languages • To give more context to these data, other languages, genetically related, may be included • Thus a set of smaller samples was improvised for the XVI and XX Centuries of Portuguese from the Mark Davis Corpus (BYU). • Words of caution: morphemes known to be the cognates, but with different spellings were not made to correspond (again an exercise in written language): -ou vs -ó, -ava vs -aba, -idade vs -idad

  25. Most prominent suffixes and suffix groups of XVIth Century Portuguese, extracted from a small selection of the Mark Davis Corpus (BYU)

  26. Most prominent suffixes and suffix groups of XXth Century Portuguese (Brazil and Portugal) extracted from a small selection of the Mark Davis Corpus (BYU)

  27. Portuguese Morphological Profiles (XVIth and XXth)

  28. Measuring morphological distances between Spanish and Portuguese 15

  29. Measuring morphological similarity between Spanish and Portuguese • XVI and XX Portuguese Stages are more similar to each other than to any dialect of Spanish.

  30. Summary • At least for the dialects examined in this experiment, the sets of most prominent affixes and sequences of them seem to be intimate enough to constitute a sort of fingerprint • Measuring distances/similarities among these sets allows for comparison of a language’s diachronic stages within relatively short periods of time.

  31. Summary • We have seen some quantitative data for three centuries of the Spanish language used in Mexico (XVI, XVIII and XX Centuries). These data were contextualized with improvised samples of Castilian and Portuguese. • Although the samples used for these experiments are small and definitely not representative (except for the CEMC), some intuitions can be corroborated (or not) at the morphological level:

  32. Concluding Remarks • Mexican Spanish seems to have emerged as a dialectal system sometime between the XVI and XVIII Centuries • Nowadays, Castilian seems less similar to XX Mexican Spanish than to XVIII dialects (both Mexican and Peninsular); which one is more conservative?

  33. Concluding Remarks This approach can be improved in several ways: • Applying alternative methods to measure distance/similarity between sets of items • Taking into account glutinosity values for clitics and other modifiers • Comparing more dialects (social and geographic); taking into account +corpora, +dialects, +registers • Applying method to other languages in synchrony (Chuj and Tojolabal –Mayan Languages) • Applying method to diachronic dialects of other languages (Purépecha XVII y XIX)

  34. THE ENDTowards a Characterization of Corpora at the Morphological Level Alfonso Medina UNAMamedinau@ii.unam.mxAACL 2008, March 15th, 2008

  35. Some items across the samples • By the XX Century, suffixes -áis, -éis occur in both dialects with different meanings: • In Spain, very productive verbal inflection (2nd person plural); • In Mexico, a pragmatic marker of very solemn or irreverent speech (very unproductive)

More Related