Laboratorio di analisi di dati linguistici - PowerPoint PPT Presentation

Laboratorio di analisi di dati linguistici
1 / 30

  • Uploaded on
  • Presentation posted in: General

Laboratorio di analisi di dati linguistici. Laurea specialistica in Linguistica Teorica e Applicata, Università di Pavia Andrea Sansò A.A. 2005-2006 Corso progredito 10 CFU. Laboratorio di analisi di risorse linguistiche. Parte quarta Lessici

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Laboratorio di analisi di dati linguistici

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Laboratorio di analisi di dati linguistici

Laboratorio di analisi di dati linguistici

Laurea specialistica in Linguistica Teorica e Applicata, Università di Pavia

Andrea Sansò

A.A. 2005-2006

Corso progredito

10 CFU

Laboratorio di analisi di risorse linguistiche

Laboratorio di analisi di risorse linguistiche

Parte quarta


Risorse per la linguistica tipologica

Strumenti e tecnologie per la creazione di risorse linguistiche



Una definizione:

“A computational lexicon is a very complex – and expensive – component to be built adequately. It must contain, in an explicit and formalised way, all the information which a native speaker uses in everyday situations, from the simpler orthographic, phonetic, morphologic information, to the more complex syntactic, semantic, pragmatic, logical, ontological, multilingual information. A ‘complete’ lexicon should practically incorporate our ‘knowledge of the world’, and represent it in an explicit and formal way”

N. Calzolari, “Computational lexicons and corpora. Complementary components in human language technology”, in P. van Sterkenburg (ed.), Linguistics Today – Facing a Greater Challenge, 89-107. Amsterdam-Philadelphia: J. Benjamins, 2004.



  • Lessici   corpora

  • Questa conoscenza del mondo è un oggetto mutevole e continuamente in accrescimento, impossibile da “congelare” in un lessico statico.

  • The only way of reflecting and capturing all the potentialities of a language relies on trying to extract the linguistic and lexical information not only from ‘experts’, i.e. native speakers or linguists, but from the texts themselves in which the language is actually used, with a continuous process of enrichment. From these considerations the importance of corpora obviously emerges.

    N. Calzolari, ibidem



Lessici   corpora

L  CPOS tagging / lemmatisation

C  Lfrequencies of different linguistic objects

C  Lproper nouns / named entity recognition

L  Csyntactic parsing

C  Lupdating / tuning a lexicon

C  Lcollocational data



Lessici   corpora

C  Lsemantic clustering and ‘nuances’ of meaning

L  Csemantic mark-up

C  Llexical knowledge acquisition

L  Cword sense disambiguation

C  Lvalidation of lexical models

C  Lcorpus-based computational lexicography



  • Lessici   corpora

  • Esempio: italiano chiedere vs domandare

  • Dal punto di vista teorico (introspettivo) sono sinonimi; i dizionari cartacei utilizzano la stessa definizione

  • Ma:

  • domandare è utilizzato quasi sempre in senso interrogativo (ask to know); chiedere è utilizzato spesso in senso imperativo (ask to have);

  • chiedere è molto più usato di domandare;



FrameNet (FN) is a corpus-based lexicon-building project that documents the links between lexical items and the semantic frame(s) they evoke; it accomplishes this by annotating sets of sentences that exemplify the items being described, and performing various operations on the resulting annotations. The basic units in FN descriptions are the frame and the lexical unit (LU), the latter understood as the pairing of a “word” with just one of its meanings; thus, a word with four meanings is treated as four lexical units. In most cases, for a word to have more than one meaning implies that it belongs to more than one frame.

 Charles J. Fillmore, Collin F. Baker, and Hiroaki Sato, “FrameNet as a ‘Net’”, in Proceedings of the 4th International Conference on Language Resources and Evaluation, Lisbon 2004, pp. 1091-1094.



  • Main components of the FrameNet database

  • the frame ontology,

  • the set of annotated sentences, and

  • the set of lexical entries.

The basis of the ontology is the set of frames, each of which consists of an informal characterization of a situation type (the frame definition), together with a collection of frame elements (FEs). The FEs are the semantic roles of the entities involved in each frame. FE names are used as labels for the words or phrases that are in grammatical construction with the L(exical) U(nit)s that evoke that particular frame. For example, the frame that includes the English verb inform has as its core FEs SPEAKER, ADDRESSEE and MESSAGE.



  • The example sentences are selected by FrameNet annotators as representing the typical uses of the LUs belonging to individual frames. Each set of annotations is centered around a particular LU; the sentence’s constituents are labeled (with FE names) according to the ways in which they fill in information about the frame. For example, sentences (1) and (2) have SPEAKER appearing as subject, and ADDRESSEE as object; the MESSAGE FE appears as a that clause in sentence (1), and as an event-naming nominalization introduced by of in sentence (2).

  • [SPEAKER We] informed [ADDRESSEE the press] [MESSAGE that the prime minister has resigned]

  • [SPEAKER We] informed [ADDRESSEE the press] [MESSAGE of the prime minister’s resignation]



  • The lexical entry for each LU is a summary of what has been recorded in its annotations, presented as valence descriptions, showing all the ways in which its frame elements can be realized, such as the alternative syntactic realizations of the MESSAGE just shown for the verb inform. The collection of annotated sentences is made available in the database as evidence for the analysis.

  • The first and most obvious way in which LUs are related to each other is through membership in the same frame. Thus inform shares a frame with the verbs notify and announce, and also with the nouns notification and announcement, and the verb resign shares frame membership with its nominal partner resignation, and with verbal expressions like abdicate, step down and stand down. But LUs can also be related to each other in other ways, either because their frames are related to other frames, or through semantic properties (called semantic types in the FN database) assigned to LUs individually rather than through their frames.



Semantic types:

  • The FrameNet database allows the assignment of semantic types to LUs, FEs and frames. The perception verbs hear vs. listen are distinguished as passive versus active perception verbs, and so, respectively, are see vs. look. Hearing and seeing are things that happen to you, listening and looking are things that you do, and this difference is considered important enough to merit entry into separate frames. In the FN database, hear and see and the passive perception uses of other sensory words, such as feel, taste and smell, belong to the Perception experience frame; the verbs look and listen belong to the Perception active frame, along with the corresponding active uses of feel, taste and smell.




Subframes are used for representing subevents; frames that represent complex processes have subframes representing their subparts. To take a simple example, the Motion scenario frame has three subframes, Departing, Motion, and Arriving. In this case, the subframes are temporally ordered, but in general, subframes need not be completely ordered with respect to each other. For example, the Commercial transaction frame has two subframes Commerce goods-transfer and Commerce money-transfer, but these are not ordered with respect to each other. In some commercial transactions, you pay in advance, in others, only after receiving the goods or services.

Framenet in azione

FrameNet in azione…

Tutta la documentazione si trova in un manuale:

Framenet in altre lingue

FrameNet in altre lingue

Salsa Project – FrameNet in German

Spanish FrameNet



Sistema di riferimento lessicale disponibile online:

I significati delle parole sono rappresentati da gruppi di sinonimi (synsets). Sono rappresentate anche relazioni quali meronimia, iperonimia, antonimia, etc.

Bibliografia aggiornata:

Altri lessici multilingui

Altri lessici multilingui


Un lessico multilingue basato su WordNet e su vocabolari liberamente disponibili sul web.


Un lessico multilingue (italiano, spagnolo, ebraico, rumeno) in cui i synsets sono allineati, laddove possibile, con i synsets del WordNet di Princeton. Sviluppato all’IRST-ITC di Povo (TN).



Un progetto analogo per le lingue europee: è possibile scaricarne una demo

I vari WordNets sono collegati ad un Interlingual index che è basato sul Wordnet americano e che permette di passare da una parola in una lingua a una parola analoga in un’altra. Questo index consente anche di accedere a un’ontologia condivisa di 63 distinzioni semantiche, che fornisce una base semantica comune per le varie lingue

Altre iniziative

Altre iniziative

  • Progetto EAGLES (Expert Advisory Group for Language Engineering Standards):

  • development of standards in morphosyntax, syntax and semantics

  • awareness of the interdependence between lexical specifications and corpus tagsets / syntactic annotations

  • gli standard sviluppati sono serviti nella creazione di risorse (sia corpora che lessici) creati all’interno dei progetti europei Parole e Simple

Altre iniziative1

Altre iniziative

  • Progetto ISLE (International Standards for Language Engineering) – Computational Lexicon Working Group:


  • una continuazione del progetto EAGLES

  • sviluppo di uno schema generale per la codifica dell’informazione lessicale multilingue (MILE; Multilingual ISLE Lexical Entry)

  • impegno a raggiungere consenso su standard di fatto attraverso una procedura bottom-up

  • impegno a massimizzare l’interazione e le sinergie con chi lavora nell’ambito del semantic web

Altre iniziative2

Altre iniziative

  • Progetto PAROLE:

  • obiettivo: produrre in Europa un nucleo iniziale di corpora e lessici armonizzati (catalano, danese, olandese, inglese, finlandese, francese, tedesco, greco, italiano, portoghese, spagnolo, svedese)

  • Informazione codificata:

    • Morfologia: written forms, including stems and variants; morphosyntactic category; inflected forms; morphological features; derivation; abridged forms

Altre iniziative3

Altre iniziative

  • Progetto PAROLE:

  • Informazione codificata:

    • Sintassi: subcategorization patterns; grammatical relations of subcategorised complements; control; diathesis and lexical alternations; pronominalization; linear order constraints; constraints on the syntactic context where the lexical entry is inserted; idioms and collocations

Altre iniziative4

Altre iniziative

  • Progetto SIMPLE:

  • Aggiunta di un livello semantico a PAROLE

    • “The first attempt to tackle harmonised encoding of semantic types and semantic (subcategorisation) frames on a large scale, i.e. for so many languages and with wide coverage”

Altre iniziative5

Altre iniziative

  • Progetto SIMPLE:

  • Informazione semantica: semantic type; domain information; lexicographic gloss; argument structure for predicative semantic units; event type, to characterise the aspectual properties of verbal predicates; links of the arguments to the syntactic subcategorization frames; ‘qualia’ structure, represented by a very large and granular set of semantic relations and features; regular polysemous alternations (e.g. container for content); hyponymy, synonymy, etc.

Due tipi di database tipologici

Due tipi di database tipologici

Databases that collect and document primary language data

e.g. Agreement database


Reflexives and intensifiers database


Databases documenting secondary language data

e.g.Noun Phrase Universals Database (Edinburgh)

The Universals Archive (Konstanz)

Das grammatikalische Raritätenkabinett (Konstanz)


Database tipologici

Database tipologici

contiene un elenco dei database tipologici elaborati all’interno del progetto LTRC (Utrecht)

Particolarmente user-friendly:

Typological Database of Intensifiers and Reflexives (TDIR):

Reduplication database:

The SMG databases:

Database tipologici1

Database tipologici

World Atlas of Language Structure

The World Atlas of Language Structures consists of 142 maps with accompanying texts on diverse features (such as vowel inventory size, noun-genitive order, passive constructions, and 'hand'/'arm' polysemy), each of which is the responsibility of a single author (or team of authors). Each maps shows between 120 (35) and 1110 languages, each language being represented by a dot, and different dot colors showing different values of the features. Altogether 2,650 languages are shown on the maps, and more than 58,000 dots give information on features in particular languages

Tools per la ricerca tipologica:

Strumenti e tecnologie per la creazione di risorse

Strumenti e tecnologie per la creazione di risorse

Tools specializzati



Fieldworks Data Notebook: source)

Speech analysis:



(versione 2.1 non gratuita; versione 1.5 gratuita)

Annotation tools:


Altri strumenti si possono trovare sulla pagina del LARL, nei link (categorie: concordancing tools e altre risorse linguistiche)

Strumenti e tecnologie per la creazione di risorse1

Strumenti e tecnologie per la creazione di risorse

Tools specializzati

Tagger morfologici:

Morph-it – tagger morfologico dell’italiano ; disponibile una demo in rete sul sito:

POS taggers:


TREE tagger:

Strumenti e tecnologie per la creazione di risorse2

Strumenti e tecnologie per la creazione di risorse

Tools specializzati

Codifica di testi

DBT (DataBase Testuale): software di analisi testuale e interrogazione full-text sviluppato da E. Picchi (ILC, CNR, Pisa)

Il LARL possiede un corpus di italiano L2 e il corpus del LIP (Lessico di frequenza dell’italiano parlato) interrogabili attraverso il DBT

  • Login