Corpora and Statistical Methods

Corpora and Statistical Methods Albert Gatt

Part 1 Semantic similarity and the vector space model

Synonymy • Different phonological/orthographic words • highly related meanings: • sofa / couch • boy / lad • Traditional definition: • w1 is synonymous with w2 if w1 can replace w2 in a sentence, salvaveritate • Is this ever the case? Can we replace one word for another and keep our sentence identical?

The importance of text genre & register • With near-synonyms, there are often register-governed conditions of use. • E.g. naive vs gullible vs ingenuous • You're so bloody gullible […] • […] outside on the pavement trying to entice gullible idiots in […] • You're so ingenuous . You tackle things the wrong way. • The commentator's ingenuous query could just as well have been prompted […] • However, it is ingenuous to suppose that peace process […] (source: BNC)

Synonymy vs. Similarity • The contextual theory of synonymy: • based on the work of Wittgenstein (1953), and Firth (1957) • You shall know a word by the company it keeps (Firth 1957) • Under this view, perfect synonyms might not exist. • But words can be judged as highly similar if people put them into the same linguistic contexts, and judge the change to be slight.

Synonymy vs. similarity: example • Miller & Charles 1991: • Weak contextual hypothesis:The similarity of the context in which 2 words appear contributes to the semantic similarity of those words. • E.g. snake is similar to [resp. synonym of] serpent to the extent that we find snake and serpent in the same linguistic contexts. • It is much more likely that snake/serpent will occur in similar contexts than snake/toad • NB: this is not a discrete notion of synonymy, but a continuous definition of similarity

The Miller/Charles experiment • Subjects were given sentences with missing words; asked to place words they felt were OK in each context. • Method to compare words A and B: • find sentences containing A • find sentences containing B • delete A and B from sentences and shuffle them • ask people to choose which sentences to place A and B in. • Results: • People tend toput similar words in the same context, and this is highly correlated with occurrence in similar contexts in corpora.

Issues with similarity • “Similar” is a much broader concept than “synonymous”: • “Contextually related, though differing in meaning”: • man / woman • boy / girl • master / pupil • “Contextually related, but with opposite meanings”: • big / small • clever / stupid

Uses of similarity • Assumption: semantically similar words behave in similar ways • Information retrieval: query expansion with related terms • K nearest neighbours, e.g.: • given: a set of elements, each assigned to some topic • task: classify an unknown w by topic • method: find the topic that is most prevalent among w’s semantic neighbours

Common approaches • Vector-space approaches: • represent word w as a vector containing the words (or other features) in the context of w • compare the vectors of w1, w2 • various vector-distance measures available • Information-theoretic measures: • w1 is similar to w2 to the extent that knowing about w1 increases my knowledge (decreases my uncertainty) about w2

Part 2 Vector-space models

Basic data structure • Matrix M • Mij = no. of times wi co-occurs with wj (in some window). • Can also have Document * word matrix • We can treat matrix cells as boolean: if Mij> 0, then wi co-occurs with wj, else it does not.

Distance measures • Many measures take a set-theoretic perspective: • vectors can be: • binary (indicate co-occurrence or not) • real-valued (indicate frequency, or probability) • similarity is a function of what two vectors have in common

Classic similarity/distance measures Boolean vector (sets) Real-valued vector Dice coefficient Jaccard Coefficient Dice coefficient Jaccard Coefficient

Dice vs. Jaccard • Dice (car, truck) • On the boolean matrix: (2 * 2)/(4+2) = 0.66 • Jaccard • On the boolean matrix: 2/4 = 0.5 • Dice is more “generous”; Jaccard penalises lack of overlap more.

Classic similarity/distance measures Boolean vector (sets) Real-valued vector Cosine similarity Cosine similarity (= angle between 2 vectors)

Part 3 probabilistic approaches

Turning counts to probabilities • P(spacewalking|cosmonaut) = ½ = 0.5 • P(red|car) = ¼ = 0.25 • NB: this transforms each row into a probability distribution corresponding to a word

Probabilistic measures of distance • KL-Divergence: • treat W1 as an approximation of W2 • Problems: • asymmetric: D(p||q) ≠ D(q||p) • not so useful for word-word similarity • if denominator= 0, then D(v||w) is undefined

Probabilistic measures of distance • Information radius (aka Jenson-Shannon Divergence) • compares total divergence between p and q to the average of p and q • symmetric! • Dagan et al (1997) showed this measure to be superior to KL-Divergence, when applied to a word sense disambiguation task.

Some characteristics of vector-space measures • Very simple conceptually; • Flexible: can represent similarity based on document co-occurrence, word co-occurrence etc; • Vectors can be arbitrarily large, representing wide context windows; • Can be expanded to take into account grammatical relations (e.g. head-modifier, verb-argument, etc).

Grammar-informed methods: Lin (1998) • Intuition: • The similarity of any two things (words, documents, people, plants) is a function of the information gained by having: • a joint description of a and b in terms of what they have in common compared to • describing a and b separately • E.g. do we gain more by a joint description of: • apple and chair (both THINGS…) • apple and banana (both FRUIT: more specific)

Lin’s definition cont/d • Essentially, we compare the info content of the “common” definition to the info content of the “separate” definition • NB: essentially mutual information!

An application to corpora • From a corpus-based point of view, what do words have in common? • context, obviously • How to define context? • just “bag-of-words” (typical of vector-space models) • more grammatically sophisticated

Kilgarriff’s (2003) application • Definition of the notion of context, following Lin: • define F(w) as the set of grammatical contexts in which w occurs • a context is a triple <rel,w,w’>: • rel is a grammatical relation • w is the word of interest • w’ is the other word in rel • Grammatical relations can be obtained using a dependency parser.

Grammatical co-occurrence matrix for cell Source: Jurafsky & Martin (2009), after Lin (1998)

Example with w = cell • Example triples: • <subject-of, cell, absorb> • <object-of, cell, attack> • <nmod-of,cell, architecture> • Observe that each triple f consists of the relation r, the second word in the relation w’, ..and the word of interest w • We can now compute the level of association between the word w and each of its triples f: • An information-theoretic measure that was proposed as a generalisation of the idea of pointwisemutual information.

Calculating similarity • Given that we have grammatical triples for our words of interest, similarity of w1 and w2 is a function of: • the triples they have in common • the triples that are unique to each • I.e.: mutual info of what the two words have in common, divided by sum of mutual info of what each word has

Sample results: master & pupil • common: • Subject-of: read, sit, know • Modifier: good, form • Possession: interest • master only: • Subject-of: ask • Modifier: past (cf. past master) • pupil only: • Subject-of: make, find • PP_at-p: school

Concrete implementation • The online SketchEngine gives grammatical relations of words, plus thesaurus which rates words by similarity to a head word. • This is based on the Lin 1998 model.

Limitations (or characteristics) • Only applicable as a measure of similarity between words of the same category • makes no sense to compare grammatical relations of different category words • Does not distinguish between near-synonyms and “similar” words • student ~ pupil • master ~ pupil • MI is sensitive to low-frequency: a relation which occurs only once in the corpus can come out as highly significant.

Part 4 Applications of vector-space models to information retrieval

Information retrieval • Basic problem definition: • Store a (very large) collection of documents • Document = Newspaper articles, encyclopedia entries, medline abstracts, html pages... • Given a user query (some set of words), retrieve the documents that are most relevant to that query. • Most IR systems take a bag of words approach: • Document = the words it contains • No syntactic information or “higher order” semantic information

IR architecture

Basic representation • Same as for semantic similarity, except that we use a document by term (=word) matrix • A document d is represented as a vector whose cells contain term weights.

Example document representation Fried eggplant recipe Document representation Place the flour, egg, and bread crumbs each in 3 small bowls. Add the 1/2 teaspoon of salt to the egg and whisk to combine. Season the bread crumbs with the tablespoon of Essence and stir with a fork or your hands to thoroughly combine. Dredge each piece of eggplant in the flour, coating thoroughly and then shaking to remove any excess flour. Coat each piece with the egg, then dredge in the bread crumb mixture, pressing to make the bread crumbs adhere. Transfer the eggplant pieces to a rack or to paper towels to let them dry slightly before frying. In a deep, heavy skillet heat 1/2-inch of vegetable oil to 375 degrees F. Fry the eggplant pieces, in batches if necessary, for about 1 minute on each side, or until golden brown. Transfer with tongs to paper towels to drain. Sprinkle lightly with salt before serving. Serve with marinara sauce, or as desired. The term weights are just simple document frequencies (for now)

Example query representation User query Document representation Suppose user types: egg and breadcrumb Query rep could be: The term weights are just simple document frequencies (for now)

More generally • Let d1 be the eggplant recipe, and d2 be a fried chicken recipe. • User query: • Note: intuitively, this query should match both docs (both contain egg and breadcrumb) • Which doc would the query fried chicken match?

Query processing • We can use the same model as we used for computing word similarity, to compute the degree of match between a query and a doc. • E.g. Compute the cosine similarity between the query and the document vector. • Documents can then be ranked by their similarity to the query.

Term weighting • So far, the intuition has been that: frequent terms in a document capture the basic meaning of the document. • Another intuition: terms that crop up in a few documents are more discriminatory.

Inverse document frequency (IDF) • A way of giving a higher weight to more discriminative words. • N = no. of docs in the collection • ni = number of documents containing term i • We combine IDF with TF (the term frequency)

TF/IDF tf idf

TF-IDF weighting • Properties: • Weights terms higher if they are: • frequent in a document AND • rare in the document collection as a whole. • Modified similarity for query/document retrieval: • We only take into account the words actually in the query

Part 5 Evaluation of IR

Evaluation of IR systems • As with most NLP systems, we require some function that our system should maximise. • A lot of NLP evaluation rely on precision and recall.

Basic rationale • For a given classification problem, we have: • a “gold standard” against which to compare • our system’s results, compared to the target gold standard: • false positives (fp) • false negatives (fn) • true positives (tp) • true negatives (tn) • Performance typically measured in terms of precision and recall.

Precision • Definition: • proportion of items that are correctly classified • i.e. proportion of true positives out of all the system’s classifications

Recall • Definition: • proportion of the actual target (“gold standard”) items that our system classifies correctly total no. of items that should be correctly classified, including those the system doesn’t get

Combining precision and recall • Typically use the F-measureas a global estimate of performance against gold standard • We need some factor (alpha) to weight precision and recall; 0.5 gives them equal weighting

Precision/Recall in IR • We assume that the results returned by the IR system can be divided into: • Relevant docs (tp) • Non-relevant docs (fp) • Precision = the fraction of docs that are relevant out of the set of returned docs • Recall: fraction of docs that are relevant out of the whole set of relevant docs • Problem: IR systems tend to rank documents by relevance.

Corpora and Statistical Methods