Distributional models
Download
1 / 34

Distributional models - PowerPoint PPT Presentation


  • 204 Views
  • Uploaded on

Distributional models. Katrin Erk. Representing meaning through collections of words. Doc 1 : Abdullah boycotting challenger commission dangerous election interest Karzai necessary runoff Sunday . Doc 2 : animatronics are children’s igloo intimidating Jonze kingdom smashing Wild .

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Distributional models' - Mercy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Representing meaning through collections of words l.jpg
Representing meaning through collections of words

Doc 1: Abdullah boycotting challenger commission dangerous election interest Karzai necessary runoff Sunday

Doc 2: animatronics are children’s igloo intimidating Jonze kingdom smashing Wild

Doc 3: applications documents engines information iterated library metadata precision query statistical web

Doc 4: cucumbers crops discoveries enjoyable fill garden leafless plotting Vermont you


Representing meaning through collections of words3 l.jpg
Representing meaning through collections of words

Wikipedia (version Oct 24, 2009) on the movie “Where the Wild Things Are”

Washington Post Oct 24, 2009 on elections in Afghanistan

Doc 1: Abdullah boycotting challenger commission dangerous election interest Karzai necessary runoff Sunday

Doc 2: animatronics are children’s igloo intimidating Jonze kingdom smashing Wild

Doc 3: applications documents engines information iterated library metadata precision query statistical web

Doc 4: cucumbers crops discoveries enjoyable fill garden leafless plotting Vermont you

Wikipedia (version Oct 24, 2009) on Information Retrieval

garden.org: Planning a Vegetable Garden


Representing meaning through a collection of words l.jpg
Representing meaning through a collection of words

  • What parts of the meaning of a document can you capture through an unordered collection of words?

  • How can you make use of such collections?


Representing meaning through a collection of words5 l.jpg
Representing meaning through a collection of words

  • What parts of the meaning of a document can you capture through an unordered collection of words?

    • General topic information: What is the document about?

    • More specifically: things mentioned in the document

  • How can you make use of such collections?

    • Documents on similar topics contain similar words

    • Use in Information Retrieval (search)


Representing collections of words through tables of counts l.jpg
Representing collections of words through tables of counts

Doc 2: animatronics are children’s igloo intimidating Jonze kingdom smashing Wild


Representing collections of words through tables of counts7 l.jpg
Representing collections of words through tables of counts

We can now compare documents by comparing tables of counts.

What can you tell about the second document below?


The second document a more extensive list of words l.jpg
The “second document”: a more extensive list of words

the 167

and 58

of 58

to 56

a 49

in 37

as 36

is 33

victor 30

* 27

with 26

by 23

her 18

film 17

for 16

emily 15

was 15

corpse 14

bride 13

victoria 13

his 13

on 13

from 11

What movie is this?


From tables to vectors l.jpg
From tables to vectors

  • Interpret table as a vector:

    • Each entry is a dimension:

      • “film” is a dimension. Document’s coordinate: 24

      • “wild” is a dimensions. Document’s coordinate: 18

  • Then this document is a point in 10-dimensional space


Documents as points in vector space l.jpg
Documents as points in vector space

  • Viewing “Wild Things” and “Corpse Bride” as vectors/points in vector space: Similarity between them as proximity in space

Where the Wild Things Are

“Distributional model”, “vector space model”, “semantic space model” used interchangeably here

Corpse Bride


What have we gained l.jpg
What have we gained?

  • Representation of document in vector space can be computed completely automatically: Just counts words

  • Similarity in vector space is a good predictor for similarity in topic

    • Documents that contain similar words tend to be about similar things


What do we mean by similarity of vectors l.jpg
What do we mean by “similarity” of vectors?

Euclidean distance (a dissimilarity measure!):

Where the Wild Things Are

Corpse Bride


What do we mean by similarity of vectors13 l.jpg
What do we mean by “similarity” of vectors?

Cosine similarity:

Where the Wild Things Are

Corpse Bride


What have we gained14 l.jpg
What have we gained?

  • We can compute the similarity of documents

    • through their Euclidean distance

    • or through their cosine

  • We can also represent a query as a vector:

    • Just count the words in the query

  • Now we can search for documents similar to the query


From documents to words l.jpg
From documents to words

  • Same holds for words as for documents: Context words are a good indicator of meaning

    • Similar words tend to occur in similar contexts

  • What is a context? How do we count here?

    • Take all the occurrences of our target word in a large text

    • Take a context window, e.g. 10 words either side

    • Count all that occurs there


Representing the meaning of a word through a collection of context words l.jpg
Representing the meaning of a word through a collection of context words

Emerging from the earth is Emily, the "Corpse Bride," a beautiful undead girl in a moldy bridal gown who declares Victor her husband.

Counts for target “Emily”, 10 words context either side.


Representing the meaning of a word through a collection of context words17 l.jpg
Representing the meaning of a word through a collection of context words

  • Go through all occurrences of “Emily” in a large corpus

    • Count words in 10-word window for each occurrence, sum up


Some co occurrences letter in pride and prejudice l.jpg
Some co-occurrences: context words“letter” in “Pride and Prejudice”

  • jane : 12

  • when : 14

  • by : 15

  • which : 16

  • him : 16

  • with : 16

  • elizabeth : 17

  • but : 17

  • he : 17

  • be : 18

  • s : 20

  • on : 20

  • was : 34

  • it : 35

  • his : 36

  • she : 41

  • her : 50

  • a : 52

  • and : 56

  • of : 72

  • to : 75

  • the : 102

  • not : 21

  • for : 21

  • mr : 22

  • this : 23

  • as : 23

  • you : 25

  • from : 28

  • i : 28

  • had : 32

  • that : 33

  • in : 34

This is not a large text!Large = something like 100 million words at least


From tables to vectors19 l.jpg
From tables to vectors context words

Counts for “letter” and “surprise” from Pride and Prejudice

  • Interpret table as a vector:

    • Each entry is a dimension:

      • “admirer” is a dimension. Coordinate of “letter”: 1. Coordinate of “surprise”: 0

      • “all” is a dimensions. Coordinate of “letter”: 8. Coordinate of “surprise: 7

  • Then each word is a point in n-dimensional space


What have we gained20 l.jpg
What have we gained? context words

  • Representation of word in vector space can be computed completely automatically: Just counts co-occurring words in all context

  • Similarity in vector space is a good predictor for meaning similarity

    • Words that occur in similar contexts tend to be similar in meaning

    • Synonyms are close together in vector space

    • Antonyms too


Parameters of vector space models l.jpg
Parameters of vector space models context words

  • W. Lowe (2001): “Towards a theory of semantic space”

  • A semantic space defined as a tuple (A, B, S, M)

  • B: base elements.

  • A: mapping from raw co-occurrence counts to something else, to correct for frequency effects

  • S: similarity measure.

  • M: transformation of the whole space to different dimensions


B base elements l.jpg
B: base elements context words

  • We have seen: context words as base elements

  • Term x document matrix:

    • Represent document as vector of weighted terms

    • Represent term as vector of weighted documents


B base elements23 l.jpg
B: base elements context words

  • Dimensions:not words in a context window, but dependency paths starting from the target word (Pado & Lapata 07)


A transforming raw counts l.jpg
A: transforming raw counts context words

  • Problem with vectors of raw counts:Distortion through frequency of target word

  • Weigh counts:

    • The count on dimension “and” will not be as informative as that on the dimension “angry”

  • For example, using Pointwise Mutual Information between target a and context word b


M transforming the whole space l.jpg
M: transforming the whole space context words

  • Dimensionality reduction:

    • Principal Component Analysis (PCA)

    • Singular Value Decomposition (SVD)

  • Latent Semantic Analysis, LSA(also called Latent Semantic Indexing, LSI):Do SVD on term x document representation to induce “latent” dimensions that correspond to topics that a document can be aboutLandauer & Dumais 1997


Using similarity in vector spaces l.jpg
Using similarity in vector spaces context words

  • Search/information retrieval: Given query and document collection,

    • Use term x document representation:Each document is a vector of weighted terms

    • Also represent query as vector of weighted terms

    • Retrieve the documents that are most similar to the query


Using similarity in vector spaces27 l.jpg
Using similarity in vector spaces context words

  • To find synonyms:

    • Synonyms tend to have more similar vectors than non-synonyms:Synonyms occur in the same contexts

    • But the same holds for antonyms:In vector spaces, “good” and “evil” are the same (more or less)

  • So: vector spaces can be used to build a thesaurus automatically


Using similarity in vector spaces28 l.jpg
Using similarity in vector spaces context words

  • In cognitive science, to predict

    • human judgments on how similar pairs of words are (on a scale of 1-10)

    • “priming”


An automatically extracted thesaurus l.jpg
An automatically extracted thesaurus context words

  • Dekang Lin 1998:

    • For each word, automatically extract similar words

    • vector space representation based on syntactic context of target (dependency parses)

    • similarity measure: based on mutual information (“Lin’s measure”)

  • Large thesaurus, used often in NLP applications


Vectors for word senses l.jpg
Vectors for word senses context words

  • Up to now: one vector per word

  • Vector for “bank” conflates

    • financial contexts

    • fishing contexts

  • How to get to vectors for word senses?


Automatically inducing word senses l.jpg
Automatically inducing word senses context words

  • Schütze 1998: one vector per sentence, or per occurrence (token)of “letter”

    • She wrote an angry letter to her niece.

    • He sprayed the word in big letters.

    • The newspaper gets 100 letters from readers every day.

  • Make token vector by adding up the vectors of all other (content) words in the sentence:

  • Cluster token vectors

  • Clusters = induced word senses


A vector for an individual occurrence of a word l.jpg
A vector for an individual occurrence of a word context words

  • Avoid having to define word senses

    • Sometimes hard to divide uses into senses:

    • words like “leave”, or “paint”

  • Erk/Pado 2008: Modify vector of “bank” using its syntactic context:

break

obj

bank

fish

on

bank


Summary vector space models l.jpg
Summary: vector space models context words

  • Representing meaning through counts

    • Represent document through content words

    • Represent word meaning through context words / parse tree snippets / documents

  • Context items as dimensions, target as vector/point in semantic space

  • Proximity in semantic space ~ similarity between words


Summary vector space models34 l.jpg
Summary: vector space models context words

  • Uses:

    • Search

    • Inducing ontologies

    • Modeling human judgments of word similarity

    • Represent word senses

      • Cluster sentence vectors

      • Compute vectors for individual occurrences


ad