1 / 34

Natural Language Processing

Natural Language Processing. Vasile Rus http://www.cs.memphis.edu/~vrus/nlp. Outline. Announcements Latent Semantic Analysis/Indexing Word embeddings. Announcements. 8000-level presentations Projects due 4/24/18. Word Vectors.

Download Presentation

Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Natural Language Processing Vasile Rus http://www.cs.memphis.edu/~vrus/nlp

  2. Outline • Announcements • Latent Semantic Analysis/Indexing • Word embeddings

  3. Announcements • 8000-level presentations • Projects due 4/24/18

  4. Word Vectors • Goal: represent a word by an m-dimensional vector (for medium-sized m, say, m=300) • Have “similar” words be represented by “nearby” vectors in this m-dimensional space • Words in a particular domain (economics, science, sports) could be closer to one another than words in other domains. • Could help with synonymy • e.g. “big” and ”large” have nearby vectors • Could help with polysemy • “Java” and ”Indonesia” could be close in some dimensions • “Java” and “Python” are close in other dimensions

  5. Word Vectors • Vectors would be shorter length and information-dense, rather than very long and information-sparse • Would require fewer weights and parameters • Fortunately, there are existing mappings which can be downloaded and used • These were trained on big corpora for a long time • Let’s understand how they were developed and trained

  6. What makes two words similar? • Idea: similar words occur in similar contexts • For a given word, look at the words in a “window” around it. • Count-based methods • Latent Semantic Analysis • Prediction-based methods • Word2vec (neural networks)

  7. What makes two words similar? • Idea: similar words occur in similar contexts • For a given word, look at the words in a “window” around it. • LSA • If a occurs with b in similar contexts they should be close to each other in the vector space • Quantify/count how many times words co-occur in the same “window” • Second-order relations: • If A co-occurs with B a lot and B co-occurs with C a lot, then A and C are somehow similar (through B) although they may not co-occur in the same window very often

  8. What makes two words similar? • Idea: similar words occur in similar contexts • For a given word, look at the words in a “window” around it. • Consider trying to predict a word given the context • This is exactly the CBOW (continuous bag of words) model “We hold these truths to be self-evident, that all men are created equal” Window size = 3 ([‘truths’, ‘to’, ‘be’, ‘that’, ’all’, ‘men’], ‘self-evident’) context target word

  9. LSA • Relationship between concepts and words is many-to-many • Solve problems of synonymy and ambiguity by representing words as vectors of ideas or concepts • Could be used in information retrieval • Documents and queries are represented using vectors, and compute cosine similarity of vectors

  10. LSA • Find the latent semantic space that underlies the words/documents • Find the basic (coarse-grained) ideas, regardless of the words used to say them • A kind of co-occurrence analysis; co-occurring words as “bridges” between non–co-occurring words • Latent semantic space has many fewer dimensions than term space has • Space depends on documents from which it is derived • Components have no names; can’t be interpreted

  11. Technical Memo Example: Titles  c1 Human machine interface for Lab ABC computer applications  c2 A survey of user opinion of computer system response time  c3 The EPS user interface management system  c4 System and humansystem engineering testing of EPS  c5 Relation of user-perceived responsetime to error measurement m1 The generation of random, binary, unordered trees m2 The intersection graph of paths in trees m3 Graph minors IV: Widths of trees and well-quasi-ordering  m4 Graph minors: A survey

  12. Technical Memo Example: Terms and Documents TermsDocuments c1 c2 c3 c4 c5 m1 m2 m3 m4 human 1 0 0 1 0 0 0 0 0 interface 1 0 1 0 0 0 0 0 0 computer 1 1 0 0 0 0 0 0 0 user 0 1 1 0 1 0 0 0 0 system 0 1 1 2 0 0 0 0 0 response 0 1 0 0 1 0 0 0 0 time 0 1 0 0 1 0 0 0 0 EPS 0 0 1 1 0 0 0 0 0 survey 0 1 0 0 0 0 0 0 1 trees 0 0 0 0 0 1 1 1 0 graph 0 0 0 0 0 0 1 1 1 minors 0 0 0 0 0 0 0 1 1

  13. Technical Memo Example: Query Query: Find documents relevant to "human computer interaction" Simple Term Matching: Matches c1, c2, and c4 Misses c3 and c5

  14. Mathematical concepts Define X as the term-document matrix, with t rows (number of index terms) and d columns (number of documents). Singular Value Decomposition For any matrix X, with t rows and d columns, there exist matrices T0, S0 and D0', such that: X = T0S0D0' T0 and D0 are the matrices of left and right singular vectors S0 is the diagonal matrix of singular values

  15. More Linear Algebra • A non-negative real number σ is a singular value for X if and only if there exist unit-length vectors u in Kt and v in Kd such that X v= σu and X‘ u= σv • u are the left singular vectors while v are the right singular vectors • K = field such as the field of real numbers

  16. Eigenvectors vs. Singular vectors Eigenvector: Mv = λv where v is an eigenvector and v is a scalar (real number) called eigenvalue MV = DV, where D is the collection of eigenvalues and V is the collection of eigenvectors M = VDV-1 if V is invertible (which is the case if all eigenvectors are distinct) ‘

  17. Eigenvectors vs. Singular vectors M = VDV‘ if eigenvectors are normalized XX‘ = (TSD‘)(TSD)‘=TSDD‘S‘T‘ = TSS‘T‘ = TDT‘ D = SS‘

  18. Linear Algebra • X = T0S0D0' • T,Dare column orthonormal • Their columns are orthogonal vectors that can form a basis for a space • They are unitary which means T‘ and D‘ are also orthonormal

  19. More Linear Algebra • Unitary matrices have the following properties • UU‘=U‘U=In • If U has all entries real it is orthogonal • Orthogonal matrices preserve the inner product of two real vectors • <Ux,Uy=<x,y> • U is an isometry, i.e. preserves distances

  20. LSA Properties • The projection into the latent concept space preserves topological properties of the original space • Close vectors will stay close • The reduced latent concept space is the best approximation of the original space in terms of distance-preservation compared to other choices of same dimensionality • Both terms and documents are mapped into a new space where they both could be compared

  21. Dimensions of matrices t x d t x m m x m m x d S0 D0' X = T0 m is the rank of X< min(t, d)

  22. ~ ~ Reduced Rank S0 can be chosen so that the diagonal elements are positive and decreasing in magnitude. Keep the first k and set the others to zero. Delete the zero rows and columns of S0 and the corresponding rows and columns of T0 and D0. This gives: X X = TSD' Interpretation If value of k is selected well, expectation is that X retains the semantic information from X, but eliminates noise from synonymyand recognizes dependence. ^ ^

  23. Dimensionality Reduction t x d t x k k x k k x d S D' ^ = X T k is the number of latent concepts (typically 300 ~ 500) X ~ X = TSD' ^

  24. Animation of SVD M is just an m×m square matrix with positive determinant whose entries are plain real numbers. Run as slideshow to see the Animation From Wikipedia

  25. Approximation Intuition 200 x 300 pixels Top 10 dim Top 50 dim

  26. Projected Terms XX‘ = (TSD‘)(TSD)‘=TSDD‘S‘T‘ = TSS‘T‘= (TS)(TS)‘

  27. LSA Summary • Strong formal framework • Completely automatic; no stemming required; allows misspellings • Computation is expensive

  28. word2vec: CBOW Model Train a neural network on a large corpus of data Single Hidden Layer (with dimension m) w(t-2) w(t-1) Target word (one hot encoded) Context words (one hot encoded) w(t) w(t+1) w(t+2)

  29. word2vec: CBOW Model Once the network is trained, weights -> word vectors Single Hidden Layer (with dimension m) w(t-2) w(t-1) Target word (one hot encoded) Context words (one hot encoded) w(t) w(t+1) w(t+2)

  30. word2vec: CBOW Model Once the network is trained, weights -> word vectors Single Hidden Layer (with dimension m) w(t-2) w(t-1) Target word (one hot encoded) Context words (one hot encoded) w(t) w(t+1) w(t+2)

  31. word2vec: Skip-gram Model Same idea, except we predict the context from the target. Single Hidden Layer Neural Network w(t-2) w(t-1) Context words (one hot encoded) Target word (one hot encoded) w(t) w(t+1) w(t+2)

  32. word2vec • Distributed Representations of Words and Phrases and Their Compositionality – Mikolov and colleagues • Uses a Skip-gram model to train on a large corpus • Lots of details to make it work better • Aggregation of multi-word phrases (e.g. Boston Globe) • Subsampling (i.e. oversample less common words) • Negative Sampling (give network examples of wrong words)

  33. Summary • Latent Semantic Analysis • Word2vec

  34. Next Time • Logic Forms • Information Extraction

More Related