1 / 49

Advanced Multimedia

Advanced Multimedia. Text Retrieval/Classification Tamara Berg. Announcements. Matlab basics lab – Feb 7 Matlab string processing lab – Feb 12. If you are unfamiliar with Matlab , attendance at labs is crucial!. Slide from Dan Klein. Slide from Dan Klein. Today!. Slide from Dan Klein.

evonne
Download Presentation

Advanced Multimedia

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Multimedia Text Retrieval/Classification Tamara Berg

  2. Announcements • Matlab basics lab – Feb 7 • Matlab string processing lab – Feb 12 If you are unfamiliar with Matlab, attendance at labs is crucial!

  3. Slide from Dan Klein

  4. Slide from Dan Klein

  5. Today! Slide from Dan Klein

  6. What does categorization/classification mean?

  7. Slide from Dan Klein

  8. Slide from Dan Klein

  9. Slide from Dan Klein

  10. Slide from Dan Klein

  11. Slide from Min-Yen Kan

  12. http://yann.lecun.com/exdb/mnist/index.html Slide from Dan Klein

  13. Slide from Dan Klein

  14. Slide from Min-Yen Kan

  15. Slide from Min-Yen Kan

  16. Slide from Min-Yen Kan

  17. Machine Learning - how to select a model on the basis of data / experience Learning parameters (e.g. probabilities) Learning structure (e.g. dependencies) Learning hidden concepts (e.g. clustering) Slide from Min-Yen Kan

  18. Representing Documents

  19. Document Vectors

  20. Document Vectors • Represent document as a “bag of words”

  21. Example • Doc1 = “the quick brown fox jumped” • Doc2 = “brown quick jumped fox the”

  22. Example • Doc1 = “the quick brown fox jumped” • Doc2 = “brown quick jumped fox the” Would a bag of words model represent these two documents differently?

  23. Document Vectors • Documents are represented as “bags of words” • Represented as vectors when used computationally • Each vector holds a place for every term in the collection • Therefore, most vectors are sparse Slide from Mitch Marcus

  24. Document Vectors • Documents are represented as “bags of words” • Represented as vectors when used computationally • Each vector holds a place for every term in the collection • Therefore, most vectors are sparse Lexicon – the vocabulary set that you consider to be valid words in your documents. Usually stemmed (e.g. running->run) Slide from Mitch Marcus

  25. Document Vectors:One location for each word. A B C D E F G H I nova galaxy heat h’wood film role diet fur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3 “Nova” occurs 10 times in text A “Galaxy” occurs 5 times in text A “Heat” occurs 3 times in text A (Blank means 0 occurrences.) Slide from Mitch Marcus

  26. Document Vectors:One location for each word. A B C D E F G H I nova galaxy heat h’wood film role diet fur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3 “Nova” occurs 10 times in text A “Galaxy” occurs 5 times in text A “Heat” occurs 3 times in text A (Blank means 0 occurrences.) Slide from Mitch Marcus

  27. Document Vectors Document ids A B C D E F G H I nova galaxy heat h’wood film role diet fur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3 Slide from Mitch Marcus

  28. Vector Space Model • Documents are represented as vectors in term space • Terms are usually stems • Documents represented by vectors of terms • A vector distance measures similarity between documents • Document similarity is based on length and direction of their vectors • Terms in a vector can be “weighted” in many ways Slide from Mitch Marcus

  29. Document Vectors Document ids A B C D E F G H I nova galaxy heat h’wood film role diet fur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3 Slide from Mitch Marcus

  30. Comparing Documents

  31. Similarity between documents A = [10 5 3 0 0 0 0 0]; G = [5 0 7 0 0 9 0 0]; E = [0 0 0 0 0 10 10 0];

  32. Similarity between documents A = [10 5 3 0 0 0 0 0]; G = [ 5 0 7 0 0 9 0 0]; E = [ 0 0 0 0 0 10 10 0]; Treat the vectors as binary = number of words in common. Sb(A,G) = ? Sb(A,E) = ? Sb(G,E) = ? Which pair of documents are the most similar?

  33. Similarity between documents A = [10 5 3 0 0 0 0 0]; G = [5 0 7 0 0 9 0 0]; E = [0 0 0 0 0 10 10 0]; Sum of Squared Distances (SSD) = SSD(A,G) = ? SSD(A,E) = ? SSD(G,E) = ?

  34. Similarity between documents A = [10 5 3 0 0 0 0 0]; G = [5 0 7 0 0 9 0 0]; E = [0 0 0 0 0 10 10 0]; Angle between vectors: Cos(θ) = Dot Product: Length (Euclidean norm):

  35. Some words give more information than others • Does the fact that two documents both contain the word “the” tell us anything? How about “and”? Stop words (noise words): Words that are probably not useful for processing. Filtered out before natural language is applied. • Other words can be more or less informative. No definitive list but might include things like: http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words

  36. Some words give more information than others • Does the fact that two documents both contain the word “the” tell us anything? How about “and”? Stop words (noise words): Words that are probably not useful for processing. Filtered out before natural language is applied. • Other words can be more or less informative. No definitive list but might include things like: http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words

  37. Classifying Documents

  38. Here the vector space is illustrated as having 2 dimensions. How many dimensions would the data actually live in? Slide from Min-Yen Kan

  39. Query document – which class should you label it with? Slide from Min-Yen Kan

  40. Classification by Nearest Neighbor Classify the test document as the class of the document “nearest” to the query document (use vector similarity to find most similar doc) Slide from Min-Yen Kan

  41. Classification by kNN Classify the test document as the majority class of the k documents “nearest” to the query document. Slide from Min-Yen Kan

  42. Slide from Min-Yen Kan

  43. Slide from Min-Yen Kan

  44. Slide from Min-Yen Kan

  45. Slide from Min-Yen Kan

  46. Slide from Min-Yen Kan

  47. Classification by kNN What are the features? What’s the training data? Testing data? Parameters? Slide from Min-Yen Kan

More Related