1 / 112

Introduction & Information Theory

Introduction & Information Theory. Ling570 Advanced Statistical Methods in NLP January 3 , 2012. Roadmap. Course Overview Information theory. Course Overview. Course Information. Course web page: http://courses.washington.edu/ ling572. Course Information. Course web page:

golda
Download Presentation

Introduction & Information Theory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January3, 2012

  2. Roadmap • Course Overview • Information theory

  3. Course Overview

  4. Course Information • Course web page: • http://courses.washington.edu/ling572

  5. Course Information • Course web page: • http://courses.washington.edu/ling572 • Syllabus: • Schedule and readings • Links to other readings, slides, links to class recordings • Slides posted before class, but may be revised

  6. Course Information • Course web page: • http://courses.washington.edu/ling572 • Syllabus: • Schedule and readings • Links to other readings, slides, links to class recordings • Slides posted before class, but may be revised • Catalyst tools: • GoPost discussion board for class issues • CollectItDropbox for homework submission and TA comments • Gradebook for viewing all grades

  7. GoPost Discussion Board • Main venue for course-related questions, discussion • What not to post: • Personal, confidential questions; Homework solutions

  8. GoPost Discussion Board • Main venue for course-related questions, discussion • What not to post: • Personal, confidential questions; Homework solutions • What to post: • Almost anything else course-related • Can someone explain…? • Is this really supposed to take this long to run?

  9. GoPost Discussion Board • Main venue for course-related questions, discussion • What not to post: • Personal, confidential questions; Homework solutions • What to post: • Almost anything else course-related • Can someone explain…? • Is this really supposed to take this long to run? • Key location for class participation • Post questions or answers • Your discussion space: Michael & I will not jump in often

  10. GoPost • Emily’s 5-minute rule: • If you’ve been stuck on a problem for more than 5 minutes, post to the GoPost!

  11. GoPost • Emily’s 5-minute rule: • If you’ve been stuck on a problem for more than 5 minutes, post to the GoPost! • Mechanics: • Please use your UW NetID as your user id • Please post early and often ! • Don’t wait until the last minute • Keep up with the GoPost– hard to use retrospectively • Notifications: • Decide how you want to receive GoPost postings

  12. Email • Should be used only for personal or confidential issues • Grading issues, extended absences, other problems • General questions/comments go on GoPost

  13. Email • Should be used only for personal or confidential issues • Grading issues, extended absences, other problems • General questions/comments go on GoPost • Please send email from your UW account • Include Ling572 in the subject • If you don’t receive a reply in 24 hours (48 on weekends), please follow-up

  14. Homework Submission • All homework should be submitted through CollectIt • Tar cvf hw1.tar hw1_dir • Homework due 11:45 Thursdays • Late homework receives 10%/day penalty (incremental) • Most major programming languages accepted • C/C++/C#, Java, Python, Perl, Ruby • If you want to use something else, please check first • Please follow naming, organization guidelines in HW • All programming assignments should run on the CL cluster under Condor

  15. Homework Assignments • (Mostly) Implementation tasks designed to get hands-on understanding of ML approaches • Focus on core concepts, not minute optimizations • If gold standard achieves 90.7%, 89.8% is okay • Not scored directly on efficiency, but.. • If it’s too slow, hard to debug, test, etc • Not scored on optimal software design either • Try to avoid hardcoding, but don’t need complex design

  16. Grading • Homework assignments: 80% • Reading assignments: 10% • Class participation: 10% • No midterm or final exams • One homework assignment may be dropped

  17. Grades • Grades in Catalyst Gradebook • TA feedback returned through CollectIt

  18. Grades • Grades in Catalyst Gradebook • TA feedback returned through CollectIt • Extensions: only for extreme circumstances • Illness, family emergencies • Incomplete: only if all work completed up last two weeks • UW policy

  19. Workload • CLMS courses carry a heavy workload • Ling572 is no exception

  20. Workload • CLMS courses carry a heavy workload • Ling572 is no exception • Estimates (per week): • ~3 hours: Lecture • 10-12 hours: Homework assignments • Highly variable, depending on prior programming exp. • 1-3 hours: Reading + reading assignments

  21. Workload • CLMS courses carry a heavy workload • Ling572 is no exception • Estimates (per week): • ~3 hours: Lecture • 10-12 hours: Homework assignments • Highly variable, depending on prior programming exp. • 1-3 hours: Reading + reading assignments • Tracking: • GoPost thread for each assignment: please post • Consider automatic time tracker (e.g. ‘hamster’ for linux)

  22. Recordings • All classes will be recorded • Links to recordings appear in syllabus • Available to all students, DL and in class

  23. Recordings • All classes will be recorded • Links to recordings appear in syllabus • Available to all students, DL and in class • Please remind me to: • Record the meeting (look for the red dot) • Repeat in-class questions

  24. Recordings • All classes will be recorded • Links to recordings appear in syllabus • Available to all students, DL and in class • Please remind me to: • Record the meeting (look for the red dot) • Repeat in-class questions • Note: Instructor’s screen is projected in class • Assume that chat window is always public

  25. Contact Info • Gina: Email: levow@uw.edu • Office hour: • Fridays: 12:30-1:30(afterTreehouse meeting) • Location: Padelford B-201 • Or by arrangement • Available by Skype or Adobe Connect

  26. Contact Info • Gina: Email: levow@uw.edu • Office hour: • Fridays: 12:30-1:30(afterTreehouse meeting) • Location: Padelford B-201 • Or by arrangement • Available by Skype or Adobe Connect • TA: Michael Wayne Goodman: • Email: goodmami@uw.edu • Office hour: Time: TBD, see GoPost • Location: Treehouse

  27. Online Option • Please check you are registered for correct section • CLMS in-class: Section A • State-funded: Section B • CLMS online: Section C

  28. Online Option • Please check you are registered for correct section • CLMS in-class: Section A • State-funded: Section B • CLMS online: Section C • Online attendance for in-class students • Not more than 2 times per term (e.g. missed bus, ice)

  29. Online Option • Please check you are registered for correct section • CLMS in-class: Section A • State-funded: Section B • CLMS online: Section C • Online attendance for in-class students • Not more than 2 times per term (e.g. missed bus, ice) • Please enter meeting room 5-10 before start of class • Try to stay online throughout class

  30. Online Tip • If you see: • You are not logged into Connect. The problem is one of the following: the permissions on the resource you are trying to access are incorrectly set.Please contact your instructor/Meeting Host/etc. • you do not have a Connect account but need to have one. For UWEO students: • If you have just created your UW NetID or just enrolled in a course • ….. • Clear your cache, close and restart your browser

  31. Course Description

  32. Course Prerequisites • Programming Languages: • Java/C++/Python/Perl/.. • Operating Systems: Basic Unix/linux • CS 326 (Data structures) or equivalent • Lists, trees, queues, stacks, hash tables, … • Sorting, searching, dynamic programming,.. • Stat 391 (Probability and statistics): random variables, conditional probability, Bayes’ rule, …. • Ling 570 (or similar)

  33. Course Prerequisites • Programming Languages: • Java/C++/Python/Perl/.. • Operating Systems: Basic Unix/linux • CS 326 (Data structures) or equivalent • Lists, trees, queues, stacks, hash tables, … • Sorting, searching, dynamic programming,.. • Stat 391 (Probability and statistics): random variables, conditional probability, Bayes’ rule, …. • Ling 570 (or similar) • If you haven’t taken Ling570 or Ling472, please email me.

  34. Textbook • No textbook • Online readings

  35. Textbook • No textbook • Online readings • Reference / Background: • Jurafskyand Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2nd edition, 2008 • Available from UW Bookstore, Amazon, etc • Manning and Schutze, Foundations of Statistical Natural Language Processing • Early edition available online through UW library

  36. Course Goals • Understand the basis of machine learning algorithms that achieve state-of-the-art results

  37. Course Goals • Understand the basis of machine learning algorithms that achieve state-of-the-art results • Focus on classification and sequence labeling

  38. Course Goals • Understand the basis of machine learning algorithms that achieve state-of-the-art results • Focus on classification and sequence labeling • Concentrate on basic concepts of machine learning techniques and application to NLP tasks • Not a computational learning theory class • Won’t focus on proofs

  39. Model Questions • Machine learning algorithms • Decision trees and naïve bayes • MaxEnt and Support Vector Machines • ….

  40. Model Questions • Machine learning algorithms • Decision trees and naïve bayes • MaxEnt and Support Vector Machines • …. • Key questions • What is the model? • What assumptions does the model make? • How many parameters does the model have?

  41. Model Questions • Training: How are the parameters learned? • Decoding: How does the model assign values?

  42. Model Questions • Training: How are the parameters learned? • Decoding: How does the model assign values? • Pros and Cons: • How does the model handle… • outliers? missing data? noisy data? • Is it scalable? • How long does it take to train? decode? • How much training data is needed? Labeled? Unlabeled?

  43. Tentative Outline for Ling572 • Unit #0 (0.5 weeks): Basics • Introduction • Information theory • Classification review

  44. Outline for Ling572 • Unit #0 (0.5 weeks): Basics • Introduction • Information Theory • Classification review • Unit #1 (3 weeks): Classic Machine Learning • K Nearest Neighbors • Decision Trees • Naïve Bayes • Perceptrons (?)

  45. Outline for Ling572 • Unit #3: (4 weeks): Discriminative Classifiers • Feature Selection • Maximum Entropy Models • Support Vectors Machines

  46. Outline for Ling572 • Unit #3: (4 weeks): Discriminative Classifiers • Feature Selection • Maximum Entropy Models • Support Vectors Machines • Unit #4: (1.5 weeks): Sequence Learning • Conditional Random Fields • Transformation Based Learning

  47. Outline for Ling572 • Unit #3: (4 weeks): Discriminative Classifiers • Feature Selection • Maximum Entropy Models • Support Vectors Machines • Unit #4: (1.5 weeks): Sequence Learning • Conditional Random Fields • Transformation Based Learning • Unit #5: (1 week): Other Topics • Semi-supervised learning,…

  48. Outline for Ling572 • Topics: • Feature selection approaches • Beam search • Toolkits: • Mallet, libSVM • Using binary classifiers for multiclass classification

  49. Early NLP • Early approaches to Natural Language Processing • Similar to classic approaches to Artificial Intelligence

  50. Early NLP • Early approaches to Natural Language Processing • Similar to classic approaches to Artificial Intelligence • Reasoning, knowledge-intensive approaches

More Related