machine learning in practice lecture 12 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Machine Learning in Practice Lecture 12 PowerPoint Presentation
Download Presentation
Machine Learning in Practice Lecture 12

Loading in 2 Seconds...

play fullscreen
1 / 44

Machine Learning in Practice Lecture 12 - PowerPoint PPT Presentation


  • 77 Views
  • Uploaded on

Machine Learning in Practice Lecture 12. Carolyn Penstein Ros é Language Technologies Institute/ Human-Computer Interaction Institute. Plan for the Day. Announcements Assingment 5 handed out – Due next Thur Note: Readings for next two lectures on Blackboard in Readings folder

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Machine Learning in Practice Lecture 12' - hazina


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
machine learning in practice lecture 12

Machine Learning in PracticeLecture 12

Carolyn Penstein Rosé

Language Technologies Institute/ Human-Computer Interaction Institute

plan for the day
Plan for the Day
  • Announcements
    • Assingment 5 handed out – Due next Thur
    • Note: Readings for next two lectures on Blackboard in Readings folder
      • See syllabus for specifics
    • Feedback on Quiz 4
    • Homework 4 Issues
    • Midterm assigned Thur, Oct 21!!!
  • More about Text
    • Term Weights
    • Start Linguistic Tools
assignment 52
Assignment 5

* 2 examples,

but there are

many more

ta office hours
TA Office Hours
  • Possibly moving to Wednesdays at 3
  • Note that there will be a special TA session before the midterm for you to ask questions
how are we doing on pace and level of detail

?

How are we doing on pace and level of detail?

33% Check

33% Check

33% Check

33% Check

33% Check

33% Check

33% Check

33% Check

33% Check

33% Check

33% Check

33% Check

33% Check

33% Check

33% Check

33% Check

feedback on quiz 41
Feedback on Quiz 4
  • Nice job overall!!!
    • I could tell you read carefully! 
  • Note that part-of-speech refers to grammatical categories like noun, verb, etc.
  • Named entity extractors locate noun phrases that refer to people, organizations, countries, etc.
  • Some people skipped they why and how parts of questions
  • Some people over-estimated the contribution of pos-tagging
error analysis4
Error Analysis
  • If I sort by different features, I can see whether rows of a particular color end up in a specific region
  • If I want to know which features to do this with, I can start with the most predictive features
  • Another option would be to use machine learning to predict which cell an instance would end up in within the confusion matrix
other suggestions
Other Suggestions
  • Use RemoveMisclassified
    • unsupervised attribute filter
    • Separates correctly classified instances from incorrectly classified instances
    • Works in a similar way to the remove folds filter
      • Only need to use it twice rather than 20 times for 10-fold cross validation
      • Doesn’t give you as much information
computing confidence intervals
Computing Confidence Intervals
  • 90% confidence interval corresponds to z=1.65
    • 5% chance that a data point will occur to the right of the rightmost edge of the interval
  • f = percentage of successes
  • N = number of trials
  • f=75%, N=1000, c=90% -> [0.727,0.773]
document retrieval inverted index
Document Retrieval/ Inverted Index

Doc# Freq Word Positions

Stemmed Tokens

DocFreq

Total Frequ

** Easy to find all documents that

have terms in common with your

query.

** Stemming allows you to retrieve

Morphological variants (run, runs,

running, runner)

** Word positions allow you to

specify that you want two terms to

appear within N words of each other

evaluating document retrieval
Evaluating Document Retrieval
  • Use standard measures: precision, recall, f-measure
  • Retrieving all documents that share words with your query will both over and under generate
    • If you have the whole web to select from, then under-generating is much less of a concern than over generating
    • Does this apply to your task?
common vocabulary is not enough
Common Vocabulary is Not Enough
  • You’ll get documents that mention other senses of the term you mean
    • River bank versus financial institution
    • Word sense disambiguation is an active area of computational linguistics!
  • You won’t get documents that discuss other related terms
common vocabulary is not enough1
Common Vocabulary is Not Enough
  • You’ll get documents that mention a term but are not about that term
    • Partly get around this by sorting by relevance
    • Term weights approximate a measure of relevance
    • Cosine similarity between Query vector and Document vector computes relevance – then sort documents by relevance
computing term weights
Computing Term Weights
  • A common vector representation for text is to have one attribute per word in your vocabulary
  • Notice that Weka gives you other options
why is it important to think about term weights
Why is it important to think about term weights?
  • If term frequency or salience matters for your task, you might lose too much information if you just consider whether a term ever occurred or not
  • On the other hand, if term frequency doesn’t matter for your task, a simpler representation will most likely work better
  • Term weights are important for information retrieval because in large documents, just knowing a term occurs at least once does not tell you whether that document is “about” that term
basics of computing term weights
Basics of Computing Term Weights
  • Assume occurrence of each term is independent so that attributes are orthogonal
    • Obviously this isn’t true! But it’s a useful simplifying assumption
  • Term weight functions have two basic components
    • Term frequency: How many times did that term occur in the current document
    • Document frequency: How many times did that term occur across documents (or how many documents did that term occur in)
basics of computing term weights1
Basics of Computing Term Weights
  • Inverse document frequency: a measure of the rarity of a term
    • Idft = log(N/nt) where t is the term, N is the number of documents, and nt is the number of documents where that term occurred at least once
    • Note that inverse document frequency is 0 in the case that a term occurs in all documents
    • It approaches 1 for very rare terms
tf idf term frequency inverse document frequency
TF.IDF – Term Frequency/Inverse Document Frequency
  • A scheme that combines term frequency with inverse document frequency
    • Wt,d = tft,d X idft,d
  • Weka also gives you the option for normalizing for document length
    • Since terms are more likely to occur in longer documents just by chance
  • You can then compute the cosine similarity between the vector representation for a text and that of a query
computing term weights1
Computing Term Weights
  • Notice how to set options for different types of term weights
trying different term weights
Trying Different Term Weights
  • Predicting Class1, 72 instances
  • Note that the number of levels for each separate feature is identical for Word Count, Term Frequency, and TF.IDF.
trying different term weights1
Trying Different Term Weights
  • What is different is the relative weight of the different features.
  • Whether this matters depends on the learning method
basic anatomy layers of linguistic analysis
Basic Anatomy: Layers of Linguistic Analysis
  • Phonology: The sound structure of language
    • Basic sounds, syllables, rhythm, intonation
  • Morphology: The building blocks of words
    • Inflection: tense, number, gender
    • Derivation: building words from other words, transforming part of speech
  • Syntax: Structural and functional relationships between spans of text within a sentence
    • Phrase and clause structure
  • Semantics: Literal meaning, propositional content
  • Pragmatics: Non-literal meaning, language use, language as action, social aspects of language (tone, politeness)
  • Discourse Analysis: Language in practice, relationships between sentences, interaction structures, discourse markers, anaphora and ellipsis
sentence segmentation
Sentence Segmentation
  • Breaking a text into sentences is a first step for processing
  • Why is this not trivial?
    • In speech there is no punctuation
    • In text, punctuation may be missing
    • Punctuation may be ambiguous (i.e., periods in abbreviations)
  • Alternative approaches
    • Rule based/regular expressions
    • Statistical models
tokenization
Tokenization
  • Segment a data stream into meaningful units
    • Each unit is called a token
  • Simple rule: a token is any sequence of characters separated by white space
    • Leaves punctuation attached to words
    • But stripping out punctuation would break up large numbers like 5,235,064
    • What about words like “school bus”
automatic segmentation1
Automatic Segmentation
  • Run a sliding window 3 symbols wide across the text
  • Some features from outside the window used also for prediction
  • Each position classified as a boundary or not
  • Boundary between 1st and 2nd position
automatic segmentation5
Automatic Segmentation
  • Features used for prediction:
    • 3 symbols
    • Punctuation
    • Whether there have been at least two capitalized words since the last boundary
    • Whether there have been at least three non-capitalized words since the last boundary
    • Whether we have seen fewer than half the number of symbols as the average segment length since the last boundary
    • Whether we have seen fewer than half the average number of symbols between punctuations since the last punctuation mark
automatic segmentation6
Automatic Segmentation
  • Model trained with decision tree learning algorithm
  • Percent accuracy: 96%
  • Agreement: .44 Kappa
  • Precision: .59
  • Recall: .37
  • We assign 66% as many boundaries as the gold standard
stemmers and taggers
Stemmers and Taggers
  • Stemmers are simple morphological analyzers
    • Strip a word down to the root
    • Run, runner, running, runs: all the same root
    • Next week we will use the Porter stemmer, which just chops endings off
  • Taggers assign syntactic categories to tokens
    • Words assigned potential POS tags in the lexicon
    • Context also plays a role
wrap up
Wrap-Up
  • Feature space design affects classification accuracy
  • We examined two main ways to manipulate the feature space representation of text
    • One way is through alternative types of term weights
    • Also using linguistic tools to identify features of texts beyond just the words that make them up
  • Part-of-speech taggers can be customized with different tag sets
    • Next time we’ll talk about the tag set you will use in the assignment
  • We will also talk about parsers
    • You can use parsers to create features for classification