Identifying sets of related words from the world wide web thesis defense 06 09 2005
Download
1 / 65

Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 - PowerPoint PPT Presentation


  • 299 Views
  • Uploaded on
  • Presentation posted in: Internet / Web

Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005. Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen. Outline. Introduction & Objective Methodology Experimental Results Conclusion Future Work Demo. Introduction.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentationdownload

Identifying Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Identifying sets of related words from the world wide web thesis defense 06 09 2005 l.jpg

Identifying Sets of RelatedWords from the World Wide Web Thesis Defense 06/09/2005

Pratheepan (Prath) Raveendranathan

Advisor: Ted Pedersen


Outline l.jpg

Outline

  • Introduction & Objective

  • Methodology

  • Experimental Results

  • Conclusion

  • Future Work

  • Demo


Introduction l.jpg

Introduction

  • The goal of my thesis research is to use the World Wide Web as a source of information to identify sets of words that are related in meaning.

    • Example, given two words - {gun,pistol}

      a possible set of related words would be

      {handgun, holster, shotgun, machine-gun, weapon,ammunition,bullet, magazine }

    • Example, given two words – {toyota, nissan, ford}

      A possible set of related words would

      {honda, gmc, chevy, mitsubishi}


Examples cont l.jpg

Examples Cont…

  • Example, given two words - {red,yellow}

    a possible set of related words would be

    {white,black,blue, colors, green}

  • Example, given two words - {George Bush,Bill Clinton}

    a possible set of related words would be

    {Ronald Reagan, Jimmy Carter, White House, Presidents, USA, etc }


Application l.jpg

Application

  • Use sets of related words to classify Semantic Orientation of reviews.

    (Peter Turney)

  • Use sets of related words to find the sentiment associated with particular product.

    (Rajiv Vaidyanathan and Praveen Agarwal).


Pros and cons of using the web l.jpg

Pros and Cons of using the Web

  • Pros

    • Huge amounts of text

    • Diverse text

      • Encyclopedia’s, Publications, Commercial Web Pages

    • Dynamic (ever-changing state)

  • Cons,

    • The Web creates a unique set of challenges,

    • Dynamic (ever-changing state)

      • News websites, Blogs

    • Presence of repetitive, noisy, or low-quality data.

      • HTML tags, web lingo (home page, information etc)


Contributions l.jpg

Contributions

  • Developed an Algorithm that predicts sets of related words by using pattern matching techniques and frequency counts.

  • Developed an Algorithm that predicts sets of related words by using a relatedness measure.

  • Developed an Algorithm that predicts sets of related words by using a relatedness measure and an extension of the Log Likelihood score.

  • Applied sets of related words to problem of Sentiment Classification.


Outline8 l.jpg

Outline

  • Introduction & Objective

  • Methodology

  • Experimental Results

  • Conclusion

  • Future Work

  • Demo


Interface to web google l.jpg

Interface to Web - Google

  • Reasons for using Google

    • Research is very much dependant on both the quantity and quality of the Web content.

    • Google has a very effective ranking algorithm called PageRank which attempts to give more important or higher quality web pages a higher ranking.

    • Google API – An interface which allows programmers to query more than 8 billion web pages using the Google search engine. (http://www.google.com/apis/).


Problems with google api l.jpg

Problems with Google API

  • Restricted to 1000 queries a day

  • 10 Results for each query

  • No “near” operator (Proximity based search)

  • Maximum 1000 results.

  • Alternative

    • Yahoo API – 5000 Queries a day (Released very recently)

      • No “near” operator as well.

      • Cannot retrieve number of hits.

        Note: Google was used only as means of retrieving from the

        Information.


Key idea behind algorithms l.jpg

Key Idea behind Algorithms

  • Words that are related in meaning often tend to occur together.

    • Example,

      A Springfield, MA , Chevrolet, Ford, Honda, Lexus, Mazda, Nissan, Saturn, Toyota automotive dealer with new and pre-owned vehicle sales and leasing


Algorithm 1 l.jpg

Algorithm 1

  • Features

    • Based on frequency

    • Takes only single words as input

    • Initial set 2 words

    • Frequency cutoff

    • Ranked by frequency

    • Smart stop list -

      • The, if, me, why, you etc (non-content words)

    • Web stop list

      • Web page, WWW, home,page, personal, url, information, link, text , decoration, verdana, script, javascript


Algorithm 1 high level description l.jpg

Algorithm 1 – High level Description

  • Create queries to Google based on the input terms.

  • Retrieve the top N number of web pages for each query.

    • Parse the retrieved web page content for each query.

      3.Tokenize web page content into list of words and frequency.

    • Discard words that occur less than C number of times.

      4.Find the common words between at least two of the sets of words. This set of intersecting words are the set of related words to the input term.

      5. Repeat the process for I iterations by using the set of related words

      from the previous iteration as input.


Algorithm 1 trace 1 l.jpg

Algorithm 1 Trace 1

  • Search Terms : S1={pistol, gun}

  • Frequency Cutoff – 15

  • Num Results (Web Pages) – 10

  • Iterations - 2


Algorithm 1 step 1 l.jpg

Algorithm 1 –Step 1

  • Create queries to Google based permutations of the Input Terms,

    • gun

    • gun AND pistol

    • pistol

    • pistol AND gun


Algorithm 1 step 2 l.jpg

Algorithm 1 – Step 2

  • Issue query to Google,

    • Retrieve the top 10 URLs for the query,

      • For each URL, retrieve the web page content, and parse the web page for more links.

      • Traverse these links and retrieve the content of those web pages as well.

        Repeat this process for each query.


Trace 1 cont l.jpg

Trace 1 Cont…

  • Web pages for the query gun


Trace 1 cont18 l.jpg

Trace 1 Cont…

  • Web pages for pistol


Trace 1 cont19 l.jpg

Trace 1 Cont…

  • Web pages for gun AND pistol


Trace 1 cont20 l.jpg

Trace 1 Cont…

  • Web pages for pistol AND gun


Algorithm 1 step 3 l.jpg

Algorithm 1 – Step 3

3. Next, for the total web page content retrieved for each query,

  • Remove HTML Tags etc and retrieve text.

  • Remove stop words.

  • Tokenize the web page content into lists of words and frequency.

    Note: This would result in the following 4 sets of words,

    each set representing the words retrieved for each

    query.


Slide22 l.jpg

Words from Web pages after removing stop words


Algorithm 1 step 4 l.jpg

Algorithm 1 – Step 4

4. Find the words that are common at least 2 sets.

Let,

  • gun AND pistol

  • pistol AND gun

  • gun

  • pistol

    Related Set =


Related set 1 iteration 1 l.jpg

Related Set 1 – Iteration 1


Trace 1 cont iteration 2 l.jpg

Trace 1 Cont… Iteration 2

  • 11 input terms –

    • Search terms created –

      • Rifle

      • Shooting

      • Guns

      • Cases

      • Airsoft

      • Shooting AND Guns

      • Guns AND Shooting

      • Guns AND Cases

        etc etc.

        Results in 112 = 121 queries to Google!

        Note: As you can see, the number of queries to Google increases

        drastically.


Result set 2 gun pistol l.jpg

Result Set 2 – {gun, pistol}


Algorithm 1 red yellow l.jpg

Algorithm 1 – {red, yellow}

Number of Results – 10

Frequency Cutoff - 15

Iterations - 1

Related Words


Problems with algorithm 1 l.jpg

Problems with Algorithm 1

  • Frequency based ranking,

  • Number of input terms restricted to 2,

  • Input and output restricted to single words


Algorithm 2 l.jpg

Algorithm 2

  • Features

    • Based on frequency & relatedness score

    • Can takes input as single words or 2 word collocations

    • Relatedness measure based on Jiang and Conrath

    • Frequency cutoff and relatedness score cutoff

    • Ranked by score

    • Initial set can be more than 2 words

    • Bi-grams as output

    • Smart stop list

      • The, if, me, why, you etc

    • Web stop words + phrases

      • Web page, WWW, home page, personal, url, information, link, text , decoration, verdana, script, javascript


Algorithm 2 high level description l.jpg

Algorithm 2 – High level Description

  • Repeat same steps as in Algorithm 1 to retrieve initial set of related words (Add bigrams to results as well).

  • For each word returned by Algorithm 1 as a related word,

    • Calculate Relatedness of word to input terms.

    • Discard any word or bigram with a relatedness score greater than the score cutoff.

    • Sort remaining terms from most relevant to irrelevant.

  • Repeat Steps 1 – 2 for each iteration, using the set of words from iteration previous iteration as input.


Relatedness measure distance measure l.jpg

Relatedness Measure (Distance Measure)

  • Relatedness (Word1, Word2) =

    log (hits(Word1)) + log (hits(Word2)) – 2 * log (hits(Word1 Word2))

    (Based on measure by Jiang and Conrath)

  • Example 1,

    hits(toyota) = 12,500,000

    hits(ford) = 22,900,000

    hits(toyota AND ford) = 50,000

    = 32.41

  • Example 2,

    hits(toyota) = 12,500,000

    hits(ford) = 22,900,000

    hits(toyota AND ford) = 150,000

    = 30.82


Relatedness measure cont l.jpg

Relatedness Measure Cont…

  • Example 3,

    hits(toyota) = 1000

    hits(ford) = 1000

    hits(toyota AND ford) = 1000

    Relatedness (toyota,ford) = 0

    As the measure tends to approach zero, the relatedness

    between the two terms increase.


Input set gun pistol l.jpg

Input Set – {gun, pistol}


Algorithm 2 red yellow l.jpg

Algorithm 2 – {red, yellow}

Number of Results – 10

Frequency Cutoff - 10

Score Cutoff - 30

Iterations - 1


Problems with algorithm 2 l.jpg

Problems with Algorithm 2

  • Certain bigrams are not good collocations,

    • For example,

      {sunny, cloudy}

      Number of Results - 10

      Frequency Cutoff - 15

      Bigram Cutoff - 4

      Score Cutoff - 30


Algorithm 3 high level description l.jpg

Algorithm 3 – High Level Description

  • Repeat same steps as in Algorithm 1 to retrieve initial set of related words (Add bigrams to results as well).

  • For each term returned by Algorithm 1 as a related word,

    • If the term is a bigram,

      • Validate if bigram is a valid collocation

        • If bigram is a valid collocation continue with step 2.2

          else

          2.Remove term from set of related words.

    • Calculate Relatedness of word to input terms.

    • Discard any word or collocation with a relatedness score greater than the score cutoff.

    • Sort remaining terms from most relevant to irrelevant.


Verifying bigrams l.jpg

Verifying Bigrams

  • Adapt Log Likelihood (G2) Score to web hit counts

    • Example, “New York”

    • 4 Queries to Google

“New *”

“New York”

“* York”

“of the”


Expected values l.jpg

Expected Values

(621 * 3560) / 5670

(5049 * 3560) / 5670

(621 * 2110) / 5670

(5049 * 2110) / 5670


Identifying a bad collocation l.jpg

Identifying a “bad” collocation

  • Bigram is discarded if,

    • Observed value for bigram is 0 (eg, “New York”)

    • Observed value for bigram is less than the expected value.


Example bigrams l.jpg

Example Bigrams


Methodology l.jpg

Methodology

  • Introduction & Objective

  • Methodology

  • Experimental Results & Evaluation

  • Conclusion

  • Future Work

  • Demo


Evaluating results l.jpg

Evaluating Results

  • Compare with Google Sets

    • http://labs.google.com/sets

  • Human Subject Experiments

    • Around 20 people expanded 2-word sets to what they feel as a set of related words


F measure precision and recall l.jpg

F-measure, Precision and Recall


Comparison of algorithm 1 2 l.jpg

Comparison of Algorithm 1 & 2


Algorithm 145 l.jpg

Algorithm 1

{jordan,chicago}

Number of Results – 10

Frequency Cutoff - 15

Iterations - 1

Precision = 0,Recall = 0

F-measure = 0


Algorithm 246 l.jpg

Algorithm 2

{toyota,ford, nissan}

Number of Results – 10

Frequency Cutoff - 10

Score Cutoff - 30

Iterations - 1

Precision = 6/11 = 0.54,Recall = 6/11 = 0.54

F-measure = 0.54


Algorithm 247 l.jpg

Algorithm 2

{january, february, may}

Number of Results – 10

Frequency Cutoff - 10

Score Cutoff - 30

Iterations - 1

Precision = 9/9 = 1,Recall = 9/9 = 1

F-measure = 1


Algorithm 248 l.jpg

Algorithm 2

{armani, versace}

Number of Results – 10

Frequency Cutoff - 10

Bigram Cutoff - 4

Score Cutoff - 30

Iterations - 1

Precision = 11/20 = 0.55,

Recall = 11/43 = .25

F-measure = 0.35

Not Entire Set


Algorithm 249 l.jpg

Algorithm 2

{artificial intelligence, machine learning}

Number of Results – 10

Frequency Cutoff - 10

Bigram Cutoff - 4

Score Cutoff - 32

Iterations - 1

Precision = 9/23 = 0.39,

Recall = 9/48 = 0.1875

F-measure = 0.25


Comparison of algorithm 2 3 l.jpg

Comparison of Algorithm 2 & 3

  • {sunny, cloudy}

Number of Results – 10

Frequency Cutoff - 10

Bigram Cutoff - 4

Score Cutoff - 30

Iterations - 1


Algorithm 3 bigrams l.jpg

Algorithm 3 - Bigrams

{artificial intelligence, machine learning}


Performance of algorithms l.jpg

Performance of Algorithms

  • F-measure increases, from Algorithm 1 to 3


Sentiment classification l.jpg

Sentiment Classification

  • Point wise Mutual Information –Information Retrieval Algorithm (PMI-IR) – Peter Turney

    • Used to classify reviews as being positive or negative in orientation

      • Part-of-speech tag the review

      • Extract 2-word phrases from text

        • Adjective followed by a Noun

        • Noun followed by a Noun etc.

      • Use a positive connotation such as “excellent” and negative connotation such as “poor”, and calculate the Semantic Orientation (SO) for each 2-word phrase,


Example l.jpg

Example,

  • Let, the phrase be “incredible cast”

    • SO(“incredible cast”)

      = log2(hits(“incredible cast” NEAR “excellent”)) * hits(“poor”)

      (hits(“incredible cast” NEAR “poor”)) * hits(“excellent”)


Problem with current algorithm l.jpg

Problem with Current Algorithm

  • Words such as “poor” have at least two senses

    • “poor” as in poverty

    • “poor” as in not good


Extended pmi ir l.jpg

Extended PMI-IR

  • Used Google instead of AltaVista

  • Used AND instead of NEAR

  • Extended SO formula

    • Use multiple pairs of positive and negative connotations

      • {excellent, poor}, {good, bad}, {great, mediocre}


A negative review for the movie planet of the apes l.jpg

A Negative Review for the movie “Planet of the Apes”

Classified by our Algorithm as being Negative


Positive review for an audi l.jpg

Positive Review for an Audi

Classified by our Algorithm as being Positive


Negative movie review l.jpg

Negative Movie Review

Classified by our Algorithm as being Negative


Performance of extended pmi ir l.jpg

Performance of Extended PMI-IR

  • Algorithm run on 20 reviews (movies and automobiles)

  • Overall Accuracy – 75%


End result l.jpg

End Result:

  • All of this is available freely on CPAN and Sourceforge

    Google-Hack


Conclusions contribution l.jpg

Conclusions & Contribution

  • Developed 3 Algorithms that try to predict sets of related words

    • Algorithm 1 was based on frequency

    • Algorithm 2 was based on a relatedness measure

    • Algorithm 3 was based on a relatedness measure and the Log Likelihood score

  • Applied sets of related words to Sentiment Classification


Conclusions contribution63 l.jpg

Conclusions & Contribution

  • Released free PERL package Google-Hack on CPAN and Sourceforge.

  • Developed a web interface.


Future work l.jpg

Future Work

  • Addition of proximity operator

  • Restrict # of web pages traversed

  • Find intersection of words through different search engines - Yahoo API

  • Use anchor text


Related urls l.jpg

Related URLs

  • Research Page

    • http://www.d.umn.edu/~rave0029/research

  • Google-Hack

    • http://google-hack.sf.net

  • CPAN Release

    • http://search.cpan.org/~prath/WebService-GoogleHack0.15/GoogleHack/GoogleHack.pm

  • Web Interface

    • http://marimba.d.umn.edu/cgi-bin/googlehack/index.cgi


ad
  • Login