improving classification accuracy using automatically extracted training data n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Improving Classification Accuracy Using Automatically Extracted Training Data PowerPoint Presentation
Download Presentation
Improving Classification Accuracy Using Automatically Extracted Training Data

Loading in 2 Seconds...

play fullscreen
1 / 19

Improving Classification Accuracy Using Automatically Extracted Training Data - PowerPoint PPT Presentation


  • 173 Views
  • Uploaded on

Improving Classification Accuracy Using Automatically Extracted Training Data. Ariel Fuxman A. Kannan, A. Goldberg, R. Agrawal, P. Tsaparas, J. Shafer Search Labs Microsoft Research – Silicon Valley Mountain View, CA. Web as a Source of Training Data.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Improving Classification Accuracy Using Automatically Extracted Training Data' - alyn


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
improving classification accuracy using automatically extracted training data

Improving Classification Accuracy Using Automatically Extracted Training Data

Ariel Fuxman

A. Kannan, A. Goldberg, R. Agrawal,

P. Tsaparas, J. Shafer

Search Labs

Microsoft Research – Silicon Valley

Mountain View, CA

web as a source of training data
Web as a Source of Training Data
  • For classification tasks, large amounts of training data can significantly improve accuracy
  • How do we create large training sets?
    • Conventional methods of using human labelers are expensive and do not scale
  • Thesis: The Web can be used to automatically create labeled data
in this talk
In this talk
  • Validate the thesis on a task of practical importance:

Retail intent identification in Web Search

  • Present desirable properties of sources of labeled data
  • Show how to extract labeled data from the sources
importance of retail intent queries
Importance of Retail Intent Queries

Share of Searches

(% of total search queries)

Share of Paid Clicks

(% of queries leading to paid clicks)

Just Behave: A Look at Searcher Behavior -Total U.S. MarketComScore Feb 2009

application of retail intent
Application of Retail Intent
  • Provide enhanced user experience around Commerce Search
retail intent identification
Retail intent identification

Definition:

A query posed to a search engine has retail intent if most users who type the query have the intent to buy a tangible product

Examples :

data sources for retail intent
Data Sources for Retail Intent
  • Sources
    • Web sites of retailers (e.g., Amazon, Walmart, Buy.com)
  • Training Data
    • Queries typed directly on search box of retailers
  • Extraction from toolbar logs

URL in toolbar log

desirable properties of web data sources
Desirable Properties of Web Data Sources
  • Popularity
    • Sources should yield large amounts of data
  • Orthogonality
    • Sources should provide training data about different regions of the training space
  • Separation
    • Sources should provide either positive or negative examples of the target class, but not both
popularity
Popularity
  • Sources should yield large amounts of data
  • For retail intent identification
    • Web site traffic is a proxy for popularity
    • More traffic means more queries
    • Choose Web sites of retailers based on publicly available traffic report (Hitwise)
orthogonality
Orthogonality
  • Sources should provide training data about different regions of the training space
  • For retail intent identification
    • Positive examples: top sites from

“Departmental Stores” and “Classified Ads” (Amazon and Craigslist)

    • Negative examples: top site from “Reference” (Wikipedia)
separation
Separation
    • Training examples must unambiguously reflect the intended meaning of most users
    • Example: there is a book called “World War I”, but the intent of the query is mostly non-commercial
  • Can be enforced by removing groups of confusable queries from the sources
method to enforce separation
Method to Enforce Separation
  • Create “groups” of positive queries
  • Compare the word frequency distribution of each group against the negative class using Jensen-Shannon divergence
  • Remove groups with low divergence
groups for retail intent
Groups for Retail Intent
  • Extracting groups from the toolbar log

URL in toolbar log

enforcing separation property
Enforcing separation property
  • JS Divergence of Amazon and Craigslist with respect to Wikipedia

See paper for experimental validation

experiments
Experiments
  • Setup
    • Built multiple classifiers using manual and automatically extracted labels in the training sets
    • Classification method: logistic regression, using unigrams and bigrams as features
    • Test set: 5K queries randomly sampled from a query log and labeled using Mechanical Turk
automatic vs manual
Automatic vs. Manual

Accuracy of extracted labels classifier on par with manual labels classifier

combining manual and automatically extracted
Combining Manual and Automatically Extracted

Marginally different from using only automatically extracted labels

using unlabeled data
Using Unlabeled Data

Performance of the automatic labels classifier is still on par with classifiers that start with manual labels and exploit unlabeled data using self-training

conclusions
Conclusions
  • By carefully choosing the data sources, we can extract valuable training data
  • Using large amounts of automatically extracted training data, we can get classifiers that are on par with those trained with manual labels
  • As future work, we would like to apply this experience to other classification tasks