1 / 35

Implementing Query Classification

Implementing Query Classification. HYP: End of Semester Update prepared Minh. Previously…. Web search queries: Understand user goal Broder (et al 2002): Queries are classified into 3 categories: Informational Navigational Transactional. Previously….

amaris
Download Presentation

Implementing Query Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Implementing Query Classification HYP: End of Semester Update prepared Minh

  2. Previously… • Web search queries: • Understand user goal • Broder (et al 2002): • Queries are classified into 3 categories: • Informational • Navigational • Transactional

  3. Previously… • Functional Faceted Web Query Classification • Ambiguity: Polysemous, General, Specific • Authority Sensitivity: Yes - No • Spatial Sensitivity: Yes - No • Temporal Sensitivity: Yes - No • Query’s 4-Tuple: <Am, Au, S, T> • 3 * 2 * 2 * 2 = 24 different combinations.

  4. Temporal Sensitivity • Definition: • A keyword is temporal sensitive if the results returned by querying it on web search engine tends to change with respect to time. • Example: • Temporal sensitive: Liverpool, Beyonce, Jennifer Hawkins, etc.. • Non-temporal sensitive: video, buying car, etc..

  5. Up-to-date Project Scope • Objective: to analyze the temporal sensitivity facet of web search queries. • Problem: find the temporal correlation between web queries

  6. Web Query Histogram • Periodic queries: • Non-periodic queries: Champions League Final Liverpool

  7. Queries Correlation • Correlation • Observation: 2 keywords are temporally related to each other

  8. Proposed System Framework • Ask Google Trends for query’s histogram • Use histogram digitizer program (Plotparser by WeiHua) to get the numerical data • Query Correlation: • Calculate correlation coefficient between queries • Query classification

  9. Google Trends

  10. Histogram Digitizer

  11. Queries Correlation: 1st attempt • Calculate Correlation coefficient: • Using data of 45 months: Jan 2004 until September 2007 • Calculate coefficient based on the entire histograms

  12. Result classification: 1st attempt • Data of 15 different popular keywords, of which: • Periodic keywords: • Champions League Final, Grammy, Pro Evolution Soccer, Oscar Winner, Valentine, Chrismas(!). • Related keywords: • PS2, Xbox, Jack Nicholson, Beyonce , chocolate, chocolateNews, Liverpool, EA Sport, Konami • All keywords are compare to each other based on correlation coefficient of their histograms. • (15*14)/2 = 105 instances

  13. Result classification: 1st attempt • Classification based on threshold method: • Statistical result: • Threshold value: 0.25

  14. 1st attempt Problems: • Very low threshold value • Only one feature used. • Using entire histogram, while some keywords are only temporally related to each other at some periods of time. • Example: Valentine – Chocolate (Correlation appears during February)

  15. Queries Correlation: 2nd attempt • Interesting period: • Period in which two query are highly related to each other • -> Segmentation (Clustering) problem

  16. Clustering Using Simple K means • Algorithm to predict no. of clusters • Use WEKA to cluster the histogram

  17. Query Correlation: 2nd attempt • Periodic keywords detection: • Identify repeated pattern using correlation • Periodic query tends to have highly correlation coefficient on repeated part.

  18. Interesting Periods Projection • Interesting periods from related keyword histogram is to be projected on periodic keyword’s histogram

  19. Result Classification: 2nd Attempt • Using previous dataset • Related keywords are compared with each of periodic keywords for correlation • Result: • Manage to increase threshold value to: 0.5

  20. 2nd attempt problems • K – means clustering does not guarantee correct interesting periods detection: • Due to the fact that we have to provide no. of cluster for K-means • -> implemented algorithm to determine no. of cluster failed to provide correct value • Small training data set. • Too simple method of threshold detector.

  21. Queries Correlation: 3rd attempt • Need to find another way to identify interesting period. • Peak period: • Period in which there is a high peak in query volume • Peak detection problem: • Mapping and smoothing using convolution

  22. Clustering using peak detection • Mapping:

  23. Clustering using peak detection • Smoothing using convolution:

  24. Clustering using peak detection • Peak Detection: using simple slope-change algorithm to determine peaks and valleys • (with threshold value: mean)

  25. Interesting periods Projections • Interesting periods from related keyword histogram is to be projected on periodic keyword’s histogram and vice versa

  26. Result Classification: 3rd attempt • Use large training data: • 47 popular keywords, of which: • 15 periodic keywords and 32 related keywords • Each related keyword is to compared with every periodic keyword to get correlation coefficient (Coef). • Data size: 15 * 32 = 480 instances

  27. Result Classification: 3rd attempt • Apply Naïve Bayes Classifier (WEKA): • 6 features: • Average Coef from related keyword projection (AveRCoef) • Average Coef from periodic keyword projection (AvePCoef) • Overall Average Coef [= (AveRCoef+AvePCoef)/2] • Max Coef from related keyword projection (MaxRCoef) • Max Coef from periodic keyword projection (MaxPCoef) • Average Max Coef [= (MaxRCoef+MaxPCoef)/2 ]

  28. Result Classification: 3rd attempt • Statistical Result: • Confusion Matrix

  29. Future attempt: Query Normalization • Search volumes tends to increase as the Internet becomes more popular • Histogram for Top 20 most popular keywords of all time:

  30. Future attempt: Normalization • Histograms need to be normalize to ignore this trend’s effect! • Proposed action: • Subtract time effect • Current Problem: More distortions are added due to scaling problem. • -> histogram from Google have been scaled. We have no information of raw data.

  31. Future attempt: From Periodic to Non-periodic • Find the correlation between two non-periodic queries. • Proposed Problem: some keywords are highly searched after other keywords • Example: “tsunami” is usually searched after “earthquake” is issued.

  32. Future attempt: From Periodic to Non-Periodic Earthquake Tsunami

  33. Potential Applications • Results re-ranking: • Move result that is more up-to-date up on the result list • Example: when user ask for Beyonce during the time of Grammy -> result that related to Grammy will have a higher rank • Server Buffering: • When user query Beyonce, the web page that related to Grammy will be buffer in local server in hope that the user will tend to search for Grammy eventually.

  34. Question?

  35. The End

More Related