1 / 32

Mining Rich Session Context to Improve Web Search

This paper proposes a framework for mining user behavior data from web sessions to improve web search ranking and user experience. It focuses on session context models and uses session clustering and ClickRank algorithm for effective results.

willd
Download Presentation

Mining Rich Session Context to Improve Web Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Rich Session Context to Improve Web Search Guangyu Zhu University of Maryland GiladMishne Yahoo! Labs

  2. Motivations • To propose an efficient and scalable framework for mining general web user behavior data • Query/click logs are useful, but limited (< 5% of traffic) • All user actions count • The web and web user behaviors both constantly evolve • Focus on sessions of general web browsing activities • A logical unit that is general across all categories • To learn the preferences, intents, and judgment of users from rich contextual information • To learn session context models to improve core web search ranking, and other web search experience

  3. Roadmap • Motivations • Mining web sessions • ClickRank • Applications to web search • Site ranking • Page ranking • Mining dynamic quicklinks

  4. Session identification • We define session as an active trail of user clicks presented by the URL referral structure • A new session starts • After 30 minutes of inactivity • Occurrence of a URL without the referrer URL • We used aggregate, anonymous general user behavior data collected by Yahoo! Toolbar • 30 billion events over 6 month period in 2008 • {cookie, timestamp, URL, referral URL, event attributes} • No personal information in source data

  5. Session characteristics • Search sessions is only less than 5% of user on-line activities • A web session contains significantly richer activity context and diversity than a search session

  6. Session characteristics • The events per session and session duration exhibit power law behaviors in web-scale general user behavior data sources

  7. Histogram session representation • We compute a distribution of activities over structured intents, given a list of URLs and their intent interpretations 7 dimensional feature vector for each session Histogram representation of the session • Sessions are highly diverse • Use PCA to reduce dimensions • The first 6 eigenvalues are significant Total number of events in the session Session duration

  8. Session categorization Cluster centroids Cluster# Attribute Full Data 1 2 3 4 5 6 7 8 9 10 100% 29.8% 16.6% 14.3% 11.9% 11.0% 4.7% 4.6% 3.5% 2.1% 1.5% ========================================================================================================================== Search 23.630 0.340 98.430 1.190 2.350 2.350 56.180 41.520 52.230 6.460 0.090 Mail 16.810 0.070 0.660 97.250 0.390 0.400 1.290 51.790 0.710 9.790 0.080 Information 12.260 0.040 0.270 0.390 1.030 96.500 24.580 2.650 0.500 5.970 0.020 Rich content 34.320 99.420 0.370 0.650 0.450 0.360 0.640 0.950 45.250 60.510 99.540 Shopping 12.850 0.080 0.240 0.410 95.670 0.290 16.920 2.600 0.860 16.840 0.060 Total events 9.040 11.140 2.890 5.660 6.250 5.330 4.240 5.380 4.260 7.850 151.680 Total time 420.300 532.490 261.370 303.850 235.780 298.910 228.400 455.580 218.010 439.780 4237.650 Addiction to content rich websites Collecting info during shopping Browsing content rich websites Reformulating search queries Reading email Informational queries Navigational queries • Intent-driven web browsing patterns emerge from session clusters • K-means clustering is sufficient to reveal meaningful intent patterns, such as long sessions of content browsing and query reformulation • Simple and effective

  9. Roadmap • Motivations • Mining web sessions • ClickRank • Applications to web search • Site ranking • Page ranking • Mining dynamic quicklinks

  10. ClickRank Overview • ClickRank is derived from contextual indicators of user preferences and judgment in general web sessions • Dwell time on the page • Click order in the session • Page load time • Frequency of occurrence in the session • We compute a local ClickRank function for each visited page in a session by incorporating session context models, and then aggregate these values to obtain the global ClickRank

  11. Local ClickRank • We define the local ClickRank function as • The weight function is computed from the rank of the page visit event in session • The weight function is computed from temporal information associated with browsing of the page • is the indicator function

  12. ClickRank incorporates click order • We define the weight function for an event in rank of a session with a total of events as where • Motivated by experiments on implicit user preference judgments in Joachims etc, SIGIR 2005 • is a monotonically decreasing function w.r.t. the rank of the event within a session • and the mean and variance of the local ClickRank function is finite

  13. ClickRank incorporates temporal signals • We define another weight function to incorporate more temporal information where and are normalized dwell time on the page and page load time w.r.t. the entire session • The indicator function above defines a filter that factors in the time range of interest

  14. Global ClickRank • Given a set of web sessions , the global ClickRank is computed from local ClickRank functions by an aggregation function • Aggregation operators to compute global ClickRank are more general • Sum, average, and filter, e.g. by criterion like time and demography • Filtering sessions is much flexible compared to filtering links

  15. Theoretical framework of ClickRank • The local ClickRank function defines a random variable a associated with the web page , given an observed session • and • Convergence Property: As converges to by the strong law of large numbers

  16. Relation to graph-based models • ClickRank is based on an intentional surfer model • ClickRank is data driven • ClickRank does not embed rigid assumptions on the traversing scheme over the web • Better reflects users’ information need and adapts faster to constantly changing user behaviors • Significantly more efficient and scalable compared to approaches based on explicit graph formulations • The ClickRank computational framework is well suited for distributed computing • ClickRank can be computed incrementally • One pass over entire data and memory friendly

  17. Roadmap • Motivations • Mining web sessions • ClickRank • Applications to web search • Site ranking • Page ranking • Mining dynamic quicklinks

  18. Applications to web search • Datasets • 3.3 billion web sessions extracted from Yahoo! Toolbar data over 6 months in 2008 • Site ranking • Compute ClickRank of 16.3 million websites in 56 minutes • Page ranking • Compute ClickRank of 3.1 billion web pages in 1 hour and 32 minutes

  19. Site ranking • ClickRank is more reliable and richer than results computed using only static link structure * The BrowseRank results are cited from Liu etc, SIGIR’08, which used MSN Toolbar data

  20. Page ranking methodology • We evaluated ClickRank with a state-of-the-art search engine with hundreds of ranking signals • We learn the ranking model using gradient boosted decision trees (GDBT) • Quantify the variable importance of individual feature

  21. Page ranking • We used a set of 9,000+ randomly sampled queries from search logs • We computed ClickRank feature only for documents that are visited by more than 5 users over time Summary of the page ranking experiment

  22. Page ranking • The ClickRank value is quantized within the range of [0, 255], to mirror the setting in a production system • We used DCG and NDCG to quantitatively evaluate ranking performance

  23. Page ranking • The ClickRank feature brings 1.02%, 0.97%, 1.11%, and 1.331% web search improvements in DCG(1), DCG(5), DCG(10), and NDCG • 1% gain over a production system is very significant • ClickRank affects 81.2% out of over 9, 000 queries and covers 62.5% of documents

  24. Competitive insights of ClickRank • ClickRank brings higher improvements to long queries • Ranked 25th in variable importance among several hundreds ranking signals • The highest-ranking feature derived from page visit count (ranked 56th) and a feature based on propagation of authority through web link graph (ranked 108th)

  25. Mining dynamic quicklinks • Many commercial search engines provide quick access links to popular destinations within the site • These links are traditionally mined from search engine query logs • Query or search session logs are limited in scope and coverage • Query logs favor old, navigational links

  26. Mining dynamic quicklinks • We demonstrate ClickRank for discovering recent, dynamic content • We adapt the time range in the temporal weight function w.r.t. the content refresh rate found by crawler • Use the indicator function as a term that specifies recency of the content

  27. Mining dynamic quicklinks Search results with quicklinks mined by ClickRank for August 10, 2008

  28. Mining dynamic quicklinks Search results with quicklinks mined by ClickRank for August 10, 2008

  29. Mining dynamic quicklinks Search results with quicklinks mined by ClickRank for August 16, 2008

  30. Mining dynamic quicklinks Search results with quicklinks mined by ClickRankd for August 16, 2008

  31. Conclusion • We expand the use of general user behavior data for web search ranking and other applications • We introduce ClickRank, an efficient, scalable algorithm for estimating web page importance by incorporating rich contextual information • ClickRank is shown to be a novel and effective query-independent ranking signal, especially on long queries • Our results highlight the potential of data-driven user behavior modeling at the web scale

  32. Thank You!Guangyu Zhuzhugy@umiacs.umd.edu

More Related