Query Processing

Query Processing Part 1: Ways to Speed Up Accessing the Index Part 2: Ways that Google Uses to Improve Results

Cosine Similarity is Only a Proxy • User has a task and formulates it as a query • The search engine’s task is to • Minimally return documents that contain the query terms • We will see several techniques for making this process efficient • what the user is actually trying to accomplish, even though the query may be (at best) vaguely stated • E.g. use knowledge graph to provide associations • In the next few slides we focus on step 1 • there may be many documents that include the query terms; how do we choose the best? • Cosine similarity is a way to find documents with terms closest to the query terms • Thus cosine similarity is a proxy for satisfying a user’s query, namely for satisfying step 1

Recap: cosine(query,document) Dot product Unit vectors cos(q,d) is the cosine similarity of q and d … or, equivalently, the cosine of the angle between q and d.

Efficient Cosine Ranking • Challenge1: find the K docs in the collection “nearest” to the query  initially find documents with matching query terms, • And then • Challenge2: find the K docs efficiently: • Compute the cosine similarity for each matching document efficiently • Among the initially matching documents find those with thelargest cosine values efficiently

The More General Problem • What we’re doing in effect: solving the K-nearest neighbor (K-NN) problem in a large n-dimensional space for a query vector • In general, we do not know how to do this efficiently for high-dimensional spaces • But it is solvable for short queries, and standard inverted indexes support this well Example of k-NN classification. The test sample (green circle) should be classified either to the first class of blue squares or to the second class of red triangles. If k = 3 (solid line circle) it is assigned to the second class because there are 2 triangles and only 1 square inside the inner circle. If k = 5 (dashed line circle) it is assigned to the first class (3 squares vs. 2 triangles inside the outer circle).

Computing Cosine Scores Length holds the lengths for each of the N documents;(normalization) Scores holds the scores of each Document; Final scores are determined in step 9 Step 10 uses a heap to determine the top K results; see next slide In step 5 we calculate the weight in the query vector for term t; Steps 6-8 update the score of each document by adding in the contribution from term t; This is referred to as term-at-a-time scoring See our textbook, Figure 6.14

Determining the K Documentswith Largest Cosines Scores • This is how to implement step 10 on the previous slide • After computing the cosine scores for the set of documents with matching terms we want to retrieve the top K docs with cosine ranking • We do not need to totally order all docs in the collection • We can pick off docs with K highest cosines by using a Heap data structure • Let J = number of docs with nonzero cosines • We seek the K best of these J • Use a binary tree in which each node’s value is greater than (>) the values of children • Takes 2J operations to construct, then each of K “winners” read off in 2log J steps. • For J=1M, K=100, this is about 10% of the cost of sorting. 1 .9 .3 .3 .8 .1 .1

Five Refinements for ChoosingDocuments Matching the Query • Basic cosine computation algorithm only considers docs containing at least one query term • Take this further and consider 2 heuristics that will transform into 5 specific refinements: • Only consider high-idf query terms • the inverse document frequency of a rare term is high • Only consider docs containing many (or all) of the query terms

Strategy 1:Consider Only Terms with High-idf Scores • For a query such as catcher in the rye • Only accumulate scores from catcher and rye • Intuition: in and the contribute little to the scores and so don’t alter rank-ordering much • Benefit: • Postings of low-idf terms have many docs  these (many) docs get eliminated from set A of contenders

Strategy 2:Consider Only Docs Containing Many Query Terms • Any doc with at least one query term is a candidate for the top K output list • For multi-term queries, only compute cosine scores for docs containing several of the query terms • Say, at least 3 out of 4 • Imposes a “soft conjunction” on queries seen on web search engines (early Google) • Easy to implement in postings traversal

3 2 4 4 8 8 16 16 32 32 64 64 128 128 1 2 3 5 8 13 21 34 3 of 4 query terms Antony Brutus Caesar Calpurnia 13 16 32 If the query is "Antony Brutus Caesar Calpurnia" then we need only compute scores for docs 8, 16 and 32.

Strategy 3:Introduce Champion Lists Heuristic • Pre-compute for each dictionary term t, the r docs of highest weight (tf-idf) in t’s postings • Call this the champion list for t • (aka fancy list or top docs for t) • Note that r has to be chosen at index build time • Thus, it’s possible that r < K • At query time, only compute scores for docs in the champion list of some query term • champion lists that include the terms of the query • Pick the K top-scoring docs from amongst these

Static Quality Scores Heuristic • We want top-ranking documents to be both relevant and authoritative • Relevance is being modeled by cosine scores • Authority is typically a query-independent property of a document • Examples of authority signals • Wikipedia among websites • Articles in certain newspapers • A paper with many citations • High Pagerank

Strategy 4:Introduce an Authority Measure • Assign to each document a query-independentquality score in [0,1] to each document d • Denote this by g(d), g stands for goodness • Thus, a quantity like the number of citations is scaled into [0,1]

Net Score • Consider a simple total score combining cosine relevance and authority • net-score(q,d) = g(d) + cosine(q,d) • Can use some other linear combination, assigning different weights to the two factors; for now we use weights = 1 • Indeed, any function of the two “signals” of user happiness • Now we seek the top K docs by net score

Strategy 5:Reorganize the Inverted List • So far we assumed that all documents were ordered by docID • Instead order all postings by g(d) • This does not change the earlier computing times for merging • the most authoritative documents will appear early in the postings list • Thus, can concurrently traverse query terms’ postings for • Postings intersection • Cosine score computation

Champion lists in g(d)-ordering • Can combine champion lists with g(d)-ordering • Maintain for each term a champion list of the r docs with highest g(d) + tf-idftd • Seek top-K results from only the docs in these champion lists authority score cosine score

High and Low Lists Heuristic • For each term, we maintain two postings lists called high and low • Think of high as the champion list • When traversing postings on a query, only traverse high lists first • If we get more than K docs, select the top K and stop • Else proceed to get docs from the low lists • Can be used even for simple cosine scores, without global quality g(d) • A means for segmenting index into two tiers

Impact-Ordered Postings Heuristic • We only want to compute scores for docs for which wft,d is high enough (term frequency weights) • sort documents by term frequency weights • We sort each postings list by wft,d, in decreasing order • Now: not all postings in a common order! • documents will have different orders in different postings list

Early Termination • When traversing t’s postings, stop early after either • a fixed number of rdocs • wft,d drops below some threshold • Take the union of the resulting sets of docs • One from the postings of each query term • Compute only the scores for docs in this union

SEO Companies Carefully Track Google Ranking Factors Download at: http://www-scf.usc.edu/~csci572/papers/Searchmetrics.pdf • Ranking factors are organized into categories: • Content factors • Content relevance • Word count • User signals • Click through rate • Bounce rate • Technical factors • Presence of H1/H2 • Use of HTTPS • User experience • Number of internal/external links • Social signals • Facebook total • Tweets

Top Two Factors in Each Category Relevant terms Keyword in internal links Click through rate Time on site Domain visibility? Search volume domain name http://www.searchmetrics.com/wp-content/wp-content/uploads/Ranking_Correlations_EN_print.pdf

Signals • Google claims to use over 200 signals; below the signals are divided into categories • The list was created by Brian Dean (not Google); for the actual list see http://backlinko.com/google-ranking-factors • Domain factors • Page-Level Factors • Site-Level Factors • Backlink Factors • User Interaction • Special Algorithm Rules • Social Signals • Brand Signals • On-Site Webspam Factors • Off Page Webspam Factors Here are the top 10 Keyword in beginning of your title tag Longer content is better Page reloading speed Keyword prominence and positioning Page authority/pagerank Domain authority Link relevancy Dwell time Responsive design Thin or duplicate content View the video

Rank Correlationhttps://en.wikipedia.org/wiki/Rank_correlation • A rank correlation is a statistic that measures the relationship between rankings of different variables or different rankings of the same variable • a "ranking" is the assignment of the labels "first", "second", "third", etc. to different observations of a particular variable. • A rank correlation coefficient measures the degree of similarity between two rankings, and can be used to assess the significance of the relation between them. • Some of the more popular rank correlation statistics include • Spearman (rho) ρ • Kendall (tau) τ • An increasing rank correlation coefficient implies increasing agreement between rankings. • The coefficient is inside the interval [−1, 1] and assumes the value: • 1, if the agreement between the two rankings is perfect; the two rankings are the same. • 0, if the rankings are completely independent. • −1, if the disagreement between the two rankings is perfect; one ranking is the reverse of the other.

Four Sample Correlation Graphs Factor A: Zero Correlation – linear curve, horizontal / high average Factor B: Positive Correlation (highest) – exponential function, falling Factor C: Negative Correlation (lowest) – linear curve, rising Factor D: Positive Correlation – irregular curve, falling In the graph above, four example correlations and the respective curves are shown. Correlation can be detected and shown on a graph as a curve of average values for each item; The y-axis indicates the average value for all 10,000 URLs studied at position X (x-axis). The higher the value of a correlation, the greater, and more regular, the differences between positions. Values in a negative range are best understood with the opposite statement positively interpreted.

Correlation is Not Necessarily Causation • Correlations are not synonymous with causal relationships, and hence there is no guarantee that the respective factors actually have any impact on the ranking – or are ever used by Google as a signal • There are many examples of illusory correlations which are referred to as “logical fallacy”; here are two examples • the co-appearance of phenomena like the number of storks and the higher birthrate in certain areas, • the relationship between sales of ice-cream and increased incidence of sunburn in the summer. • These examples show a (illusory) correlation, not a causal relationship. • For many more funny ones go to: http://tylervigen.com/spurious-correlations

What is a ranking Factorhttp://www.searchmetrics.com/what-is-a-ranking-factor/ • Searchmetrics is one SEO company that carefully monitors changes in the SERP (search engine result pages) for various industries and clients • The evaluation of a website’s relevance is now based on the complex interplay of hundreds of factors, each of which is assigned its own flexible weighting. And this all happens in real time • Every year they analyze the top 30 search results for 10,000 keywords and compare the matching website to a variety of factors: # of backlinks, text length, occurrences of keywords, age of the text, etc. • Using the above data they calculate the rank-correlation coefficient http://www-scf.usc.edu/~csci572/papers/searchqualityevaluatorguidelines.pdf

Content Factors

Word Count The word count of a landing page ranked amongst the top positions has been rising for years – this shows that the content on URLs near the top of the search results is becoming more detailed, more holistic and therefore better able to answer more user questions. Pages rank well under the condition that the content is not simply long, but also relevant, usually also meaning that they rank comparably well for several keywords related to the same topic

Click-Through Rate The Click-Through Rate measures the average percentage of users who click on the result at each position on the SERP. The CTR calculated for our dataset is much higher than that of our previous analysis – and in sum is far higher than 100%. This is because users often click through several URLs in the search results to find information or to research a product. Our data shows that keywords in position 1 have an average CTR of 44%, the rate dropping to 30% for position 3. This year, we again see that the click rate for landing pages at the top of the second results page is higher than for results at the bottom of page 1.

Bounce Rate The Bounce Rate measures the percentage of users who only click on the URL from Google’s search results, without visiting any other URLs on the domain, and then return back to the SERP. These are single-page sessions where the user leaves the site without interacting with the page. On its own, this is insufficient for drawing conclusions about the quality of the content, because users can also “bounce” once their intention has been fulfilled. That said, combined with the other KPIs and/or information regarding the content category or the purpose and type of page (e.g. glossary entry or product page), the Bounce Rate can help to measuring a URL’s relevance.

Technical Factors Page encryption using HTTPS is currently on the march. Last year, only 12% of pages relied on data transfer via HTTP. Today, this has more than tripled, with over a third of websites encrypting the data traffic on their pages. Furthermore, Google has announced that pages that have not switched to HTTPS by 2017 will be marked as “unsafe” in its Chrome browser. This will continue to elevate the status of HTTPS as a ranking factor.

Social Signals The correlation between social signals and ranking position is extremely high, and the number of social signals per landing page has remained constant when compared with the values from last year’s whitepaper. This is true of all social networks analyzed. Facebook remains the social network with by far the highest level of user interactions. Furthermore, Facebook, compared with the other social networks, shows relatively high signals across the first search results page

Query Processing