1 / 0

Ch19. Web search basics

Ch19. Web search basics. 2010. 3. 22. 유 보 림. contents. 19.1 Background and history 19.2 Web characteristics 19.2.1 The web graph 19.2.2 Spam 19.3 Advertising as the economic model 19.4 User query needs 19.5 Index size and estimation 19.6 Near-duplicates and shingling.

ralph
Download Presentation

Ch19. Web search basics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ch19. Web search basics

    2010. 3. 22. 유 보 림
  2. contents 19.1 Background and history 19.2 Web characteristics 19.2.1 The web graph 19.2.2 Spam 19.3 Advertising as the economic model 19.4 User query needs 19.5 Index size and estimation 19.6 Near-duplicates and shingling
  3. 19.1 Background and history Web: Unprecedented in many ways World Wide Web in 1990’s Simple, open client-server design server: http protocol encoded HTML language client: browser
  4. Crawls Content consumers Content creators The coarse-level dynamics Subscription Editorial Feeds Transaction Advertisement Content aggregators
  5. 19.1 Background and history The basic operation The browser specifies URL(for Universal Resource Locator) Http://www.stanford.edu/ home/atoz/contact.html http request Client Web Server protocol domain hierarchical path with file “contact.html”
  6. 19.1 Background and history The HTML-encoded file contact.html holds the hyperlinks and the content “http request” ← crawling & indexing Chapter. 20 Simple convenience →rapid proliferation Incompatible dialects of HTML “bring the system down” error
  7. 19.1 Background and history Early attempts at making web information “discoverable” full-text index search engines (Altavista) taxonomies populated with web pages in categories (Yahoo) 2 drawbacks
  8. 19.2 Web characteristics 웹 검색이 가지는 특징 몇 가지. 1. 웹 검색은 링크에 기반한 검색이다. 2. 웹 검색 쿼리는 다양하고 많다. 3. 사용자 층이 다양하다. 4. 다양한 종류의 문서를 보유하고 있다. 5. 문맥(context) 정보가 다른 정보검색 애플리케이션보다 중요하게 쓰인다. 6. ’스팸’이 있다.
  9. 19.2 Web characteristics The essential feature that led to the explosive growth of the web The substance of the text New level of granularity in opinion on virtually any subject Which web pages does one trust? User-independent notion
  10. 19.2 Web characteristics How big is the web? = How many web pages are in a search engine’s index? Static web pages Content does not vary from one request for that page to the next Dynamic web pages The URL has the character “?” in it
  11. AA129 Browser Back-end databases The Web: Dynamic content Application server A dynamically generated web page
  12. 19.2.1 The Web Graph Text is generally encapsulated in the href attribute of the <a> (for anchor) Directed graph is not strongly connected
  13. 19.2.1 The Web Graph IN-LINK 와 OUT-LINK
  14. 19.2.1 The Web Graph Links are not randomly distributed; the distribution of the number of links into a web page does not follow the Poisson distribution Rather, this distribution is widely reported to be a power law the total number of web pages with in-degree iis proportional to 1/iα the value of α typically reported by studies is 2.1
  15. Query Distribution Power law few popular broad queries, many rare specific queries 시스템의 역동적성질이 PowerLaw분포를 가질때 가장 효율적으로 최대의 정보를 전송할 수 있다. PowerLaw은 시스템내 개체들의 불평등성으로 인해, 시스템을 효율적으로 운영할 수 있음을 의미하기도 한다.
  16. 19.2.1 The Web Graph Three major categories of web pages that are sometimes referred to as IN, OUT and SCC
  17. 19.2.2 Spam Motives Commercial, political, religious, lobbies Promotion funded by advertising budget Manipulation of web page content for the purpose of appearing high up in search results Example: A user finding Maui golf real estate not seeking news & entertainment info. But likely to be seeking to purchase some property
  18. 19.2.2 Spam Serve fake content to search engine spider Paid inclusion: search engine marketing product Cloaking
  19. 19.2.2 Spam The spam industry
  20. 19.2.2 Spam More spam techniques: Doorway pages When a browser requests the doorway page, it is redirected to a page containing content of amore commercial nature Search Engine Optimizers (SEO) v.s. web search engines Adversarial information retrieval the exploitation of the link structure of the Web Link spamming Spammers invest considerable effort in subverting link anaylsis
  21. 19.3 Advertising as the economic model Branding To convey to the viewer a positive feeling about the brand of the company placing the advertisement cost per mil (CPM) ; impressions cost per click (CPC) ; click! Goto’s model Sponsored search (search advertising) Algorithmic search: Displayed separately and distinctively
  22. 19.3 Advertising as the economic model Search engine marketing (SEM) Click spam
  23. 19.4 The search user experience How do search engines differentiate themselves and grow their traffic? Google identified two principles: (1) focus on relevance (2) lightweight: entirely textual, with very few graphical elements
  24. 19.4 The search user experience 19.4.1 User query needs 1. informational queries Seek general information on a broad topic 2. navigational queries Seek the web site or home page of a single entity that the user has in mind 3. transactional queries one that is a prelude to the user performing a transaction on the Web
  25. User Web spider Search Indexer The Web Indexes Ad indexes The various components of a web search engine
  26. 19.5 index size and estimation Estimating web size and search engine index size What is the size of the web ? The web is really infinite Dynamic content, e.g., calendar Soft 404 error: www.yahoo.com/<anything> = Given two search engines, what are the relative sizes of their indexes?
  27. 19.5 index size and estimation Capture-recapture method: A random page from the index of E1and test whether it is in E2’s index and test whether a random page from E2is in E1 A fraction x of the pages in E1 are in E2, while a fraction y of the pages in E2 are in E1 = the size of the index of search engine Ei
  28. 19.5 index size and estimation Statistical methods: Random searches Random IP addresses Random walks Random queries
  29. 19.5 index size and estimation No sampling solution is perfect. Lots of new ideas ... ....but the problem is getting harder Quantitative studies are fascinating and a good research problem
  30. 19.6 Near-duplicates and shingling 문서들 간 중복이나 유사도 판별 The web is full of duplicated content as many as 40% of the pages on the web Duplication Fingerprint로 판별 가능 문서들이 정확히 일치하는 경우 Near-Duplication 거의 일치 (유사한 경우) Similarity > 80% →Documents are “near duplicates”
  31. 19.6 Near-duplicates and shingling 두 문서가 완전히 같지는 않지만 거의 유사할 때: Shingle (k개의 연속된 단어 시퀀스) K-shingling a rose is a rose is a rose → a_rose_is_a rose_is_a_rose is_a_rose_is Similarity Measure between two docs (= sets of shingles) Jaccardcoefficient Measures the degree of overlap between the sets S(d1) and S(d2)
  32. 19.6 Near-duplicates and shingling Jaccard coefficient 두 문서의 유사도 계산 = (두 문서의 교집합) / (두 문서 비교 단위의 합집합) S(dj): j번째 문서d 안의 shingle의 집합 S(d1)과 S(d2)사이 유사도가 미리 정해진 threshold보다 초과된 값이면 indexing에서 제거 Jaccardcoefficient의 pairwise를 고려, “hashing”으로 계산하는 방법
  33. 19.6 Near-duplicates and shingling 우선, 모든 shingle을 64bit의 넓은 공간에서의 hash value로 생각 H(dj): S(dj)로부터 파생된 64bit hash value의 집합 π: 64bit integer로 이루어진 임의의 순열 Π(dj): H(dj)의 hash value의 순열 집합 H(dj)의 원소 h에 대해, 이면,
  34. 19.6 Near-duplicates and shingling Shingles sketch computation의 4단계
  35. Document 1 19.6 Near-duplicates and shingling Start with 64-bit f(shingles) Permute on the number line with pi Pick the min value
  36. Document 1 19.6 Near-duplicates and shingling Document 2 264 264 264 264 264 264 A B 264 264 같은가 비교 200random permutation에 대해 반복:p1, p2,… p200
  37. Document 2 Document 1 264 264 264 264 264 264 264 264 19.6 Near-duplicates and shingling A B 가장 마지막 단계에서, ●끼리 서로 같다면 두 문서는 중복 확률의 문제: Size_of_intersection / Size_of_union
  38. 19.6 Near-duplicates and shingling 문서di의 sketch ψ(di): 의 200개 결과값의 집합 di, dj에 대한Jaccard coefficient 값이preset threshold보다 클 경우, 두 문서는 비슷하다고 판정한다. |ψi ∩ ψj |/200 > |threshold|
More Related