1 / 32

搜尋引擎簡介

搜尋引擎簡介. 國立中正大學資訊工程研究所 吳昇 副教授 (sw@cs.ccu.edu.tw). What is a search engine?. A web service site for the Internet Users to find information in the Internet Cyberspace 。 The software to provide web search service. Use of search engines?. Search for the url of a company/website

zanna
Download Presentation

搜尋引擎簡介

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 搜尋引擎簡介 國立中正大學資訊工程研究所 吳昇 副教授(sw@cs.ccu.edu.tw)

  2. What is a search engine? • A web service site for the Internet Users to find information in the Internet Cyberspace。 • The software to provide web search service

  3. Use of search engines? • Search for the url of a company/website • Look for the contact info about a person or an organization • Search for information related to a term, eg. to collect information about 櫻花鉤吻鮭 • Look for news regarding XXX • Treat the search engine as a big dictionary • …...

  4. Types of search engines • Directory browse/search • Web pages search • USENET news search • Ftp search • People/organization search • Daily-life information search • Library search • Commercial product search

  5. Yahoo, Google, AltaVista, MSN, Excite, Lycos, ... YAM, Kimo, PCHome GAIS, Openfind, ... DejaNews, Archie, ... Example search engines

  6. Portal Services • Directory / Search • Daily information: Weather, Maps. TV, ... • Free Emails, Free Pages, Calendar • Personalized services, channel subscription • Web Chat, • E-Commerce, • Content Aggregation • …...

  7. Directory implementation • Each url data is a record • The url data is managed by a database system • Search function is supported for searching the data in the directory tree

  8. Directory implementation • The search is in general for locating a website or a category of web sites. • The data input is through manual registration by the website owner or the suffer • The management of the directory tree needs intensive labor work by people who are familiar with certain domain knowledge

  9. The Advantages/Disadvantages of Directory search engine • Advantages • The data is manually maintained, and thus contains less noise, and is more precise. • The output of search can be categorized and can be more organized • Can support search within a category

  10. The Advantages/Disadvantages of Directory search engine • Disadvantages: • The data coverage is limited, and sometimes, can not find wanted • Does not support relevance ranking • Labor intensive

  11. Implementation of Webpage search engine 1.Feature consideration 2.Data Gathering 3.Data Preprocessing 4.Data Indexing 5.Query Processing 6. Interaction 7.Service tools 8.Personalization

  12. Requirements for WebPage search engines 0. The quality of the search result in a search engine basically depends on • a. the quality of the underlying data • b. the search techniques such as ranking tech. 1. Data coverage should be large enough 2. Data needs to be filtered, such as removing redundant pages

  13. Requirements for WebPage search engines 3. Full text search capability should be provided 4. Relevance Ranking mechanism should be provided 5. Search Speed should be fast enough 6. Search features ; I.e., evaluation points: Quality, speed, scale, robustness, features,

  14. Data Gatherer • Also known as spider, crawler, robot, ... • Periodically travels the web space to collect web pages • Need a list management to decide which and when to collect • Need a link analyzer to generate new URL list • Need to decide what to collect and what not to.

  15. Data Gatherer • Get-file function through http protocol is the basic function • Webpage parser module used to extract link info from a retrieved page, • URL bank manager module to manage the urls to be fetched. • Robot-controller module to manage the data collection using multiple clients

  16. Issues of Robot • Site Based vs URL based • Site based is popular such as wget, teleport • robots.txt is easier to implement in SiteBased robot • URL based robot is more appropriate for large scale search engines • Retrieval Schedule, BFS is better • Incremental Retrieval

  17. Robot Issues • What to gather and what not to? • Hidden web data collection • Focused crawling • targeting specialized content of web pages • suitable for special search engines • evaluated by precision and recall

  18. Data Preprocessing • Remove redundant pages • Transform the page into internal data format. • Perform web cross-link analysis to generate a URL databank. • Filter the data to remove data that better not be indexed • Partition the data space*

  19. Redundancy removal • 15% to 20% of the web pages are replicated on different websites, e.g., some tutorials such as Java, Perl, Python, … • Can be implemented by partitioned-hashing or external sorting

  20. Ranking the URLs • Link analysis is done to count the mutual reference between web pages • A URL receiving higher number of references will get higher score • weighted link • discount internal link // such as back to home • Order the web pages in order of score such that a page with higher rank will have lower ID

  21. Data Partition • The data is partitioned by language type • The language partition can be done as follows: • for each known language, collect certain amount of webpages of that language • build up high-frequent term set for each language set from the analysis of the sample data • determine the language type by term analysis

  22. Indexer • In general, inverted file is used to generate the index • Need large data space for the indexing task. • For each indexed term, an index list is generated to record which files/locations such term appears. • Need about the same or more space as the original data

  23. Indexer - implementation issue • Data filter module is used to cope with different data sources • Inversion module is the kernel module • Need to be scalable to handle continuous growing data size. • Hundreds of Giga bytes • Tera bytes • Distributed/Concurrent Indexing

  24. Indexer - implementation issue • Temporary space minimization • Index speed is crucial • Memory can be utilized to improve the index performance • Hashing and Sorting is the key!

  25. Query Processing • Use dictionary/stop-list to preprocess the query string • Parse the query into expressions of tokens • Use index structure to locate the matched • Use TF*IDF type technique to score the matched documents • Combine URL scores to rank the result

  26. Search CGI programs • search agent CGI: • parse the query and fork a searcher process to do the search (or use IPC to query the searcher) • when the searcher returns, analyze and process the result for formatted output • process the result and store it in tmp result store • log query and some status info • cgi for view-next-page • showmatch cgi

  27. Output control • Site grouping: group the pages from same website together • Title grouping: group the pages with similar title • Sort the output according to certain criteria

  28. Interaction • Term Suggestion: • Related terms • thesaurus • term-expansion • error correction • phonetic • spelling

  29. Personalization • Keeping track of a user’s interest such that the search result can be tuned to improve the satisfaction to the user • Query Tracking and classification

  30. Service tools • Query cache to improve the performance of the Search, for queries that have been served. • Use memory cache file system to reduce the dick access overhead • Mechanism for special case handling • Log analyzer

  31. Research Issues • Hidden Web data collection • Distributed index/search • Index minimization, incremental Indexing • Smart robot • Intelligent Retrieval • Output result auto classification/clustering • Data source clustering/classification • classifying/clustering the whole web

  32. Conclusion • Size does matter • Is still searching for a better engine!

More Related