1 / 36

CP3024 Lecture 12

This lecture discusses the main problem of the World Wide Web - the difficulty of finding the desired web pages. It explains what search engines are and their types, components, and query interface. It also addresses the problems with search engines and provides tips for improving query results. The lecture covers the operation of search engines, how pages get into search engines, and strategies for getting pages to the top of search results. It concludes with an introduction to the WWLib-TNG search engine and the use of automatic classifiers for generating metadata.

mmorabito
Download Presentation

CP3024 Lecture 12

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CP3024 Lecture 12 Search Engines

  2. What is the main WWW problem? • With an estimated 800 million web pages finding the one you want is difficult!

  3. What is a Search Engine? • A page on the web connected to a backend program • Allows a user to enter words which characterise a required page • Returns links to pages which match the query

  4. A Typical Search Engine

  5. Types of Search Engine • Automatic search engine e.g. Altavista, Lycos • Classified Directory e.g. Yahoo! • Meta-Search Engine e.g. Dogpile

  6. Components of a Search Engine • Robot (or Worm or Spider) • collects pages • checks for page changes • Indexer • constructs a sophisticated file structure to enable fast page retrieval • Searcher • satisfies user queries

  7. Query Interface • Usually a boolean interface • (Fred and Jean) or (Bill and Sam) • Normally allows phrase searches • "Fred Smith" • Also proximity searches • Not generally understood by users • May have extra 'friendlier' features ?

  8. Search Results • Presented as links • Supposedly ordered in terms of relevancy to the query • Some Search Engines score results • Normally organised if groups of ten per page

  9. Problems • Links are often out of date • Usually too many links are returned • Returned links are not very relevant • The Engines don't know about enough pages • Different engines return different results • U.S. bias

  10. Improving query results • To look for a particular page use an unusual phrase you know is on that page • Use phrase queries where possible • Check your spelling! • Progressively use more terms • If you don't find what you want, use another Search Engine!

  11. Who operates Search Engines? • People who can get money from venture capitalists! • Many search engines originate from U.S. universities • Often paid for by advertisements • Engines monitor carefully what else interests you (paid by the click)

  12. How do pages get into a Search Engine? • Robot discovery (following links) • Self submission • Payments

  13. Robot Discovery • Robots visit sites while following links • The more links the more visits • Make sure you don't exclude Robots from visiting public pages

  14. Payments • Some search engines only index paying customers • The more you pay the higher you appear on answers to queries

  15. Self submission • Register your page with a search engine • Pay for a company to register you with many search engines • Get registration with many search engines for free!

  16. Getting to the top • Only relevant queries should be ranked highly • Search engines only look at text • Search engine operators try to stop "search engine spamming" • Some queries are pre-answered

  17. Get where you should be! • Put more than graphics on a page • Don't use frames • Use the <ALT….> tag • Make good use of <TITLE> and <H1> • Consider using the <META> tag • Get people to link to your page

  18. Summary • Search Engines are vital to the Web user • Search Engines are not perfect by a long way • There are tactics for better searching • Page design can bring more visitors via Search Engines • The more links the better!

  19. WWLib-TNG A Next Generation Search Engine

  20. In the beginning • WWLib-TOS • Manually constructed directory • Classified on Dewey Decimal • Simple data structure • Proof of concept

  21. The New Architecture

  22. The Classifier

  23. Motive - Why Generate Metadata Automatically? • Meta tags are not compulsory • Old pages are less likely to have meta tags • Available data can be unreliable • The Web of Trust requires comprehensive resource description • An essential prerequisite for widespread deployment of RDF applications

  24. Method - How can Metadata be Generated Automatically? • Using an automatic classifier • The classifier classifies Web Pages according to Dewey Decimal Classification • Other useful metadata can be extracted during the process of automatic classification

  25. Automatic Classification • Intended to combine the intuitive accuracy of manually maintained classified directories with the speed and comprehensive coverage of automated search engines • DDC has been adopted because of its universal coverage, multilingual scope and hierarchical nature

  26. Automatic Classifier - How does it work? Firstly, the page is retrieved from a URL or local file and parsed to produce a document object

  27. Automatic Classifier - How does it work? The document object is then compared with DDC objects representing the top ten DDC classes

  28. Automatic Classifier - How does it work? • Each time a word in the document matches a word in the DDC object, the two associated weights are added to a total score • A measure of similarity is then calculated using a similarity coefficient

  29. Automatic Classifier - How does it work? • If there is a significant measure of similarity the document will be compared with any subclasses of that DDC class • If there are no subclasses (i.e. the DDC class is a leaf node) the document is assigned the classmark • If the result is not significant, the comparison process will proceed no further through that particular branch of the DDC object hierarchy

  30. The automatic classification process can be used to extract other useful metadata elements other than the classification classmarks: Keywords Classmarks Word count Metadata elements • Title • URL • Abstract • A unique accession number and associated dates can be obtained and supplied by the system

  31. Metadata elements - Wolverhampton Core

  32. RDF Data Model

  33. RDF Schema • There is a significant overlap with the Dublin Core element set • Requirement for implementation clarity • Those that have Dublin Core equivalents are declared as sub-properties • Maintain interoperability with Dublin Core applications

  34. RDF Schema <rdf:Description ID="Keyword"> <rdf:type rdf:resource="http://www.w3.org/TR/WD-rdf-syntax#Property"/> <rdfs:subPropertyOf resource="http://purl.org/metadata/dublin_core#Subject"/> <rdfs:label>Keyword</rdfs:label> </rdf:Description> <rdf:Description ID="Classmark"> <rdf:type rdf:resource="http://www.w3.org/TR/WD-rdf-syntax#Property"/> <rdfs:subPropertyOf resource="http://purl.org/metadata/dublin_core#Subject"/> <rdfs:label>Classmark</rdfs:label> </rdf:Description>

  35. Classifier Evaluation • Automatic metadata generation will become important for the widespread deployment of RDF based applications • Documents created before the invention of RDF generating authoring tools also need to be described • RDF utilised in this manner may encourage interoperability between search engines • More info: http://www.scit.wlv.ac.uk/~ex1253/

  36. Current Status of WWLib-TNG • New results interface proposed • R-wheel (CirSA) • Builder and searcher constructed, now being tested • Classifier constructed • Test Dispatcher/Analyser/Archiver in place

More Related