CP3024 Lecture 12

CP3024 Lecture 12 Search Engines

What is the main WWW problem? • With an estimated 800 million web pages finding the one you want is difficult!

What is a Search Engine? • A page on the web connected to a backend program • Allows a user to enter words which characterise a required page • Returns links to pages which match the query

A Typical Search Engine

Types of Search Engine • Automatic search engine e.g. Altavista, Lycos • Classified Directory e.g. Yahoo! • Meta-Search Engine e.g. Dogpile

Components of a Search Engine • Robot (or Worm or Spider) • collects pages • checks for page changes • Indexer • constructs a sophisticated file structure to enable fast page retrieval • Searcher • satisfies user queries

Query Interface • Usually a boolean interface • (Fred and Jean) or (Bill and Sam) • Normally allows phrase searches • "Fred Smith" • Also proximity searches • Not generally understood by users • May have extra 'friendlier' features ?

Search Results • Presented as links • Supposedly ordered in terms of relevancy to the query • Some Search Engines score results • Normally organised if groups of ten per page

Problems • Links are often out of date • Usually too many links are returned • Returned links are not very relevant • The Engines don't know about enough pages • Different engines return different results • U.S. bias

Improving query results • To look for a particular page use an unusual phrase you know is on that page • Use phrase queries where possible • Check your spelling! • Progressively use more terms • If you don't find what you want, use another Search Engine!

Who operates Search Engines? • People who can get money from venture capitalists! • Many search engines originate from U.S. universities • Often paid for by advertisements • Engines monitor carefully what else interests you (paid by the click)

How do pages get into a Search Engine? • Robot discovery (following links) • Self submission • Payments

Robot Discovery • Robots visit sites while following links • The more links the more visits • Make sure you don't exclude Robots from visiting public pages

Payments • Some search engines only index paying customers • The more you pay the higher you appear on answers to queries

Self submission • Register your page with a search engine • Pay for a company to register you with many search engines • Get registration with many search engines for free!

Getting to the top • Only relevant queries should be ranked highly • Search engines only look at text • Search engine operators try to stop "search engine spamming" • Some queries are pre-answered

Get where you should be! • Put more than graphics on a page • Don't use frames • Use the <ALT….> tag • Make good use of <TITLE> and <H1> • Consider using the <META> tag • Get people to link to your page

Summary • Search Engines are vital to the Web user • Search Engines are not perfect by a long way • There are tactics for better searching • Page design can bring more visitors via Search Engines • The more links the better!

WWLib-TNG A Next Generation Search Engine

In the beginning • WWLib-TOS • Manually constructed directory • Classified on Dewey Decimal • Simple data structure • Proof of concept

The New Architecture

The Classifier

Motive - Why Generate Metadata Automatically? • Meta tags are not compulsory • Old pages are less likely to have meta tags • Available data can be unreliable • The Web of Trust requires comprehensive resource description • An essential prerequisite for widespread deployment of RDF applications

Method - How can Metadata be Generated Automatically? • Using an automatic classifier • The classifier classifies Web Pages according to Dewey Decimal Classification • Other useful metadata can be extracted during the process of automatic classification

Automatic Classification • Intended to combine the intuitive accuracy of manually maintained classified directories with the speed and comprehensive coverage of automated search engines • DDC has been adopted because of its universal coverage, multilingual scope and hierarchical nature

Automatic Classifier - How does it work? Firstly, the page is retrieved from a URL or local file and parsed to produce a document object

Automatic Classifier - How does it work? The document object is then compared with DDC objects representing the top ten DDC classes

Automatic Classifier - How does it work? • Each time a word in the document matches a word in the DDC object, the two associated weights are added to a total score • A measure of similarity is then calculated using a similarity coefficient

Automatic Classifier - How does it work? • If there is a significant measure of similarity the document will be compared with any subclasses of that DDC class • If there are no subclasses (i.e. the DDC class is a leaf node) the document is assigned the classmark • If the result is not significant, the comparison process will proceed no further through that particular branch of the DDC object hierarchy

The automatic classification process can be used to extract other useful metadata elements other than the classification classmarks: Keywords Classmarks Word count Metadata elements • Title • URL • Abstract • A unique accession number and associated dates can be obtained and supplied by the system

Metadata elements - Wolverhampton Core

RDF Data Model

RDF Schema • There is a significant overlap with the Dublin Core element set • Requirement for implementation clarity • Those that have Dublin Core equivalents are declared as sub-properties • Maintain interoperability with Dublin Core applications

RDF Schema <rdf:Description ID="Keyword"> <rdf:type rdf:resource="http://www.w3.org/TR/WD-rdf-syntax#Property"/> <rdfs:subPropertyOf resource="http://purl.org/metadata/dublin_core#Subject"/> <rdfs:label>Keyword</rdfs:label> </rdf:Description> <rdf:Description ID="Classmark"> <rdf:type rdf:resource="http://www.w3.org/TR/WD-rdf-syntax#Property"/> <rdfs:subPropertyOf resource="http://purl.org/metadata/dublin_core#Subject"/> <rdfs:label>Classmark</rdfs:label> </rdf:Description>

Classifier Evaluation • Automatic metadata generation will become important for the widespread deployment of RDF based applications • Documents created before the invention of RDF generating authoring tools also need to be described • RDF utilised in this manner may encourage interoperability between search engines • More info: http://www.scit.wlv.ac.uk/~ex1253/

Current Status of WWLib-TNG • New results interface proposed • R-wheel (CirSA) • Builder and searcher constructed, now being tested • Classifier constructed • Test Dispatcher/Analyser/Archiver in place

CP3024 Lecture 12

CP3024 Lecture 12

Presentation Transcript

Lecture 12

Lecture 12

Lecture #12

Lecture 12

CP3024 Lecture 9

Lecture 12

Lecture 12

CP3024 Lecture 10

CP3024 Lecture 3

CP3024 – Lecture 4

Lecture 12

Lecture 12

Lecture 12

Lecture 12

CP3024 Lecture 6

Lecture 12

Lecture 12