1001000101101011 001010101010101 101010101101011. Analysis and statistical programs produce reports. Report for UK Universities. The WebWatch Project. About WebWatch The WebWatch project is funded by BLRIC (British Library Research and Innovation Centre)
Analysis and statistical programs produce reports
Report for UK Universities
A WebWatch Trawl
A simple model of how the WebWatch robot trawls communities is shown below
Resource A,B, etc. could beindividual pages or entire websites
Input fileof URLs
WebWatch robot reads input file and retrieves resources
UKOLN is funded by the British Library Research and Innovation Centre, the Joint Information Systems Committee of the Higher Education Funding Councils, as well as by project funding from the JISC’s Electronic Libraries Programme and the European Union. UKOLN also receives support from the University of Bath where it is based.
The WebWatch project carried out a trawl of UK University entry points on 24 October 1997.
The trawl was repeated in 31 July 1998.
The most popular web server was Apache. This has grown in popularity, with a decline in the CERN, NCSA and other smaller servers.
Microsoft's IIS server has also grown in popularity, perhaps indicating growth in use of Windows NT.
Size of Entry Points
The file size of HTML resource(s) (including frame sets) and images (but excluding background images) were analysed.
Four pages were less than 5 Kb.
The largest page was 193Kb.
The largest pages contained animated GIF images.
An analysis of some of the technologies used in UK University entry points is given below.
Liverpool University is probably the only university entry page using Java
None of the institutions trawled made use of Java.
Subsequently it was found that one institution used Java. This institution used the Robot Exclusion Protocol to stop robots from trawling the site.
Java provides this scrolling news facility
In October 1997 54 institutions used "Alta Vista" type metadata on their main entry point. By July 1998 the metadata was used on 74 entry points.
In contrast Dublin Core metadata was used on only 2 pages on both occasions.
<META NAME="description" CONTENT="Mailbase is a national mailing list centre for UK HE">
<META NAME="keywords" CONTENT="mail", "listserve">
<META NAME="DC.Title" CONTENT="The Mailbase Home Page">
<META NAME="DC.Creator" CONTENT="John Smith">
Possible Use of Alta Vista and Dublin Core Metadata
Interest in cache-friendly web resources has grown since the introduction of network charging on 1 August 1998.
Over 50% of institutional HTML resources were found to be cachable, with only 1% not cachable. Further analyses is needed for the other resources.
% telnet www.ukoln.ac.uk:80
GET / HTTP/1.0
HTTP/1.1 200 OK
Date: Fri, 28 Aug 1998 16:22:51 GMT
Telnet can be used to analyse HTTP headers, including caching information
A WebWatch service is being developed to provide a web-interface to the telnet command, to give more helpful information.
UMIST is an example of a framed website
Liverpool University also uses frames but this was not detected by the robot due to their use of the Robot Exclusion Protocol.
In July 1998 5 sites used client-side requests to provide redirects or "splash screens".
"Splash screens" are created by <META HTTP-EQUIV="refresh" CONTENT="n; URL=xxx.html">
De Montfort University displays a screen with a yellow background. After 8 seconds a new screen is displayed.
The WebWatch trawls revealed some interesting hyperlinking issues, which are described below.
Numbers of Hyperlinks
The histogram of the numbers of hyperlinks from institutional entry points shows an approximately normal distribution.
Six sites were found to have fewer than 5 links.
One site contained over 75 links.
Trawls of UK University Entry Points
The WebWatch project has surveyed UK University web site entry points on three occasions: 24 October 1997, 31 July 1998 and 25 November 1998.
A summary of significant trends is given below.
The Apache and Microsoft web servers are both growing in popularity, at the expense of the CERN and Netscape servers, and a number of more specialist servers.
The number of entry points using "splash" screen has increased from 5 (Oct 97), to 7 (Jul 98) to 10 (Nov 98).
Use of Dublin Core (DC) metadata grew during the summer 1998 from 2 sites to 11. DC metadata is still dwarfed by "Alta Vista" style metadata.
Size Of Entry Points
Trends in the sizes (HTML plus embedded images) have been analysed. The majority of entry points have not changed in size significantly, although one or two have grown (~ 100Kb) or decreased in size (~50Kb) substantially.
WebWatch provides access to various tools and utilities which have been developed to support its work. These services can be accessed using a Web browser at the address <URL: http://www.ukoln.ac.uk/web-focus/webwatch/services/ >.
A web form is available which can be used to obtain the HTTP headers sent when the resource is accessed.
This service can be useful for getting information, such as the name of the server software, HTTP version information, etc.
A web form is available which can be used to obtain information on web resources.
The Doc-info service is integrated with the HTTP-info service, enabled the HTTP headers are all objects contained in a resource to be analysed.
The Harvest software was used originally.
Harvest is widely used within the research community for indexing resources. For example the ACDC project uses Harvest to provide a distributed index of UK.AC web resources.
Unfortunately as Harvest was designed for indexing, it is limited in its ability to audit and monitor web technologies.
The current version of the WebWatch robot uses Perl.
ACDC uses Harvest. See <URL: http://acdc.hensa.ac.uk/>
Typical robots.txt File
WebWatch Hosts A robots.txt Checker Service
The final WebWatch report makes a number of recommendations, based on its trawls, including advice for Information Providers, Web Administrators and Robot Software Developers
Further recommendations are included in the final WebWatch report.
The report is available at <URL: http://www.ukoln.ac.uk/web-focus/webwatch/reports/final/ >.
The WebWatch Officer is Ian Peacock (email I.Peacock@ukoln.ac.uk).
Ian's responsibilities include software development, running the robot trawls, analysing the data and producing reports.
The WebWatch project is managed by Brian Kelly (email B.Kelly@ukoln.ac.uk).
The final WebWatch report can be obtained from <URL:http://www.ukoln.ac.uk/web-focus/webwatch/reports/final/>