Clustering of search engine results by Google. Wouter.Mettrop@cwi.nl CWI , Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be Vrije Universiteit Brussel , and Universiteit Antwerpen , Belgium Hanneke Smulders Infomare Consultancy , The Netherlands
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
CWI, Amsterdam, The Netherlands
Vrije Universiteit Brussel, and Universiteit Antwerpen, Belgium
Infomare Consultancy, The Netherlands
Presented at Internet Librarian International 2004in London, England, October 2004
Our experimental and quantitative investigation has shed some light on the phenomenon that the Google search engine omits WWW documents from the ranked list of search results that it provides, when the documents are “very similar”. Google offers the possibility to "repeat the search with the omitted results included", on the last page with search results. All this can be considered as an additional service offered by the system to the users.
However, our investigation revealed that pages are also clustered, omitted and thus hidden to some extent, even when they can be substantially different in meaning for a human reader. The system does not distinguish authentic pages from copies or more importantly from copies that were modified on purpose.
Furthermore, Google selects different WWW documents over time to represent the cluster of very similar documents.
A practical consequence of this system is that a search for information may lead a user to rely on the information that is presented in a WWW document that represents a cluster of documents, but that is not necessarily the most appropriate or authentic document.
This is analogous to the problem of how to rank entries in the presentation of search results.
All test documents: 18
Test documents found by searching for content (not URL): 15
Test documents used as representative: 9
Test documents found by searching for URL: 15