A library historically indexed its collection in a very space- and time-consuming manner. The index consisted of a physical card catalogue organized by three key fields (subject, author, title) and a few cross-references, such that every book had three separate cards, each meticulously filed in three separate catalogues.
The system was difficult to manage even with a modest collection of books, numbering in the tens of thousands, and would certainly not “scale up” well to the volume of information being published today.
Furthermore, the old catalogue system was relatively shallow in its information content; if a patron wanted to find books that featured stories about a specific topic, such as aardvarks, and the word aardvark did not occur in the title or subject fields, then the catalogue would be unlikely to yield the full range of materials even if they existed in the collection.
It would be nice, from the point of view of the aardvark enthusiast, to have the capacity to get a list of all books that mentioned aardvarks, and to know something about the relative frequency of use so as to determine whether the source might be a significant source of information or if the mention was just a passing reference.
The preparation of such a textual analysis is based upon the generation of a “concordance” of the texts, wherein each word of the text is indexed, and the result is a list of words along with their frequency of use. The production of a concordance is a long and painstaking process when working from printed texts; it becomes quite easy with an electronic form of that same text and the power of a modern computer.
A concordance is an interesting tool, with a well-established history in the field of textual analysis. The first known example of a concordance was created in the 12th century, using the books of the bible. Another well-known example is the concordance of the works of William Shakespeare.
In both of these cases, the concordance data was used to facilitate the cross-referencing of people, places and events, and to help investigate the use of particular phrases or literary allusions.
A concordance can also used as a means to authenticate texts, to ascertain the authorship of a particular text; this kind of use makes the newspapers every so often as a researcher claims that an old work was actually written by Shakespeare, or that Shakespeare’s works were really written by someone else, or in trying to unmask the identity of a criminal (Jack the Ripper, for example) based upon notes left behind at the scenes of the crimes.
The output of a concordance program, in digital form, could easily be scanned for key words (or, in more advanced forms, for phrases and words in close proximity as well).
The ready access to the original document and its entire vocabulary list would provide a richer and deeper capacity to help determine that, in this example, the (fictional) book “The Life and Times of Arnold” is a likely source of information about aardvarks, in spite of the fact that the word “aardvark” doesn’t appear in the title, and the subject might be listed as “biography”.
One of the most famous examples of a concordance making a big splash in the news occurred in 1990, with the “unapproved” publication of the Dead Sea Scrolls.
The Dead Sea Scrolls were discovered in 1947, in the Middle East (near the Dead Sea of course). There were approximately 800 scrolls, representing most of the books of the Old Testament. They were the oldest and thus most “authentic” of any known examples of biblical texts.
They soon became the center of controversy. The scrolls and their contents were kept out of general circulation, seen and studied only by a select group of scholars. The limited distribution was purportedly set up out of concern that the translation and interpretation of the scrolls was too important to be done in a careless or insensitive manner.
But as decades passed, and few of the scrolls had been published, there was growing impatience amongst others in the field who were upset that some of the most important documents in history were being deliberately withheld, with no definitive plan for their ultimate publication.
The situation changed dramatically when the text of the Dead Sea Scrolls were released by a graduate student from the Union Theological Seminary in Cincinnati, Ohio. It seems that a full concordance of the scroll had been previously prepared, complete with listings of every word (in Hebrew and Aramaic) and where it occurred in each of the documents. The school possessed a full printed concordance of the scrolls.
The significance of the concordance was not lost on the graduate student, who transcribed the information into digital form and used a desktop computer to re-assemble the original text from the concordance data.
A complete concordance would index every word in a document. This includes all parts of speech, such as definite and indefinite articles, pronouns, and conjunctions. While such words are important in a full concordance, they are not so useful in the context of indexing files and distinguishing amongst documents in the context of a search.
The implementation of a web indexing strategy would include a mechanism by which specific words and/or entire parts of speech would be excluded from the index. This could be done using an explicit list or table of excluded words, or more ambitiously by using structural analysis to identify the parts of speech and exclude entire categories of words.
The aardvark case is but a single small example; consider the potential if every document were available in digital form and indexed in its entirety (not just a few key words). The comprehensiveness of a search for topical documents would be significantly improved, though some caution would be needed to filter out extraneous information and somehow prioritize the results.