The Ferret Copy Detector Finding short passages of similar texts in large document collections

The Ferret Copy DetectorFinding short passages of similar texts in large document collections Relevance to natural computing: System is based on processing short sequences Exploits sequencing characteristics of natural language. It has a natural analogue in human sequencing processors. Caroline Lyon, University of Hertfordshire, c.m.lyon@herts.ac.uk

Human sequencing functions Primitive sequencing processors in the sub-cortical basal ganglia part of the brain control motor functions, e.g. walking. These sub-cortical sequencing processors also contribute to cognitive processing, e.g. language, complementing cortical functions. Reference Human Language and our Reptilian Brain, P Lieberman, 2000

Sequencing in human speech and language Sequential processing is necessary at many levels: • Phonetic • Syllabic • Lexical • Syntactic Phonetics: speakers must control a sequence of independent motor acts to produce speech sounds.

Sequencing in speech and language (2) • Phonetic segments can only be combined in certain ways to produce phonemes and then syllables. • Different languages have different phonemic systems, but all have sequential constraints. • Syllables combine to make words, which combine to make phrases, which combine to make sentences. All have constraints.

The need for sequential processing • Many of our most frequently used words are homophones: <to, too, two> <for, four> <here, hear> • True of other languages too. • This does not seem to impede communication. • Our primary method of disambiguation is through sequential processing of short strings of words: e.g. < I want to/too/two eggs > only has one interpretation.

Alternative method of avoiding word ambiguity A recent mathematical model of human language asserts that there are unique mappings from sounds to meanings, that absence of word ambiguity is a mark of evolutionary fitness. [ Computational and evolutionary aspects of language, M Novak et al. Nature, June 2002, vol 417, pp 611-617; and other references] This is a logical suggestion, but it is not how human language works.

Language models • Language can be modelled by a regular grammar – a linear sequence of symbols. • Chomsky showed that this is inadequate. • However, it has produced effective practical applications. Speech recognition systems are typically based on Markov models. • The Ferret is based on a model of simple linear sequences.

Concepts underlying the Ferret (1) A text can be converted into a set of short sequences of adjacent words – bigrams, trigrams etc. Example with trigrams A storm was forecast for today becomes (a storm was) (storm was forecast) (was forecast for) (forecast for today)

Concepts underlying the Ferret (2) To find similar passages in two documents, both texts are converted to sets of trigrams. Then the sets are compared for matches. Independently written texts have a sprinkling of matches. But copied passages (not necessarily identical) have a significant number of matches, above a threshold.

Zipfian distribution of words(or why this method works) A small number of words occur frequently, but most words occur rarely. This phenomenon is more pronounced for bigrams and trigrams. The characteristic form of a text based on trigrams will have a few frequent trigrams, but most will be rare. References Prediction and Entropy of Printed English, C Shannon, 1950 Many publications in speech recognition literature.

Statistics from Wall Street Journal corpus (1) From Handbook of Standards and Resources for Spoken Language Systems, Gibbon et al., 1997

Statistics from Wall Street Journal (WSJ) corpus (2) • WSJ is a narrow domain. • Topics are revisited. • On close dates, subjects may be very similar. Yet after over 38 million words have been analyzed, a new article will on average have 77% new trigrams.

The Ferret and speech recognition systems “Sparse data” is a key problem in speech recognition. New input to a system typically contains a number of previously unseen trigrams. Ferret exploits this problem: sparse data means a text has characteristic features that do not appear in other texts, unless passages are copied.

Comparison metrics in the Ferret Set theoretic measures are used to compare Documents. 2 texts of comparable length have Resemblance, R If NA and NB are the sets of trigrams in texts A and B, then: There is a threshold for R, found empirically, above which texts are suspiciously similar ______

Benchmarking Resemblance threshold Experiments were conducted on The Federalist Papers. This set of essays, the basis of the American Constitution, is very well known. • 81 of the papers were used. • 2 authors. • All are on related topics. The maximum measure of resemblance between two of these essays suggests an upper limit on similarity between independently written texts.

The Ferret process To find similar passages in large document collections • Documents are converted to .txt from Word (or shortly from .pdf) • Each text is converted to a set of trigrams; in this form, each is compared with each other. 3. A table showing Resemblance between each pair of texts is displayed in ranked order. The user can select any pair, display side by side, and see matching sections highlighted, save if wanted.

The Ferret as plagiarism detector for students’ work • Detects plagiarism or collusion in work from large cohorts of students. • Short sections of similar text can be identified, with some insertions and deletions. • Documents from the web can be included in a semi- automatic process: top 50 hits from a search are converted to .txt and added to other texts. Reference: Experiments in Electronic Plagiarism Detection, C Lyon et al. TR 388, Computer Science Dept., University of Hertfordshire, 2003

Ferret demonstration Aim: to find if there are similar passages in any two documents Data: • 320 texts of 10,000 words, taken from Gutenberg site. Simulated copying by pasting passages of 100 to 400 words from one text into another. • 100 texts of student work, 2000 – 5000 words each. • 34 documents from Dutch students. • Please bring other data to try.

The Ferret Copy Detector Finding short passages of similar texts in large document collections

The Ferret Copy Detector Finding short passages of similar texts in large document collections

Presentation Transcript

Processing of large document collections

Processing of large document collections

Processing of Large Document Collections 1

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Automatic Document Indexing in Large Medical Collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections