The Case for Corpus Profiling

The Case for Corpus Profiling Anne De Roeck (Udo Kruschwitz, Nick Webb, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite, Dawei Song) Centre for Research in Computing The Open University, Walton Hall, Milton Keynes, MK7 6AA, UK.

Fact or Factoid: Hyperlinks • Hyperlinks do not significantly improve recall and precision in diverse domains, such as the TREC test data (Savoy and Pickard 1999, Hawking et al 1999).

Fact or Factoid: Hyperlinks • Hyperlinks do not significantly improve recall and precision in diverse domains, such as the TREC test data (Savoy and Pickard 1999, Hawking et al 1999). • Hyperlinks do significantly improve recall and precision in narrow domains and Intranets (Chen et al 1999, Kruschwitz 2001).

Fact or Factoid: Stemming • Stemming does not improve effectiveness of retrieval (Harman 1991)

Fact or Factoid: Stemming • Stemming does not improve effectiveness of retrieval (Harman 1991) • Stemming improves performance for morphologically complex languages (Popovitch and Willett 1992)

Fact or Factoid: Stemming • Stemming does not improve effectiveness of retrieval (Harman 1991) • Stemming improves performance for morphologically complex languages (Popovitch and Willett 1992) • Stemming improves performance on short documents (Krovetz 1993)

Fact or Factoid: Long or Short. • Stemming improves performance on short documents (Krovetz 1993) • Short keyword based queries behave differently from long structured queries (Fujii and Croft 1999) • Keyword based retrieval works better on long texts (Jurawsky and Martin 2000)

Fact • Performance of IR and NLP techniques depends on the characteristics of the dataset.

Fact • Performance of IR and NLP techniques depends on the characteristics of the dataset. • Performance will vary with task, technique and language.

Fact • Performance of IR and NLP techniques depends on the characteristics of the dataset. • Performance will vary with task, technique and language. • Datasets really are significantly different.

Fact • Performance of IR and NLP techniques depends on the characteristics of the dataset. • Performance will vary with task, technique and language. • Datasets really are significantly different. • Vital Statistics • Sparseness

Description

Vital Stats

Type to Token Ratios

Assumption • Successful (statistical?) techniques can be successfully ported to other languages. • Western European languages • Japanese, Chinese, Malay, … • WordSmith: Effective use requires 5M word corpus (Garside 2000)

Type to Token ratio

Cargo Cult Science? • Richard Feynman (1974)

Cargo Cult Science? • Richard Feynman (1974) “It's a kind of scientific integrity, a principle of scientific thought that corresponds to a kind of utter honesty--a kind of leaning over backwards. For example, if you're doing an experiment, you should report everything that you think might make it invalid--not only what you think is right about it: other causes that could possibly explain your results; and things you thought of that you've eliminated by some other experiment, and how they worked--to make sure the other fellow can tell they have been eliminated.”

Cargo Cult Science? • Richard Feynman (1974) “Details that could throw doubt on your interpretation must be given, if you know them. You must do the best you can--if you know anything at all wrong, or possibly wrong--to explain it.” “In summary, the idea is to give all of the information to help others to judge the value of your contribution; not just the information that leads to judgement in one particular direction or another.”

Cargo Cult Science? • The role of data in the outcome of experiments should be clarified • Why? • How?

Why explore role of data? • Methodological: Replicability • Barbu and Mitkov (2001) – Anaphora resolution • Donaway et al (2000) – Automatic Summarisation

Why explore role of data? • Methodological: Replicability • Barbu and Mitkov (2001) – Anaphora resolution • Donaway et al (2000) – Automatic Summarisation • Epistemological: Theory induction • What is the relationship between data properties and technique performance?

Why explore role of data? • Methodological: Replicability • Barbu and Mitkov (2001) – Anaphora resolution • Donaway et al (2000) – Automatic Summarisation • Epistemological: Theory induction • What is the relationship between data properties and technique performance? • Practical: Application • What is relationship between two sets of data? • What is this dataset (language?) like?

How explore role of data? • One way: Profiling for Bias • Assumption: Collection will be biased w.r.t. technique & task • Find measures that reflect bias • Verify effects experimentally

How explore role of data? • Profile standard collections • Adds to past experiments • Profile new data • Gauge distance to known collections • Estimate effectiveness of techniques

Why Profile for Bias? • And by the way, the others think it is vital. (Machine Learning, Data Mining, Pattern Matching etc.)

Why Profile for Bias? • And by the way, the others think it is vital. (Machine Learning, Data Mining, Pattern Matching etc.) • And so did we! (or do we?)

Profiling: An Abandoned Agenda? • Sparck-Jones (1973) “Collection properties influencing automatic term classification performance.” Information Starage and Retrieval. Vol 9 • Sparck-Jones(1975) “A performance Yardstick for test collections.” Journal of Documentation. 31:4

What has changed? • Sparck-Jones (1973) • Is a collection useably classifiable? • Number of query terms which can be used for matching. • Is a collection usefully classifiable? • Number of useful, linked terms in document or collection • Is a collection classifiable? • Size of vocabulary and rate of incidence

Profiling: An Abandoned Agenda • Term weighting formula tailored to query • Salton 1972 • Stop word identification relative to collection/query • Wilbur & Sirotkin1992; Yang & Wilbur 1996 • Effect of collection homogeneity on language model quality • Rose & Haddock 1997

What has changed? • Proliferation of (test) collections • More data per collection • Increased application need

What has changed? • Proliferation of (test) collections • More data per collection • Increased application need • Sparseness is only one kind of bias

What has changed? • Proliferation of (test) collections • More data per collection • Increased application need • Sparseness is only one kind of bias • Better (ways of computing) measures?

Profiling Measures • Requirements: measures should be • relevant to NLP techniques given task • fine grained • cheap to implement

Profiling Measures • Requirements: measures should be • relevant to NLP techniques given task • fine grained • cheap to implement • Need to agree a framework • Fixed points: • Collections? • Properties? • Measures?

Profiling Measures • Simple starting point: • Vital Statistics • Zipf (sparseness; ideosyncracy) • Type to token ratio (sparseness, specialisation) • Manual sampling (quality; content) • Refine? • Homogeneity? • Burstiness? • (Words and Genre?)

Profiling Measures • Homogeneity (or how strong is evidence defeating homogeneity assumption) • Term Distribution Models (Words!) • Frequentist vs non-frequentist • Very frequent terms (!!)

Very Frequent Terms • Lots of them • Reputedly “noise-like” (random? homogeneous?) • Present in most datasets (comparison) • Stop word identification relative to collection/query is independently relevant • Wilbur & Sirotkin1992; Yang & Wilbur 1996

Homogeneity • Homogeneity Assumption • Bag of Words • Function word distribution • Content word distribution • Measure of Heterogeneity as dataset profile • Kilgariff & others 1992 onwards • Measure distance between corpora • Identify genre

Heterogeneity Measures • 2 (Kilgariff 1997; Rose & Haddock 1997) • G2 (Rose & Haddock 1997; Rayson & Garside 2000 ) • Correlation, Mann-Whitney (Kilgariff 1996) • Log-likelihood (Rayson & Garside 2000) • Spearman’s S (Rose & Haddock 1997) • Kullback-Leibler divergence (Cavaglia 2002)

Measuring Heterogeneity • Divide corpus using 5000 word chunks in random halves • Frequency list for each half • Calculate 2 for term frequency distribution differences between halves • Normalise for corpus length • Iterate over successive random halves

Measuring Heterogeneity • Kilgariff registers values of 2 statistic • High value indicates high heterogeneity • Finds high heterogeneity in all texts

Defeating the Homogeneity Assumption • Assume word distribution is homogeneous (bag of words) • Explore chunk sizes • Chunk size 1 -> homogeneous (random) • Chunk size 5000 -> heterogeneous (Kilgariff 1997) • 2 test (statistic + p-value) • Defeat assumption with statistical relevance • Register differences between datasets • Focus on frequent terms (!)

Homogeneity detection at a level of statistical significance • p-value: evidence for/against the hypothesis • < 0.1 -- weak evidence against • < 0.01 -- strong evidence against • < 0.001 -- very strong evidence against • < 0.05 -- significant (moderate evidence against the hypothesis) • Indication of statistically significant non-homogeneity

Dividing a Corpus • docDiv: place documents in random halves • term distribution across documents • halfdocDiv: place half documents in random halves • term distribution within the same document • chunkDiv: place chunks (between 1 and 5000 words) in random halves • term distribution between text chunks (genre?)

Results DocDiv

Results HalfDocDiv

Results ChunkDiv (5)

The Case for Corpus Profiling

The Case for Corpus Profiling

Presentation Transcript

Profiling the Entrepreneur

Corpus Linguistics Case study 2

A Case for Vertical Profiling

Corpus Linguistics for Understanding the Quran

The Uterine Corpus

The Uterine Corpus

Profiling For Good!

The Uterine Corpus

The Uterine Corpus

The Uterine Corpus

The METER Corpus: A corpus for analysing journalistic text reuse

Improving Search through Corpus Profiling

LIPOMA OF THE CORPUS CALLOSUM A CASE REPORT

The Games Corpus

The CareGiver corpus

Statistical Measures for Corpus Profiling

Profiling the Stalker

Corpus Linguistics Case study

The European Union case law corpus (EUCLCORP)

The SIMS Corpus

Corpus Linguistics Case study 2