1 / 68

The Case for Corpus Profiling

The Case for Corpus Profiling. Anne De Roeck (Udo Kruschwitz, Nick Webb, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite, Dawei Song) Centre for Research in Computing The Open University, Walton Hall, Milton Keynes, MK7 6AA, UK. Fact or Factoid: Hyperlinks.

tovah
Download Presentation

The Case for Corpus Profiling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Case for Corpus Profiling Anne De Roeck (Udo Kruschwitz, Nick Webb, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite, Dawei Song) Centre for Research in Computing The Open University, Walton Hall, Milton Keynes, MK7 6AA, UK.

  2. Fact or Factoid: Hyperlinks • Hyperlinks do not significantly improve recall and precision in diverse domains, such as the TREC test data (Savoy and Pickard 1999, Hawking et al 1999).

  3. Fact or Factoid: Hyperlinks • Hyperlinks do not significantly improve recall and precision in diverse domains, such as the TREC test data (Savoy and Pickard 1999, Hawking et al 1999). • Hyperlinks do significantly improve recall and precision in narrow domains and Intranets (Chen et al 1999, Kruschwitz 2001).

  4. Fact or Factoid: Stemming • Stemming does not improve effectiveness of retrieval (Harman 1991)

  5. Fact or Factoid: Stemming • Stemming does not improve effectiveness of retrieval (Harman 1991) • Stemming improves performance for morphologically complex languages (Popovitch and Willett 1992)

  6. Fact or Factoid: Stemming • Stemming does not improve effectiveness of retrieval (Harman 1991) • Stemming improves performance for morphologically complex languages (Popovitch and Willett 1992) • Stemming improves performance on short documents (Krovetz 1993)

  7. Fact or Factoid: Long or Short. • Stemming improves performance on short documents (Krovetz 1993) • Short keyword based queries behave differently from long structured queries (Fujii and Croft 1999) • Keyword based retrieval works better on long texts (Jurawsky and Martin 2000)

  8. Fact • Performance of IR and NLP techniques depends on the characteristics of the dataset.

  9. Fact • Performance of IR and NLP techniques depends on the characteristics of the dataset. • Performance will vary with task, technique and language.

  10. Fact • Performance of IR and NLP techniques depends on the characteristics of the dataset. • Performance will vary with task, technique and language. • Datasets really are significantly different.

  11. Fact • Performance of IR and NLP techniques depends on the characteristics of the dataset. • Performance will vary with task, technique and language. • Datasets really are significantly different. • Vital Statistics • Sparseness

  12. Description

  13. Vital Stats

  14. Type to Token Ratios

  15. Type to Token Ratios

  16. Assumption • Successful (statistical?) techniques can be successfully ported to other languages. • Western European languages • Japanese, Chinese, Malay, … • WordSmith: Effective use requires 5M word corpus (Garside 2000)

  17. Type to Token ratio

  18. Cargo Cult Science? • Richard Feynman (1974)

  19. Cargo Cult Science? • Richard Feynman (1974) “It's a kind of scientific integrity, a principle of scientific thought that corresponds to a kind of utter honesty--a kind of leaning over backwards. For example, if you're doing an experiment, you should report everything that you think might make it invalid--not only what you think is right about it: other causes that could possibly explain your results; and things you thought of that you've eliminated by some other experiment, and how they worked--to make sure the other fellow can tell they have been eliminated.”

  20. Cargo Cult Science? • Richard Feynman (1974) “Details that could throw doubt on your interpretation must be given, if you know them. You must do the best you can--if you know anything at all wrong, or possibly wrong--to explain it.” “In summary, the idea is to give all of the information to help others to judge the value of your contribution; not just the information that leads to judgement in one particular direction or another.”

  21. Cargo Cult Science? • The role of data in the outcome of experiments should be clarified • Why? • How?

  22. Why explore role of data? • Methodological: Replicability • Barbu and Mitkov (2001) – Anaphora resolution • Donaway et al (2000) – Automatic Summarisation

  23. Why explore role of data? • Methodological: Replicability • Barbu and Mitkov (2001) – Anaphora resolution • Donaway et al (2000) – Automatic Summarisation • Epistemological: Theory induction • What is the relationship between data properties and technique performance?

  24. Why explore role of data? • Methodological: Replicability • Barbu and Mitkov (2001) – Anaphora resolution • Donaway et al (2000) – Automatic Summarisation • Epistemological: Theory induction • What is the relationship between data properties and technique performance? • Practical: Application • What is relationship between two sets of data? • What is this dataset (language?) like?

  25. How explore role of data? • One way: Profiling for Bias • Assumption: Collection will be biased w.r.t. technique & task • Find measures that reflect bias • Verify effects experimentally

  26. How explore role of data? • Profile standard collections • Adds to past experiments • Profile new data • Gauge distance to known collections • Estimate effectiveness of techniques

  27. Why Profile for Bias? • And by the way, the others think it is vital. (Machine Learning, Data Mining, Pattern Matching etc.)

  28. Why Profile for Bias? • And by the way, the others think it is vital. (Machine Learning, Data Mining, Pattern Matching etc.) • And so did we! (or do we?)

  29. Profiling: An Abandoned Agenda? • Sparck-Jones (1973) “Collection properties influencing automatic term classification performance.” Information Starage and Retrieval. Vol 9 • Sparck-Jones(1975) “A performance Yardstick for test collections.” Journal of Documentation. 31:4

  30. What has changed? • Sparck-Jones (1973) • Is a collection useably classifiable? • Number of query terms which can be used for matching. • Is a collection usefully classifiable? • Number of useful, linked terms in document or collection • Is a collection classifiable? • Size of vocabulary and rate of incidence

  31. Profiling: An Abandoned Agenda • Term weighting formula tailored to query • Salton 1972 • Stop word identification relative to collection/query • Wilbur & Sirotkin1992; Yang & Wilbur 1996 • Effect of collection homogeneity on language model quality • Rose & Haddock 1997

  32. What has changed? • Proliferation of (test) collections • More data per collection • Increased application need

  33. What has changed? • Proliferation of (test) collections • More data per collection • Increased application need • Sparseness is only one kind of bias

  34. What has changed? • Proliferation of (test) collections • More data per collection • Increased application need • Sparseness is only one kind of bias • Better (ways of computing) measures?

  35. Profiling Measures • Requirements: measures should be • relevant to NLP techniques given task • fine grained • cheap to implement

  36. Profiling Measures • Requirements: measures should be • relevant to NLP techniques given task • fine grained • cheap to implement • Need to agree a framework • Fixed points: • Collections? • Properties? • Measures?

  37. Profiling Measures • Simple starting point: • Vital Statistics • Zipf (sparseness; ideosyncracy) • Type to token ratio (sparseness, specialisation) • Manual sampling (quality; content) • Refine? • Homogeneity? • Burstiness? • (Words and Genre?)

  38. Profiling Measures • Homogeneity (or how strong is evidence defeating homogeneity assumption) • Term Distribution Models (Words!) • Frequentist vs non-frequentist • Very frequent terms (!!)

  39. Very Frequent Terms • Lots of them • Reputedly “noise-like” (random? homogeneous?) • Present in most datasets (comparison) • Stop word identification relative to collection/query is independently relevant • Wilbur & Sirotkin1992; Yang & Wilbur 1996

  40. Homogeneity • Homogeneity Assumption • Bag of Words • Function word distribution • Content word distribution • Measure of Heterogeneity as dataset profile • Kilgariff & others 1992 onwards • Measure distance between corpora • Identify genre

  41. Heterogeneity Measures • 2 (Kilgariff 1997; Rose & Haddock 1997) • G2 (Rose & Haddock 1997; Rayson & Garside 2000 ) • Correlation, Mann-Whitney (Kilgariff 1996) • Log-likelihood (Rayson & Garside 2000) • Spearman’s S (Rose & Haddock 1997) • Kullback-Leibler divergence (Cavaglia 2002)

  42. Measuring Heterogeneity • Divide corpus using 5000 word chunks in random halves • Frequency list for each half • Calculate 2 for term frequency distribution differences between halves • Normalise for corpus length • Iterate over successive random halves

  43. Measuring Heterogeneity • Kilgariff registers values of 2 statistic • High value indicates high heterogeneity • Finds high heterogeneity in all texts

  44. Defeating the Homogeneity Assumption • Assume word distribution is homogeneous (bag of words) • Explore chunk sizes • Chunk size 1 -> homogeneous (random) • Chunk size 5000 -> heterogeneous (Kilgariff 1997) • 2 test (statistic + p-value) • Defeat assumption with statistical relevance • Register differences between datasets • Focus on frequent terms (!)

  45. Homogeneity detection at a level of statistical significance • p-value: evidence for/against the hypothesis • < 0.1 -- weak evidence against • < 0.01 -- strong evidence against • < 0.001 -- very strong evidence against • < 0.05 -- significant (moderate evidence against the hypothesis) • Indication of statistically significant non-homogeneity

  46. Dividing a Corpus • docDiv: place documents in random halves • term distribution across documents • halfdocDiv: place half documents in random halves • term distribution within the same document • chunkDiv: place chunks (between 1 and 5000 words) in random halves • term distribution between text chunks (genre?)

  47. Results DocDiv

  48. Results HalfDocDiv

  49. Results ChunkDiv (5)

More Related