1 / 66

Dataset Profiling

Dataset Profiling. Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University, Walton Hall, Milton Keynes, MK7 6AA, UK. Fact or Factoid: Hyperlinks.

baxter
Download Presentation

Dataset Profiling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University, Walton Hall, Milton Keynes, MK7 6AA, UK.

  2. Fact or Factoid: Hyperlinks • Hyperlinks do not significantly improve recall and precision in diverse domains, such as the TREC test data (Savoy and Pickard 1999, Hawking et al 1999).

  3. Fact or Factoid: Hyperlinks • Hyperlinks do not significantly improve recall and precision in diverse domains, such as the TREC test data (Savoy and Pickard 1999, Hawking et al 1999). • Hyperlinks do significantly improve recall and precision in narrow domains and Intranets (Chen et al 1999, Kruschwitz 2001).

  4. Fact or Factoid: Stemming • Stemming does not improve effectiveness of retrieval (Harman 1991)

  5. Fact or Factoid: Stemming • Stemming does not improve effectiveness of retrieval (Harman 1991) • Stemming improves performance for morphologically complex languages (Popovitch and Willett 1992)

  6. Fact or Factoid: Stemming • Stemming does not improve effectiveness of retrieval (Harman 1991) • Stemming improves performance for morphologically complex languages (Popovitch and Willett 1992) • Stemming improves performance on short documents (Krovetz 1993)

  7. Fact or Factoid: Long or Short. • Stemming improves performance on short documents (Krovetz 1993) • Short keyword based queries behave differently from long structured queries (Fujii and Croft 1999) • Keyword based retrieval works better on long texts (Jurawsky and Martin 2000)

  8. Assumption • Successful (statistical?) techniques can be successfully ported to other languages. • Western European languages • Japanese, Chinese, Malay, …

  9. Assumption • Successful (statistical?) techniques can be successfully ported to other languages. • Western European languages • Japanese, Chinese, Malay, … • WordSmith: Effective use requires 5M word corpus (Garside 2000)

  10. Type to Token ratio

  11. Fact • Performance of IR and NLP techniques depends on the characteristics of the dataset.

  12. Fact • Performance of IR and NLP techniques depends on the characteristics of the dataset. • Performance will vary with task, technique and language

  13. Cargo Cult Science? • Richard Feynman (1974)

  14. Cargo Cult Science? • Richard Feynman (1974) “It's a kind of scientific integrity, a principle of scientific thought that corresponds to a kind of utter honesty--a kind of leaning over backwards. For example, if you're doing an experiment, you should report everything that you think might make it invalid--not only what you think is right about it: other causes that could possibly explain your results; and things you thought of that you've eliminated by some other experiment, and how they worked--to make sure the other fellow can tell they have been eliminated.”

  15. Cargo Cult Science? • Richard Feynman (1974) “Details that could throw doubt on your interpretation must be given, if you know them. You must do the best you can--if you know anything at all wrong, or possibly wrong--to explain it.” “In summary, the idea is to give all of the information to help others to judge the value of your contribution; not just the information that leads to judgement in one particular direction or another.”

  16. Cargo Cult Science? • The role of data in the outcome of experiments must be clarified • Why? • How?

  17. Why Profile Datasets? • Methodological: Replicability • Barbu and Mitkov (2001) – Anaphora resolution • Donaway et al (2000) – Automatic Summarisation

  18. Why Profile Datasets? • Methodological: Replicability • Barbu and Mitkov (2001) – Anaphora resolution • Donaway et al (2000) – Automatic Summarisation • Epistemological: Theory induction • What is the relationship between dataset properties and technique performance?

  19. Why Profile Datasets? • Methodological: Replicability • Barbu and Mitkov (2001) – Anaphora resolution • Donaway et al (2000) – Automatic Summarisation • Epistemological: Theory induction • What is the relationship between dataset properties and application performance? • Practical: Application • What is relationship between two datasets? • What is this dataset (language?) like?

  20. Why Profile Datasets? • And by the way, the others think it is vital. (Machine Learning, Data Mining, Pattern Matching etc.)

  21. Why Profile Datasets? • And by the way, the others think it is vital. (Machine Learning, Data Mining, Pattern Matching etc.) • And so did we! (or do we?)

  22. Profiling: An Abandoned Agenda? • Sparck-Jones (1973) “Collection properties influencing automatic term classification performance.” Information Starage and Retrieval. Vol 9 • Sparck-Jones(1975) “A performance Yardstick for test collections.” Journal of Documentation. 31:4

  23. Profiling: An Abandoned Agenda • Term weighting formula tailored to query • Salton 1972 • Stop word identification relative to collection/query • Wilbur & Sirotkin1992; Yang & Wilbur 1996 • Effect of collection homogeneity on language model quality • Rose & Haddock 1997

  24. What has changed? • Proliferation of (test) collections • More data per collection • Increased application need

  25. What has changed? • Proliferation of (test) collections • More data per collection • Increased application need • Better (ways of computing) measures?

  26. What has changed? • Sparck-Jones (1973) • Is a collection useably classifiable? • Number of query terms which can be used for matching. • Is a collection usefully classifiable? • Number of useful, linked terms in document or collection • Is a collection classifiable? • Size of vocabulary and rate of incidence

  27. Profiling Measures • Requirements: measures should be • relevant to NLP techniques • fine grained • cheap to implement

  28. Profiling Measures • Requirements: measures should be • relevant to NLP techniques • fine grained • cheap to implement(!) • Simple starting point: • Vital Statistics

  29. Description

  30. Vital Stats

  31. Profiling Measures • Requirements: measures should be • relevant to NLP techniques • fine grained • cheap to implement(!) • Simple starting point: • Vital Statistics • Zipf (sparseness; ideosyncracy)

  32. Zipf Curve - Bengali CIIL corpus

  33. Profiling Measures • Requirements: measures should be • relevant to NLP techniques • fine grained • cheap to implement(!) • Simple starting point: • Vital Statistics • Zipf (sparseness; ideosyncracy) • Type to token ratio (sparseness, specialisation)

  34. Type to Token Ratios

  35. Type to Token Ratios

  36. Profiling Measures • Requirements: measures should be • relevant to NLP techniques • fine grained • cheap to implement(!) • Simple starting point: • Vital Statistics • Zipf (sparseness; ideosyncracy) • Type to token ratio (sparseness, specialisation) • Manual sampling (quality; content)

  37. Profiling by Measuring Heterogeneity • Homogeneity Assumption • Bag of Words • Function word distribution • Content word distribution • Measure of Heterogeneity as dataset profile • Measure distance between corpora • Identify genre

  38. Heterogeneity Measures • 2 (Kilgariff 1997; Rose & Haddock 1997) • G2 (Rose & Haddock 1997; Rayson & Garside 2000 ) • Correlation, Mann-Whitney (Kilgariff 1996) • Log-likelihood (Rayson & Garside 2000) • Spearman’s S (Rose & Haddock 1997) • Kullback-Leibler divergence (Cavaglia 2002)

  39. Kilgariff’s Methodology • Divide corpus using 5000 word chunks in random halves • Frequency list for each half • Calculate 2 for term frequency distribution differences between halves • Normalise for corpus length • Iterate over successive random halves

  40. Kilgariff’s Findings • Registers values of 2 statistic • High value indicates high heterogeneity • Finds high heterogeneity in all texts

  41. Defeating the Homogeneity Assumption • Assume word distribution is homogeneous (random) • Kilgariff methodology • Explore chunk sizes • Chunk size 1 -> homogeneous (random) • Chunk size 5000 -> heterogeneous (Kilgariff 1997) • 2 test (statistic + p-value) • Defeat assumption with statistical relevance • Focus on frequent terms (!)

  42. Homogeneity detection at a level of statistical significance • p-value: evidence for/against the hypothesis • < 0.1 -- weak evidence against • < 0.01 -- strong evidence against • < 0.001 -- very strong evidence against • < 0.05 -- significant (moderate evidence against the hypothesis) • Indication of statistically significant non-homogeneity

  43. Frequent Term Distribution • Lots of them • Reputedly “noise-like” (random?) • Present in most datasets (comparison) • Cheap to model

  44. Dividing a Corpus • docDiv: place documents in random halves • term distribution across documents • halfdocDiv: place half documents in random halves • term distribution within the same document • chunkDiv: place chunks (between 1 and 5000 words) in random halves • term distribution between text chunks (genre?)

  45. Results DocDiv

  46. Results HalfDocDiv

  47. Results ChunkDiv (5)

  48. Results: ChunkDiv (100)

More Related