1 / 28

InfoMagnets : Making Sense of Corpus Data

InfoMagnets : Making Sense of Corpus Data. Jaime Arguello Language Technologies Institute. Outline. InfoMagnets Applications Topic Segmentation Conclusions Q/A. Outline. InfoMagnets Applications Topic Segmentation Conclusions Q/A. Defining Exploratory Corpus Analysis.

gerodi
Download Presentation

InfoMagnets : Making Sense of Corpus Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. InfoMagnets: Making Sense of Corpus Data Jaime Arguello Language Technologies Institute

  2. Outline • InfoMagnets • Applications • Topic Segmentation • Conclusions • Q/A

  3. Outline • InfoMagnets • Applications • Topic Segmentation • Conclusions • Q/A

  4. Defining Exploratory Corpus Analysis • Getting a “sense” of your data • How does it relate to: • Information retrieval • Need to understand the whole corpus • Data mining • Need rich interface to support serendipitous search • Text classification • Need to find the “interesting” classes

  5. InfoMagnets

  6. InfoMagnets Applications • Behavioral Research • 2 Publishable results (submitted to CHI) • CycleTalk Project, LTI • New findings on mechanisms at work in guided exploratory learning • Robert Kraut’s Netscan Group, HCII • Conversational Interfaces • Corpus organization makes authoring conversational agents less intimidating. Rose, Pai, & Arguello (2005); Gweon et al., (2005)

  7. InfoMagnets Applications • Behavioral Research • 2 Publishable results (submitted to CHI) • CycleTalkProject, LTI • New findings on mechanisms at work in guided exploratory learning • Robert Kraut’s Netscan Group, HCII • Conversational Interfaces • Corpus organization makes authoring conversational agents less intimidating. Rose, Pai, & Arguello (2005); Gweon et al., (2005)

  8. Authoring Conversational Interfaces • Goal: Make Authoring CI’s easier • Solution: • Guide development with pre-processed sample human-human conversations • Addresses different issues • Accessible to non-computational linguists • Developers ≠ domain experts • Consistent with user-centered design: “The user is not like me!”

  9. Authoring Conversational Interfaces Constructing a Master Template B A C Transcribed human-human conversations A C Topic Segmentation B

  10. Topic Segmentation • Preprocess for InfoMagnets • But, an important computational linguistics problem in its own right! • Previous Work • Marti Hearst’s TextTiling (1994) • Beeferman, Berger, and Lafferty (1997) • Barzilay and Lee (2004) NAACL best paper award! • ….. • But, should it all fall under “topic segmentation”?

  11. Topic Segmentation of Dialogue • Dialogue is Different: • Very little training data • Linguistic Phenomena • Ellipsis • Telegraphic Content • Coherence is organized around a shared task, not primarily around a single flow of information

  12. Coherence Defined Over Shared Task • Lots of places where there is no overlap in “meaningful” content

  13. Coherence Defined Over Shared Task Multiple topic shifts in regions w/ zero lexical cohesion

  14. Experimental Condition • 22 student-tutor pairs • Conversation captured through mainstream chat client • Thermodynamics domain • Training and test data coded by one coder • Results shown in terms of p_k (Lafferty & Beeferman, 1999) • Significant tests: 2-tailed, t-tests

  15. 1st Attempt: TextTiling • TextTiling (Hearst, 1997) • Slide two adjacent “windows” down the text • At each state calculate cosine correlation • Use correlation values to calculate “depth” • “Depth” values higher than a threshold correspond to topic shifts w1 w2

  16. TextTiling Results • Trend for TextTiling to perform worse than degenerate baselines • Difference not statistically significant • Why doesn’t it work?

  17. TextTiling Results • Lots of gaps where the correlation = 0 • Must select boundary heuristically • And, still a heuristical improvement on original

  18. TextTiling Results • But, topic shifts tend NOT to occur where corr > 0.

  19. 2nd Attempt: Barzilay and Lee (2005) • Cluster utterances • Treat each cluster as a “state” • Construct HMM • Emission probabilities: state-specific language models • Transition probabilities: based on location and cluster-membership of the utterances • Viterbi re-estimation until convergence

  20. B&L Results • B&L statistically better than TT, but not better than degenerate algorithms

  21. B&L Results • Too fine grained topic boundaries • Most clusters based on “fixed expressions” (e.g. “ok”, “yeah”, “sure” ) • Remember: cohesion based on shared task • Are state-based language models sufficiently different?

  22. Incorporating Dialogue Dynamics • Dialogue Act coding scheme • Not originally developed for segmentation, but for discourse analysis of human-tutor dialogues • 4 main dimensions: • Action: open question, closed question, negation, etc. • Depth: (yes/no) is utterance accompanied with explanation or elaboration • Focus: (binary) is focus on speaker or other agent • Control: Initiation, Response, Feedback • Dialogue Exchange (Sinclair and Coulthart, 1975)

  23. 3rd Attempt: Cross-Dimensional Learning • (Donmez, 2004) • Use estimated labels on some dimensions to learn other dimensions • 3 types of Features: • Text (discourse cues) • Lexical coherence (binary) • Dialogue Acts labels • 10-fold cross-validation • Topic Boundaries learned on estimated labels, not hand coded ones!

  24. X-Dimensional Learning Results • X-DIM statistically better than TT and degenerate algorithms!

  25. Statistically Significant Improvement

  26. Future Directions Merge cross-dimensional learning (w/ dialogue act features) with B&L content modeling HMM approach. • Explore other work in topic segmentation of dialogue

  27. Recap • InfoMagnets and applications • Corpus exploration and authoring of CI’s • Challenges of topic segmentation of dialogue • Description of TextTiling, Barzilay & Lee, X-DIM vs. degenerate methods and each other

  28. Q/A Thank you!

More Related