450 likes | 471 Views
Explore strategies for web search summarization using annotated data and pie chart visualizations. Develop skills in summarization systems with SIDE environment. Learn about web organization and information foraging behavior.
E N D
Summarization and Personal Information Management Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute
Announcements • Questions? • Homework 2 due! • Only 2 people turned it in so far… • Homework 3 assigned! • Due before Spring Break • Plan for Today • Summarizing Web Searches • Pirolli chapter
Homework 3 • Start with AnnotatedData.csv • Load it into SIDE • Train a model to predict Summary class • Create a summary to select the top 5 instances of the POS class and return that as the summary • Take a screen shot of the generated summary • Create a Pie chart of the Summary class • Take a screen shot • Turn in your two screen shots
SIDE • Download from http://www.cs.cmu.edu/~cprose/SIDE.html • Download the code and the documentation • Note they are not in the same file! • Video lecture: • http://fatman-vm.isri.cmu.edu/CourseCast/Viewer/Default.aspx?id=add722f7-b566-479a-a564-495f56474747 • Email elijah@cmu.edu if you have trouble • Data file is at http://www.cs.cmu.edu/~cprose/AnnotatedData.csv
Goals • SIDE is a development environment for building summarization systems • Design inspired by Teufel and Moens • Architecture allows you to add your own plugins: • Feature extractors • Classification algorithms • Ranking metrics • Presentation algorithms • Visualizations
Important note!! • Documentation gives instructions for using the error analysis interface • You’re not required to use it, but it would be helpful!!
Summarizing Web Search or Summarization in the Process of Web Search
Data Set Description • 2,000 users, each assigned 1 of 10 search tasks • We have a log of all of their click behavior • URLs for every page they looked at • A write-up summary of what they found • 20 “gold standard” users • 3 different people did each of the 10 tasks • Gold standard click behavior • Gold standard write-ups • Recently annotated pages as relevant or not relevant – ask Naman Gupta for this (nkgupta@andrew.cmu.edu)
Information ForagingNote that you can find a more extensive journal article linked to Peter’s homepage! • Also a psychological approach • Examines both the landscape (the WWW) and a cognitive model of human foraging behavior • Lab studies with tasks that are like real information seeking tasks • Based on “critical incident analysis”
“The results illustrate how the structure of the Web environment and the goals and heuristics of human information foragers mutually shape foraging behavior.” • But what accounts for the Web’s structure? • What are “decentralized social evolutionary processes?”
Web Organization • A hierarchy of patches • Link structure mirrors directory structure as well as similarity structure • Probably also mirrors organizational structure • Also reflects the role in scientific discourse • Hubs have many outbound links (e.g., review articles) • Authorities have many inbound links (e.g., important, seminal work) • Related to “page rank” measure used by search engines
Critical Incident Technique ** How do you think this approach might bias the results?
Surprise? • “What is surprising about these results is that the Web is mainly aimed at helping users find specific pieces of information (e.g., through search engines), and this suggests a latent demand for tools to support these broader sense-making activities.” ** Do you agree with this statement? Why or why not?
Student Quote • Exposure to extremely similar information is something that may be indicative of convergence on an optimal solution for comparative searching, but is likely something that will create frustration or falsely signal the terminus of useful information availability on the topic.
Is there such a thing as a successful search strategy in the abstract? • The point I'm trying to make is that search activities #1 and #2 are very different, and the strategies applied to one are likely to generate unsatisfactory outcomes in the other.
Student Quote • The www is indeed structural in the sense as discussed in the chapter. It has 'information patches' and 'hubs' and 'authorities' which are exploited by search engines to refine their search results. This structure of the web can help in summarization tasks since foraging is one of the most important aspects of summarization. The more relevant results one can get, the more relevant will be the summarization.
Student Quote • The correlation between the link structure and the topical similarity of the web pages have also been discussed which is quite fascinating.
Student Quote • The results from information scent …I think thats an actual problem with user face in using search engines where a person just drills down into a website and is presented with information patches with low information scent. This problem can be averted by giving the user a glimpse of information patches (summary snippet) along with the main page result which is presented to the user. This would help alleviate the problem of going through all the low information bearing pages in the same website.