SIMS 296a-3:Current Topics in Information Access Marti Hearst Fall ‘98
Today • Introductions • Goals and Course Requirements • Administrivia • Topics • What is Information Access • Current Topics (an outline) • Intro to IA
Goals • Become expert on the state-of-the-art in timely topics related to information access • Begin getting research results.
Course Requirements • To get S/U credit for the class • Lead two discussions • Do the readings • Attend the meetings
Course Requirements • To get a grade in the class • Do the above • Do one of the following (optionally with the help of a faculty member and/or another student): • Write a publishable survey paper on an emerging area of information access. • Do research that should lead to a publishable research paper ona new idea, method, analysis, or vision statement for an emerging area of information access. • Implement and/or evaluate code to further an information access research project.
Administrivia • Sign up sheet • Readings • Other questions?
Outline • What is Information Access? • Goals, Tasks, Types of data • Standard Information Retrieval • Assumptions, Techniques, Evaluation • Current Topics • Candidate topics
What is Information Access? • Information Access: • The process by which users use information technology to seek, organize, and understand information. • Focus: information expressed as text.
Information Retrieval • Task Statement Build a system that retrieves documents that users are likely to find relevant to their queries. • This set of assumptions underlies the field of Information Retrieval.
Information Retrieval Assumptions • The system has available only pre-existing, “canned” text passages. • Its response is limited to selecting from these passages and presenting them to the user. • It must select, say, 10 or 20 passages out of millions or billions!
Top 10 Research Issues for IRWhat do people want from IR? • By Bruce Croft, DLIB Magazine, Nov 95 • Based on work observations from work on public-domain systems, including: • THOMAS • American Memory Project (Library of Congress) • The order of importance does not correspond to many IR researchers’ priorities. • The same can be said for AI researchers.
Top 10 Research Issues for IR • Bruce Croft, DLIB Magazine, Nov 95. In descending order of importance. • Integrated Solutions • Distributed IR • Efficient, Flexible Indexing and Retreival • “Magic” (Effective Vocabulary Expansion) • Interfaces and Browsing • Routing and Filtering • Effective Retrieval • Multimedia Retrieval • Information Extraction • Relevance Feedback
Other Issues • Mundane issues are important • Spelling Correction • Fast display of initial results • Less important but more interesting from many researchers’ points of view: (Bruce Croft, DLIB Magazine, Nov 95) • Multilingual IR • Data Mining (in text databases) • Text Categorization
Matching Tasks, Collections, and Search Systems • Typical WWW search is not the whole picture. • Different information needs require: • different collections • different search systems and strategies • Compare: • general WWW • newswire and magazines • medical journal articles
Match Task and Search Type • WWW Tasks: (from www.cnet.com/Content/Reviews/Compare/Seach/ss1a.html) • Find how-to pages for Doom. • Purchase plane tickets and hotel for a trip to Java. • Find the top five all-time scoring leaders in the national hockey league. • Find a recipe for potato latkes. • Find the tide tables for Maui. • Characteristics: • Timely, specific, found via help from human agents and in well-known resources before the WWW.
Match Task and Search Type • Newswire & Magazine Tasks: (from the TREC collection) • Find articles on research into cures for osteoporosis. • Find articles on the effects of recycling of tires on the environment. • Find information on jail and prison overcrowding and how inmates are forced to cope with those conditions. • Find discussion of an existing or proposed insurance plan (governmental, commercial or individual) and the coverage it provides for long term care confinements in an institution. • Characteristics: • Complex combinations of topics. • Research-oriented • Either timely or retrospective
Match Task and Search Type • MEDLINE Tasks: (From OHSUMED, medir.ohsu.edu/pub/ohsumed) • Are there adverse effects on lipids when progesterone is given with estrogen replacement therapy? • Pathophysiology and treatment of disseminated intravascular coagulation. • Reviews on subdurals in the elderly. • Effectiveness of etidronate in treating hypercalcemia of malignancy. • Characteristics • Research-oriented • Technical • Cause and Effect, Implications
The Problem of Information Access • Main problem: • Computers can’t understand natural language. • Therefore: • Information access systems must guide users to information of interest by approximate methods. • General common methods: • word match • topic directories
Why Text is Tough • Abstract concepts difficult to represent (AI-Complete) • “Countless” combinations of subtle, abstract relationships among concepts • Many ways to represent similar concepts space ship, flying saucer, UFO, figment of imagination • Concepts are difficult to visualize • High dimensionality Tens or hundreds of thousands of features
Why Text is Tough • I saw Pathfinder on Mars with a telescope. • Pathfinder photographed Mars. • The Pathfinder photograph mars our perception of a lifeless planet. • The Pathfinder photograph from Ford has arrived. • The Pathfinder forded the river without marring its paint job.
Outline • What is Information Access? • Goals, Tasks, Types of data • Standard Information Retrieval • Assumptions, Techniques, Evaluation • Current Topics • Candidate topics • User Interfaces • Quality Assessment • Text Data Mining • Student suggestions
Tools for Information Access User Interfaces (information visualization) Information Access (information retrieval) Language and Task Analysis Content Analysis
Current Topics • User Interfaces • Incorporating “personal” information • Automated “Agents” vs. User Initiated Steps • Support for the dynamic process of information access • How to organize large search results • Categories, clusters, combinations of these • Question Answering • Others?
Current Topics • Quality Assessment • Issues: • How to define quality • Rating methods • Different fields (medicine, business) • Techniques • Visitation patterns and times • “Social” techniques • Link structure (co-citation patterns) • Link structure + content
Current Topics • Text Data Mining • Visualizating the contents of large text collections • Automatically discovering associations within text collections • Discovering useful patterns • Spotting anomalies • *Finding chains of associated information • *I have a proposal for this
Current Topics • Cognitive modeling/AI techniques • Your idea goes here:
For Next Time • Do background reading • Think about which topics to pursue • I will present more background information