Next generation search

Next generation search Marc Krellenstein VP, Search and Discovery Elsevier August 23, 2004 m.krellenstein@elsevier.com

Basic search is pretty good • Modern search engines are fast and scalable • Having the data (usually lots) is still key • Can interpret keyword, Boolean and pseudo-natural language queries • Ex: “how to make an international call” • Spell checking, thesauri and stemming to improve recall (and sometimes precision) • Recall = % of relevant documents found • Precision = % of returned documents that are relevant • Get lots of hits, but that’s usually OK if there are good ones on top

Basic search is pretty good • Best practice relevancy ranking is good: • Term frequency (TF): more hits count more • Inverse document frequency (IDF): hits of rarer search terms count more • Ex: diabetes diagnosis and treatment • Hits of search terms near each other count more • Ex: penicillin allergy vs. “penicillin allergy” • Hits on metadata (title,subject, etc.) count more • Use anchor text – referring text – as metadata • Items with more links/references to them count more • Authoritative links/referrers count yet more • Many other factors: length, date, etc.

Basic search is pretty good • Using these techniques search engines can locate specific documents, or good documents (if not the absolute best) around general or specific topics • But challenges remain…

Current challenges • Integrated search: Content still in silos • Silos getting bigger but there are still dozens • Finding the best (not just good) documents • Answering hard questions • Hard to match multiple criteria • find an experimental method like this one • Hard to get answers to complex questions, • patient X with pre-existing conditions Y presents with Z…what information is relevant? • Summary, discovery and analysis • Summarize, uncover relationships, analyze • Long-term: understand any question…

The integration challenge • Two approaches: • Build bigger databases • Sometimes the easiest way… • …but can be difficult or impossible to secure appropriate rights and consolidate data • Distributed search: Search separately managed (or owned) large databases as if they are one • Technically more challenging, but a scalable and maintainable architecture

Distributed search • Index multiple (maybe geographically) separate databases with a single search engine that supports distributed search • Use common metadata scheme (e.g., Dublin Core) and/or determine other common fields or field mappings for each database • Search engine provides parallel search, integrated ranking and integrated results • The separate databases can be maintained and updated separately • Elsevier is currently unifying its own sources in such a model with a ‘web service’ architecture • Such services can also be offered externally

Distributed search • Simplifies some business issues, but still requires common technology platform • Where common platform not possible, can use federated search (i.e., metasearch) • Translate queries • Access and perform parallel search of multiple search engines (vs. multiple databases) • Integrate results as best as possible • Use standards to approximate distributed research • Uniform access, one query language (Z39.50, updated) • Add standards for relevancy ranking and results return? • NISO and its members are working on standards

Finding the best • More data can also make finding the best documents harder • For searches on rare items, more data is a win • For all other searches, it’s more likely your answer is in there…but can be a problem too • Why? relevancy is good but… • Relevancy has its limits • “I need information on depression” • “Ok…here are 2,352 articles and 87 books” • Need a dialog…”what kind of depression” …”psychological”…”what about it?” • Underlying problem: most searches are under-specified

One solution: clustering documents • Group results around common themes: same author, web site, journal, subject… • Blurt out largest/most interesting categories: the inarticulate librarian model • Depression  psychology, economics, meteorology, antiques… • Psychology  treatment of depression, depression symptoms, seasonal affective… • Psychology  Kocsis, J. (10), Berg, R. (8), … • Themes could come from static metadata or dynamically by analysis of results text • Static: fixed, clear categories and assignments • Dynamic: doesn’t require metadata/taxonomy

Clustering benefits • Disambiguates and refines search results to get to documents of interest quickly • Can navigate long result lists hierarchically • Would never offer thousands of choices to choose from as input… • Access to bottom of list…maybe just less common • Discovery – new aspects or sources • Can narrow results *after* search • Start with the broadest area search – don’t narrow by subject or other categories first • Easier, plus can’t guess wrong, miss useful, or pick unneeded, categories…results-driven • Knee surgery  cartilage replacement, plastics, …

Answering hard questions • Main problem is still short searches/under-specification • One solution: Relevance feedback – marking good and bad results • A long-standing and proven search refinement technique • More information is better than less • Pseudo-relevancy feedback is a research standard • Most commercial forms not widely used… • …but Pubmed is an exception • A catch: Must first find a good document to be similar to….may be hard or impossible

One solution: descriptive search • Let the user or situation provide the ideal “document” – a full problem description – as input in the first place • Can enter free text or specific documents describing the need, e.g., an article, grant proposal or experiment description • Might draw on user or query context -- user characteristics (MD or nurse), patient record,… • Use thesauri, domain knowledge and limited natural language processing to identify must-have’s • Main focus, pre-existing conditions, etc. • Should provide the best possible search short of real language understanding

Summarize, discover & analyze • How do you summarize a corpus? • May want to report on what’s present, numbers of occurrences, trends, etc. • Ex: What diseases are studied the most? • Must know all diseases and look one by one • How to you find a relationship if you don’t know what relationships exist? • Ex:does gene p53 relate to any disease? • Must check for each possible relationship • Ad hoc analysis • How do all genes relate to this one disease? Over time? What organisms have the gene been studied in? Show me the document evidence

One solution: text mining • Identify entities (things) in a text corpus • Examples: authors, universities… diseases, drugs, side-effects, genes…companies, law suits, plaintiffs, defendants… • Use lexicons, patterns, NLP for finding any or all instances of the entity (including new ones) • Identify relationships: • Through co-occurrence • Relationship presumed from proximity • Example: author-university affiliation • Through limited natural language processing • Semantic relations – causes, is-part-of, etc. • Examples: drug-causes-disease, drug-treats-disease • Identify appropriate verbs, recognize active vs. passive voice, resolve anaphora (…it causes…)

Elsevier pilot project • Goal: Demonstrate real value to a working expert in 90 days • Chose biomedical domain • Hired expert to help define entities and relationships • Used 25,000 abstracts from 23 Elsevier journals • Worked with text mining vendor to define and revise extraction of entities and relationships

Pilot scenarios • Answered real questions using real data – not a demo or mock-up • The user: • anyone involved in genomic academic research: a primary researcher, graduate student or post-doc • Scenario 1: Research about gene p53 • What journals should I publish in? • Who’s an expert I can ask for advice? • What connections have been made to my gene? • What organisms have my gene?

What journals should I publish in?

Who’s an expert?

Connections to p53?

To organisms?

Pilot scenarios • Scenario 2: Disease research • What diseases are most researched? • What’s the time trend in HIV research? • What are the centers of HIV research? • Who are the author teams in HIV? • What gene-disease relationships are there? What were they to start in 1996? through 1997? • (Note: Cannot practically answer the above with search alone)

What diseases are most researched?

Time trend in HIV research?

Centers of HIV research?

Author teams In HIV research?

Gene-disease relationships?

To start, in 1996?

Through 1997?

Pilot scenarios • Scenario 3: Connections between leukemia and Alzheimer’s • Are there direct connections between leukemia and Alzheimer’s? • What enzymatic activity is associated with leukemia? • Are there indirect connections between leukemia and Alzheimer’s mediated by enzymatic activity?

Direct connections between leukemia and Alzheimer’s?

Enzymes associated with leukemia?

Indirect links from leukemia to Alzheimer’s via enzymes

Red – Product Pink – Reactant Green – Reagent Brown – Solvent …

The power of text mining • Almost impossible to determine manually • Can provide completely unexpected relationships between source and target • Catch: must do the work domain by domain • Silver lining: can build on preceding work

Long-term: answer any question • Must recognize multiple (any) entities and relationships • Must recognize all forms of linguistic relationship • Must have background of common sense information (or enough entities/relations?) • Information on donors (to political parties) • For now, building text miners, domain by domain, is perhaps the best we can do • Can build on preceding pieces…e.g., if you know drugs, diseases and drug-disease causation, can try to recognize ‘advancements in drug therapy’

Summary • Need to search more broadly, more easily • Larger databases • Distributed search • Need to locate best documents in even larger (distributed) databases • Clustering to find documents of real interest • Need to answer complex questions • Descriptive search • Need to go beyond search for overviews, relationship discovery and analysis • Text-based data mining • Through text mining (perhaps), approach full natural language understanding

Next generation search

Next generation search

Presentation Transcript

Next-Generation Firewall

Search and the Crowd: Next-Generation Software Tools

Discovering the Next Generation of Library Search

Facets, Search, and Discovery in Next Generation Catalogs

Generation Next

The Next Generation of Next Generation Learning

Next-generation phenotyping

Next-Generation UC Libraries; Next-Generation UC Librarians

Next Generation of Search

Next Generation

NEXT GENERATION INFRASTRUCTURES

Next Generation Engagement

Generation Next

Web Search APIs: The Next Generation

Generation Next!

Next Generation Sequencing

CS598CXZ Panel – Next Generation Search Engines

Generation Next

Next Generation Engineering

Next generation search

The Next Generation