INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture

INFM 700: Session 9Search (Part II)Search Engines in Information Architecture Paul Jacobs The iSchool University of Maryland Wednesday, Apr. 18, 2012 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Today’s Topics • Very short recap • Fundamentals of information retrieval • Search engines in practice (web search and web sites) • Issues and tricks • Stemming/word issues • Query formulation/expansion/assistance • Tagging/structuring • Others • Deploying search – what we get to do, and how Issues and Tricks Deploying Search

Vector Space Model t3 d2 d3 d1 θ φ t1 d5 t2 d4 Assumption: Documents that are “close together” in vector space “talk about” the same things Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”)

Term Weighting • Term weights consist of two components • Local: how important is the term in this doc? • Global: how important is the term in the collection? • Here’s the intuition: • Terms that appear often in a document should get high weights • Terms that appear in many documents should get low weights • How do we capture this mathematically? • Term frequency (local) • Inverse document frequency (global)

TF.IDF Term Weighting weight assigned to term i in document j number of occurrence of term i in document j number of documents in entire collection number of documents with term i

Summary thus far… • Represent documents (and queries) as “bags of words” (terms) • Derive term weights based on frequency • Use weighted term vectors for each document, query • Compute a vector-based similarity score • Display sorted, ranked results

Issues and Tricks • What’s a word/term? • We can ignore words (“stop words”), combine (phrases), split up (“stem”) words • Other special treatment (e.g. names, categories) • Query formulation/suggestion • Type of information need • Popularity • Based on link analysis/page rank • Based on click through, other • Structuring and tagging (e.g., “best bets”) Issues and Tricks Deploying Search

Issues and Tricks (cont’d) • Thesaurus/query expansion • Based on meaning, conceptual relationships • Based on decomposition/type • User feedback/”More like this” • Clustering/grouping of results Issues and Tricks Deploying Search

Morphological Variation • Handling morphology: related concepts have different forms • Inflectional morphology: same part of speech • Derivational morphology: different parts of speech • Different morphological processes: • Prefixing • Suffixing • Infixing • Reduplication dogs = dog + PLURAL broke = break + PAST destruction = destroy + ion researcher = research + er Issues and Tricks Deploying Search

Stemming • Dealing with morphological variation: index stems instead of words • Stem: a word equivalence class that preserves the central concept • How much to stem? • organization  organize  organ? • resubmission  resubmit/submission  submit? • reconstructionism? Issues and Tricks Deploying Search

Does Stemming Work? • Generally, yes! (in English) • Helps more for longer queries, fewer results • Lots of work done in this area • But used very sparingly in web search – why? Donna Harman (1991) How Effective is Suffixing? Journal of the American Society for Information Science, 42(1):7-15. Robert Krovetz. (1993) Viewing Morphology as an Inference Process. Proceedings of SIGIR 1993. David A. Hull. (1996) Stemming Algorithms: A Case Study for Detailed Evaluation. Journal of the American Society for Information Science, 47(1):70-84. And others… Issues and Tricks Deploying Search

Beyond Words… • Stemming/tokenization = specific instance of a general problem: what is it? • Other units of indexing • Concepts (e.g., from WordNet) • Named entities • Relations • … Issues and Tricks Deploying Search

Some Observations Search engine fundamentals are very similar There are many tricks, differences beyond the basic model Differences appear differently, and are magnified as we get to sites, specific applications So, as we get to deployment … Be skeptical Test rigorously Some small things can make a big difference Issues and Tricks Deploying Search

Deployment - Overview What we can control Basic process of setting up/using search in IA Key parameters/issues What to search/organization content Testing and improving results Presentation/interfaces Issues and Tricks Deploying Search

What we control (the IA part)? Requirements and search engine selection Developing search requirements Build vs. buy Vendor evaluation/selection Consultants? Content selection What to search/zones/etc. Tags Search engine configuration Zones, what gets indexed, sometimes how Number of results, sometimes recall vs. precision Others (very often interface-related) Interfaces Issues and Tricks Deploying Search

Search Engine Selection Commercial examples Autonomy (including the former Verity, Ultraseek, . . .) Google (site search, search appliance) Thunderstone Build your own, open source? Lucene Defining requirements Basic search – how big, type of documents, what sort of interface, metadata, parametric? Advanced requirements – automatic tagging, alerts, “more like this” Customization and improvement using logs Keep it focused? Issues and Tricks Deploying Search

Search Engine Selection (con’d) Pitfalls to avoid “Getting a bargain” Getting it “free” Great sales reps Good ideas Get case studies, talk to references Get a “proof of concept period” Issues and Tricks Deploying Search

Simple Requirements Matrix Issues and Tricks Deploying Search

Content Selection (What to Search) Generally, search everything but … Be leery about providing “search the web” option Use zones or separate text databases for frequent/infrequent information needs Be careful about outdated/deleted content Make sure “best bets” come to the top Use logs, test & improve Issues and Tricks Deploying Search

Testing and Improvement Keep track of queries (and results, if possible) using logs If logs are not available, try user experiments If results are not available, get them Relevance/correct judgments; quantitative (e.g. recall/precision) scores are, too How to improve Focus on most frequent (important?) requests (90-10 or 80-20) “Best bets” Content manipulation (e.g., adding tags) Thesaurus Keep testing Issues and Tricks Deploying Search

“Best Bets” – How to Implement Identify desired result page Determine possible query strings (from logs) Tag meta-data in documents with query string Configure search interface (e.g., to show Best Best first, what to do about multiple Best Bets) This is a special case of using tag field (e.g., keywords, categories, description) Issues and Tricks Deploying Search

Designing a Search Interface The Box (size, position, labels) Content selection (defaults, radio buttons or pull-down selection) Parameters or advanced search (Booleans, separate zones, other possibilities) Issues and Tricks Deploying Search

Designing a Search Interface - Results Number of results to display Recall/precision tradeoff? Snippet/summary information for each hit Layout of best bits/other hits Repetition of the query “No results” – other possible tips Iteration and refinement Other (e.g., scores, clusters, …) Issues and Tricks Deploying Search

Some example sites www.hp.com www.dell.com www.ecoearth.info www.washingtonpost.com www.dailygazette.com www.friendsofrockcreek.org www.cbf.org www.umd.edu Issues and Tricks Deploying Search

Integrating Search and Browsing • Provide more navigation for common needs …based on search logs, other info • Redirect from search results to navigation • Faceted browsing • . . .

Faceted Browsing Example Issues and Tricks Deploying Search

Faceted Browsing Example Issues and Tricks Deploying Search Demo: http://flamenco.berkeley.edu/demos.html

Advantages of Facets • Integrates searching and browsing • Easy to build complex queries • Easy to narrow, broaden, shift focus • Helps users avoid getting lost • Helps to prevent “categorization wars” Issues and Tricks Deploying Search

Recap • Search is an IA issue! • Quality of search results/user experience depends on: • Understanding how search engines work • Choosing and deploying carefully • Constant testing and improvement • Time • Tremendous range of parameters/interface choices • Integrating search and browsing/navigation is a very good idea Issues and Tricks Deploying Search

INFM 700: Session 9 Search (Part II) Search Engines in Information Architecture