1 / 21

going further together

Information Search & Retrieval: Problems, solutions, trends… Tony Rose, PhD MBCS CEng Vice-Chair, BCS IRSG. going further together. Contents. The BCS Information Retrieval SG What is IR anyway? How search engines work Why search is hard Where’s it all going?. Information Retrieval SG.

cybulski
Download Presentation

going further together

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Search & Retrieval: Problems, solutions, trends… Tony Rose, PhD MBCS CEng Vice-Chair, BCS IRSG going further together

  2. Contents • The BCS Information Retrieval SG • What is IR anyway? • How search engines work • Why search is hard • Where’s it all going?

  3. Information Retrieval SG • Growing rapidly • 750+ members • Annual conference (ECIR) • FDIA • Various 1-day events • Search Solutions • Informer • Discounts for various events, e.g. SIGIR • … is free to join!

  4. Information Retrieval SG • Traditional focus on search (text retrieval) • Knowledge management, Multimedia retrieval, User experience, Information visualisation, extraction, summarisation, etc. • Latest issue of Informer: • “Searching for the Music You Like” • “Exploring Maps through Geo-referenced Images and RDF Shared Metadata” • “Using Semantic Relations to improve Question Answering” • “Modeling & Annotation of Dance Media Semantics”

  5. What is IR? • “Science of searching for: • information in documents • documents themselves • metadata which describe documents, • within databases • …whether relational stand-alone databases or hypertextually-networked databases such as the World Wide Web”

  6. The Need for IR • In a word … Infoglut • 800Mb of recorded information is produced per person per year [Computing magazine] • Up to 80% of corporate information is unstructured • Documents, emails, images, voicemail, etc. • So …can’t we just use Google?

  7. How do Search Engines Work? • On the surface: • Understand what the user wants • Find documents about that topic • In reality: • Count words • Apply a simple equation

  8. How do Search Engines Work? • Measure the conceptual distance between your query and each document in the DB • Return the best matches [Source: Maristella Agosti, University of Padova]

  9. The Central Problem in IR Information Seeker Author Concepts Concepts Query Terms Document Terms Do these represent the same concepts? [Source: Jimmy Lin, University of Maryland]

  10. The Central Problem in IR • How do you represent the concepts? • Documents and queries = “bag of words” • Unordered set of terms + numeric weights • How do you calculate similarity? • Set theory (e.g. Boolean) • Algebraic (e.g. vector space) • Probabilistic

  11. IR models [Source: Wikipedia]

  12. How do we Evaluate Search? • Assume that results are either relevant or non-relevant • Precision: • Proportion of retrieved documents that are relevant • Recall: • Proportion of known-relevant documents that were actually retrieved • But what about: indexing / retrieval speed, query language, user experience, etc? relevant retrieved

  13. Why Search is Hard • Document representation • Keywords are not enough • Blind Venetian = Venetian Blind • Terms are not independent • Structural & discourse dependencies, co-references, etc. • Imperfect “stop lists” • the, and, of…

  14. Why Search is Hard • Morphological relationships • Computer, computing, compute, computed… • Index documents using word stems • False positives: • organization, organ  organ • police, policy  polic • arm, army  arm • False negatives: • cylinder, cylindrical • create, creation • Europe, European • Prefixes are particularly difficult • Un*, dis* • Delegate = de-leg-ate • Ratify = rat-ify

  15. Why Search is Hard • Named entity recognition • Companies in New York • New companies in York • NEs are highly discriminatory • People • Places • Organisations • Many vertical applications • e.g. bioscience

  16. Why Search is Hard • Semantic relationships • Car = automobile • Buy = purchase • Sick = ill • Synonym rings • Car, automobile, truck, bus, taxi... • Appropriate level of abstraction depends on user & task • Development of subject-specific taxonomies • “concept matching”

  17. Why Search is Hard • Word sense disambiguation • “Bank” • Financial institution? • Part of a river? • An aerial manoeuvre? • Active research area • Categorisation & clustering of results

  18. Google’s Insight • Exploit the link structure inherent in the web • calculate measure of document’s value • Independent of any query • “PageRank” • Overall relevance based on 100+ parameters • Constant battle with SEOs • Enterprise search is a different proposition… • As is desktop search

  19. Where’s it all going? • Vertical search • Jobs, travel, health, people, etc. • Rich media search • Audio, video, TV, images • Specialised content search • blogs, news, classifieds • Social search • Personalisation

  20. Where’s it all going? • Mobile search • Answer engines • Active research communityin Question Answering • Multi / cross-lingual search • Search agents • Human UI

  21. Further Information • www.irsg.bcs.org • Informer • ECIR (March 2008, Glasgow) • Search Solutions 2008 (Sept 2008, London)

More Related