1 / 65

CS276A Text Information Retrieval, Mining, and Exploitation

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 15 26 Nov 2002. …/~newbie/. www.ibm.com. /…/…/leaf.htm. Recap: Web Anatomy. E2. E1. WEB. Recap:Size of the Web. Capture – Recapture technique Assumes engines get independent random subsets of the Web.

melina
Download Presentation

CS276A Text Information Retrieval, Mining, and Exploitation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS276AText Information Retrieval, Mining, and Exploitation Lecture 15 26 Nov 2002

  2. …/~newbie/ www.ibm.com /…/…/leaf.htm Recap: Web Anatomy

  3. E2 E1 WEB Recap:Size of the Web • Capture – Recapture technique • Assumes engines get independent random subsets of the Web E2 contains x% of E1. Assume, E2 contains x% of the Web as well Knowing size of E2 compute size of the Web Size of the Web = 100*E2/x Bharat & Broder: 200 M (Nov 97), 275 M (Mar 98) Lawrence & Giles: 320 M (Dec 97)

  4. Recent Measurements Source: http://www.searchengineshowdown.com/stats/change.shtml

  5. Today’s Topics • Web IR infrastructure • Search deployment • XML intro • XML indexing and search

  6. Web IR Infrastructure • Connectivity Server • Fast access to links to support link analysis • Term Vector Database • Fast access to document vectors to augment link analysis

  7. Connectivity Server[CS1: Bhar98b, CS2 & 3: Rand01] • Fast web graph access to support connectivity analysis • Stores mappings in memory from • URL to outlinks, URL to inlinks • Applications • HITS, Pagerank computations • Crawl simulation • Graph algorithms: web connectivity, diameter etc. • Visualizations

  8. Output URLs + Values Input Graph algorithm + URLs + Values Execution Graph algorithm runs in memory URLs to FPs to IDs IDs to URLs Usage Translation Tables on Disk URL text: 9 bytes/URL (compressed from ~80 bytes ) FP(64b) -> ID(32b): 5 bytes ID(32b) -> FP(64b): 8 bytes ID(32b) -> URLs: 0.5 bytes

  9. ID assignment E.g., HIGH IDs: Max(indegree , outdegree) > 254 ID URL … 9891 www.amazon.com/ 9912 www.amazon.com/jobs/ … 9821878 www.geocities.com/ … 40930030 www.google.com/ … 85903590 www.yahoo.com/ • Partition URLs into 3 sets, sorted lexicographically • High: Max degree > 254 • Medium: 254 > Max degree > 24 • Low: remaining (75%) • IDs assigned in sequence (densely) Adjacency lists • In memory tables for Outlinks, Inlinks • List index maps from an ID to start of adjacency list

  10. … … 98 … 132 -6 104 105 106 153 34 104 105 106 98 21 147 -8 153 49 … 6 … … … Sequence of Adjacency Lists Delta Encoded Adjacency Lists List Index List Index Adjacency List Compression - I • Adjacency List: • - Smaller delta values are exponentially more frequent (80% to same host) • - Compress deltas with variable length encoding (e.g., Huffman) • List Index pointers: 32b for high, Base+16b for med, Base+8b for low • - Avg = 12b per pointer

  11. Base (4 bytes) List Index Pointers ID offset Adjacency lists URL Info LC:TID LC:TID … Offsets For 16 IDs LC:TID FRQ:RL FRQ:RL … FRQ:RL ID to adjacency list lookup

  12. Adjacency List Compression - II • Inter List Compression • Basis: Similar URLs may share links • Close in ID space => adjacency lists may overlap • Approach • Define a representative adjacency list for a block of IDs • Adjacency list of a reference ID • Union of adjacency lists in the block • Represent adjacency list in terms of deletions and additions when it is cheaper to do so • Measurements • Intra List + Starts: 8-11 bits per link (580M pages/16GB RAM) • Inter List: 5.4-5.7 bits per link (870M pages/16GB RAM.)

  13. Term Vector Database[Stat00] • Fast access to 50 word term vectors for web pages • Term Selection: • Restricted to middle 1/3rd of lexicon by document frequency • Top 50 words in document by TF.IDF. • Term Weighting: • Deferred till run-time (can be based on term freq, doc freq, doc length) • Applications • Content + Connectivity analysis (e.g., Topic Distillation) • Topic specific crawls • Document classification • Performance • Storage: 33GB for 272M term vectors • Speed: 17 ms/vector on AlphaServer 4100 (latency to read a disk block)

  14. URLid * 64 /480 offset Base (4 bytes) URL Info LC:TID LC:TID Terms 128 Byte TV Record … Bit vector For 480 URLids LC:TID FRQ:RL FRQ:RL Freq … FRQ:RL URLid to Term Vector Lookup Architecture

  15. Search Deployment • Web IR is just one (very specific) type of IR • Commercially most important IR application: • Enterprise search (large corporations) • Problem different from Web IR • Peer-2-Peer (P2P) search • Another search deployment strategy

  16. Enterprise Search Deployment Search Boxes Web Portals E-Commerce Enterprises Markets Proprietary content Public content Content Location Database Company Web Site Corporate Network World Wide Web Sources Content Management Groupware

  17. Evolution of Enterprise Search 1st Generation: Classic Information Retrieval 2nd Generation: Driven by WWW 3rd Generation: Discovery (Text Mining) User: Trained specialist Everyone Everyone and software agents Scope: Small, closed collections Intranet/Extranet Structured, semi-structured and unstructured information Technology: Pattern/string matching Pattern/string matching and external factors for relevance ranking + categorization Introduction of linguistic and semantic processing 2000+ 1994 - 1999 1985 - 1993

  18. Security Cannot search what you should not read Content organization & creation Automatic classification Taxonomy generation Support for multiple languages, multiple formats Conduits into databases and other content management -- homes for “valuable” content Information processing tools Annotation Range searches Custom ranking criteria Cross lingual tools,… Individual preferences Personalization Notification, … Enterprise IR is a lot more than search …

  19. Peer-To-Peer (P2P) Search • No central index • Each node in a network builds and maintains own index • Each node has “servent” software • On booting, servent pings ~4 other hosts • Connects to those that respond • Initiates, propagates and serves requests

  20. Which hosts to connect to? • The ones you connected to last time • Random hosts you know of • Request suggestions from central (or hierarchical) nameservers • All govern system’s shape and efficiency

  21. Serving P2P search requests • Send your request to your neighbors • They send it to their neighbors • decrement “time to live” for query • query dies when ttl = 0 • Send search matches back along requesting path

  22. Some P2P Networks • Gnutella • Kazaa • Bearshare • Aimster • Grokster • Morpheus

  23. P2P: Information Retrieval Issues • Why is this more difficult than centralized IR?

  24. P2P: Information Retrieval Issues • Selection of nodes to query • Merging of results • Spam

  25. What is XML? • eXtensible Markup Language • A framework for defining markup languages • No fixed collection of markup tags • Each XML language targeted for application • All XML languages share features • Enables building of generic tools

  26. Basic Structure • An XML document is an ordered, labeled tree • character data leaf nodes contain the actual data (text strings) • data nodes must be non-empty and non-adjacent to other character data nodes • elementnodes, are each labeled with • a name (often called the element type), and • a set of attributes, each consisting of a name and a value, • can have child nodes

  27. XML Example

  28. XML Example <chapter id="cmds"> <chaptitle>FileCab</chaptitle> <para>This chapter describes the commands that manage the <tm>FileCab</tm>inet application.</para> </chapter>

  29. Elements • Elements are denoted by markup tags • <foo attr1=“value” … > thetext </foo> • Element start tag: foo • Attribute: attr1 • The character data: thetext • Matching element end tag: </foo>

  30. XML vs HTML • Relationship?

  31. XML vs HTML • HTML is a markup language for a specific purpose (display in browsers) • XML is a framework for defining markup languages • HTML can be formalized as an XML language (XHTML) • XML defines logical structure only • HTML: same intention, but has evolved into a presentation language

  32. XML: Design Goals • Separate syntax from semantics to provide a common framework for structuring information • Allow tailor-made markup for any imaginable application domain • Support internationalization (Unicode) and platform independence • Be the future of (semi)structured information (do some of the work now done by databases)

  33. Why Use XML? • Represent semi-structured data (data that are structured, but don’t fit relational model) • XML is more flexible than DBs • XML is more structured than simple IR • You get a massive infrastructure for free

  34. Applications of XML • XHTML • CML – chemical markup language • WML – wireless markup language • ThML – theological markup language • <h3 class="s05" id="One.2.p0.2">Having a Humble Opinion of Self</h3> <p class="First" id="One.2.p0.3">EVERY man naturally desires knowledge <note place="foot" id="One.2.p0.4"> <p class="Footnote" id="One.2.p0.5"><added id="One.2.p0.6"> <name id="One.2.p0.7">Aristotle</name>, Metaphysics, i. 1. </added></p> </note>; but what good is knowledge without fear of God? Indeed a humble rustic who serves God is better than a proud intellectual who neglects his soul to study the course of the stars. <added id="One.2.p0.8"><note place="foot" id="One.2.p0.9"> <p class="Footnote" id="One.2.p0.10"> Augustine, Confessions V. 4. </p> </note></added> </p>

  35. XML Schemas • Schema = syntax definition of XML language • Schema language = formal language for expressing XML schemas • Examples • DTD • XML Schema (W3C) • Relevance for XML IR • Our job is much easier if we have a (one) schema

  36. XML Tutorial • http://www.brics.dk/~amoeller/XML/index.html • (Anders Møller and Michael Schwartzbach) • Previous (and some following) slides are based on their tutorial

  37. XML Indexing and Search

  38. Native XML Database • Uses XML document as logical unit • Should support • Elements • Attributes • PCDATA (parsed character data) • Document order • Contrast with • DB modified for XML • Generic IR system modified for XML

  39. XML Indexing and Search • Most native XML databases have taken an DB approach • Exact match • Evaluate path expressions • No IR type relevance ranking • Only a few that focus on relevance ranking

  40. Timber: XML as DB extension • DB: search tuples • Timber: search trees • Main focus • Complex and variable structure of trees (vs. tuples) • Ordering • XML query optimization vs relational optimization

  41. ToXin • Native XML database • Exploits overall path structure • Supports any general path query • Query evaluation in three stages • Preselection stage • Selection stage • Postselection stage

  42. ToXin: Motivation • Strawman: • Index all paths occurring in database • Does not allow backward navigation • Example query: • find all the titles of articles from 1990

  43. Query Evaluation Stages • Pre-selection • First navigation down the tree • Selection • Value selection according to filter • Post-selection • Navigation up and down again

  44. ToXin

  45. Factors Impacting Performance • Data source specific • Document size • Number of XML nodes and values • Path complexity (degree of nesting) • Average value size • Query specific • Selectiveness of path constraint • Size of query answer • Number of elements selected by filter

  46. Benchmark Parameters

  47. Query Classification

  48. Evaluation

  49. ToXin: Summary • Efficient native XML database • All paths are indexed (not just from root) • Path index linear in corpus size • Shortcomings • Order of nodes ignored • Semantics of IDRefs ignored What is missing?

  50. IR/Relevance Ranking for XML • Why is this difficult?

More Related