1 / 9

Search

INTERNET ENGINEERING. MOHAMMAD BORUJERDI. Search. CHAPTER 12. OUTLINE : Why do we need search? Problems with search using SQL and RDBMS. Full Text Indexed Search System. Arguments against the Split-System. Oracle-Text: Full text indexed search in DB. 1. INTERNET ENGINEERING.

gino
Download Presentation

Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. INTERNET ENGINEERING MOHAMMAD BORUJERDI Search CHAPTER 12 OUTLINE : Why do we need search? Problems with search using SQL and RDBMS. Full Text Indexed Search System. Arguments against the Split-System. Oracle-Text: Full text indexed search in DB. 1

  2. INTERNET ENGINEERING MOHAMMAD BORUJERDI Search CHAPTER 12 Why do we need search? Accommodate new users with existing content relevant to their needs. A community's first line of defense is high quality information architecture and navigation. Users are better at browsing than formulating search queries. A community's second line of defense, however, is a superb full-text search facility. On a large site a user might wish to restrict the search in some way. 2

  3. INTERNET ENGINEERING MOHAMMAD BORUJERDI Search CHAPTER 12 Problems with search using SQL and RDBMS: 1. Quality : 1.1. Single Word Queries : sensitive to caps! 1.2. Multi Word Queries : problems with AND ing, more words, less found. stemming problems : running and marathon. 2. Performance : 2.1. Using a B-tree index takes log N (# of rows). 2.2. Index on first words is not much helpful. 3

  4. INTERNET ENGINEERING MOHAMMAD BORUJERDI Search CHAPTER 12 Full Text Indexed Search System : Extra work at insertion time is traded for less work at query time. Time approaches O[1], does not vary with the size of the corpus indexed. Table of every word next to database keys of documents containing the word: Word Document IDs absquatulate 612 bedizen 36, 9211 cryptogenic 9 dactylioglyph 7214 exheredate 57, 812, 4010 feuilleton 87, 349, 1203 genetotrophic 5000 hartebeest 710 inspissate 549, 21, 3987 4

  5. INTERNET ENGINEERING MOHAMMAD BORUJERDI Search CHAPTER 12 Full Text Indexed Search System : cont’d Using hash table, access to a row in the table is O[1]. If rows sorted, we have O[log W] access to any row in the table, where W is # of words in our vocabulary. Performance does not vary with the number of documents in the collection. * Must eliminate stop words, words that are too common to be worth indexing. For standard English, the stop word list includes such words as "a", "and", "as", "at", "for", "or", "the", etc. * For relevancy, We need a new data structure: the word-frequency histogram. This will tell us which words occur in a document and how frequently they occur in a way that is easily adjusted for the total length of a document. 5

  6. INTERNET ENGINEERING MOHAMMAD BORUJERDI Search CHAPTER 12 Full Text Indexed Search System : cont’d The crude histogram is adjusted for the prevalence of words in standard English. Appearance of "resemble" is more interesting than "happy" because "resemble" occurs less frequently in standard English. Stop words such as "is" are thrown away altogether. Stemming : In the index and in queries we convert all words to their stems. The stem word for "families", for example, is "family". With stemming, a query for "families" would match a document containing "family" and vice versa. Now it is possible to answer queries such as "Show me documents that are similar to this one" or "Show me documents whose histogram is closest to a user-entered string.“ Similarity measures need to be defined. 6

  7. INTERNET ENGINEERING MOHAMMAD BORUJERDI Search CHAPTER 12 Full Text Indexed Search System : cont’d Why not chuck the RDBMS altogether? We can't adopt it as our primary database management system unless it handles the concurrency problem as well as the RDBMS. A pragmatic approach would seem to start by keeping all the documents in the RDBMS: articles, user comments, discussion forum postings, etc. Either once per night or every time a new document was added, update a full-text search system's collection. 7

  8. INTERNET ENGINEERING MOHAMMAD BORUJERDI Search CHAPTER 12 • Arguments against the Split-System : • Two copies of the document collection are being kept. • In an age of $200 disk drives of high capacity, this isn't a powerful argument. • 2. The collections will get out of sync. • Hire sufficiently careful programmers and sufficiently dedicated system. • 3. Disparity of interfaces. • The cost of bringing in a new programmer grows if you have to teach that person not only about an RDBMS, but also about specialized tools, each with its own library of interfaces. • 4. (the best) Split system does not naturally support some necessary kinds of queries : • - Documents matching "best restaurants" written by users whose address is within 10 miles of zip code 02138. • - Documents matching "studio photography" written by users whose contributions have been rated above average by other users. • drawing on features outside document 8

  9. INTERNET ENGINEERING MOHAMMAD BORUJERDI Search CHAPTER 12 • Oracle-Text: Full text indexed search in DB : • Build a full-text search indexer inside the RDBMS. The relevant Oracle product is called "Oracle Text". • Oracle query processor is smart enough to know how to use the Text index to answer queries without doing a sequential table scan. • It is possible to build a multi-column index. • Oracle Text also has the property that its default search mode is exact phrase matching. • Oracle Text, via the "INSO filters" option, has the capability to index a remarkable variety of documents in a BLOB column. For example, the software can recognize a Microsoft Excel spreadsheet, pull the text out and add it to the index. 9

More Related