1 / 16

Indexing

Indexing. CSCI 572: Information Retrieval and Search Engines Summer 2010. Outline. Building your search corpus Differences from RDBMS The Document/Field Model The Flattening Process Understanding Field Types Challenges. Building an index.

Download Presentation

Indexing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Indexing CSCI 572: Information Retrieval and Search Engines Summer 2010

  2. Outline • Building your search corpus • Differences from RDBMS • The Document/Field Model • The Flattening Process • Understanding Field Types • Challenges

  3. Building an index • Once you have contentin the form of metadataand extracted text, you need to persist that content • For querying • For retrieval and display • How should we persistthe content?

  4. Some considerations • Extracted metadata is typically unstructured • It’s not something that necessarily maps to a set of Entities (Tables), with rows and with consistent columns • Documents have different, sometimes, non-overlapping metadata models • Dublin Core • Word • Climate Forecast • The write/access patterns are a bit different • Think crawling strategies…

  5. Databases versus Search Indices • Databases are optimized for • Write often • Read often • Transactional properties in the face of the above • Atomic – operations should occur atomically, or be rolled back • Consistent – writes/etc., should be propogated in a consistent fashion • Isolated – transactions and modificationslimited to the entities that they modify • Durable – expected to be running all the time and thus resilient in the face of catastrophic failure

  6. Databases versus Search Indices • Search Indices are optimized for • Write infrequently • Read very frequently • Based off a loose unstructured Document model • Limited transactional properties • ACID not necessary • Onus to produce results quickly • Rollback not supported most often • Subject to corruption • Extremely efficient in terms of queryread times by exploiting the above

  7. The Document Field Model • A method of dealing with unstructured data and its persistence to an index • Treat each indexable content item as a “Document” • Each Document has 1…N named Fields • Each Field has 1…N values • Values can be: • Text • Numerical • Hierarchical (made up of other fields) • Complex (Geospatial, etc.) Field1: v…vnField2: v…vn

  8. Example: two web pages • Document 1 • Field [title], Value(s): “Chris Mattmann’s Web Page” • Type: string (text) • Field [length], Value(s): 3026 • Type: int (assumed to be bytes) • Field [author], Value(s): Chris Mattmann • Document 2 • Field [title], Value(s): “CS572 Web Page” • Type: string (text) • Field [length], Value(s): 10000 • Type: int, (assumed to be bytes) • Field [author], Value(s): Chris Mattmann, Univ. of Southern California

  9. Example: a word document • Document 3 • Field [title], Value(s): “My CS572 Final Project” • Type: string (text) • Field [length], Value(s): 30012 • Type: int (assumed to be bytes) • Field [wordcount], Value(s): 2912 • Type: int • Field [mswordversion], Value(s): 2008, Mac • Type: string (text)

  10. Apples to Oranges • Whether it’s an HTML page, a Word document, a PDF file, etc. • We can still use the Document/Field model to represent the content as itis indexed • The Document Field model works for Metadata, but also for extracted text • Define a custom text field containing all extracted, searchable text

  11. What about structure? • For example, let’s say we are extracting Person records from a RDBMS to index • We’ve got 2 tables • Person • Attribute: id, int PK UNIQUE AUTO INCREMENT • Attribute: first_name VARCHAR(255) • Attribute: last_name VARCHAR(255) • PersonAddress • Attribute: person_id FK to Person.id • Attribute: address_txt, VARCHAR(255) • Attribute: zipcode, int

  12. What about structure? • Example records • Person: • id, first_name, last_name • 1, Chris, Mattmann • 2, Homer, Simpson • PersonAddress: • person_id, address_txt, zipcode • 1, 1234 Joe Lane, 91354 • 2, 6344 Evergreen Terrace, Springfield, IL, 60999

  13. What about structure? • How to get the aforementioned rows into the Document Field model? • Flatten the structure • Document 1 • Field [first_name], Value(s): Chris • Type: string (text) • Field [last_name], Value(s): Mattmann • Type: string (text) • Field [id], Value(s): 1 • Type: int • Field [addresstxt], Value(s): Joe Lane • Type: string (text) • Field [zipcode], Value(s): 91354 • Type: int • Document 2 • Field [first_name], Value(s): Homer • Type: string (text) • Field [last_name], Value(s): Simpson • Type: string (text) • Field [id], Value(s): 2 • Type: int • Field [addresstxt], Value(s): 6344 Evergreen Terrace, Springfield, IL • Type: string (text) • Field [zipcode], Value(s): 60999 • Type: int

  14. Benefits of the Document Field model • Documents are independent, wholly contained entities • Reduces ACID dependencies • Increases the ability to become eventually consistent • Fields can be indexed and stored in different ways • Reformatted on entry into the index, and reformatted on the way out • Geohash great example of this • Analyzers – implications on query model • Tokenizers – implications on query model

  15. Challenges • Reducing structured data to unstructured, flattened data isn’t exactly as easy as the cooked up example • Imagine having to encode values to preserve ordering in some fashion • Requires deep understanding of the data and methodologies for naming field names and ordering values • Loss of ACID properties makes it difficult to leverage index structure for search directly in transactional systems • Have to stand up search as a separate service outside of data management system • Determining the right tuning parameters to index • Max Buffer Size, When to Optimize, When to Merge, etc.

  16. Wrapup • Introduction to the Document Field indexing model • Differences between traditional RDBMS models and Search indices • Know when and where to use each • Search optimized for read frequent, write infrequent

More Related