1 / 27

EMu Searching Explained (What’s going on under the hood!)

EMu Searching Explained (What’s going on under the hood!). Bernard Marshall Chief Technical Officer KE Software. Overview . The basic theory Tools and tuning Searching issues . EMu search mechanism . Two level superimposed coding scheme for partial match retrieval

keefe
Download Presentation

EMu Searching Explained (What’s going on under the hood!)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EMu Searching Explained (What’s going on under the hood!) Bernard MarshallChief Technical OfficerKE Software

  2. Overview • The basic theory • Tools and tuning • Searching issues

  3. EMu search mechanism • Two level superimposed coding scheme for partial match retrieval • Developed from research at the University of Melbourne (early 1980s) • Designed to provide very high speed retrieval from very large datasets • The more search terms provided, the faster the search time • One set of indexes for all searching (except key searches)

  4. Record Descriptor • Encodes the contents of one record into a single bit string • Descriptors stored sequentially in the rec file • Each record descriptor has the data offset (from the data file) appended rec file data file

  5. k b column no pseudo random number generator term bit numbers

  6. Record descriptor (searching) • Generate record descriptor for search term(s) • AND with all record descriptors to find matching record(s)

  7. False matches • Query descriptor matches a record descriptor that does not contain the search term

  8. False matches • Chance of a false match related to bit density • The lower the bit density, the less probability of a false match • EMu uses a bit density of < 25%; that is, less than 25% of bits are one • Probability of a false match with k = 5 is 1 in 1,024 record descriptors checked for a single term query • Probability for a two term query 1 in 1,048,576 • Lower bit density requires more disk space and produces longer record descriptors

  9. Segment descriptor • Encodes the contents of multiple records into a bit string • Descriptors stored sequentially in the seg file (bitsliced)

  10. Segment descriptor • For each group of records (Nr) a single descriptor is calculated as for a record descriptor • Segment level has its own values for k (number of bits to set) and b (length of bit string)

  11. Segment descriptor (searching) • Segment searching checks Nr records per descriptor • For efficient disk access for searching, “flip” seg file (bitslicing) • Penalty is slower record insertions / updates (use oflow file)

  12. Segment descriptor (bitsliced) • Each bit slice is ANDed to determine matching segments • Matching segments are given by bit positions with a value of one AND

  13. Complete search sequence • Build segment query descriptor for query terms • Search bitslice segment file for list of matching segments • Build record query descriptor for query terms • Search record descriptors in matching segments for matching records • Exact match record only before showing to user

  14. Number of disk accesses (logical) • For a single search term with one matching record: • ks – bits set per term (segment level) • 1 – disk read to read segment to match record descriptor • Number of logical reads is independent of the table size • Number of physical reads increases as table grows (but disk read ahead helps here)

  15. Client query evaluation • Attachment searches performed and matching IRNs on reference column added to query statement • Reverse attachment searches performed and matching reference values added to query statement • Local search terms added to query statement • Also search columns added to query statement • Search performed

  16. What is a term? • A term is the basic index component

  17. Term modifiers • Modifiers alter how the term is indexed

  18. Indexing tools • texdensity • Prints out the bit density for segment and record descriptors • texanalyse • Prints the number of terms per record • texconf • Calculate a suitable index configuration • Adjust configuration parameters manually

  19. Configuration parameters • params file in table directory • Override default configuration parameters • Bit density (rec/seg) • File system block size • False match probability (rec/seg) • Minimum number of records per segment • XML based file

  20. Searching Issues – false matches • Issue • Some queries are slow but disk activity is high • Diagnose • texadmin database usage shows a high number of index false matches • texdensity shows high density or large standard deviation with high maximum density (check seg and rec) • texanalyse shows a large standard deviation for the number of index terms (check seg and rec) • Fix • Reconfigure table • Set configuration parameters manually

  21. Searching Issues – common terms • Issue • Some queries containing common terms are slow • “false” segment matches • Diagnose • Querying on each term individually results in a large number of matches (query is quick) • Querying on the combination of terms becomes slow • Fix • Cluster table on a common term • Sort data before indexing

  22. Searching Issues – block size mismatch • Issue • Overall searching is slow but disk activity is high • Using zfs with large record size • Diagnose • Determine the block size of the file system used to hold index files • Use texconf to determine the block size used for indexing • Fix • Set blocksize configuration parameter manually • Adjust zfs record size to 16K

  23. Searching Issues – RAID configuration • Issue • Record updates are very slow • Fast disks but performance less than optimal • Diagnose • Disk controller or driver is configured to use RAID 5 or 6 • Fix • Optimal performance in a RAID environment is RAID 1+0 (RAID 10) (stripe/mirror) • Ensure striping agrees with block size of file system • Enable striping where possible

  24. Searching Issues – Unindexed fields • Issue • Wildcard / stem / phonetic based queries are extremely slow • Diagnose • Use emuindexing to check indexing of fields being queried • Fix • Add Registry entries to enable indexing required: • System|Setting|Table|table|Stem Index|colname;colname;... • System|Setting|Table|table|Phonetic Index|colname;colname;... • System|Setting|Table|table|Null Index|colname;colname;... • System|Setting|Table|table|Partial Index|colname=parts;...

  25. Searching Issues – Range queries slow • Issue • Queries containing ranges are slow • Diagnose • Use emuindexing to check if range indexing is enabled • Fix • Use emurangeupdate to optimise range based searching • Add Registry entries to enable indexing required: • System|Setting|Table|table|Range Buckets|colname|bucket;...

  26. Searching Issues – Large attachment queries • Issue • Query is very slow when performing a query containing attachments and other terms • Diagnose • “Optimising query” status is displayed for a long time • Cause • The search engine is re-organising the query (a AND b) AND (c OR d OR e OR f or g)becomes (a AND b AND c) OR (a AND b AND d) or (a AND b AND e) or (a and b and f) OR (a AND b AND g) • Fix • Rewrite the query optimiser 

  27. References • EMu 4.0.01 Release Notes • System Tuning • Configuration • Range Indexing • www.kesoftware.com/downloads/EMu/documents/configuration.pdf • www.kesoftware.com/downloads/EMu/documents/Range Indexing/rangeindexing.pdf

More Related