270 likes | 360 Views
EMu Searching Explained (What’s going on under the hood!). Bernard Marshall Chief Technical Officer KE Software. Overview . The basic theory Tools and tuning Searching issues . EMu search mechanism . Two level superimposed coding scheme for partial match retrieval
E N D
EMu Searching Explained (What’s going on under the hood!) Bernard MarshallChief Technical OfficerKE Software
Overview • The basic theory • Tools and tuning • Searching issues
EMu search mechanism • Two level superimposed coding scheme for partial match retrieval • Developed from research at the University of Melbourne (early 1980s) • Designed to provide very high speed retrieval from very large datasets • The more search terms provided, the faster the search time • One set of indexes for all searching (except key searches)
Record Descriptor • Encodes the contents of one record into a single bit string • Descriptors stored sequentially in the rec file • Each record descriptor has the data offset (from the data file) appended rec file data file
k b column no pseudo random number generator term bit numbers
Record descriptor (searching) • Generate record descriptor for search term(s) • AND with all record descriptors to find matching record(s)
False matches • Query descriptor matches a record descriptor that does not contain the search term
False matches • Chance of a false match related to bit density • The lower the bit density, the less probability of a false match • EMu uses a bit density of < 25%; that is, less than 25% of bits are one • Probability of a false match with k = 5 is 1 in 1,024 record descriptors checked for a single term query • Probability for a two term query 1 in 1,048,576 • Lower bit density requires more disk space and produces longer record descriptors
Segment descriptor • Encodes the contents of multiple records into a bit string • Descriptors stored sequentially in the seg file (bitsliced)
Segment descriptor • For each group of records (Nr) a single descriptor is calculated as for a record descriptor • Segment level has its own values for k (number of bits to set) and b (length of bit string)
Segment descriptor (searching) • Segment searching checks Nr records per descriptor • For efficient disk access for searching, “flip” seg file (bitslicing) • Penalty is slower record insertions / updates (use oflow file)
Segment descriptor (bitsliced) • Each bit slice is ANDed to determine matching segments • Matching segments are given by bit positions with a value of one AND
Complete search sequence • Build segment query descriptor for query terms • Search bitslice segment file for list of matching segments • Build record query descriptor for query terms • Search record descriptors in matching segments for matching records • Exact match record only before showing to user
Number of disk accesses (logical) • For a single search term with one matching record: • ks – bits set per term (segment level) • 1 – disk read to read segment to match record descriptor • Number of logical reads is independent of the table size • Number of physical reads increases as table grows (but disk read ahead helps here)
Client query evaluation • Attachment searches performed and matching IRNs on reference column added to query statement • Reverse attachment searches performed and matching reference values added to query statement • Local search terms added to query statement • Also search columns added to query statement • Search performed
What is a term? • A term is the basic index component
Term modifiers • Modifiers alter how the term is indexed
Indexing tools • texdensity • Prints out the bit density for segment and record descriptors • texanalyse • Prints the number of terms per record • texconf • Calculate a suitable index configuration • Adjust configuration parameters manually
Configuration parameters • params file in table directory • Override default configuration parameters • Bit density (rec/seg) • File system block size • False match probability (rec/seg) • Minimum number of records per segment • XML based file
Searching Issues – false matches • Issue • Some queries are slow but disk activity is high • Diagnose • texadmin database usage shows a high number of index false matches • texdensity shows high density or large standard deviation with high maximum density (check seg and rec) • texanalyse shows a large standard deviation for the number of index terms (check seg and rec) • Fix • Reconfigure table • Set configuration parameters manually
Searching Issues – common terms • Issue • Some queries containing common terms are slow • “false” segment matches • Diagnose • Querying on each term individually results in a large number of matches (query is quick) • Querying on the combination of terms becomes slow • Fix • Cluster table on a common term • Sort data before indexing
Searching Issues – block size mismatch • Issue • Overall searching is slow but disk activity is high • Using zfs with large record size • Diagnose • Determine the block size of the file system used to hold index files • Use texconf to determine the block size used for indexing • Fix • Set blocksize configuration parameter manually • Adjust zfs record size to 16K
Searching Issues – RAID configuration • Issue • Record updates are very slow • Fast disks but performance less than optimal • Diagnose • Disk controller or driver is configured to use RAID 5 or 6 • Fix • Optimal performance in a RAID environment is RAID 1+0 (RAID 10) (stripe/mirror) • Ensure striping agrees with block size of file system • Enable striping where possible
Searching Issues – Unindexed fields • Issue • Wildcard / stem / phonetic based queries are extremely slow • Diagnose • Use emuindexing to check indexing of fields being queried • Fix • Add Registry entries to enable indexing required: • System|Setting|Table|table|Stem Index|colname;colname;... • System|Setting|Table|table|Phonetic Index|colname;colname;... • System|Setting|Table|table|Null Index|colname;colname;... • System|Setting|Table|table|Partial Index|colname=parts;...
Searching Issues – Range queries slow • Issue • Queries containing ranges are slow • Diagnose • Use emuindexing to check if range indexing is enabled • Fix • Use emurangeupdate to optimise range based searching • Add Registry entries to enable indexing required: • System|Setting|Table|table|Range Buckets|colname|bucket;...
Searching Issues – Large attachment queries • Issue • Query is very slow when performing a query containing attachments and other terms • Diagnose • “Optimising query” status is displayed for a long time • Cause • The search engine is re-organising the query (a AND b) AND (c OR d OR e OR f or g)becomes (a AND b AND c) OR (a AND b AND d) or (a AND b AND e) or (a and b and f) OR (a AND b AND g) • Fix • Rewrite the query optimiser
References • EMu 4.0.01 Release Notes • System Tuning • Configuration • Range Indexing • www.kesoftware.com/downloads/EMu/documents/configuration.pdf • www.kesoftware.com/downloads/EMu/documents/Range Indexing/rangeindexing.pdf