EMu Searching Explained (What’s going on under the hood!)

EMu Searching Explained (What’s going on under the hood!) Bernard MarshallChief Technical OfficerKE Software

Overview • The basic theory • Tools and tuning • Searching issues

EMu search mechanism • Two level superimposed coding scheme for partial match retrieval • Developed from research at the University of Melbourne (early 1980s) • Designed to provide very high speed retrieval from very large datasets • The more search terms provided, the faster the search time • One set of indexes for all searching (except key searches)

Record Descriptor • Encodes the contents of one record into a single bit string • Descriptors stored sequentially in the rec file • Each record descriptor has the data offset (from the data file) appended rec file data file

k b column no pseudo random number generator term bit numbers

Record descriptor (searching) • Generate record descriptor for search term(s) • AND with all record descriptors to find matching record(s)

False matches • Query descriptor matches a record descriptor that does not contain the search term

False matches • Chance of a false match related to bit density • The lower the bit density, the less probability of a false match • EMu uses a bit density of < 25%; that is, less than 25% of bits are one • Probability of a false match with k = 5 is 1 in 1,024 record descriptors checked for a single term query • Probability for a two term query 1 in 1,048,576 • Lower bit density requires more disk space and produces longer record descriptors

Segment descriptor • Encodes the contents of multiple records into a bit string • Descriptors stored sequentially in the seg file (bitsliced)

Segment descriptor • For each group of records (Nr) a single descriptor is calculated as for a record descriptor • Segment level has its own values for k (number of bits to set) and b (length of bit string)

Segment descriptor (searching) • Segment searching checks Nr records per descriptor • For efficient disk access for searching, “flip” seg file (bitslicing) • Penalty is slower record insertions / updates (use oflow file)

Segment descriptor (bitsliced) • Each bit slice is ANDed to determine matching segments • Matching segments are given by bit positions with a value of one AND

Complete search sequence • Build segment query descriptor for query terms • Search bitslice segment file for list of matching segments • Build record query descriptor for query terms • Search record descriptors in matching segments for matching records • Exact match record only before showing to user

Number of disk accesses (logical) • For a single search term with one matching record: • ks – bits set per term (segment level) • 1 – disk read to read segment to match record descriptor • Number of logical reads is independent of the table size • Number of physical reads increases as table grows (but disk read ahead helps here)

Client query evaluation • Attachment searches performed and matching IRNs on reference column added to query statement • Reverse attachment searches performed and matching reference values added to query statement • Local search terms added to query statement • Also search columns added to query statement • Search performed

What is a term? • A term is the basic index component

Term modifiers • Modifiers alter how the term is indexed

Indexing tools • texdensity • Prints out the bit density for segment and record descriptors • texanalyse • Prints the number of terms per record • texconf • Calculate a suitable index configuration • Adjust configuration parameters manually

Configuration parameters • params file in table directory • Override default configuration parameters • Bit density (rec/seg) • File system block size • False match probability (rec/seg) • Minimum number of records per segment • XML based file

Searching Issues – false matches • Issue • Some queries are slow but disk activity is high • Diagnose • texadmin database usage shows a high number of index false matches • texdensity shows high density or large standard deviation with high maximum density (check seg and rec) • texanalyse shows a large standard deviation for the number of index terms (check seg and rec) • Fix • Reconfigure table • Set configuration parameters manually

Searching Issues – common terms • Issue • Some queries containing common terms are slow • “false” segment matches • Diagnose • Querying on each term individually results in a large number of matches (query is quick) • Querying on the combination of terms becomes slow • Fix • Cluster table on a common term • Sort data before indexing

Searching Issues – block size mismatch • Issue • Overall searching is slow but disk activity is high • Using zfs with large record size • Diagnose • Determine the block size of the file system used to hold index files • Use texconf to determine the block size used for indexing • Fix • Set blocksize configuration parameter manually • Adjust zfs record size to 16K

Searching Issues – RAID configuration • Issue • Record updates are very slow • Fast disks but performance less than optimal • Diagnose • Disk controller or driver is configured to use RAID 5 or 6 • Fix • Optimal performance in a RAID environment is RAID 1+0 (RAID 10) (stripe/mirror) • Ensure striping agrees with block size of file system • Enable striping where possible

Searching Issues – Range queries slow • Issue • Queries containing ranges are slow • Diagnose • Use emuindexing to check if range indexing is enabled • Fix • Use emurangeupdate to optimise range based searching • Add Registry entries to enable indexing required: • System|Setting|Table|table|Range Buckets|colname|bucket;...

Searching Issues – Large attachment queries • Issue • Query is very slow when performing a query containing attachments and other terms • Diagnose • “Optimising query” status is displayed for a long time • Cause • The search engine is re-organising the query (a AND b) AND (c OR d OR e OR f or g)becomes (a AND b AND c) OR (a AND b AND d) or (a AND b AND e) or (a and b and f) OR (a AND b AND g) • Fix • Rewrite the query optimiser 

References • EMu 4.0.01 Release Notes • System Tuning • Configuration • Range Indexing • www.kesoftware.com/downloads/EMu/documents/configuration.pdf • www.kesoftware.com/downloads/EMu/documents/Range Indexing/rangeindexing.pdf

EMu Searching Explained (What’s going on under the hood!)

EMu Searching Explained (What’s going on under the hood!)

Presentation Transcript

Searching Molecular Databases with BLAST

ENTC 3030

Fractions Explained

Reaction Searching in STN

Inside Windows Azure Storage : what's new and under the hood deep dive

Algorithmics and Applications of Tree and Graph Searching

Indexing and Searching

Parallel Computing Explained Parallel Computing Overview

Searching for Gravitational Waves with Millisecond Pulsars:

Part 2: Story genres in academic discourse Susan Hood Sue.hood@uts.edu.au

Data Structures for 3D Searching

Database Searching for Evidence Based Medicine Literature

25.2 INDUSTRIALIZATION CASE STUDY MANCHESTER

Vision Explained

Searching for microbes Part XIII. Parasitology

Introduction Slide 1

Introduction Slide 1

Unit 10 Continued The 1960s and Vietnam War

On the direct Searching for Cold Dark Matter -

29.3 A GLOBAL CONFLICT

29.4 A FLAWED PEACE

BLAST and searching sequence databases