Oracle Text Search saves your time

Oracle Text Search saves your time Anna Suwalska European Organization for Nuclear Research - Geneva OracleWorld Paris 2003

Content CERN Engineering Data Management System at CERN Oracle Text How we profit from this technology Conclusion

Content CERN

CERN - European Organization for Nuclear Research • The world’s largest particle physics research laboratory • Founded in 1954, CERN has today 20 member states • 2400 staff • Over 6500 scientists come here to use research facilities • 500 universities, over 80 nationalities • CERN explores what matter is made of, and what forces hold it together • WWW was born here

LHC - The Large Hadron Collider Project

LHC - Cryodipole

Content EDMS Engineering Data Management System

EDMS - Engineering Data Management System EDMS Portal EDMS Common layer Axalant MP5 Other DB’s Design Data Documents and Drawings Asset tracking Work management

Structures EDMS - Engineering Data Management System Managing

EDMS - Engineering Data Management System Managing Structures Complete life-cycle for a single/compound documents.

EDMS - Engineering Data Management System Managing Structures Complete life-cycle for a single/compound documents. Document versioning

EDMS - Engineering Data Management System Managing Structures Complete life-cycle for a single/compound documents. Document versioning Document approval processes (comments collector)

EDMS - Engineering Data Management System Managing Structures Complete life-cycle for a single/compound documents. Document versioning Document approval processes (comments collector) Assemblies

EDMS - Engineering Data Management System Managing Structures Complete life-cycle for a single/compound documents. Document versioning Document approval processes (comments collector) Assemblies Equipment workflow, data

EDMS - Engineering Data Management System Managing Structures Complete life-cycle for a single/compound documents. Document versioning Document approval processes (comments collector) Assemblies Equipment workflow, data Installation (jobs, locations, etc..)

Provide an efficient search tool to support with requirements above - our choice Oracle Text EDMS mandate Manage a full description of the LHC project’s engineering data over it’s lifetime (>25 years) Design Installation Operation Dismantling A full description of the machine and its components through their lifecycle must be constantly available for all concerned parties Support and coordinate engineering work / information / data workflow Help tracing solutions to all problems occurring in the machine Establish a knowledge transfer: evolving staff, many short time visitors Operation

Oracle Text – our choice Our needs

Oracle Text – our choice Our needs Index metadata & files : First line search is done on meta data, however the possibility to index files is essential

Oracle Text – our choice Our needs Index metadata & files : First line search is done on meta data, however the possibility to index files is essential Bi-lingual : Official CERN languages are English and French. We have to support both

Oracle Text – our choice Our needs Index metadata & files : First line search is done on meta data, however the possibility to index files is essential Bi-lingual : Official CERN languages are English and French. We have to support both Performance: Response time is very important

Oracle Text – our choice Our needs Index metadata & files : First line search is done on meta data, however the possibility to index files is essential Bi-lingual : Official CERN languages are English and French. We have to support both Performance: Response time is very important Simplicity: Simple for users Simple to develop Simple to maintain

Oracle Text – our choice Our needs Index metadata & files : First line search is done on meta data, however the possibility to index files is essential  Oracle Text supports most of the document formats Bi-lingual : Official CERN languages are English and French. We have to support both  Oracle text supports 39 languages Performance: Response time is very important  Very efficient for searches within big collection of data Simplicity: Simple for users Simple to develop Simple to maintain Results with scoring methodology to help navigate through a result Standard SQL statements Easy to maintain with ALTER INDEX or CTX_DDL packages 

Oracle Text – our choice Our needs Index metadata & files : First line search is done on meta data, however the possibility to index files is essential  Oracle Text supports most of the document formats Bi-lingual : Official CERN languages are English and French. We have to support both  Oracle text supports 39 languages Performance: Response time is very important  Very efficient for searches within big collection of data Simplicity: Simple for users Simple to develop Simple to maintain Results with scoring methodology to help navigate through a result Standard SQL statements Easy to maintain with ALTER INDEX or CTX_DDL packages   Oracle text comes as an option in RDBMS - no additional costs

Content Oracle Text

Oracle Text Oracle text Uses standard SQL Enables the building of a Text Query Application and a Document Classification Application Takes care of: indexing searching: word and theme viewing text

CONTEXT Index Creation CREATE INDEX index_name ON table_name(column_name) INDEXTYPE IS CTXSYS.CONTEXT PARAMETERS(‘parameters string’); [datastore datastore_pref] [filter filter_pref] [charset column charset_column_name] [format column format_column_name] [lexer lexer_pref] [language column language_column_name] [wordlist wordlist_pref] [storage storage_pref] [stoplist stoplist] [section group section_group] [memory memsize] [populate | nopopulate]

Types of indexes Index Query Operator Characteristics Used for CONTEXT CONTAINS CLOB, BLOB, BFILE, CHAR, VARCHAR2, XML On text column Most complete of all 3 types. Large coherent documents CTXCAT CATSEARCH CHAR, VARCHAR2 Combined index on a text column and one or more other columns. Transactional – no need for synchronizing when DML. Creating can be longer because of the sub-indexes. Supports: INDEX SET, LEXER*, STOPLIST, STORAGE, WORDLIST* Has it’s own query language. For indexing small text fragments and related information. To improve mixed query performance CTXRULE MATCHES VARCHAR2, CLOB On column containing a set of queries. Supports: LEXER (only BASIC) Does not support number of operators. Building a document classification application

Index Maintenance & Optimization ALTER INDEX index_name REBUILD [ONLINE][PARAMETERS(parameters string)]; ALTER INDEX cdi_text_ctx REBUILD ONLINE PARAMETERS(‘optimize fast’); ALTER INDEX cdi_text_ctx REBUILD ONLINE PARAMETERS(‘optimize full maxtime10’); ALTER INDEX cdi_text_ctx REBUILD ONLINE PARAMETERS(‘optimize full’);

Index Maintenance & Optimization ALTER INDEX index_name REBUILD [ONLINE][PARAMETERS(parameters string)]; CTX_DDL package CTX_DDL.OPTIMIZE_INDEX CTX_DDL.SYNC_INDEX

DML processing INSERT A new row inserted in DR$PENDING queue, not available for query before synchronization UPDATE Existing ROWID is placed in DR$PENDING, neither new nor old content is available for query before synchronization DELETE The row is immediately unavailable for query(marked as invalid), but only removed when optimization complete CTX_USER_PENDING (CTX_PENDING) view To check records waiting for synchronization

Scoring “To calculate a relevance score for a returned document in a word query, Oracle uses an inverse frequency algorithm based on Salton's formula. Inverse frequency scoring assumes that frequently occurring terms in a document set are noise terms, and so these terms are scored lower. For a document to score high, the query term must occur frequently in the document but infrequently in the document set as a whole.” Oracle Text Reference, Release 9.0.1 Example In data set: M number of occurrences of TERM1, N number of occurrences of TERM2 M >> N Document having equal (n-occurrences) of TERM1 and TERM2 Result SCORE for querying TERM1 < SCORE for querying TERM2

Query Operators Boolean OR NOT MINUS AND lhc AND magnet AND NOT cryogenic Linguistics SYNonym ABOUT STEM Translation Term Broader, Narrower, Preferred, Related Term SYN (science) ABOUT (particle) Others FUZZY NEAR SOUNDEX WITHIN SQE begin ctx_query.store_sqe ( ‘particle‘ , ’atom, molecule proton’ ); end; ‘SQE (particle)’

CTX packages CTX_ADMIN Administer servers and the data dictionary (only ctxsys user) Create and manage the preferences, section groups, stoplists, manage indexes CTX_DDL CTX_DOC Document presentation features (only for CONTAINS indexes) CTX_OUTPUT Manage logs for the indexes Generating query feedback, counting hits, and creating SQE (stored query expressions) CTX_QUERY CTX_THES Manage and browse thesaurus

Content How we profit from this technology

EDMS metadata index preferences

EDMS search for both languages accelerateur lhc méthode Version 1.5

EDMS metadata index preferences

Escaping characters to query them To be able to query on reserved words or symbols such as “minus”, “-” , “near” they must be escaped. There are 2 methods to escape the character, using “{}” or “\” . When using: We had to hardcode it for each symbol and word. A standard “dictionary table” with the reserved characters would be useful.

It is important to know how users will search the data and what kind of data you are going to index before you actually do it. EDMS metadata index preferences

Test documents 74 000 Drawings 266 500 New documents (monthly) 4 000 Document updates (monthly) 4 000 Files (CSV, DOC, DOT, HTM, HTML, MPP, PDF, PPT, PS, RES, TXT) / total 88 800 / 148 000 Files (GB) / total 45 / 78 EDMS Index Maintenance & Optimization Environment Meta data Files Hardware & system: Two node cluster based on two Sun SPARC 450, running Solaris 2.6 + Sun Cluster 2.1 RDBMS: 8.1.7.4 ~500 MB SGA size 60-80 concurrent users (during working hours)

Test documents 74 000 Drawings 266 500 New documents (monthly) 4 000 Document updates (monthly) 4 000 Files (CSV, DOC, DOT, HTM, HTML, MPP, PDF, PPT, PS, RES, TXT) / total 88 800 / 148 000 Files (GB) / total 45 / 78 EDMS Index Maintenance & Optimization Environment Meta data Files Index synchronization: every 10 min, takes a few seconds PROCEDURE rebuild_metedata_ctx IS BEGIN EXECUTE IMMEDIATE ('alter index CDI_TEXT_CTX rebuild online parameters(' ' sync ' ')'); END; Index optimization: every weekend, takes ~30 min PROCEDURE optimize_metedata_ctx IS BEGIN EXECUTE IMMEDIATE ('alter index CDI_TEXT_CTX rebuild online parameters(' ' optimize full' ')'); END;

Test documents 74 000 Drawings 266 500 New documents (monthly) 4 000 Document updates (monthly) 4 000 Files (CSV, DOC, DOT, HTM, HTML, MPP, PDF, PPT, PS, RES, TXT) / total 88 800 / 148 000 Files (GB) / total 45 / 78 EDMS Index Maintenance & Optimization Environment Meta data Files Synchronize every 24h ? Optimize (fast, full) every month?

Scoring SQL> SELECT c_id,score(10) FROM compound_doc_info WHERE CONTAINS(c_text,’lhc’,10)>0 AND c_id = 1738594907; C_ID SCORE(10) ------------------ ---------------- 1738594907 9 SQL> SELECT c_id,score(10) FROM compound_doc_info WHERE CONTAINS(c_text,’evolution’,10)>0 AND c_id = 1738594907; C_ID SCORE(10) ------------------ ---------------- 1738594907 15

Using the thesaurus DECLARE xtab ctx_thes.exp_tab; …. BEGIN ctxsys.ctx_thes.rt(xtab,p_term,’edms_thes’); FOR i IN 1..xtab.COUNT LOOP IF xtab(i).xrel = C_RELETED_TERM THEN htp.anchor ( L_DOC_SEARCH ||'?cookie=' ||cookie ||'&p_search_type=' ||p_search_type ||'&p_free_text=' ||LOWER(xtab(i).xphrase) ,LOWER(xtab(i).xphrase) ); END IF; END LOOP; END; Propose the RT (Related Term) if nothing found with the original term(s). Would be nice to have a spell checker corrector, using existing tokens.

Using the thesaurus - example

Querying with Oracle Text versus standard SQL …WHERE CONTAINS (c_text, p_free_text) > 0; Total 83 ms

Querying with Oracle Text versus standard SQL … WHERE UPPER(c_text) LIKE '%’||UPER(p_free_text)||’%’ Total 03.98s

Querying with Oracle Text versus standard SQL p_free_text is a single word or an exact sentence Tool Oracle Text Standard SQL Characteristics Fast. Underperforming. Statement WHERE CONTAINS (c_text,p_free_text) > 0 WHERE UPPER(c_text) LIKE '%’||UPER(p_free_text)||’%’ Time * 83 ms (821ms) 03.98s (39.14s) * Tests done with TOra 1.3.8 (in parentheses repeated 10x)

Querying with Oracle Text versus standard SQL p_free_text is an expression with OR operator Tool Oracle Text Standard SQL Characteristics Fast. Underperforming. Statement WHERE CONTAINS (c_text,p_free_text) > 0 WHERE ( UPPER(c_text) LIKE '%’||UPPER(p_text_1)||’%’ OR UPPER(c_text) LIKE '%’||UPER(p_text_2)||’%’ ) Time * 103ms (01:03 ) 09:09 (1:22.09) * Tests done with TOra 1.3.8 (in parentheses repeated 10x)

Oracle Text Search saves your time

Oracle Text Search saves your time

Presentation Transcript

Studying Online -Saves Time And Money

Time Text Set

Your text Your text Your text Your text

Saves

ASCP Membership Saves time Saves money Eliminates paperwork

Oracle Text Operations

Full Text Search

Search Text-Field

Oracle Text

The sleek time saves nine

Liquid Funds in India - An Investment In Time, Saves Your Time

How Cloud Computing Saves Your Time & Money

How POS Software business saves your time?

Barcode Verifier Saves Time and Money

3D Rendering Saves Time and Money

The ashing system which saves time

Best Courier Service That Saves Your Time And Money!

Saves your Time with Book Maid Online

Oracle Text Search saves your time

Samsung Washing Machine Repair Saves Time!

Salesforce Lead Routing saves time