1 / 59

Oracle Text Search saves your time

Oracle Text Search saves your time. Anna Suwalska European Organization for Nuclear Research - Geneva. OracleWorld Paris 2003. Content. CERN Engineering Data Management System at CERN Oracle Text How we profit from this technology Conclusion. Content. CERN.

meredith
Download Presentation

Oracle Text Search saves your time

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Oracle Text Search saves your time Anna Suwalska European Organization for Nuclear Research - Geneva OracleWorld Paris 2003

  2. Content CERN Engineering Data Management System at CERN Oracle Text How we profit from this technology Conclusion

  3. Content CERN

  4. CERN - European Organization for Nuclear Research • The world’s largest particle physics research laboratory • Founded in 1954, CERN has today 20 member states • 2400 staff • Over 6500 scientists come here to use research facilities • 500 universities, over 80 nationalities • CERN explores what matter is made of, and what forces hold it together • WWW was born here

  5. LHC - The Large Hadron Collider Project

  6. LHC - Cryodipole

  7. Content EDMS Engineering Data Management System

  8. EDMS - Engineering Data Management System EDMS Portal EDMS Common layer Axalant MP5 Other DB’s Design Data Documents and Drawings Asset tracking Work management

  9. Structures EDMS - Engineering Data Management System Managing

  10. EDMS - Engineering Data Management System Managing Structures Complete life-cycle for a single/compound documents.

  11. EDMS - Engineering Data Management System Managing Structures Complete life-cycle for a single/compound documents. Document versioning

  12. EDMS - Engineering Data Management System Managing Structures Complete life-cycle for a single/compound documents. Document versioning Document approval processes (comments collector)

  13. EDMS - Engineering Data Management System Managing Structures Complete life-cycle for a single/compound documents. Document versioning Document approval processes (comments collector) Assemblies

  14. EDMS - Engineering Data Management System Managing Structures Complete life-cycle for a single/compound documents. Document versioning Document approval processes (comments collector) Assemblies Equipment workflow, data

  15. EDMS - Engineering Data Management System Managing Structures Complete life-cycle for a single/compound documents. Document versioning Document approval processes (comments collector) Assemblies Equipment workflow, data Installation (jobs, locations, etc..)

  16. Provide an efficient search tool to support with requirements above - our choice Oracle Text EDMS mandate Manage a full description of the LHC project’s engineering data over it’s lifetime (>25 years) Design Installation Operation Dismantling A full description of the machine and its components through their lifecycle must be constantly available for all concerned parties Support and coordinate engineering work / information / data workflow Help tracing solutions to all problems occurring in the machine Establish a knowledge transfer: evolving staff, many short time visitors Operation

  17. Oracle Text – our choice Our needs

  18. Oracle Text – our choice Our needs Index metadata & files : First line search is done on meta data, however the possibility to index files is essential

  19. Oracle Text – our choice Our needs Index metadata & files : First line search is done on meta data, however the possibility to index files is essential Bi-lingual : Official CERN languages are English and French. We have to support both

  20. Oracle Text – our choice Our needs Index metadata & files : First line search is done on meta data, however the possibility to index files is essential Bi-lingual : Official CERN languages are English and French. We have to support both Performance: Response time is very important

  21. Oracle Text – our choice Our needs Index metadata & files : First line search is done on meta data, however the possibility to index files is essential Bi-lingual : Official CERN languages are English and French. We have to support both Performance: Response time is very important Simplicity: Simple for users Simple to develop Simple to maintain

  22. Oracle Text – our choice Our needs Index metadata & files : First line search is done on meta data, however the possibility to index files is essential  Oracle Text supports most of the document formats Bi-lingual : Official CERN languages are English and French. We have to support both  Oracle text supports 39 languages Performance: Response time is very important  Very efficient for searches within big collection of data Simplicity: Simple for users Simple to develop Simple to maintain Results with scoring methodology to help navigate through a result Standard SQL statements Easy to maintain with ALTER INDEX or CTX_DDL packages 

  23. Oracle Text – our choice Our needs Index metadata & files : First line search is done on meta data, however the possibility to index files is essential  Oracle Text supports most of the document formats Bi-lingual : Official CERN languages are English and French. We have to support both  Oracle text supports 39 languages Performance: Response time is very important  Very efficient for searches within big collection of data Simplicity: Simple for users Simple to develop Simple to maintain Results with scoring methodology to help navigate through a result Standard SQL statements Easy to maintain with ALTER INDEX or CTX_DDL packages   Oracle text comes as an option in RDBMS - no additional costs

  24. Content Oracle Text

  25. Oracle Text Oracle text Uses standard SQL Enables the building of a Text Query Application and a Document Classification Application Takes care of: indexing searching: word and theme viewing text

  26. CONTEXT Index Creation CREATE INDEX index_name ON table_name(column_name) INDEXTYPE IS CTXSYS.CONTEXT PARAMETERS(‘parameters string’); [datastore datastore_pref] [filter filter_pref] [charset column charset_column_name] [format column format_column_name] [lexer lexer_pref] [language column language_column_name] [wordlist wordlist_pref] [storage storage_pref] [stoplist stoplist] [section group section_group] [memory memsize] [populate | nopopulate]

  27. Types of indexes Index Query Operator Characteristics Used for CONTEXT CONTAINS CLOB, BLOB, BFILE, CHAR, VARCHAR2, XML On text column Most complete of all 3 types. Large coherent documents CTXCAT CATSEARCH CHAR, VARCHAR2 Combined index on a text column and one or more other columns. Transactional – no need for synchronizing when DML. Creating can be longer because of the sub-indexes. Supports: INDEX SET, LEXER*, STOPLIST, STORAGE, WORDLIST* Has it’s own query language. For indexing small text fragments and related information. To improve mixed query performance CTXRULE MATCHES VARCHAR2, CLOB On column containing a set of queries. Supports: LEXER (only BASIC) Does not support number of operators. Building a document classification application

  28. Index Maintenance & Optimization ALTER INDEX index_name REBUILD [ONLINE][PARAMETERS(parameters string)]; ALTER INDEX cdi_text_ctx REBUILD ONLINE PARAMETERS(‘optimize fast’); ALTER INDEX cdi_text_ctx REBUILD ONLINE PARAMETERS(‘optimize full maxtime10’); ALTER INDEX cdi_text_ctx REBUILD ONLINE PARAMETERS(‘optimize full’);

  29. Index Maintenance & Optimization ALTER INDEX index_name REBUILD [ONLINE][PARAMETERS(parameters string)]; CTX_DDL package CTX_DDL.OPTIMIZE_INDEX CTX_DDL.SYNC_INDEX

  30. DML processing INSERT A new row inserted in DR$PENDING queue, not available for query before synchronization UPDATE Existing ROWID is placed in DR$PENDING, neither new nor old content is available for query before synchronization DELETE The row is immediately unavailable for query(marked as invalid), but only removed when optimization complete CTX_USER_PENDING (CTX_PENDING) view To check records waiting for synchronization

  31. Scoring “To calculate a relevance score for a returned document in a word query, Oracle uses an inverse frequency algorithm based on Salton's formula. Inverse frequency scoring assumes that frequently occurring terms in a document set are noise terms, and so these terms are scored lower. For a document to score high, the query term must occur frequently in the document but infrequently in the document set as a whole.” Oracle Text Reference, Release 9.0.1 Example In data set: M number of occurrences of TERM1, N number of occurrences of TERM2 M >> N Document having equal (n-occurrences) of TERM1 and TERM2 Result SCORE for querying TERM1 < SCORE for querying TERM2

  32. Query Operators Boolean OR NOT MINUS AND lhc AND magnet AND NOT cryogenic Linguistics SYNonym ABOUT STEM Translation Term Broader, Narrower, Preferred, Related Term SYN (science) ABOUT (particle) Others FUZZY NEAR SOUNDEX WITHIN SQE begin ctx_query.store_sqe ( ‘particle‘ , ’atom, molecule proton’ ); end; ‘SQE (particle)’

  33. CTX packages CTX_ADMIN Administer servers and the data dictionary (only ctxsys user) Create and manage the preferences, section groups, stoplists, manage indexes CTX_DDL CTX_DOC Document presentation features (only for CONTAINS indexes) CTX_OUTPUT Manage logs for the indexes Generating query feedback, counting hits, and creating SQE (stored query expressions) CTX_QUERY CTX_THES Manage and browse thesaurus

  34. Content How we profit from this technology

  35. EDMS metadata index preferences

  36. EDMS search for both languages accelerateur lhc méthode Version 1.5

  37. EDMS metadata index preferences

  38. Escaping characters to query them To be able to query on reserved words or symbols such as “minus”, “-” , “near” they must be escaped. There are 2 methods to escape the character, using “{}” or “\” . When using: We had to hardcode it for each symbol and word. A standard “dictionary table” with the reserved characters would be useful.

  39. It is important to know how users will search the data and what kind of data you are going to index before you actually do it. EDMS metadata index preferences

  40. Test documents 74 000 Drawings 266 500 New documents (monthly) 4 000 Document updates (monthly) 4 000 Files (CSV, DOC, DOT, HTM, HTML, MPP, PDF, PPT, PS, RES, TXT) / total 88 800 / 148 000 Files (GB) / total 45 / 78 EDMS Index Maintenance & Optimization Environment Meta data Files Hardware & system: Two node cluster based on two Sun SPARC 450, running Solaris 2.6 + Sun Cluster 2.1 RDBMS: 8.1.7.4 ~500 MB SGA size 60-80 concurrent users (during working hours)

  41. Test documents 74 000 Drawings 266 500 New documents (monthly) 4 000 Document updates (monthly) 4 000 Files (CSV, DOC, DOT, HTM, HTML, MPP, PDF, PPT, PS, RES, TXT) / total 88 800 / 148 000 Files (GB) / total 45 / 78 EDMS Index Maintenance & Optimization Environment Meta data Files Index synchronization: every 10 min, takes a few seconds PROCEDURE rebuild_metedata_ctx IS BEGIN EXECUTE IMMEDIATE ('alter index CDI_TEXT_CTX rebuild online parameters(' ' sync ' ')'); END; Index optimization: every weekend, takes ~30 min PROCEDURE optimize_metedata_ctx IS BEGIN EXECUTE IMMEDIATE ('alter index CDI_TEXT_CTX rebuild online parameters(' ' optimize full' ')'); END;

  42. Test documents 74 000 Drawings 266 500 New documents (monthly) 4 000 Document updates (monthly) 4 000 Files (CSV, DOC, DOT, HTM, HTML, MPP, PDF, PPT, PS, RES, TXT) / total 88 800 / 148 000 Files (GB) / total 45 / 78 EDMS Index Maintenance & Optimization Environment Meta data Files Synchronize every 24h ? Optimize (fast, full) every month?

  43. Scoring SQL> SELECT c_id,score(10) FROM compound_doc_info WHERE CONTAINS(c_text,’lhc’,10)>0 AND c_id = 1738594907; C_ID SCORE(10) ------------------ ---------------- 1738594907 9 SQL> SELECT c_id,score(10) FROM compound_doc_info WHERE CONTAINS(c_text,’evolution’,10)>0 AND c_id = 1738594907; C_ID SCORE(10) ------------------ ---------------- 1738594907 15

  44. Using the thesaurus DECLARE xtab ctx_thes.exp_tab; …. BEGIN ctxsys.ctx_thes.rt(xtab,p_term,’edms_thes’); FOR i IN 1..xtab.COUNT LOOP IF xtab(i).xrel = C_RELETED_TERM THEN htp.anchor ( L_DOC_SEARCH ||'?cookie=' ||cookie ||'&p_search_type=' ||p_search_type ||'&p_free_text=' ||LOWER(xtab(i).xphrase) ,LOWER(xtab(i).xphrase) ); END IF; END LOOP; END; Propose the RT (Related Term) if nothing found with the original term(s). Would be nice to have a spell checker corrector, using existing tokens.

  45. Using the thesaurus - example

  46. Using the thesaurus - example

  47. Querying with Oracle Text versus standard SQL …WHERE CONTAINS (c_text, p_free_text) > 0; Total 83 ms

  48. Querying with Oracle Text versus standard SQL … WHERE UPPER(c_text) LIKE '%’||UPER(p_free_text)||’%’ Total 03.98s

  49. Querying with Oracle Text versus standard SQL p_free_text is a single word or an exact sentence Tool Oracle Text Standard SQL Characteristics Fast. Underperforming. Statement WHERE CONTAINS (c_text,p_free_text) > 0 WHERE UPPER(c_text) LIKE '%’||UPER(p_free_text)||’%’ Time * 83 ms (821ms) 03.98s (39.14s) * Tests done with TOra 1.3.8 (in parentheses repeated 10x)

  50. Querying with Oracle Text versus standard SQL p_free_text is an expression with OR operator Tool Oracle Text Standard SQL Characteristics Fast. Underperforming. Statement WHERE CONTAINS (c_text,p_free_text) > 0 WHERE ( UPPER(c_text) LIKE '%’||UPPER(p_text_1)||’%’ OR UPPER(c_text) LIKE '%’||UPER(p_text_2)||’%’ ) Time * 103ms (01:03 ) 09:09 (1:22.09) * Tests done with TOra 1.3.8 (in parentheses repeated 10x)

More Related