slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Oracle Database 11g New Search Features and Roadmap PowerPoint Presentation
Download Presentation
Oracle Database 11g New Search Features and Roadmap

Loading in 2 Seconds...

play fullscreen
1 / 27

Oracle Database 11g New Search Features and Roadmap - PowerPoint PPT Presentation


  • 317 Views
  • Uploaded on

Oracle Database 11g New Search Features and Roadmap. Roger Ford Senior Principal Product Manager. Contents. Oracle’s Search Products Oracle Text 11g New Features Oracle Text 11.2.0.2 New Features Entity Extraction Name Search Result Set Interface Search Product Roadmap Oracle Text

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Oracle Database 11g New Search Features and Roadmap' - raimundo


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
oracle database 11g new search features and roadmap

Oracle Database 11g New Search Features and Roadmap

Roger Ford

Senior Principal Product Manager

contents
Contents

Oracle’s Search Products

Oracle Text 11g New Features

Oracle Text 11.2.0.2 New Features

Entity Extraction

Name Search

Result Set Interface

Search Product Roadmap

Oracle Text

Secure Enterprise Search

<Insert Picture Here>

oracle s search products
Oracle’s Search Products

Oracle Text

A SQL and PL/SQL based toolkit for creating full-text search applications

Free with all database versions

Previously known as Context Option, interMedia Text

Secure Enterprise Search

A complete search based on Oracle Text capabilities

Crawlers for datasources such as web, email, document repositories, databases

End-user query application and APIs for embedding

oracle text 11g new features
Oracle Text 11g New Features

Composite Domain Indexes and SDATA sections

Allows storage of structured info (eg numbers, dates) within text index

Makes for much faster “mixed” queries

Auto Lexer

Automatic Language Recognition

Segmentation and Stemming for 32 languages

Context-sensitive stemming for 23 of these languages

Off-line and time-limited index creation

Enables rebuild of indexes offline in quiet periods for true 24x7 operation

11 2 0 2 new features summary
11.2.0.2 New Features - Summary
  • Entity Extraction
    • Find “entities” such as people, countries, cities, states, zip codes, phone numbers etc from the text
    • Use default dictionary and rules or define your own dictionary and rules based on regular expressions
  • Name Search (NDATA sections)
    • Inexact searches, copes with mis-spellings, segmentation errors, contractions and word reversal
    • Useful for many searches, but particular good for names
  • ResultSet Interface
    • Query request in XML and results returned as XML
    • Avoids SQL layer and requirement to work within “SELECT” semantics
entity extraction
Entity Extraction
  • Indentify names, places, dates, times, etc
  • Tag each occurence with type and subtype
  • Entities are defined by DICTIONARY and RULES
  • Implemented by CTX_ENTITY package
    • create_extract_policy – create a policy to which you can add extract rules
      • Choose to use/not use built in rules and dictionary
    • add_extract_rule – create an XML-based rule to define an entity
    • add_stop_entity – prevent defined entities from being used
    • compile – build the policy with its rules
    • extract – get an XML-based list of entities for a doc
  • Also can use ctxload to load user dictionary
entities built in types
building

city

company

country

currency

date

day

email_address

geo_political

holiday

location_other

month

non_profit

organization_other

percent

person_jobtitle

person_name

person_other

phone_number

postal_address

product

region

ssn

state

time_duration

tod

url

zip_code

Entities: built-in types
entity extraction example 1 defaults
Entity Extraction – Example 1: Defaults

ctx_entity.create_extract_policy('my_default_policy');

ctx_entity.compile('mypolicy');

ctx_entity.extract('mypolicy', mydoc, mylang, myresults);

Output in "myresults":

<entities>

<entity id="0" offset="75" length="8" source="SuppliedDictionary">

<text>New York</text>

<type>city</type>

</entity>

<entity id="1" offset="55" length="16" source="SuppliedRule">

<text>Hupplewhite Inc.</text>

<type>company</type>

</entity>

</entities>

entity extraction example 2 user rule
Entity Extraction – Example 2: User rule

ctx_entity.create_extract_policy('mypolicy');

ctx_entity.add_extract_rule('mypolicy', 5, '<rule> <expression>((North|South)? America)</expression>

  <type refid="1">xContinent</type>

</rule>');

ctx_entity.compile('mypolicy');

ctx_entity.extract('mypolicy', mydoc, mylang, myresults);

Note parentheses around expression. refid="1" means take the first expression in paren – so "North America" or just "America".

User defined types must be prefixed with a "x" – hence "xContinent"

<entities>

<entity id="0" offset="75" length="13" source="UserRule">

<text>North America</text>

<type>xContinent</type>

</entity>

</entities>

ent ext adding a user dictionary
Ent Ext: Adding a user dictionary
  • Create file ud.xml:

<dictionary> <entities>

<entity> <value>Dow Jones Industrial Average</value> <type>xIndex</type> </entity>

<entity> <value>S&amp;P 500</value> <type>xIndex</type> </entity>

<entities> </dictionary>

  • Create the policy with CTXLOAD (can add rules later)

ctxload -user scott/tiger -extract -name pol1 -file ud.xml

  • Compile the policy

ctx_entity.compile('pol1');

  • Results

<entity id="69" offset="1010" length="7" source="UserDictionary">

<text>S&amp;P 500</text>

<type>xIndex</type>

</entity>

entity extraction other stuff
Entity Extraction – other stuff
  • Extracting only certain entity types:
    • ctx_entity.extract('p1', mydoc, null, myresults, 'city,company,xContinent');
name search
Name Search
  • Searching names has many difficulties
    • Spelling (steven = stephen)
    • Alternate Names (fred = alfred, chuck = charles)
    • Transcription (copying from spoken to written form)
    • Transliteration (copying from one writing system to another)
    • Segmentation (Mary Jane, Maryjane)
    • First, Middle, and Last Name Classification
  • Name search does intelligent matching across all these issues
ndata section type
NDATA section type
  • Basic implementation for name search
  • Limitations
    • 511 characters
    • 255 whitespace-delimited terms
    • No offset information, therefore no:
      • Highlighting / Markup
      • NEAR or phrase search with NDATA
  • Uses WORDLIST preference attributes:
    • NDATA_ALTERNATE_SPELLING
    • NDATA_BASE_LETTER
    • NDATA_THESAURUS (for alternate names – default thesaurus provided)
    • NDATA_JOIN_PARTICLES (list such as 'de:du:mc:mac')
  • Query Syntax
    • NDATA(fieldname, search terms [, order [, proximity ] ] )
result set interface
Result Set Interface
  • Some queries are difficult to express in SQL:
    • eg "Give me the top 5 hits in each category"
  • Result set interface uses a simple text query and an XML result set descriptor
  • Hitlist is returned in XML according to result set descriptor
  • Uses SDATA sections for
    • Grouping
    • Counting
result set example query
Result Set Example Query

ctx_query.result_set('docidx', 'oracle',

'<ctx_result_set_descriptor>

<count/>

<hitlist start_hit_num="1" end_hit_num="2" order="pubDate desc, score desc">

<score/> <rowid/>

<sdata name="author"/>

<sdata name="pubDate"/>

</hitlist>

<group sdata="pubDate">

<count/>

</group>

<group sdata="author">

<count/>

</group>

</ctx_result_set_descriptor> ', rs);

result set output
Result Set Output

<ctx_result_set>

<hitlist>

<hit>

<score>3</score><rowid>AAAPoEAABAAAMWsAAC</rowid>

<sdata name="AUTHOR">John</sdata>

<sdata name="PUBDATE">2001-01-03 00:00:00</sdata>

</hit>

<hit>

<score>3</score><rowid>AAAPoEAABAAAMWsAAG</rowid>

<sdata name="AUTHOR">John</sdata>

<sdata name="PUBDATE">2001-01-03 00:00:00</sdata>

</hit>

</hitlist>

<count>100</count>

result set output continued
Result Set Output - Continued

<groups sdata="PUBDATE">

<group value="2001-01-01 00:00:00"><count>25</count></group>

<group value="2001-01-02 00:00:00"><count>50</count></group>

<group value="2001-01-03 00:00:00"><count>25</count></group>

</groups>

<groups sdata="AUTHOR">

<group value="John"><count>50</count></group>

<group value="Mike"><count>25</count></group>

<group value="Steve"><count>25</count></group>

</groups>

</ctx_result_set>

roadmap merging text and ses
Roadmap – merging Text and SES

Secure Enterprise Search

Oracle Text

Full Control

Full Featured

  • Fine-grained Index Options
  • Data Storage Options
  • Lexer Options
  • Stoplists
  • Use existing database
  • RAC, Exadata
  • Built in database and mid-tier
  • Crawlers for many sources
  • Simple Query Interface
  • End user GUI / API
  • Embedded security
coming search features
Coming Search Features
  • Natural Language Processing enhancements
    • Ontology based classification
    • Question answering
  • Automatic Partitioning
    • Query load load balancing
  • Full support for facetted navigation (MVDATA sections)
  • Functional completeness for Result Set Interface
    • Result Iterator – streaming support
    • Parallel Query
  • Replication Support
    • Golden Gate / Logical Standby / Streams
  • Operator improvements
    • NEAR2 – best query in one operator
    • MNOT – mild not, eg YORK mnot NEW YORK
    • Nested near
  • Substring index and query performance improvements
coming search features continued
Coming Search Features - Continued
  • Multiple enhancements to query performance
    • BIGIO leverages Secure Files CLOBs
    • Automatic optimization of indexes with “stage index”
    • Two level index – keep common search terms in memory
  • Partition maintenance without reindexing
  • Off-load filtering from database server
  • Section specific index options
    • Choose different options, eg language, stopwords, PRINTJOINS for each section
  • Regular expression based stopwords
  • Forward Index
    • Hugely improved performance for highlighting, snippets
  • PDF “Native” Highlighting
  • Unlimited SDATA, MDATA and Field Sections
slide26

The preceding is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions.The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.