1 / 37

EP118 What are you Searching For?

EP118 What are you Searching For?. Dmitry Chernizer Enterprise Systems Architect Dcherniz@sybase.com. In My generation… Concept Search Module Content Search Module Sample Configurations Summary. Agenda. Back in my day, people used to walk to their information. Up Hill… Both Ways!

theo
Download Presentation

EP118 What are you Searching For?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EP118What are youSearching For? • Dmitry Chernizer • Enterprise Systems Architect • Dcherniz@sybase.com

  2. In My generation… Concept Search Module Content Search Module Sample Configurations Summary Agenda

  3. Back in my day, people used to walk to their information. Up Hill… Both Ways! Now everything is at you fingertips. www.Research.com www.StockAdvice.com www.NewJob.com Billions of pages of text, HTML, XML Most of them useless and out-dated information It’s not about the 20 docs you have It’s about the 5 pages you need! The e-Volution of Unstructured Data

  4. Hierarchical The e-Volution of Unstructured Data How did we get here? Word Crash! Text Relational PDF HTML

  5. A way to store non-relational (maybe hierarchical) data A standard way to express complex relationships Gather data Process & store it Query & display it Ability to assign a life-cycle to a piece of information Why you ask? Because your brain works that way The e-Volution of Unstructured Data What is knowledge management?

  6. What you need to know… • Less than Einstein • More then this guy.. • Two kinds of Search Engines • Concept Based Search • Content Based Search

  7. What you need to know… Concept Based Search Deals with processing unstructured requests Content Based Search Deal with processing structured requests

  8. Concept Based Search Why and Where does it fit in? Personalization Content Management Continuous Availability Integration Security

  9. The Purpose The Process Bayesian Inference Shannon’s Information Theory Adaptive Probabilistic Concept Modeling Dynamic Reasoning Engine Examples Concept Search Engine Basics Note: The Sybase Concept Search Engine uses embedded technology from Autonomy

  10. Automate process of getting the right information to the right person Improve the efficiency of information retrieval Enable the dynamic personalization of digital content. Natural language content search and retrieval Automatic categorization by an agent Automatic Content Personalization Concept Search Engine BasicsThe Purpose…

  11. Advanced Concept matching techniques High-performance pattern matching algorithms Can analyze a text and identify the key concepts within the document Based on frequency and relationships of terms correlated with meaning Language Independent Concept Search Engine BasicsThe process…

  12. Keyword, Boolean and Proximity Searches: Exacerbate / increase information overload Can’t tell how relevant a document is to subject being researched Only track simple occurrence of keywords ( e.g., "CD AND (NOT (financial OR money OR invest*)) AND music.“ May track proximity of content but not relevant content Lack of localization (English Wizard… hey I’m NOT!) Concept Search Engine BasicsLimitation of other approaches…

  13. Developed by Thomas Bayes, 18th century cleric and mathematician Central tenet of modern statistical probability modeling Calculates probabilistic relationship between multiple variables and the extent to which each impacts the other Used in pattern and fingerprint recognition Bayesian Inference Okay maybe not this guy

  14. Developed by Claude Shannon in 1949 Words which are less frequent across all documents, but appear in a cluster of documents are more distinguishing and tend to convey more information Ideas can be inferred from related content An inference engine may be used to parse and build content Shannon’s Information Theory

  15. Adaptive, Probabilistic Concept Modeling • Bayesian Inference + Shannon’s Information Theory • Dynamic Reasoning Engine (DRE) generates networks of concepts • Terms are weighted; relationships are established • The unstructured content “portal” metaphor

  16. Core Engine of the Concept Search Logic Uses the APCM algorithms to extract content Generates relative weight of document relevance, base summary and/or result set (non-tabular) Generates query plans for unstructured data May be stored as Templates for reusable queries May be used by agent processes for aggregation Accessed thru Enterprise Portal Search API Concept Search Inference EngineAlso known as Dynamic Reasoning Engine…

  17. Automatically gathers text content from local file systems and imports external files into an index Can gather document sets in a local file system Can spider mapped drives Can load a single document as discrete sets Uses Verity, Keyview & Adobe filters, To work on ASCII text Will continually check for new content Auto Indexer

  18. Automated Content Categorizer Stores categories or reusable queries known as “Agents” Agents can be shared or used to find people with similar interests Agent Process

  19. Allows ‘auto- spidering’ of web sites to gather data Converts web content to index able format May be used to Fetch content from many sites simultaneously Can return meta-data and conventional text content Obtains Web Pages behind Firewalls and through Proxy Servers Obtains Web Pages protected by a login Obtains Web Pages using Cookies Knowledge Fetch Process

  20. Auto Indexer HTML E-Mail News Inference Engine PDF The Knowledge Management ServerA portal Service… Sybase Enterprise Portal Open Client IBM DRDA SQL*Net ODBC/JDBC File I/O POP3 Exchange Lotus Notes Application Service Engine HTTP HTTPS HTTP HTTPS Word Back

  21. Encapsulate Search API into a set of EP components Components can be accessed by other EP services, such as security servlets, messaging or other EJBs Allows load balancing across server clusters Secure Search and Profile Locking Allows extending of the Dynamic Reasoning Engine via ANY component model (Java, C, ActiveX, Server Side Java Script, etc.) Enterprise Portal Search Services

  22. EP Data Store Sample Architectures Load Balancing Hardware Firewall Client Web Server Presentation Layer External Spider Agent Concept Search Inference Engine Application Engine Knowledge Server Agents Knowledge Server Internal Spider Agents Fetch Agent Fetch Agent Unstructured Data Repositories Data repository Intranet DMZ Ring Back

  23. Storage Overhead • No content stored, just terms & wts: • ~30 - 50% of original document size • Content stored, plus terms & wts: • ~150% of original document size • Content, proximity & phrase matching, and terms & wts: • ~250% of original document size

  24. Content Based Search Why and Where does it fit in? Personalization Content Management Continuous Availability Integration Security

  25. The Purpose The Process Content Search Basics Full Text Search Specialty Data Store Sample Architectures Content Full Text Search Engine Basics Note: The Sybase Content Full Text Search Engine uses embedded technology from Verity

  26. Structured (SQL) Access to Unstructured Data Adaptive Server (or EP) indexes documents stored in external data stores Indexes are maintained within a collection It understands words and language constructs It understands many document types e.g. MS Word, html, sgml, pdf, etc Content Search Engine BasicsThe Purpose…

  27. HTML EP Data Store PDF Content Search Engine BasicsThe Process… Sybase Enterprise Portal Application Service Engine SQL Query Specialty Data Store Text Word

  28. Queries are issued against a collection Results include a document identifier and a score Score indicates how well a document matched the query Can understand and index many foreign languages Include rules for understanding words and constructs of the specified language Content Search Engine BasicsThe Process…

  29. Queries are issued against a collection Results include a document identifier and a score Score indicates how well a document matched the query Content Search Engine BasicsThe Process… Collection - A Find documents where “blue” is near “red” ID = 68, score=98 ID=17, score = 71

  30. Can understand and index many foreign languages Include rules for understanding words and constructs of the specified language Content Search Engine BasicsThe Process… Hola! Bon Jiorno! Mahalo! Kem-Cho!

  31. Specialty Data Store EP Data Store Content Search Engine BasicsThe Process… Indexed data and index in two separate data stores Indexed Data Indexed Data • Updates, synchronization, backup, recovery?

  32. Data Store propagates source changes to the collection An events table (text_events) is used to log changes to the source tables Data Store must be notified that changes exist Backups of both data stores must be synchronized Full Text Search is a Specialty Data StoreYes but..

  33. Specialty Data Store EP Data Store Full Text Search is a Specialty Data StoreSybase Provides… • Integrated backup and restore facility • Backup / Recover database and text indexes • Online configuration • Configure Full Text Search at runtime dump database...

  34. Enhanced Full Text Search Features • Clustering: a feature for grouping similar documents • Clusters are inherently fuzzy - the algorithm merely attempts to group similar documents • Query By Example: provides ability to search for documents that are similar to one or more segments of text • select summary, score, copy • from t1 t, vt1 v • where t.id = v.id and • index_any = ‘<like> (“Space the final frontier”)’

  35. Custom Thesaurus allows users to build a thesaurus specific to their application. Synonym Maps for proximity search control: 1synonyms:(list: “red, ruby, scarlet, fuchsia, magenta” list: “blue <or> azure ”)$$ A Text Index is used by joining the source table and the index table select score, copy from story_index i, stories s where i.id = s.id and i.score > 70 and i.index_any = “Digital <near> Compaq” Enhanced Full Text Search Features

  36. Sybase provides 2 types of Knowledge Management Concept Search Content Search Technology Futures include an unstructured data server, XML search and indexing, XSL translation and other ways of managing hybrid data. Summary

  37. Yes it can be done Content, Concept We have it all Summary

More Related