70 likes | 146 Views
Discover how text analytics tools extract latent value from unstructured data beyond words, enabling semantic analysis and scalable indexing. Learn about data processing techniques for deriving value, and explore the architecture of a scalable analytics index system.
E N D
Who Needs All Those Indexes ?One is Enough Bruce Lindsay IBM Almaden Research Center bgl@almaden.ibm.com
Stored Data is Heterogeneous • Most storeddata is NOT well structured • Text & Semi-structured • Sparse, multi-valued, & multi-occurrence attributes • Much value latent in un-structured data • Text analytic tools can extract value • Beyond the words: names, roles, concepts, … • Text analytics: searching for meaning in the content • Semantic & knowledge driven analysis • Expensive: big dictionaries, byte-by-byte, big inputs and outputs • Stateless easy scale-out
Text Analytics Object analytic1 analytic2 to Index • Derive {<attribute, value, position>} from inputs • Language, words (stems, part-of-speech, …) • Context (title, bold, anchor text, …) • Concepts (person, organization, role, product, …) • Classification (complaint, fraud, spam, xxx, …) • Meta-data (to/from, subject, date, title, abstract, reference, …) • Domain and customer specific analysis offer most value • Analytics produced attributes induce index schema Data Source Dictionary Attributes/ Values Attributes/ Values
Text Indexing • Logical index over<attribute, value, object, position> • MANY entries per object • Large index – even with aggressive compression • Non-transactional • Scale-out needed • Capacity - single index too big for one (commodity) node • Ingest thruput – concurrent insert to index fragments • Query response – fan-out / in for query parallelism • Query • Predicates over <attribute, value> matches • Match scoring – magic weighting of predicate importance & position • Query planning & optimization probably needed
What about Data Processing?select / project / join / aggregate • Add “value” postings to index for keys and measures<‘attrVal’, attribute, object, value> • Select: {<attr1, val1>} {obj1} • Project: {<‘attrVal’, keyAttr, obj1>} {val2} • Join: {<keyAttr, val2>} {obj2} • Project: {<‘attrVal’, measAttr, obj2>} {measVal} • Aggregation: sum({measVal})
Analytics Analytics IndexFragment IndexFragment Architecture Obj storeMgr Indexer …scale-out… Analytics Query queryPlanner queryDriver ranked results ObjStore Obj Indexer Obj Queue Obj file file file
Conclusions • Derived value from un-structured objects • Much value latent in un-structured data • Value extracted via analytic tools • Value captured in scalable index • Value exploited via query and data processing • Architecture • Index independent object store schema • Application choice of object analytics induces index schema • Scaled-out analytics and index