Unleashing Value in Unstructured Data through Text Analytics and Indexing

Who Needs All Those Indexes ?One is Enough Bruce Lindsay IBM Almaden Research Center bgl@almaden.ibm.com

Stored Data is Heterogeneous • Most storeddata is NOT well structured • Text & Semi-structured • Sparse, multi-valued, & multi-occurrence attributes • Much value latent in un-structured data • Text analytic tools can extract value • Beyond the words: names, roles, concepts, … • Text analytics: searching for meaning in the content • Semantic & knowledge driven analysis • Expensive: big dictionaries, byte-by-byte, big inputs and outputs • Stateless  easy scale-out

Text Analytics Object analytic1 analytic2 to Index • Derive {<attribute, value, position>} from inputs • Language, words (stems, part-of-speech, …) • Context (title, bold, anchor text, …) • Concepts (person, organization, role, product, …) • Classification (complaint, fraud, spam, xxx, …) • Meta-data (to/from, subject, date, title, abstract, reference, …) • Domain and customer specific analysis offer most value • Analytics produced attributes induce index schema Data Source Dictionary Attributes/ Values Attributes/ Values

Text Indexing • Logical index over<attribute, value, object, position> • MANY entries per object • Large index – even with aggressive compression • Non-transactional • Scale-out needed • Capacity - single index too big for one (commodity) node • Ingest thruput – concurrent insert to index fragments • Query response – fan-out / in for query parallelism • Query • Predicates over <attribute, value> matches • Match scoring – magic weighting of predicate importance & position • Query planning & optimization probably needed

What about Data Processing?select / project / join / aggregate • Add “value” postings to index for keys and measures<‘attrVal’, attribute, object, value> • Select: {<attr1, val1>}   {obj1} • Project: {<‘attrVal’, keyAttr, obj1>}   {val2} • Join: {<keyAttr, val2>}   {obj2} • Project: {<‘attrVal’, measAttr, obj2>}   {measVal} • Aggregation: sum({measVal})

Analytics Analytics IndexFragment IndexFragment Architecture Obj  storeMgr Indexer …scale-out… Analytics Query  queryPlanner queryDriver  ranked results ObjStore Obj Indexer Obj Queue Obj file file file

Conclusions • Derived value from un-structured objects • Much value latent in un-structured data • Value extracted via analytic tools • Value captured in scalable index • Value exploited via query and data processing • Architecture • Index independent object store schema • Application choice of object analytics induces index schema • Scaled-out analytics and index

Unleashing Value in Unstructured Data through Text Analytics and Indexing