Data Mining with Unstructured Data A Study And Implementation of Industry Product(s)

Data Mining with Unstructured DataA Study And Implementation of Industry Product(s) Samrat Sen

Goals • Issues in Text Mining with Unstructured Data • Analysis of Data Mining products • Study of a Real Life Classification Problem • Strategy for solving the problem UB - CS 711, Data Mining with Unstructured Data

Issues in Text Mining • Different from KDD and DM techniques in structured Databases Problems: 1. Concerned with predefined fields 2. Based on learning from attribute- value database e.g P.T.O UB - CS 711, Data Mining with Unstructured Data

Issues in Text Mining Potential Customer Table Married toTable Person Age Sex Income Customer Ann S 32 F 10,000 yes Jane G 53 F 20,000 no Sri S 35 M 65,000 yes Egor 25 M 10,000 yes Husband Wife Egor Ann S Sri H Jane Induced Rules If Married(Person, Spouse) and Income(Person) >= 25,000 Then Potential-Customer(Spouse) If Married(Person, Spouse) and Potential-Customer(Person) Then Potential-Customer(Spouse) UB - CS 711, Data Mining with Unstructured Data

Issues in Text Mining • Algorithm techniques like Association Extraction from Indexed data, Prototypical Document Extraction from full Text • Industry standard data mining tools cannot be used directly e.g a usual process has to have the Text Transformer, Text Analyzer, Summary generator UB - CS 711, Data Mining with Unstructured Data

Issues in Text Mining • The input and output interfaces, the file formats may cost in time and money. • Exhaustive domains have to be set up for classification. • Cost and Benefits have to be weighed before model selection. 1.Gain from positive prediction 2. Loss from an incorrect positive prediction (false positive) 3. Benefit from a correct negative prediction 4. Cost of incorrect negative prediction (false negative) 5. Cost of project time (a better product/algorithm may come up) UB - CS 711, Data Mining with Unstructured Data

Data Mining Products/Tools • DARWIN – from Oracle • Intelligent Data Miner – from IBM • Intermedia Text with Oracle Database with context query feature (theme based document retrieval) FOR MORE INFO... http://www.oracle.com/ip/analyze/warehouse/datamining/ http://www-4.ibm.com/software/data/iminer/ UB - CS 711, Data Mining with Unstructured Data

Data Mining Products/Tools • New Specification being proposed by SUN for a Data Mining API * • SQLServer 2000 – Data mining and English query writing features • Verity Knowledge Organizer FOR MORE INFO... * http://java.sun.com/aboutJava/communityprocess/jsr/jsr_073_dmapi. html#3 Additional Text Mining sites: 1.http://textmining.krdl.org.sg/resourves.html 2. www.intext.de/TEXTANAE.htm 3. www.cs.uku.fi/~kuikka/systems.html UB - CS 711, Data Mining with Unstructured Data

DARWIN Functions • Prediction (from known values) • Classification (into categories) • Forecasting (future predictions) Approach • Plan • Prepare Dataset • Build and Use models UB - CS 711, Data Mining with Unstructured Data

DARWIN • The problem is defined in terms of data fields and data records • The fields are classified as follows: - Categorical and Ordered Fields - Predictive Fields - Target Fields • DARWIN dataset file has to be created containing all the records in the problem domain (using a descriptor file) UB - CS 711, Data Mining with Unstructured Data

DARWIN - Models • Tree model – Based on classification and regression tree algorithm • Net model – A feed forward multilayer neural network • Match Model – Memory based reasoning model, using a K-nearest neighbor algorithm UB - CS 711, Data Mining with Unstructured Data

DARWIN – Tree Model Create Tree Training Data Test/Evaluate Tree (Information on error rates of pruned sub-trees) I/P Prediction Dataset Predict with Tree (using the selected sub-tree) Merged I/P & O/P prediction dataset Analyze Results UB - CS 711, Data Mining with Unstructured Data

DARWIN – Net Model Training Dataset Neural Network Model Create Net Train Net (Information on error rates of pruned sub-trees) I/P Prediction Dataset Trained Neural Network Prediction Dataset Merged I/P & O/P prediction dataset Analyze Results UB - CS 711, Data Mining with Unstructured Data

DARWIN – Match Model Training Data Create Match Model Optimize match weights I/P Prediction Dataset Predict with Match Merged I/P & O/P prediction dataset Analyze Results UB - CS 711, Data Mining with Unstructured Data

DARWIN – Analyzing Evaluate Evaluates the performance of a given model on a given dataset, when working on known data for test or evaluation purposes. Summarize Data Provides a statistical summary of the values taken by a data in the specified fields of a dataset Frequency Count Provides information on the frequency with which particular data values appear in a dataset UB - CS 711, Data Mining with Unstructured Data

DARWIN – Analyzing Performance Matrix Can be used to compare simple fields or simple functions of fields Sensitivity Provides a model showing the relative importance of attributes used in building a model UB - CS 711, Data Mining with Unstructured Data

DARWIN – Code Generation • Darwin can generate C, C++, Java code for a Tree or Net model so that a prediction function can be called from an application Program • Java code can also be generated to embed a model in a Web Applet FOR MORE INFO... http://technet.oracle.com/docs/products/datamining/doc_index.htm UB - CS 711, Data Mining with Unstructured Data

DARWIN • For more info • http://technet.oracle.com/software/products/intermedia/software_index.html 1. Oracle Data Mining Data sheet 2. Oracle Data Mining Solutions • http://www.oracle.com/ip/analyze/warehouse/datamining/ • http://www.oracle.com/oramag/oracle/98-Jan/fast.html 1. Managing Unstructured Data with Oracle8 • http://technet.oracle.com/products/datamining/ 1. Product manuals UB - CS 711, Data Mining with Unstructured Data

DARWIN UB - CS 711, Data Mining with Unstructured Data

Oracle – Intermedia Text • Ranking technique called theme proving is used Documents grouped into categories and subcategories • Integrated with the Oracle – 8 database. • Absolutely no training or tuning required UB - CS 711, Data Mining with Unstructured Data

Oracle – Intermedia Text • Lexical Knowledge Base - 200,000 concepts from very broad domains - 2000 major categories - Concepts mapped into one or more words/phrases in canonical form - Each of these have alternate inflectional variations,acronyms, synonyms stored - Total vocabulary of 450,000 terms - Each entry has other parameters like parts of speech UB - CS 711, Data Mining with Unstructured Data

Oracle – Intermedia Text Theme Extraction -Themes are assigned initial ranks based on structure of the document and the frequency of the theme. - All the ancestor themes also included in the result - Theme proving done before final ranking Queries Direct match, phrase search (‘contains’), case-sensitive query, misspellings and fuzzy match, inflections (‘about’), compound queries, Boolean operators, Natural language query UB - CS 711, Data Mining with Unstructured Data

Oracle – Intermedia Text • Oracle at Trec 8 (Eighth text retrieval conference-http://otn.oracle.com/products/intermedia/htdocs/imt_trec8pap.htm) Recall at 1000 71.57% (3384/4728) Average Precision 41.30% Initial precision (at 92.79% recall 0.0) Final precision (at 07.91% recall 1.0) UB - CS 711, Data Mining with Unstructured Data

Intermedia Text-Model UB - CS 711, Data Mining with Unstructured Data

Interface Options UB - CS 711, Data Mining with Unstructured Data

Language Selection • Java for robot • PL/SQL for data retrieval UB - CS 711, Data Mining with Unstructured Data

Code Execution UB - CS 711, Data Mining with Unstructured Data

Overview of the System Intermedia Text Customer Browser Client Browser Web Server Oracle 8i Listening at port 80 Server process Tag stripper JDBC UB - CS 711, Data Mining with Unstructured Data

Intermedia Text Steps for Building an application • Load the documents • Index the document • Issue Queries • Present the documents that satisfy the query UB - CS 711, Data Mining with Unstructured Data

Loading Methods • Loading Methods • Insert Statements • SQL Loader • Ctxsrv – This is a server daemon process which builds the index at regular intervals • Ctxload Utility Used for Thesaurus Import/Export Text Loading Document Updating/Exporting UB - CS 711, Data Mining with Unstructured Data

Create and Populate a Simple Table CREATE TABLE quick ( quick_id NUMBER CONSTRAINT quick_pk PRIMARY KEY, text VARCHAR2(80) ); INSERT INTO quick VALUES ( 1, 'The cat sat on the mat' ); INSERT INTO quick VALUES ( 2, 'The fox jumped over the dog' );INSERT INTO quick VALUES ( 3, 'The dog barked like a dog' );COMMIT; UB - CS 711, Data Mining with Unstructured Data

Run a Text Query SELECT text FROM quick WHERE CONTAINS ( text, 'sat on the mat' ) > 0;DRG-10599: column is not indexed You must have a Text index on a columnbefore you can do a “contains” query on it UB - CS 711, Data Mining with Unstructured Data

Create the Text Index CREATE INDEX quick_text on quick ( text ) INDEXTYPE IS CTXSYS.CONTEXT; CTXSYS is the system user for interMedia Text The INDEXTYPE keyword is a feature of the Extensible Indexing Framework UB - CS 711, Data Mining with Unstructured Data

Run a Text Query SELECT text FROM quick WHERE CONTAINS ( text, 'sat on the mat' ) > 0;TEXT ----------------------- The cat sat on the mat You should regard the CONTAINS function as boolean in meaning It is implemented as a number since SQL does not have a boolean datatype The only sensible way to use it is with >0 UB - CS 711, Data Mining with Unstructured Data

Run a Text Query SELECT SCORE(42) s, text FROM quick WHERE CONTAINS ( text, 'dog', 42 ) >= 0 /* just for teaching purposes! */ ORDER BY s; S TEXT -- --------------------------- 7 The dog barked like a dog 4 The fox jumped over the dog The better is the match, the higher is the score The value can be used in ORDER BY but has no absolute significance The score is zero when the query is not matched UB - CS 711, Data Mining with Unstructured Data

Intermedia Text - Indexing Pipeline Filtered Doc text Doc Data Sectioner Datastore Filter Section Offsets Column data Engine Lexer Database Plain text Tokens Index Data • First step is creating an index Datastore • Reads the data out of the table (for URL datastore performs a ‘GET ‘) UB - CS 711, Data Mining with Unstructured Data

Intermedia Text - Indexing Pipeline • Filter : The data is transformed to some text type, this is needed as some of formats may be binary as when storing doc, pdf, HTML types • Sectioner: Converts to plain text, removes tags and invisible info. • Lexer: Splits the text into discrete tokens. • Engine: Takes the tokens from lexer , the offsets from sectioner and a list of stoplist words to build an index. UB - CS 711, Data Mining with Unstructured Data

Intermedia Text - Indexing Pipeline Example of index creation Statements • Insert into docs values(1,’first document’); • Insert into docs values(2,’second document’); Produces an index DOCUMENT doc 1 position 2, doc 2 position 2 FIRST  doc 1 position 1 SECOND  doc 2 position 1 UB - CS 711, Data Mining with Unstructured Data

Testing procedure • Document set from newsgroups 122 documents from a text mining site Loaded using insert statements File datastore used • Documents(HTML) from browsing 20 documents Loaded from server process URL datastore used UB - CS 711, Data Mining with Unstructured Data

Newsgroup Results 1.Religion ,Atheism – 15 on bible, islam, religious beliefs 2.Comp-os-ms-windows-misc - 17 about operating sys, protocols, installation 3.Comp.graphics – 27 on hardware and software for computer graphics 4.Ice Hockey - 18 5.Computer hardware – 12 on installation of different peripheral devices 6.Mideast.politics - 14 on political development in mideast 7. Science.space - 19 on various space programs, devices,theories UB - CS 711, Data Mining with Unstructured Data

Group Retrieved Wrong Not Retrieved Recall Precision Science and technology 120 16 1 99% 78% Computer Hardware Industry 12 0 5 71% 100% Government 103 26 8 90% 74% Newsgroup Results UB - CS 711, Data Mining with Unstructured Data

politics 17 3 0 100% 82% Military 5 1 0 80% 80% Social Environment 48 2 14 77% 96% Religion 22 3 2 90% 86% Islam 4 0 0 100% 100% Leisure recreati-on 22 4 5 78% 82% Newsgroup Results UB - CS 711, Data Mining with Unstructured Data

Sports 21 1 0 90% 90% Hockey 18 0 0 100% 100% Newsgroup Results Recall = # of correct positive predictions ---------------------------------- # of positive examples Precision = # of correct positive predictions --------------------------------- # of positive predictions UB - CS 711, Data Mining with Unstructured Data

Query • AND & • OR | • EQUIV = • MINUS - • NOT ~ • ACCUM , Syntax: Binary Operators cat & dogcat | dogcat = dog cat - dogcat ~ dogcat , dog UB - CS 711, Data Mining with Unstructured Data

Semantics: Binary Operators • The semantics of all the binary operators is defined in terms of SCORE • However, the score for even the simplest query expression - a single word - is calculated by a subtle rule • the score is higher for a document where the query word occurs more frequently than for one where it occurs less frequently • but when “word1” occurs N times indocument D, its score is lower than when “word2” occurs N times in document D if “word1” occurs more often in the whole document set than “word2” UB - CS 711, Data Mining with Unstructured Data

The Salton Algorithm • interMedia Text uses an algorithm which is similar to the Salton Algorithm - widely used in Text Retrieval products • The score for a word is proportional to... f ( 1+log ( N/n) )...where • f is the frequency of the search term in the document • N is the total number documents • and n is the number of documents which contain the search term • The score is converted into an integer in the range 0 - 100. UB - CS 711, Data Mining with Unstructured Data

The Salton Algorithm Assumption Inverse frequency scoring assumes that frequently occurring terms in a document set are noise terms, and so these terms are scored lower. For a document to score high, the query term must occur frequently in the document but infrequently in the document set as a whole. UB - CS 711, Data Mining with Unstructured Data

The Salton Algorithm This table assumes that only one document in the set contains the query term. # of Documents in Document Set Occurrences of Term in Document Needed to Score 100 1 34 5 20 10 17 50 13 100 12 500 10 1,000 9 10,000 7 100,000 5 1,000,000 4 UB - CS 711, Data Mining with Unstructured Data

Summary of operators Binary operators… & | = - ~ , • Built-in expansion... ? $ ! • Thesaurus... BT, BTG, BTP, BTI, NT, NTG, NTP, NTI, PT, RT, SYN, TR, TRSYN, TT UB - CS 711, Data Mining with Unstructured Data

Summary of operators • Stored query expression... SQE • Grouping and escaping... () {} \ • Special... NEARWITHINABOUT UB - CS 711, Data Mining with Unstructured Data

Data Mining with Unstructured Data A Study And Implementation of Industry Product(s)