Data mining with unstructured data a study and implementation of industry product s
Download
1 / 58

Data Mining with Unstructured Data A Study And Implementation of Industry Product(s) - PowerPoint PPT Presentation


  • 119 Views
  • Uploaded on

Data Mining with Unstructured Data A Study And Implementation of Industry Product(s). Samrat Sen. Goals. Issues in Text Mining with Unstructured Data Analysis of Data Mining products Study of a Real Life Classification Problem Strategy for solving the problem. Issues in Text Mining.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Data Mining with Unstructured Data A Study And Implementation of Industry Product(s)' - carlos-goff


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Data mining with unstructured data a study and implementation of industry product s

Data Mining with Unstructured DataA Study And Implementation of Industry Product(s)

Samrat Sen


Goals
Goals

  • Issues in Text Mining with Unstructured Data

  • Analysis of Data Mining products

  • Study of a Real Life Classification Problem

  • Strategy for solving the problem

UB - CS 711, Data Mining with Unstructured Data


Issues in text mining
Issues in Text Mining

  • Different from KDD and DM techniques in structured Databases

    Problems:

    1. Concerned with predefined fields

    2. Based on learning from attribute- value

    database

    e.g

    P.T.O

UB - CS 711, Data Mining with Unstructured Data


Issues in text mining1
Issues in Text Mining

Potential Customer Table

Married toTable

Person Age Sex Income Customer

Ann S 32 F 10,000 yes

Jane G 53 F 20,000 no

Sri S 35 M 65,000 yes

Egor 25 M 10,000 yes

Husband Wife

Egor Ann S

Sri H Jane

Induced Rules

If Married(Person, Spouse) and Income(Person) >= 25,000

Then Potential-Customer(Spouse)

If Married(Person, Spouse) and Potential-Customer(Person)

Then Potential-Customer(Spouse)

UB - CS 711, Data Mining with Unstructured Data


Issues in text mining2
Issues in Text Mining

  • Algorithm techniques like

    Association Extraction from Indexed data,

    Prototypical Document Extraction from full Text

  • Industry standard data mining tools cannot be used directly

    e.g a usual process has to have the Text Transformer, Text Analyzer, Summary generator

UB - CS 711, Data Mining with Unstructured Data


Issues in text mining3
Issues in Text Mining

  • The input and output interfaces, the file formats

    may cost in time and money.

  • Exhaustive domains have to be set up for

    classification.

  • Cost and Benefits have to be weighed before

    model selection.

    1.Gain from positive prediction

    2. Loss from an incorrect positive prediction (false positive)

    3. Benefit from a correct negative prediction

    4. Cost of incorrect negative prediction (false negative)

    5. Cost of project time (a better product/algorithm may come up)

UB - CS 711, Data Mining with Unstructured Data


Data mining products tools
Data Mining Products/Tools

  • DARWIN – from Oracle

  • Intelligent Data Miner – from IBM

  • Intermedia Text with Oracle Database with context query feature

    (theme based document retrieval)

FOR MORE INFO...

http://www.oracle.com/ip/analyze/warehouse/datamining/

http://www-4.ibm.com/software/data/iminer/

UB - CS 711, Data Mining with Unstructured Data


Data mining products tools1
Data Mining Products/Tools

  • New Specification being proposed by SUN for a Data Mining API *

  • SQLServer 2000 – Data mining and English query writing features

  • Verity Knowledge Organizer

FOR MORE INFO...

* http://java.sun.com/aboutJava/communityprocess/jsr/jsr_073_dmapi. html#3

Additional Text Mining sites:

1.http://textmining.krdl.org.sg/resourves.html

2. www.intext.de/TEXTANAE.htm

3. www.cs.uku.fi/~kuikka/systems.html

UB - CS 711, Data Mining with Unstructured Data


Darwin
DARWIN

Functions

  • Prediction (from known values)

  • Classification (into categories)

  • Forecasting (future predictions)

    Approach

  • Plan

  • Prepare Dataset

  • Build and Use models

UB - CS 711, Data Mining with Unstructured Data


Darwin1
DARWIN

  • The problem is defined in terms of data fields and data records

  • The fields are classified as follows:

    - Categorical and Ordered Fields

    - Predictive Fields

    - Target Fields

  • DARWIN dataset file has to be created containing all the records in the problem domain (using a descriptor file)

UB - CS 711, Data Mining with Unstructured Data


Darwin models
DARWIN - Models

  • Tree model – Based on classification and regression tree algorithm

  • Net model – A feed forward multilayer neural network

  • Match Model – Memory based reasoning model, using a K-nearest neighbor algorithm

UB - CS 711, Data Mining with Unstructured Data


Darwin tree model
DARWIN – Tree Model

Create Tree

Training Data

Test/Evaluate Tree

(Information on error rates of pruned sub-trees)

I/P Prediction Dataset

Predict with Tree

(using the selected sub-tree)

Merged I/P & O/P prediction

dataset

Analyze Results

UB - CS 711, Data Mining with Unstructured Data


DARWIN – Net Model

Training

Dataset

Neural

Network

Model

Create Net

Train Net

(Information on error rates of pruned sub-trees)

I/P Prediction Dataset

Trained

Neural

Network

Prediction Dataset

Merged I/P & O/P prediction

dataset

Analyze Results

UB - CS 711, Data Mining with Unstructured Data


DARWIN – Match Model

Training Data

Create Match Model

Optimize match weights

I/P Prediction Dataset

Predict with Match

Merged I/P & O/P prediction

dataset

Analyze Results

UB - CS 711, Data Mining with Unstructured Data


DARWIN – Analyzing

Evaluate

Evaluates the performance of a given model on a given dataset, when working on known data for test or evaluation purposes.

Summarize Data

Provides a statistical summary of the values taken by a data in the specified fields of a dataset

Frequency Count

Provides information on the frequency with which particular data values appear in a dataset

UB - CS 711, Data Mining with Unstructured Data


DARWIN – Analyzing

Performance Matrix

Can be used to compare simple fields or simple functions of fields

Sensitivity

Provides a model showing the relative importance of attributes used in building a model

UB - CS 711, Data Mining with Unstructured Data


DARWIN – Code Generation

  • Darwin can generate C, C++, Java code for a

    Tree or Net model so that a prediction function

    can be called from an application Program

  • Java code can also be generated to embed a

    model in a Web Applet

FOR MORE INFO...

http://technet.oracle.com/docs/products/datamining/doc_index.htm

UB - CS 711, Data Mining with Unstructured Data


Darwin2
DARWIN

  • For more info

  • http://technet.oracle.com/software/products/intermedia/software_index.html

    1. Oracle Data Mining Data sheet

    2. Oracle Data Mining Solutions

  • http://www.oracle.com/ip/analyze/warehouse/datamining/

  • http://www.oracle.com/oramag/oracle/98-Jan/fast.html

    1. Managing Unstructured Data with Oracle8

  • http://technet.oracle.com/products/datamining/

    1. Product manuals

UB - CS 711, Data Mining with Unstructured Data


Darwin3
DARWIN

UB - CS 711, Data Mining with Unstructured Data


Oracle – Intermedia Text

  • Ranking technique called theme proving is used

    Documents grouped into categories and subcategories

  • Integrated with the Oracle – 8 database.

  • Absolutely no training or tuning required

UB - CS 711, Data Mining with Unstructured Data


Oracle intermedia text
Oracle – Intermedia Text

  • Lexical Knowledge Base

    - 200,000 concepts from very broad domains

    - 2000 major categories

    - Concepts mapped into one or more words/phrases in

    canonical form

    - Each of these have alternate inflectional

    variations,acronyms, synonyms stored

    - Total vocabulary of 450,000 terms

    - Each entry has other parameters like parts of speech

UB - CS 711, Data Mining with Unstructured Data


Oracle intermedia text1
Oracle – Intermedia Text

Theme Extraction

-Themes are assigned initial ranks based on

structure of the document and the frequency of the theme.

- All the ancestor themes also included in the result

- Theme proving done before final ranking

Queries

Direct match, phrase search (‘contains’), case-sensitive query, misspellings and fuzzy match, inflections (‘about’), compound queries, Boolean operators, Natural language query

UB - CS 711, Data Mining with Unstructured Data


Oracle intermedia text2
Oracle – Intermedia Text

  • Oracle at Trec 8

    (Eighth text retrieval conference-http://otn.oracle.com/products/intermedia/htdocs/imt_trec8pap.htm)

    Recall at 1000 71.57% (3384/4728)

    Average Precision 41.30%

    Initial precision (at 92.79%

    recall 0.0)

    Final precision (at 07.91%

    recall 1.0)

UB - CS 711, Data Mining with Unstructured Data


Intermedia text model
Intermedia Text-Model

UB - CS 711, Data Mining with Unstructured Data


Interface Options

UB - CS 711, Data Mining with Unstructured Data


Language selection
Language Selection

  • Java for robot

  • PL/SQL for data retrieval

UB - CS 711, Data Mining with Unstructured Data


Code execution
Code Execution

UB - CS 711, Data Mining with Unstructured Data


Overview of the system
Overview of the System

Intermedia Text

Customer Browser

Client Browser

Web

Server

Oracle 8i

Listening at port 80

Server process

Tag stripper

JDBC

UB - CS 711, Data Mining with Unstructured Data


Intermedia text
Intermedia Text

Steps for Building an application

  • Load the documents

  • Index the document

  • Issue Queries

  • Present the documents that satisfy the query

UB - CS 711, Data Mining with Unstructured Data


Loading methods
Loading Methods

  • Loading Methods

    • Insert Statements

    • SQL Loader

    • Ctxsrv – This is a server daemon process which builds

      the index at regular intervals

    • Ctxload Utility Used for

      Thesaurus Import/Export

      Text Loading

      Document Updating/Exporting

UB - CS 711, Data Mining with Unstructured Data


Create and populate a simple table
Create and Populate a Simple Table

CREATE TABLE quick (

quick_id NUMBER CONSTRAINT quick_pk PRIMARY KEY,

text VARCHAR2(80) );

INSERT INTO quick

VALUES ( 1, 'The cat sat on the mat' );

INSERT INTO quick

VALUES ( 2, 'The fox jumped over the dog' );INSERT INTO quick

VALUES ( 3, 'The dog barked like a dog' );COMMIT;

UB - CS 711, Data Mining with Unstructured Data


Run a text query
Run a Text Query

SELECT text FROM quick

WHERE CONTAINS ( text,

'sat on the mat' ) > 0;DRG-10599: column is not indexed

You must have a Text index on a columnbefore you can do a “contains” query on it

UB - CS 711, Data Mining with Unstructured Data


Create the text index
Create the Text Index

CREATE INDEX quick_text

on quick ( text )

INDEXTYPE IS CTXSYS.CONTEXT;

CTXSYS is the system user for interMedia Text

The INDEXTYPE keyword is a feature of the Extensible Indexing Framework

UB - CS 711, Data Mining with Unstructured Data


Run a text query1
Run a Text Query

SELECT text FROM quick

WHERE CONTAINS ( text,

'sat on the mat' ) > 0;TEXT

-----------------------

The cat sat on the mat

You should regard the CONTAINS function as boolean in meaning

It is implemented as a number since SQL does not have a boolean datatype

The only sensible way to use it is with >0

UB - CS 711, Data Mining with Unstructured Data


Run a text query2
Run a Text Query

SELECT SCORE(42) s, text FROM quick

WHERE CONTAINS ( text, 'dog', 42 )

>= 0 /* just for teaching purposes! */ ORDER BY s;

S TEXT

-- ---------------------------

7 The dog barked like a dog

4 The fox jumped over the dog

The better is the match, the higher is the score

The value can be used in ORDER BY but has no absolute significance

The score is zero when the query is not matched

UB - CS 711, Data Mining with Unstructured Data


Intermedia text indexing pipeline
Intermedia Text - Indexing Pipeline

Filtered

Doc text

Doc Data

Sectioner

Datastore

Filter

Section

Offsets

Column data

Engine

Lexer

Database

Plain text

Tokens

Index Data

  • First step is creating an index

    Datastore

  • Reads the data out of the table (for URL datastore performs a ‘GET ‘)

UB - CS 711, Data Mining with Unstructured Data


Intermedia text indexing pipeline1
Intermedia Text - Indexing Pipeline

  • Filter : The data is transformed to some text type, this is needed as some of formats may be binary as when storing doc, pdf, HTML types

  • Sectioner: Converts to plain text, removes tags and invisible info.

  • Lexer: Splits the text into discrete tokens.

  • Engine: Takes the tokens from lexer , the offsets from sectioner and a list of stoplist words to build an index.

UB - CS 711, Data Mining with Unstructured Data


Intermedia text indexing pipeline2
Intermedia Text - Indexing Pipeline

Example of index creation

Statements

  • Insert into docs values(1,’first document’);

  • Insert into docs values(2,’second document’);

    Produces an index

    DOCUMENT doc 1 position 2, doc 2 position 2

    FIRST  doc 1 position 1

    SECOND  doc 2 position 1

UB - CS 711, Data Mining with Unstructured Data


Testing procedure
Testing procedure

  • Document set from newsgroups

    122 documents from a text mining site

    Loaded using insert statements

    File datastore used

  • Documents(HTML) from browsing

    20 documents

    Loaded from server process

    URL datastore used

UB - CS 711, Data Mining with Unstructured Data


Newsgroup results
Newsgroup Results

1.Religion ,Atheism – 15

on bible, islam, religious beliefs

2.Comp-os-ms-windows-misc - 17

about operating sys, protocols, installation

3.Comp.graphics – 27

on hardware and software for computer graphics

4.Ice Hockey - 18

5.Computer hardware – 12

on installation of different peripheral devices

6.Mideast.politics - 14

on political development in mideast

7. Science.space - 19

on various space programs, devices,theories

UB - CS 711, Data Mining with Unstructured Data


Newsgroup results1

Group

Retrieved

Wrong

Not Retrieved

Recall

Precision

Science and technology

120

16

1

99%

78%

Computer Hardware Industry

12

0

5

71%

100%

Government

103

26

8

90%

74%

Newsgroup Results

UB - CS 711, Data Mining with Unstructured Data


Newsgroup results2

politics

17

3

0

100%

82%

Military

5

1

0

80%

80%

Social Environment

48

2

14

77%

96%

Religion

22

3

2

90%

86%

Islam

4

0

0

100%

100%

Leisure recreati-on

22

4

5

78%

82%

Newsgroup Results

UB - CS 711, Data Mining with Unstructured Data


Newsgroup results3

Sports

21

1

0

90%

90%

Hockey

18

0

0

100%

100%

Newsgroup Results

Recall = # of correct positive predictions

----------------------------------

# of positive examples

Precision = # of correct positive predictions

---------------------------------

# of positive predictions

UB - CS 711, Data Mining with Unstructured Data


Query
Query

  • AND &

  • OR |

  • EQUIV =

  • MINUS -

  • NOT ~

  • ACCUM ,

Syntax: Binary Operators

cat & dogcat | dogcat = dog cat - dogcat ~ dogcat , dog

UB - CS 711, Data Mining with Unstructured Data


Semantics binary operators
Semantics: Binary Operators

  • The semantics of all the binary operators is defined in terms of SCORE

  • However, the score for even the simplest query expression - a single word - is calculated by a subtle rule

    • the score is higher for a document where the query word occurs more frequently than for one where it occurs less frequently

    • but when “word1” occurs N times indocument D, its score is lower than when “word2” occurs N times in document D if “word1” occurs more often in the whole document set than “word2”

UB - CS 711, Data Mining with Unstructured Data


The salton algorithm
The Salton Algorithm

  • interMedia Text uses an algorithm which is similar to the Salton Algorithm - widely used in Text Retrieval products

  • The score for a word is proportional to... f ( 1+log ( N/n) )...where

    • f is the frequency of the search term in the document

    • N is the total number documents

    • and n is the number of documents which contain the search term

  • The score is converted into an integer in the range 0 - 100.

UB - CS 711, Data Mining with Unstructured Data


The salton algorithm1
The Salton Algorithm

Assumption

Inverse frequency scoring assumes that frequently occurring terms in a document

set are noise terms, and so these terms are scored lower. For a document to score

high, the query term must occur frequently in the document but infrequently in the

document set as a whole.

UB - CS 711, Data Mining with Unstructured Data


The salton algorithm2
The Salton Algorithm

This table assumes that only one document in the set contains the query term.

# of Documents in Document Set Occurrences of Term in Document Needed to Score 100 1 34

5 20

10 17

50 13

100 12

500 10

1,000 9

10,000 7

100,000 5

1,000,000 4

UB - CS 711, Data Mining with Unstructured Data


Summary of operators
Summary of operators

Binary operators…

& | = - ~ ,

  • Built-in expansion...

? $ !

  • Thesaurus...

BT, BTG, BTP, BTI, NT, NTG, NTP, NTI, PT, RT, SYN, TR, TRSYN, TT

UB - CS 711, Data Mining with Unstructured Data


Summary of operators1
Summary of operators

  • Stored query expression...

SQE

  • Grouping and escaping...

() {} \

  • Special...

NEARWITHINABOUT

UB - CS 711, Data Mining with Unstructured Data


Application Details- Customer profile Analyzer

The http server

For (User web

Page caching)

Is started

Oracle web

Server also

started

UB - CS 711, Data Mining with Unstructured Data


Log In Screen- Customer & User

Log in Screen

Used both

By the customer

And the users

The oracle web-

Server takes care

Of the secure

Connections, while

For the http server,

The user id is

Common for the session

-no user can invoke a

Document from server

Without user id.

UB - CS 711, Data Mining with Unstructured Data


Customer Interface – Http Server

The user

Uses the

Interface

Provided

By the custom

http server

UB - CS 711, Data Mining with Unstructured Data


Main User Screen

User can

Choose the

Type of data

To be analyzed.

Two types of data

exist-

1. Newsgroups

2. User Browsed

URL’s

UB - CS 711, Data Mining with Unstructured Data


Selection of Category and options

User chooses

Category and

Other options

Like-

Generating theme

Generating gist

Generating-

marked-up text

Date range

UB - CS 711, Data Mining with Unstructured Data


Results Page – Gist Generation

Can use this

Page for drilling

Down to the

Actual document

Which opens up in

The browser (generated

By the filter option)

Can generate theme

And gist from this

Screen.

UB - CS 711, Data Mining with Unstructured Data


Search Screen

Search screen,

Has advance options

Like fuzzy search,

About search etc.

A chain of expressions

Can be used along

With conjunctions (like

‘not’,’or’,’and’ etc) for

Joining the statements

UB - CS 711, Data Mining with Unstructured Data


Conclusion
Conclusion

  • New estimation methods trying to find more meaning from text.

  • Industry has great text mining products and is constantly improving technology.

  • Unstructured Data Mining – a long way to go.

UB - CS 711, Data Mining with Unstructured Data


ad