CES 514 – Data Mining Spring 2010 Sonoma State University

1 / 33

# CES 514 – Data Mining Spring 2010 Sonoma State University - PowerPoint PPT Presentation

CES 514 – Data Mining Spring 2010 Sonoma State University. Course Details:. Instructor: Bala Ravikumar (Ravi) Email: [email protected] , [email protected] Tel: (707) 664 3335 Office: Darwin Hall 116 I Course Web Page http://ravi.cs.sonoma.edu/~ravi/ces514sp10 Lecture time:

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' CES 514 – Data Mining Spring 2010 Sonoma State University' - cosima

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### CES 514 – Data MiningSpring 2010Sonoma State University

Course Details:

Instructor: Bala Ravikumar (Ravi)

Tel: (707) 664 3335

Office: Darwin Hall 116 I

Course Web Page

http://ravi.cs.sonoma.edu/~ravi/ces514sp10

Lecture time:

6 to 8:45 PM, Wednesdays

Room: Salazar Hall 2003

Office hours: M 9 – 10, T 11 – 12, W 5 – 6

Prerequisites

basic probability and statistics (probability distribution, random variable, conditional probability etc.)

algorithms and data structures (sorting, hashing, binary trees, algorithm design techniques)

Programming in high-level language (Java, Python, Matlab, c#, …)

Linear algebra (vectors, linear independence, matrix rank, Gaussian elimination etc.)

These topics will be reviewed. However, it will be helpful to spend some time on your own to familiarize yourself.

Text book

Christopher D. Manning,  Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.

This book’s focus is on WEB DATA MINING

Web site for the text:

http://nlp.stanford.edu/IR-book/information-retrieval-book.html

Mining the Web, S.Chakrabarti, MKP.

Data Mining, Witten and Frank, MKP.

The elements of statistical learning, Hastie, Tibshirani, and Friedman, Springer-Verlag.

Web Data Mining: Exploring Hyperlinks, Contents and Usage data, Bing Liu, Springer-Verlag.

Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Pearson/Addison Wesley.

Overlapping fields

• Statistics
• Artificial intelligence (machine learning)
• Data base and Information retrieval
• Natural language processing
• Algorithm design and analysis

Quiz: 10%

Home Work: 25 %

Midterm: 15%

One mid-term, in-class, open book/notes?

Final Exam: 25%

In-class or take-home?

Project: 25%

Individual, design and implementation

Example Projects from Fall 2005 and 2007

• Strategy for predicting the winner in a game similar to Jai Alai.
• Hand-written character recognition
• classify the type of disease based on some test results
• classification of e-mail (junk vs. useful, personal vs. business vs. family etc.)
• classification of questions in a multiple choice test based on the responses of students
• identifying the author from a sample text
• implement an association rule mining algorithm
• implement a visualization algorithm that provides various options for viewing the data
• classifying mushroom into edible and poisonous based on a number of attributes – such as color, length of the stem, width etc.
• classifying web site based on content

Project is done individually, and is semester long - implement, test, write a paper, present in class.

Today’s lecture

• Overview of the course
• Chapter 1 of the text
Overview of Topics

Web data organization

Web search

Classification (supervised learning)

Clustering (unsupervised learning)

Association rule mining

Language models for information retrieval

Vector space models

SVM and other tools

LSI and tools from linear algebra

Other applications – e.g. bioinformatics

What is data mining?

Data mining is also called knowledge discovery

Data mining is

extraction of useful patterns from data sources, e.g., databases, texts, web, images, etc.

Patterns must be:

valid, novel, potentially useful, understandable

Our focus will be on text data (in particular web)

Some sample problems in Data Mining

• Extract useful knowledge from the vast data and information available on the web. (e.g. tagging of web sites, labeling images, predict the needs of a web surfer from pattern of clicks.)
• Using the financial record of a person, determine the risk involved in giving a loan. (decision could be yes or no. more generally, it could be the type of loan – interest rate, duration etc.)
• movie (book etc.) recommendation based on prior choices.
• prediction of weather, traffic pattern, outcome of an event etc.
• From the items recorded in the check-out counter of a super market, determine any correlation between items being sold. (used to decide which ones to put on sale.)
• study and understand of social networks.
• rank web page according to significance.

Classification

mining patterns that can classify future (new) data into known classes.

Association rule mining

mining any rule of the form X Y, where X and Y are sets of data items.

Clustering

identifying similar groups in the data

Regression analysis

Sequential pattern mining:

A sequential rule: A B, says that event A will be immediately followed by event B with a certain confidence

Deviation detection:

discovering the most significant changes in data

Data visualization: using graphical methods to show patterns in data.

Why is data mining important?

Computerization of businesses produce huge amount of data

How to make best use of data?

Knowledge discovered from data can be used for competitive advantage.

Online businesses generate even larger data sets

Online retailers (e.g., amazon.com) are largely driven by data mining.

Web search engines are information retrieval and data mining companies

Why is data mining necessary?

Make use of your data assets

There is a big gap from stored data to knowledge; and the transition won’t occur automatically.

Many interesting things you want to find cannot be found using database queries

“find me people likely to buy my products”

“Who are likely to respond to my promotion?”

“Which movies should be recommended to each customer?”

Why data mining now?

The data is abundant.

The computing power is not an issue.

Data mining tools are available

The competitive pressure is very strong.

Almost every company is doing (or has to do) it

Socio-political exigencies

Detecting terrorism activities

New technologies

Streaming data, mobile computing, wireless networks

Related fields

Data mining is an multi-disciplinary field:

Machine learning/artificial intelligence

Statistics

Databases

Information retrieval

Visualization

Natural language processing

Game theory

etc.

Data mining applications

Marketing:customer profiling and retention, identifying potential customers, market segmentation.

Engineering: identify causes of problems in products.

Scientific data analysis: weather prediction, financial data analysis, image analysis etc.

Fraud detection: identifying credit card fraud, intrusion detection.

Text and web: a huge number of applications …

Bioinformatics: structure prediction, classification, microarray analysis etc.

Any application that involves a large amount of data …

Structural descriptions

Example: if-then rules

If tear production rate = reducedthen recommendation = none

Otherwise, if age = young and astigmatic = no then recommendation = soft

Classification vs. association rules

Classification rule:predicts value of a given attribute (the classification of an example)

Association rule:predicts value of arbitrary attribute (or combination)

If outlook = sunny and humidity = highthen play = no

If temperature = cool then humidity = normal

If humidity = normal and windy = falsethen play = yes

If outlook = sunny and play = no then humidity = high

If windy = false and play = no then outlook = sunny and humidity = high

Spam filter software

Given below are the % of occurrences of a few select words in spam and genuine e-mail messages:

A decision list may be used to identify spam.

Text mining

Data mining on text

Due to a huge amount of online texts on the Web and other sources

Text contains a huge amount of information of any imaginable type!

A major direction and tremendous opportunity!

Main topics

Text classification and clustering

Information retrieval

Information extraction

Opinion mining and summarization

Example: Opinion Mining

The Web has dramatically changed the way that people express their opinions.

One can post their opinions on almost anything at review sites, Internet forums, discussion groups, blogs, etc.

Product reviews

Benefits of Review Analysis

Potential Customer: No need to read many reviews

Product manufacturer: market intelligence, product benchmarking

Feature Based Analysis & Summarization

Extracting product features (called Opinion Features) that have been commented on by customers.

Identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative.

Summarizing and comparing results.

An example

GREAT Camera., Jun 3, 2004

Reviewer: jprice174 from Atlanta, Ga.

I did a lot of research last year before I bought this camera... It kinda hurt to leave behind my beloved nikon 35mm SLR, but I was going to Italy, and I needed something smaller, and digital.

The pictures coming out of this camera are amazing. The \'auto\' feature takes great pictures most of the time. And with digital, you\'re not wasting film if the picture doesn\'t come out. …

….

Summary:

Feature1: picture

Positive:12

The pictures coming out of this camera are amazing.

Overall this is a good camera with a really good picture clarity.

. . . .

Negative: 2

The pictures come out hazy if your hands shake even for a moment during the entire process of taking a picture.

Focusing on a display rack about 20 feet away in a brightly lit room during day time, pictures produced by this camera were blurry and in a shade of orange.

Feature2: battery life

Visual Comparison

+

• Summary of reviews of Digitalcamera 1

_

Picture

Battery

Zoom

Size

Weight

+

• Comparison of reviews of

Digitalcamera 1

Digital camera 2

_

Information retrieval – Ch 1 Boolean query

Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).

Sec. 1.1

Unstructured data in 1680
• Which plays of Shakespeare contain the words BrutusANDCaesar but NOTCalpurnia?
• One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia?
• Why is that not the answer?
• Slow (for large corpora)
• NOTCalpurnia is non-trivial
• Other operations (e.g., find the word Romans nearcountrymen) not feasible
• Ranked retrieval (best documents to return)
• Later lectures

Sec. 1.1

Term-document incidence

1 if play contains word, 0 otherwise

Brutus AND Caesar BUT NOT Calpurnia

Sec. 1.1

Incidence vectors
• So we have a 0/1 vector for each term.
• To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)  bitwise AND.
• 110100 AND 110111 AND 101111 = 100100.