Instructor: Bala Ravikumar (Ravi)
Tel: (707) 664 3335
Office: Darwin Hall 116 I
Course Web Page
6 to 8:45 PM, Wednesdays
Room: Salazar Hall 2003
Office hours: M 9 – 10, T 11 – 12, W 5 – 6
basic probability and statistics (probability distribution, random variable, conditional probability etc.)
algorithms and data structures (sorting, hashing, binary trees, algorithm design techniques)
Programming in high-level language (Java, Python, Matlab, c#, …)
Linear algebra (vectors, linear independence, matrix rank, Gaussian elimination etc.)
These topics will be reviewed. However, it will be helpful to spend some time on your own to familiarize yourself.
Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.
This book’s focus is on WEB DATA MINING
Web site for the text:
Mining the Web, S.Chakrabarti, MKP.
Data Mining, Witten and Frank, MKP.
The elements of statistical learning, Hastie, Tibshirani, and Friedman, Springer-Verlag.
Web Data Mining: Exploring Hyperlinks, Contents and Usage data, Bing Liu, Springer-Verlag.
Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Pearson/Addison Wesley.
Home Work: 25 %
One mid-term, in-class, open book/notes?
Final Exam: 25%
In-class or take-home?
Individual, design and implementation
Project is done individually, and is semester long - implement, test, write a paper, present in class.
Web data organization
Classification (supervised learning)
Clustering (unsupervised learning)
Association rule mining
Language models for information retrieval
Vector space models
SVM and other tools
LSI and tools from linear algebra
Other applications – e.g. bioinformatics
Data mining is also called knowledge discovery
Data mining is
extraction of useful patterns from data sources, e.g., databases, texts, web, images, etc.
Patterns must be:
valid, novel, potentially useful, understandable
Our focus will be on text data (in particular web)
mining patterns that can classify future (new) data into known classes.
Association rule mining
mining any rule of the form X Y, where X and Y are sets of data items.
identifying similar groups in the data
Sequential pattern mining:
A sequential rule: A B, says that event A will be immediately followed by event B with a certain confidence
discovering the most significant changes in data
Data visualization: using graphical methods to show patterns in data.
Computerization of businesses produce huge amount of data
How to make best use of data?
Knowledge discovered from data can be used for competitive advantage.
Online businesses generate even larger data sets
Online retailers (e.g., amazon.com) are largely driven by data mining.
Web search engines are information retrieval and data mining companies
Make use of your data assets
There is a big gap from stored data to knowledge; and the transition won’t occur automatically.
Many interesting things you want to find cannot be found using database queries
“find me people likely to buy my products”
“Who are likely to respond to my promotion?”
“Which movies should be recommended to each customer?”
The data is abundant.
The computing power is not an issue.
Data mining tools are available
The competitive pressure is very strong.
Almost every company is doing (or has to do) it
Detecting terrorism activities
Streaming data, mobile computing, wireless networks
Data mining is an multi-disciplinary field:
Machine learning/artificial intelligence
Natural language processing
Marketing:customer profiling and retention, identifying potential customers, market segmentation.
Engineering: identify causes of problems in products.
Scientific data analysis: weather prediction, financial data analysis, image analysis etc.
Fraud detection: identifying credit card fraud, intrusion detection.
Text and web: a huge number of applications …
Bioinformatics: structure prediction, classification, microarray analysis etc.
Any application that involves a large amount of data …
Example: if-then rules
If tear production rate = reducedthen recommendation = none
Otherwise, if age = young and astigmatic = no then recommendation = soft
Classification rule:predicts value of a given attribute (the classification of an example)
Association rule:predicts value of arbitrary attribute (or combination)
If outlook = sunny and humidity = highthen play = no
If temperature = cool then humidity = normal
If humidity = normal and windy = falsethen play = yes
If outlook = sunny and play = no then humidity = high
If windy = false and play = no then outlook = sunny and humidity = high
Linear regression functionPredicting CPU performance
Given below are the % of occurrences of a few select words in spam and genuine e-mail messages:
A decision list may be used to identify spam.
Data mining on text
Due to a huge amount of online texts on the Web and other sources
Text contains a huge amount of information of any imaginable type!
A major direction and tremendous opportunity!
Text classification and clustering
Opinion mining and summarization
The Web has dramatically changed the way that people express their opinions.
One can post their opinions on almost anything at review sites, Internet forums, discussion groups, blogs, etc.
Benefits of Review Analysis
Potential Customer: No need to read many reviews
Product manufacturer: market intelligence, product benchmarking
Extracting product features (called Opinion Features) that have been commented on by customers.
Identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative.
Summarizing and comparing results.
GREAT Camera., Jun 3, 2004
Reviewer: jprice174 from Atlanta, Ga.
I did a lot of research last year before I bought this camera... It kinda hurt to leave behind my beloved nikon 35mm SLR, but I was going to Italy, and I needed something smaller, and digital.
The pictures coming out of this camera are amazing. The 'auto' feature takes great pictures most of the time. And with digital, you're not wasting film if the picture doesn't come out. …
The pictures coming out of this camera are amazing.
Overall this is a good camera with a really good picture clarity.
. . . .
The pictures come out hazy if your hands shake even for a moment during the entire process of taking a picture.
Focusing on a display rack about 20 feet away in a brightly lit room during day time, pictures produced by this camera were blurry and in a shade of orange.
Feature2: battery life
Digital camera 2
Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).
1 if play contains word, 0 otherwise
Brutus AND Caesar BUT NOT Calpurnia