data mining prof navneet goyal bits pilani
Skip this Video
Download Presentation
DATA MINING Prof. Navneet Goyal BITS, Pilani

Loading in 2 Seconds...

play fullscreen
1 / 40

DATA MINING Prof. Navneet Goyal BITS, Pilani - PowerPoint PPT Presentation

  • Uploaded on

DATA MINING Prof. Navneet Goyal BITS, Pilani. 1960s & Earlier. Data Collection & Database Creation. Primitive File Processing. 1970s-early 1980s. DBMSs. Hierarchical & Network DBS RDBMS Data Modeling Tools (ER Model) Indexing Techniques Query languages: SQL

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'DATA MINING Prof. Navneet Goyal BITS, Pilani' - bevis

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
evolution of database technology
1960s & Earlier

Data Collection & Database Creation

Primitive File Processing

1970s-early 1980s


Hierarchical & Network DBS


Data Modeling Tools (ER Model)

Indexing Techniques

Query languages: SQL

User Interfaces: Froms & Reports

Query Processing & Optimization

Transaction Management: Concurrency & Recovery


Mid 1980s-present

Advanced DBS

Advanced Data Models

Extended Relational




Application Oriented




Late 1980s-present

Data Warehousing &

Data Mining

DW & OLAP Technology


1990s – present

Web-based DBS

XML based Databases

Web Mining

2000- ……….

New Generation of Integrated Information Systems

Evolution of Database Technology
  • Why study Data Mining?
tsunami of data
“There is a tsunami of data that is crashing onto the beaches of the civilized world. This is a tidal wave of unrelated, growing data formed in bits and bytes, coming in an unorganized, uncontrolled, incoherent cacophony of foam. It's filled with flotsam and jetsam. It's filled with the sticks and bones and shells of inanimate and animate life. None of it is easily related, none of it comes with any organizational methodology. ...The tsunami is a wall of data -- data produced at greater and greater speed, greater and greater amounts to store in memory, amounts that double, it seems, with each sunset. On tape, on disks, on paper, sent by streams of light. Faster and faster, more and more and more.”

Richard Saul Wurman, Information Architects

Tsunami of Data
tsunami of data1
In 2005, mankind created 150 exabytes of data

In 2010, it will create 1200 exabytes*

* 2008 study by International Data Corp. (IDC)

Tsunami of Data
tsunamis of data
Global Cloud Resolving Model (GCRM) @CSU

30 TB/night: Large Synoptic Survey (LSS) Telescope (2014)

15 PB/year: CERN’s LHC (May 2008)

1 PB over 3 years: EOS (Earth Observing System) data (2001)

Tsunamis of Data

2 km, 100 levels, hourly data

~4 TB / simulated hour

~100 TB / simulated day

~35 PB / simulated year

  • 4 km, 100 levels, hourly data
  • ~1 TB / simulated hour
  • ~24 TB / simulated day
  • ~9 PB / simulated year
tsunami of data2
Telecom data ( 4.6 bn mobile subscribers)

There are 3 Billion Telephone Calls in US each day, 30 Billion emails daily, 1 Billion SMS, IMs.

IP Network Traffic: up to 1 Billion packets per hour per router. Each ISP has many (hundreds) routers!

Weblog data (160 mn websites)

Tsunami of Data
tsunami of data3
No. of pics on Facebook

15 bn unique photos

60 bn photos stored (4 sizes)

Imageshack (20 bn)

Photobucket (7.2 bn)

Flickr (3.4 bn)

Multiply (3 bn)

Tsunami of Data
recent articles
The Data Deluge

25th Feb. 2010, The Economist

The Data Singularity is here!

08th Mar. 2010, Dataspora Blog

The Data Singularity Part II: Human-sizing big data

27th May. 2010, Dataspora Blog

Recent Articles
data mining
My definition of Data Mining

“Data Mining is a family of techniques that transforms raw data into actionable information/knowledge”

Data Mining
Data Mining has two perspectives:

Data (algorithms)

Domain (applications)

One person having both these perspective: Very unlikely!!

Domain experts should know what is possible with Data Mining

Data miners seek problems from domain experts

Modeling perspective: requires involvement of both data mining & domain experts

Intrusion Detection Systems

Spam mail filtering

Data Recovery

Web personalization

Adaptive Websites

Information Retrieval

Data Cleaning

Information Retrieval


Precision farming

Predicting crop yield

Terrorism Prevention



Fraud detection

Tax cheats

Credit card abuse

Predicting TRPs


Health Care

Civil Engineering*

what is not data mining
What is NOT Data Mining?
  • Originally a “statistician” term

Overusing of data to draw invalid inferences

  • Bonferroni's theorem warns us that if there are too many possible conclusions to draw, some will be true for purely statistical reasons, with no physical validity.
  • Famous example: David Rhine, a “parapsychologist" at Duke in the 1950's tested students for extrasensory perception" by asking them to guess 10 cards - red or black. He found about 1/1000 of them guessed all 10, and instead of realizing that is what you'd expect from random guessing, declared them to have ESP. When he retested them, he found they did no better than average.

His conclusion:

“telling people they have ESP causes them to lose it”

what is data mining
What is Data Mining?
  • Discovery of useful summaries of data - Ullman
  • Extracting or “Mining” knowledge form large amounts of data
  • The efficient discovery of previously unknown patterns in large databases
  • Technology which predict future trends based on historical data
  • It helps businesses make proactive and knowledge-driven decisions
  • Data Mining vs. KDD
  • The name “Data Mining” a misnomer?
data mining2
Data Mining

Data mining is ready for application in the business & scientific community because it is supported by three technologies that are now sufficiently mature:

  • Massive data collection
  • Powerful multiprocessor computers
  • Data mining algorithms
data mining applications
Data Mining Applications

Some examples of “successes":

1. Decision trees constructed from bank-loan histories to produce algorithms to decide whether to grant a loan.

2. Patterns of traveler behavior mined to manage the sale of discounted seats on planes, rooms in hotels,etc.

3. “Diapers and beer." Observation that customers who buy diapers are more likely to by beer than average allowed supermarkets to place beer and diapers nearby, knowing many customers would walk between them. Placing potato chips between increased sales of all three items.

4. Skycat and Sloan Sky Survey: clustering sky objects by their radiation levels in different bands allowed astronomers to distinguish between galaxies, nearby stars, and many other kinds of celestial objects.

5. Comparison of the genotype of people with/without a condition allowed the discovery of a set of genes that together account for many cases of diabetes. This sort of mining has become much more important as the human genome has fully been decoded

data mining communities
Data Mining Communities

Several different communities have laid claim to DM

1. Statistics.

2. AI, where it is called “machine learning."

3. Researchers in clustering algorithms.

4. Visualization researchers.

5. Databases. We'll be taking this approach, of course, concentrating on the challenges that appear when the data is large and the computations complex. In a sense, data mining can be thought of as algorithms for executing very complex queries on non-main-memory data.

stages of data mining process
Stages of Data Mining Process

1. Data gathering, e.g., data warehousing.

2. Data cleansing: eliminate errors and/or bogus data, e.g., patient fever = 125.

3. Feature extraction: obtaining only the interesting attributes of the data, e.g., “date acquired” is probably not useful for clustering celestial objects, as in Skycat.

4. Pattern extraction and discovery. This is the stage that is often thought of as “data mining” and is where we shall concentrate our effort.

5. Visualization of the data.

6. Evaluation of results; not every discovered fact is useful, or even true! Judgment is necessary before following your software's conclusions.

data mining4
Data Mining
  • Many different algorithms for performing many different tasks
  • DM algorithms can be characterized as consisting of 3 parts:
    • Model
    • Preference
    • Search
  • Model could be
    • Predictive
    • Descriptive
predictive model
Predictive Model
  • Making prediction about values of data using known results from different data
  • Example: Credit Card Company
  • Every purchase is placed in 1 of 4 classes
      • Authorize
      • Ask for further identification before authorizing
      • Do not authorize
      • Do not authorize but contact police

Two functions of Data Mining

      • Examine historical data to determine how the data fit into 4 classes
      • Apply the model to each new purchase
descriptive model
Descriptive Model

Identifies patterns or relationship in data

Example: Later

two important terms
Two Important Terms
  • Supervised Learning
    • Training Data Set
    • Model is told to which class each training data belongs
    • Learning by example
    • Similar to Discriminate Analysis in Statistics
  • Unsupervised Learning
    • Class-label of training set is not known
    • No. of classes also may not be known
    • Learning by observation
    • Example CLUSTERING
Examples of Discovered Patterns
  • Association rules
    • 98% of people who purchase diapers also buy beer
  • Classification
    • People with age less than 25 and salary > 40k drive sports cars
  • Similar time sequences
    • Stocks of companies A and B perform similarly
  • Outlier Detection
    • Residential customers for telecom company with businesses at home
association rules frequent itemsets
Association Rules & Frequent Itemsets
  • Market-Basket Analysis
  • Grocery Store: Large no. of ITEMS
  • Customers fill their market baskets with subset of items
  • 98% of people who purchase diapers also buy beer
  • Used for shelf management
  • Used for deciding whether an item should be put on sale
  • Other interesting applications
    • Basket=documents, Items=words

Words appearing frequently together in documents may represent phrases or linked concepts. Can be used for intelligence gathering.

    • Basket=sentences, Items=documents

Two documents with many of the same sentences could represent plagiarism or mirror sites on the Web.

  • Customer’s name, age income_level and credit _rating known
  • Training Set
  • Use classification algorithm to come up with classification rules
  • If age between 31 & 40 and income_level= ‘High’, then credit_rating = ‘Excellent’
  • New Data(customer): Sachin, age=31, income_level=‘High’ implies


  • Classifier Accuracy?
  • Hold-out, k-fold cross validation
  • Prediction vs Classification
  • Given points in some space, often a high-dimensional space
  • Group the points into a small number of clusters
  • Each cluster consisting of points that are “near” in some sense
  • Points in the same cluster are “similar” and are “dissimilar” to points in other clusters
clustering examples
Clustering: Examples
  • Cholera outbreak in London
  • Skycat clustered 2x109 sky objects into stars, galaxies, quasars, etc. Each object was a point in a space of 7 dimensions, with each dimension representing radiation in one band of the spectrum.
  • The Sloan Sky Survey is a more ambitious attempt to catalog and cluster the entire visible universe
association rules
Association Rules
  • Purchasing of one product when another product is purchased represents an AR
  • Used mainly in retail stores to
      • Assist in marketing
      • Shelf management
      • Inventory control
  • Faults in Telecommunication Networks
  • Transaction Database
  • Item-sets, Frequent or large item-sets
  • Support & Confidence of AR
association rules1
Association Rules
  • A rule must have some minimum user-specified confidence

1 & 2 => 3 has 90% confidence if when a customer bought 1 and 2, in 90% of cases, the customer also bought 3.

  • A rule must have some minimum user-specified support

1 & 2 => 3 should hold in some minimum percentage of transactions to have business value

  • AR X => Y holds with confidence T, if T% of transactions in DB that support X also support Y
types of association rules
Types of Association Rules
  • Boolean/Quantitative ARs

Based on type of values handled

Bread  Butter

age(X, “30….39”) & income(X, “42K…48K”)  buys(X, Projection TV)

  • Single/Multi-Dimensional Ars

Based on dimensions of data involved

buys(X,Bread)  buys(X,Butter)

  • Single/Multi-Level ARs

Based on levels of Abstractions involved

age(X, “30….39”)  buys(X, laptop)

age(X, “30….39”)  buys(X, computer)

  • Transaction Database
  • For minimum support = 50%, minimum confidence = 50%, we have the following rules

1 => 3 with 50% support and 66% confidence

3 => 1 with 50% support and 100% confidence

support confidence
Support & Confidence

I=Set of all items

D=Transaction Database

AR A=>B has support s if s is the %age of Txs in D that contain AUB

s(A=>B )=P(AUB)

AR A=>B has confidence c in D if c is the %age of Txs in D containing A that also contain B

c(A=>B)=P (B/A)=P(AUB)/P(A)

mining association rules
Mining Association Rules

2 Step Process

  • Find all frequent Itemsets is all itemsets satisfying min_sup
  • Generate strong ARs from frequent itemsets ie Ars satisfying min_sup & min_conf
frequent itemsets fis
Frequent Itemsets (FIs)

Algorithms for finding FIs

  • Apriori
  • Sampling
  • Partitioning
apriori algorithm boolean ars
Apriori Algorithm (Boolean ARs)

Candidate Generation

  • Level-wise search

Frequent 1-itemset (L1) is found

Frequent 2-itemset (L2) is found & so on…

Until no more Frequent k-itemsets (Lk) can be found

Finding each Lk requires one pass

  • Apriori Property

“All nonempty subsets of a FI must also be frequent”

P(I) < min_sup  P(I U A) < min_sup, where A is any item

“Any subset of a FI must be frequent”

  • Anti-Monotone Property

“If a set cannot pass a test, all its supersets will fail the test as well”

Property is monotonic in the context of failing a test