Data mining prof navneet goyal bits pilani
This presentation is the property of its rightful owner.
Sponsored Links
1 / 40

DATA MINING Prof. Navneet Goyal BITS, Pilani PowerPoint PPT Presentation


  • 96 Views
  • Uploaded on
  • Presentation posted in: General

DATA MINING Prof. Navneet Goyal BITS, Pilani. 1960s & Earlier. Data Collection & Database Creation. Primitive File Processing. 1970s-early 1980s. DBMSs. Hierarchical & Network DBS RDBMS Data Modeling Tools (ER Model) Indexing Techniques Query languages: SQL

Download Presentation

DATA MINING Prof. Navneet Goyal BITS, Pilani

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Data mining prof navneet goyal bits pilani

DATA MININGProf. Navneet GoyalBITS, Pilani


Evolution of database technology

1960s & Earlier

Data Collection & Database Creation

Primitive File Processing

1970s-early 1980s

DBMSs

Hierarchical & Network DBS

RDBMS

Data Modeling Tools (ER Model)

Indexing Techniques

Query languages: SQL

User Interfaces: Froms & Reports

Query Processing & Optimization

Transaction Management: Concurrency & Recovery

OLTP

Mid 1980s-present

Advanced DBS

Advanced Data Models

Extended Relational

Object-oriented

Object-relational

Deductive

Application Oriented

Spatial

Temporal

Multimedia

Late 1980s-present

Data Warehousing &

Data Mining

DW & OLAP Technology

DM & KDD

1990s – present

Web-based DBS

XML based Databases

Web Mining

2000- ……….

New Generation of Integrated Information Systems

Evolution of Database Technology


Motivation

Motivation

  • Why study Data Mining?


Data mining prof navneet goyal bits pilani

  • Tsunami of Data


Tsunami of data

“There is a tsunami of data that is crashing onto the beaches of the civilized world. This is a tidal wave of unrelated, growing data formed in bits and bytes, coming in an unorganized, uncontrolled, incoherent cacophony of foam. It's filled with flotsam and jetsam. It's filled with the sticks and bones and shells of inanimate and animate life. None of it is easily related, none of it comes with any organizational methodology. ...The tsunami is a wall of data -- data produced at greater and greater speed, greater and greater amounts to store in memory, amounts that double, it seems, with each sunset. On tape, on disks, on paper, sent by streams of light. Faster and faster, more and more and more.”

Richard Saul Wurman, Information Architects

Tsunami of Data


Tsunami of data1

In 2005, mankind created 150 exabytes of data

In 2010, it will create 1200 exabytes*

* 2008 study by International Data Corp. (IDC)

Tsunami of Data


Tsunamis of data

Global Cloud Resolving Model (GCRM) @CSU

30 TB/night: Large Synoptic Survey (LSS) Telescope (2014)

15 PB/year: CERN’s LHC (May 2008)

1 PB over 3 years: EOS (Earth Observing System) data (2001)

Tsunamis of Data

2 km, 100 levels, hourly data

~4 TB / simulated hour

~100 TB / simulated day

~35 PB / simulated year

  • 4 km, 100 levels, hourly data

  • ~1 TB / simulated hour

  • ~24 TB / simulated day

  • ~9 PB / simulated year


Tsunami of data2

Telecom data ( 4.6 bn mobile subscribers)

There are 3 Billion Telephone Calls in US each day, 30 Billion emails daily, 1 Billion SMS, IMs.

IP Network Traffic: up to 1 Billion packets per hour per router. Each ISP has many (hundreds) routers!

Weblog data (160 mn websites)

Tsunami of Data


Tsunami of data3

No. of pics on Facebook

15 bn unique photos

60 bn photos stored (4 sizes)

Imageshack (20 bn)

Photobucket (7.2 bn)

Flickr (3.4 bn)

Multiply (3 bn)

Tsunami of Data


Recent articles

The Data Deluge

25th Feb. 2010, The Economist

The Data Singularity is here!

08th Mar. 2010, Dataspora Blog

The Data Singularity Part II: Human-sizing big data

27th May. 2010, Dataspora Blog

Recent Articles


Data mining

My definition of Data Mining

“Data Mining is a family of techniques that transforms raw data into actionable information/knowledge”

Data Mining


Data mining1

Data Mining


Motivation1

Data Mining has two perspectives:

Data (algorithms)

Domain (applications)

One person having both these perspective: Very unlikely!!

Domain experts should know what is possible with Data Mining

Data miners seek problems from domain experts

Modeling perspective: requires involvement of both data mining & domain experts

Motivation


Possibilities

Intrusion Detection Systems

Spam mail filtering

Data Recovery

Web personalization

Adaptive Websites

Information Retrieval

Data Cleaning

Information Retrieval

Possibilities


Possibilities1

Agriculture

Precision farming

Predicting crop yield

Terrorism Prevention

Retail

CRM

Fraud detection

Tax cheats

Credit card abuse

Predicting TRPs

Bioinformatics

Health Care

Civil Engineering*

Possibilities


What is not data mining

What is NOT Data Mining?

  • Originally a “statistician” term

    Overusing of data to draw invalid inferences

  • Bonferroni's theorem warns us that if there are too many possible conclusions to draw, some will be true for purely statistical reasons, with no physical validity.

  • Famous example: David Rhine, a “parapsychologist" at Duke in the 1950's tested students for extrasensory perception" by asking them to guess 10 cards - red or black. He found about 1/1000 of them guessed all 10, and instead of realizing that is what you'd expect from random guessing, declared them to have ESP. When he retested them, he found they did no better than average.

His conclusion:

“telling people they have ESP causes them to lose it”


What is data mining

What is Data Mining?

  • Discovery of useful summaries of data - Ullman

  • Extracting or “Mining” knowledge form large amounts of data

  • The efficient discovery of previously unknown patterns in large databases

  • Technology which predict future trends based on historical data

  • It helps businesses make proactive and knowledge-driven decisions

  • Data Mining vs. KDD

  • The name “Data Mining” a misnomer?


Data mining2

Data Mining

Data mining is ready for application in the business & scientific community because it is supported by three technologies that are now sufficiently mature:

  • Massive data collection

  • Powerful multiprocessor computers

  • Data mining algorithms


Data mining applications

Data Mining Applications

Some examples of “successes":

1. Decision trees constructed from bank-loan histories to produce algorithms to decide whether to grant a loan.

2. Patterns of traveler behavior mined to manage the sale of discounted seats on planes, rooms in hotels,etc.

3. “Diapers and beer." Observation that customers who buy diapers are more likely to by beer than average allowed supermarkets to place beer and diapers nearby, knowing many customers would walk between them. Placing potato chips between increased sales of all three items.

4. Skycat and Sloan Sky Survey: clustering sky objects by their radiation levels in different bands allowed astronomers to distinguish between galaxies, nearby stars, and many other kinds of celestial objects.

5. Comparison of the genotype of people with/without a condition allowed the discovery of a set of genes that together account for many cases of diabetes. This sort of mining has become much more important as the human genome has fully been decoded


Data mining communities

Data Mining Communities

Several different communities have laid claim to DM

1. Statistics.

2. AI, where it is called “machine learning."

3. Researchers in clustering algorithms.

4. Visualization researchers.

5. Databases. We'll be taking this approach, of course, concentrating on the challenges that appear when the data is large and the computations complex. In a sense, data mining can be thought of as algorithms for executing very complex queries on non-main-memory data.


Data mining3

Data Mining


Stages of data mining process

Stages of Data Mining Process

1. Data gathering, e.g., data warehousing.

2. Data cleansing: eliminate errors and/or bogus data, e.g., patient fever = 125.

3. Feature extraction: obtaining only the interesting attributes of the data, e.g., “date acquired” is probably not useful for clustering celestial objects, as in Skycat.

4. Pattern extraction and discovery. This is the stage that is often thought of as “data mining” and is where we shall concentrate our effort.

5. Visualization of the data.

6. Evaluation of results; not every discovered fact is useful, or even true! Judgment is necessary before following your software's conclusions.


Data mining4

Data Mining

  • Many different algorithms for performing many different tasks

  • DM algorithms can be characterized as consisting of 3 parts:

    • Model

    • Preference

    • Search

  • Model could be

    • Predictive

    • Descriptive


Data mining5

Data Mining


Predictive model

Predictive Model

  • Making prediction about values of data using known results from different data

  • Example: Credit Card Company

  • Every purchase is placed in 1 of 4 classes

    • Authorize

    • Ask for further identification before authorizing

    • Do not authorize

    • Do not authorize but contact police

      Two functions of Data Mining

    • Examine historical data to determine how the data fit into 4 classes

    • Apply the model to each new purchase


Descriptive model

Descriptive Model

Identifies patterns or relationship in data

Example: Later


Two important terms

Two Important Terms

  • Supervised Learning

    • Training Data Set

    • Model is told to which class each training data belongs

    • Learning by example

    • Example CLASSIFICATION

    • Similar to Discriminate Analysis in Statistics

  • Unsupervised Learning

    • Class-label of training set is not known

    • No. of classes also may not be known

    • Learning by observation

    • Example CLUSTERING


Data mining prof navneet goyal bits pilani

Examples of Discovered Patterns

  • Association rules

    • 98% of people who purchase diapers also buy beer

  • Classification

    • People with age less than 25 and salary > 40k drive sports cars

  • Similar time sequences

    • Stocks of companies A and B perform similarly

  • Outlier Detection

    • Residential customers for telecom company with businesses at home


Association rules frequent itemsets

Association Rules & Frequent Itemsets

  • Market-Basket Analysis

  • Grocery Store: Large no. of ITEMS

  • Customers fill their market baskets with subset of items

  • 98% of people who purchase diapers also buy beer

  • Used for shelf management

  • Used for deciding whether an item should be put on sale

  • Other interesting applications

    • Basket=documents, Items=words

      Words appearing frequently together in documents may represent phrases or linked concepts. Can be used for intelligence gathering.

    • Basket=sentences, Items=documents

      Two documents with many of the same sentences could represent plagiarism or mirror sites on the Web.


Classification

Classification

  • Customer’s name, age income_level and credit _rating known

  • Training Set

  • Use classification algorithm to come up with classification rules

  • If age between 31 & 40 and income_level= ‘High’, then credit_rating = ‘Excellent’

  • New Data(customer): Sachin, age=31, income_level=‘High’ implies

    credit_rating=‘Excellent’

  • Classifier Accuracy?

  • Hold-out, k-fold cross validation

  • Prediction vs Classification


Clustering

Clustering

  • Given points in some space, often a high-dimensional space

  • Group the points into a small number of clusters

  • Each cluster consisting of points that are “near” in some sense

  • Points in the same cluster are “similar” and are “dissimilar” to points in other clusters


Clustering examples

Clustering: Examples

  • Cholera outbreak in London

  • Skycat clustered 2x109 sky objects into stars, galaxies, quasars, etc. Each object was a point in a space of 7 dimensions, with each dimension representing radiation in one band of the spectrum.

  • The Sloan Sky Survey is a more ambitious attempt to catalog and cluster the entire visible universe


Association rules

Association Rules

  • Purchasing of one product when another product is purchased represents an AR

  • Used mainly in retail stores to

    • Assist in marketing

    • Shelf management

    • Inventory control

  • Faults in Telecommunication Networks

  • Transaction Database

  • Item-sets, Frequent or large item-sets

  • Support & Confidence of AR


  • Association rules1

    Association Rules

    • A rule must have some minimum user-specified confidence

      1 & 2 => 3 has 90% confidence if when a customer bought 1 and 2, in 90% of cases, the customer also bought 3.

    • A rule must have some minimum user-specified support

      1 & 2 => 3 should hold in some minimum percentage of transactions to have business value

    • AR X => Y holds with confidence T, if T% of transactions in DB that support X also support Y


    Types of association rules

    Types of Association Rules

    • Boolean/Quantitative ARs

      Based on type of values handled

      Bread  Butter

      age(X, “30….39”) & income(X, “42K…48K”)  buys(X, Projection TV)

    • Single/Multi-Dimensional Ars

      Based on dimensions of data involved

      buys(X,Bread)  buys(X,Butter)

    • Single/Multi-Level ARs

      Based on levels of Abstractions involved

      age(X, “30….39”)  buys(X, laptop)

      age(X, “30….39”)  buys(X, computer)


    Example

    Example

    • Transaction Database

    • For minimum support = 50%, minimum confidence = 50%, we have the following rules

      1 => 3 with 50% support and 66% confidence

      3 => 1 with 50% support and 100% confidence


    Support confidence

    Support & Confidence

    I=Set of all items

    D=Transaction Database

    AR A=>B has support s if s is the %age of Txs in D that contain AUB

    s(A=>B )=P(AUB)

    AR A=>B has confidence c in D if c is the %age of Txs in D containing A that also contain B

    c(A=>B)=P (B/A)=P(AUB)/P(A)


    Mining association rules

    Mining Association Rules

    2 Step Process

    • Find all frequent Itemsets is all itemsets satisfying min_sup

    • Generate strong ARs from frequent itemsets ie Ars satisfying min_sup & min_conf


    Frequent itemsets fis

    Frequent Itemsets (FIs)

    Algorithms for finding FIs

    • Apriori

    • Sampling

    • Partitioning


    Apriori algorithm boolean ars

    Apriori Algorithm (Boolean ARs)

    Candidate Generation

    • Level-wise search

      Frequent 1-itemset (L1) is found

      Frequent 2-itemset (L2) is found & so on…

      Until no more Frequent k-itemsets (Lk) can be found

      Finding each Lk requires one pass

    • Apriori Property

      “All nonempty subsets of a FI must also be frequent”

      P(I) < min_sup  P(I U A) < min_sup, where A is any item

      “Any subset of a FI must be frequent”

    • Anti-Monotone Property

      “If a set cannot pass a test, all its supersets will fail the test as well”

      Property is monotonic in the context of failing a test


  • Login