DATA MINING Prof. Navneet Goyal BITS, Pilani. 1960s & Earlier. Data Collection & Database Creation. Primitive File Processing. 1970s-early 1980s. DBMSs. Hierarchical & Network DBS RDBMS Data Modeling Tools (ER Model) Indexing Techniques Query languages: SQL
DATA MININGProf. Navneet GoyalBITS, Pilani
1960s & Earlier
Data Collection & Database Creation
Primitive File Processing
Hierarchical & Network DBS
Data Modeling Tools (ER Model)
Query languages: SQL
User Interfaces: Froms & Reports
Query Processing & Optimization
Transaction Management: Concurrency & Recovery
Advanced Data Models
Data Warehousing &
DW & OLAP Technology
DM & KDD
1990s – present
XML based Databases
New Generation of Integrated Information Systems
“There is a tsunami of data that is crashing onto the beaches of the civilized world. This is a tidal wave of unrelated, growing data formed in bits and bytes, coming in an unorganized, uncontrolled, incoherent cacophony of foam. It's filled with flotsam and jetsam. It's filled with the sticks and bones and shells of inanimate and animate life. None of it is easily related, none of it comes with any organizational methodology. ...The tsunami is a wall of data -- data produced at greater and greater speed, greater and greater amounts to store in memory, amounts that double, it seems, with each sunset. On tape, on disks, on paper, sent by streams of light. Faster and faster, more and more and more.”
Richard Saul Wurman, Information Architects
In 2005, mankind created 150 exabytes of data
In 2010, it will create 1200 exabytes*
* 2008 study by International Data Corp. (IDC)
Global Cloud Resolving Model (GCRM) @CSU
30 TB/night: Large Synoptic Survey (LSS) Telescope (2014)
15 PB/year: CERN’s LHC (May 2008)
1 PB over 3 years: EOS (Earth Observing System) data (2001)
2 km, 100 levels, hourly data
~4 TB / simulated hour
~100 TB / simulated day
~35 PB / simulated year
Telecom data ( 4.6 bn mobile subscribers)
There are 3 Billion Telephone Calls in US each day, 30 Billion emails daily, 1 Billion SMS, IMs.
IP Network Traffic: up to 1 Billion packets per hour per router. Each ISP has many (hundreds) routers!
Weblog data (160 mn websites)
No. of pics on Facebook
15 bn unique photos
60 bn photos stored (4 sizes)
Imageshack (20 bn)
Photobucket (7.2 bn)
Flickr (3.4 bn)
Multiply (3 bn)
The Data Deluge
25th Feb. 2010, The Economist
The Data Singularity is here!
08th Mar. 2010, Dataspora Blog
The Data Singularity Part II: Human-sizing big data
27th May. 2010, Dataspora Blog
My definition of Data Mining
“Data Mining is a family of techniques that transforms raw data into actionable information/knowledge”
Data Mining has two perspectives:
One person having both these perspective: Very unlikely!!
Domain experts should know what is possible with Data Mining
Data miners seek problems from domain experts
Modeling perspective: requires involvement of both data mining & domain experts
Intrusion Detection Systems
Spam mail filtering
Predicting crop yield
Credit card abuse
Overusing of data to draw invalid inferences
“telling people they have ESP causes them to lose it”
Data mining is ready for application in the business & scientific community because it is supported by three technologies that are now sufficiently mature:
Some examples of “successes":
1. Decision trees constructed from bank-loan histories to produce algorithms to decide whether to grant a loan.
2. Patterns of traveler behavior mined to manage the sale of discounted seats on planes, rooms in hotels,etc.
3. “Diapers and beer." Observation that customers who buy diapers are more likely to by beer than average allowed supermarkets to place beer and diapers nearby, knowing many customers would walk between them. Placing potato chips between increased sales of all three items.
4. Skycat and Sloan Sky Survey: clustering sky objects by their radiation levels in different bands allowed astronomers to distinguish between galaxies, nearby stars, and many other kinds of celestial objects.
5. Comparison of the genotype of people with/without a condition allowed the discovery of a set of genes that together account for many cases of diabetes. This sort of mining has become much more important as the human genome has fully been decoded
Several different communities have laid claim to DM
2. AI, where it is called “machine learning."
3. Researchers in clustering algorithms.
4. Visualization researchers.
5. Databases. We'll be taking this approach, of course, concentrating on the challenges that appear when the data is large and the computations complex. In a sense, data mining can be thought of as algorithms for executing very complex queries on non-main-memory data.
1. Data gathering, e.g., data warehousing.
2. Data cleansing: eliminate errors and/or bogus data, e.g., patient fever = 125.
3. Feature extraction: obtaining only the interesting attributes of the data, e.g., “date acquired” is probably not useful for clustering celestial objects, as in Skycat.
4. Pattern extraction and discovery. This is the stage that is often thought of as “data mining” and is where we shall concentrate our effort.
5. Visualization of the data.
6. Evaluation of results; not every discovered fact is useful, or even true! Judgment is necessary before following your software's conclusions.
Two functions of Data Mining
Identifies patterns or relationship in data
Examples of Discovered Patterns
Words appearing frequently together in documents may represent phrases or linked concepts. Can be used for intelligence gathering.
Two documents with many of the same sentences could represent plagiarism or mirror sites on the Web.
1 & 2 => 3 has 90% confidence if when a customer bought 1 and 2, in 90% of cases, the customer also bought 3.
1 & 2 => 3 should hold in some minimum percentage of transactions to have business value
Based on type of values handled
age(X, “30….39”) & income(X, “42K…48K”) buys(X, Projection TV)
Based on dimensions of data involved
Based on levels of Abstractions involved
age(X, “30….39”) buys(X, laptop)
age(X, “30….39”) buys(X, computer)
1 => 3 with 50% support and 66% confidence
3 => 1 with 50% support and 100% confidence
I=Set of all items
AR A=>B has support s if s is the %age of Txs in D that contain AUB
AR A=>B has confidence c in D if c is the %age of Txs in D containing A that also contain B
2 Step Process
Algorithms for finding FIs
Frequent 1-itemset (L1) is found
Frequent 2-itemset (L2) is found & so on…
Until no more Frequent k-itemsets (Lk) can be found
Finding each Lk requires one pass
“All nonempty subsets of a FI must also be frequent”
P(I) < min_sup P(I U A) < min_sup, where A is any item
“Any subset of a FI must be frequent”
“If a set cannot pass a test, all its supersets will fail the test as well”
Property is monotonic in the context of failing a test