DATA MINING Prof. Navneet Goyal BITS, Pilani. 1960s & Earlier. Data Collection & Database Creation. Primitive File Processing. 1970s-early 1980s. DBMSs. Hierarchical & Network DBS RDBMS Data Modeling Tools (ER Model) Indexing Techniques Query languages: SQL
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Data Collection & Database Creation
Primitive File Processing
Hierarchical & Network DBS
Data Modeling Tools (ER Model)
Query languages: SQL
User Interfaces: Froms & Reports
Query Processing & Optimization
Transaction Management: Concurrency & Recovery
Advanced Data Models
Data Warehousing &
DW & OLAP Technology
DM & KDD
1990s – present
XML based Databases
New Generation of Integrated Information SystemsEvolution of Database Technology
Richard Saul Wurman, Information ArchitectsTsunami of Data
30 TB/night: Large Synoptic Survey (LSS) Telescope (2014)
15 PB/year: CERN’s LHC (May 2008)
1 PB over 3 years: EOS (Earth Observing System) data (2001)Tsunamis of Data
2 km, 100 levels, hourly data
~4 TB / simulated hour
~100 TB / simulated day
~35 PB / simulated year
One person having both these perspective: Very unlikely!!
Domain experts should know what is possible with Data Mining
Data miners seek problems from domain experts
Modeling perspective: requires involvement of both data mining & domain expertsMotivation
Overusing of data to draw invalid inferences
“telling people they have ESP causes them to lose it”
Data mining is ready for application in the business & scientific community because it is supported by three technologies that are now sufficiently mature:
Some examples of “successes":
1. Decision trees constructed from bank-loan histories to produce algorithms to decide whether to grant a loan.
2. Patterns of traveler behavior mined to manage the sale of discounted seats on planes, rooms in hotels,etc.
3. “Diapers and beer." Observation that customers who buy diapers are more likely to by beer than average allowed supermarkets to place beer and diapers nearby, knowing many customers would walk between them. Placing potato chips between increased sales of all three items.
4. Skycat and Sloan Sky Survey: clustering sky objects by their radiation levels in different bands allowed astronomers to distinguish between galaxies, nearby stars, and many other kinds of celestial objects.
5. Comparison of the genotype of people with/without a condition allowed the discovery of a set of genes that together account for many cases of diabetes. This sort of mining has become much more important as the human genome has fully been decoded
Several different communities have laid claim to DM
2. AI, where it is called “machine learning."
3. Researchers in clustering algorithms.
4. Visualization researchers.
5. Databases. We'll be taking this approach, of course, concentrating on the challenges that appear when the data is large and the computations complex. In a sense, data mining can be thought of as algorithms for executing very complex queries on non-main-memory data.
1. Data gathering, e.g., data warehousing.
2. Data cleansing: eliminate errors and/or bogus data, e.g., patient fever = 125.
3. Feature extraction: obtaining only the interesting attributes of the data, e.g., “date acquired” is probably not useful for clustering celestial objects, as in Skycat.
4. Pattern extraction and discovery. This is the stage that is often thought of as “data mining” and is where we shall concentrate our effort.
5. Visualization of the data.
6. Evaluation of results; not every discovered fact is useful, or even true! Judgment is necessary before following your software's conclusions.
Two functions of Data Mining
Identifies patterns or relationship in data
Words appearing frequently together in documents may represent phrases or linked concepts. Can be used for intelligence gathering.
Two documents with many of the same sentences could represent plagiarism or mirror sites on the Web.
1 & 2 => 3 has 90% confidence if when a customer bought 1 and 2, in 90% of cases, the customer also bought 3.
1 & 2 => 3 should hold in some minimum percentage of transactions to have business value
Based on type of values handled
age(X, “30….39”) & income(X, “42K…48K”) buys(X, Projection TV)
Based on dimensions of data involved
Based on levels of Abstractions involved
age(X, “30….39”) buys(X, laptop)
age(X, “30….39”) buys(X, computer)
1 => 3 with 50% support and 66% confidence
3 => 1 with 50% support and 100% confidence
I=Set of all items
AR A=>B has support s if s is the %age of Txs in D that contain AUB
AR A=>B has confidence c in D if c is the %age of Txs in D containing A that also contain B
2 Step Process
Algorithms for finding FIs
Frequent 1-itemset (L1) is found
Frequent 2-itemset (L2) is found & so on…
Until no more Frequent k-itemsets (Lk) can be found
Finding each Lk requires one pass
“All nonempty subsets of a FI must also be frequent”
P(I) < min_sup P(I U A) < min_sup, where A is any item
“Any subset of a FI must be frequent”
“If a set cannot pass a test, all its supersets will fail the test as well”
Property is monotonic in the context of failing a test