Data mining current status and directions
Download
1 / 30

Data Mining: Current Status and Directions - PowerPoint PPT Presentation


  • 91 Views
  • Uploaded on

Data Mining: Current Status and Directions. What is Data Mining?. Data mining (also called knowledge discovery in databases)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Data Mining: Current Status and Directions' - tyanne


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

What is data mining
What is Data Mining?

  • Data mining (also called knowledge discovery in databases)

    • Extraction of interesting (non-trivial,implicit, previously unknown and potentially useful) information (knowledge) or patterns from data in large databases or other information repositories

  • The goal is to understand and use data, to make data itself something of value and strategic importance


Data is everywhere
Data is everywhere!

  • Relational databases—A commodity of every enterprise

  • POS (Point of Sales): Transactional DBs are often terabytes in size

  • Legacy databases

  • Spatial databases (GIS), remote sensing database (EOS), and scientific/engineering databases

  • Time-series data (e.g., stock trading) and temporal data

  • Text (documents, emails) and multimedia databases

  • WWW: A huge, hyper-linked, dynamic, global information system


The potential for data mining is everywhere too
The potential for Data Mining Is Everywhere, too!

  • Knowledge to be mined

    • Characterization, discrimination, association, classification, clustering, trend, deviation and outlier analysis, etc.

  • Techniques utilized

    • Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, neural networks, etc.

  • Applications adapted

    • Retail, telecommunication, banking, fraud analysis, DNA mining, stock market analysis, Web mining, Weblog analysis, etc.


Data mining a confluence of multiple disciplines
Data Mining: A Confluence of Multiple Disciplines

Database

Technology

Statistics

Data Mining

Machine

Learning (AI)

Visualization

Information

Science

Other

Disciplines


Multi dimensional data analysis
Multi-Dimensional Data Analysis

  • Data warehousing: integration from heterogeneous or semi-structured databases

  • Multi-dimensional modeling of data: star & snowflake schemas (in Relational DBMS)

  • Efficient and scalable computation of data cubes or iceberg cubes (in MDDB)

  • OLAP (on-line analytical processing): drilling, dicing, slicing, etc.

  • Discovery-driven (data driven) exploration of data cubes




Start with standard normalized relational database tables
Start with standard normalized relational database tables.

Creating Multi-dimensional data warehouses


Data warehouse star schema
Data warehouse ‘STAR’ Schema

In order to reduce the number of joins that must be performed, data is reformatted into ‘fact’ tables. Fact tables typically consist of many foreign keys


Data warehouse snowflake schema
Data Warehouse ‘Snowflake’ Schema

Very similar to the snowflake schema, can you tell what this schema lets us see that the snowflake did not?


Making optimal use of storage space

  • Many cuboids can be materialized by analyzing another cuboid as opposed to the entire data set

Example: Consider analyzing sales based on the dimensions of Route, Source, and Time. The number of rows in each view is given in Millions.

Route, Source, Time

6 M

Route, Time

6 M

Route, Source

.8 M

Source, Time

6 M

Time

.1 M

Route

.2 M

Source

.01 M

Materialization of all views would require roughly 19.1 Million rows

None


Dependent Cuboids

Selective materialization in this case can reduce the number of stored rows by 12 Million

Assume that ‘Part’ can be further partitioned into ‘size’ and ‘color’, ‘Customer’ can be partitioned into ‘Individual’, ‘State’, and ‘Country’

Part, Supplier,

Customer

6 M

Part, Supplier

Supplier, Customer

Part, Customer

.8 M

6 M

6 M

Part(color),

Customer

(State)

Part(size),

Customer

(State)

Part (color),

Customer

(Country)

Part (Size),

Customer

(Country)

Part (Color),

Customer

(Individual)

Part (Size),

Customer

(Individual)

Part

Customer


Association and frequent pattern analysis
Association and Frequent Pattern Analysis

  • Objective is to find patterns in the tendency of items to be found together.

  • A typical 2-item association rule output will generally look something like this:

  • ComputerSoftware (7%, 72%)

  • This is telling you that 7% (a.k.a. confidence level) of your sales transactions involved computers AND software, and that 72% (a.k.a. support level) of all computer sales involved the sale of software.


Association and frequent pattern analysis1
Association and Frequent Pattern Analysis

  • Associations can also be found among 3, 4, or more item sets, for example:

  • (Computers, Software) Mouse Pad (8%, 65%) This tells you that 8% of transactions involved computers, software, and mouse pads. And that 65% of transactions involving computers and software also involved the purchase of a mouse pad


Association and frequent pattern analysis2
Association and Frequent Pattern Analysis

  • The problem with unguided associative analysis is that the number of associations can be enormous.

  • Consider a store like L.L. Bean trying to identify meaningful associations. The output could number in the millions.

  • In order to “filter” the output, users will frequently set parameters for confidence and support thresholds.



Clustering and outlier analysis
Clustering and Outlier Analysis

  • Attribute of interest is plotted on a graph whose axes represent the dimensions of interest. Cluster analysis is frequently two dimensional, but does not have to be.

  • The objective of the data mining algorithm is to find the centers of clusters that maximizes the distance between cluster centers while minimizing the distance between points in a cluster and the center of the cluster.

  • The center of the cluster typically defines the cluster (e.g. males between 30 and 35 years old with incomes between 50K and 75K) and axes are usually parametric rather than continuous


Clustering analysis
Clustering Analysis

  • Can include user-specified constraints (e.g. no cluster has less than 1000 customers)


Sequential patterns and time series analysis
Sequential Patterns and Time-Series Analysis

  • Trend analysis

    • Trend movement vs. cyclic variations, seasonal variations and random fluctuations

  • Similarity search in time-series database

    • Handling gaps, scaling, etc.

    • Indexing methods and query languages for time-series

  • Sequential pattern mining

    • Various kinds of sequences, various methods

  • Periodicity analysis

    • Full periodicity, partial periodicity, cyclic association rules


Data mining industry and applications
Data Mining Industry and Applications

  • Industry has grown rapidly over the past few years

    • From research prototypes to data mining products, languages, and standards

    • IBM Intelligent Miner, SAS Enterprise Miner, SGI MineSet, Clementine, MS/SQLServer 2000, etc.

    • A few data mining languages and standards (esp. MS OLEDB for Data Mining).

  • Application achievements in many domains

    • Market analysis, trend analysis, fraud detection, outlier analysis, Web mining, etc.


The data mining industry
The data mining industry

  • Data mining is growing rapidly

    • R & D has seen huge increases

    • Applications have been broadened substantially

  • But not as rapidly as some may have hoped. Why not?

    • Value is easy to objectively measure

      • It is difficult to sell on hype alone, although they try!

    • Not on-the-shelf in nature

      • Need training, understanding, and customization

      • Definite learning curve associated with effective use

      • Benefit of effective use not seen immediately


Trends in data mining
Trends in data mining

  • Web mining (and incorporating data from outside the organization into the analysis of internal data)

  • Towards integrated data mining environments and tools

    • “Vertical” (or application-specific) data mining

    • Invisible data mining

  • Towards intelligent, efficient, and scalable data mining methods


Web mining a rapidly expanding area in data mining
Web Mining: A Rapidly Expanding area in Data Mining

  • Mine what the Web search engine finds

  • Automatic classification of Web documents

  • Discovery of authoritative Web pages, Web structures and Web communities

  • Meta-Web Warehousing: Web yellow page service

  • Web usage mining


Mining the results of web search engine finds
Mining the results of Web Search Engine Finds

  • Current Web search engines:

    • keyword-based, return too many, often low quality answers, still missing a lot, not customized, etc.

  • Data mining will help:

    • coverage: “Enlarge and then shrink,” using synonyms and conceptual hierarchies

    • better search primitives: user preferences/hints

    • linkage analysis: authoritative pages and clusters

    • customization: home page + Weblog + user profiles

    • Identification of “hub” pages


A layered meta web architecture
A Layered Meta-Web Architecture

More Generalized Descriptions

Layern

...

Layer1

Generalized Descriptions

Layer0


Importance of constructing multi layer meta web
Importance of Constructing Multi-Layer Meta Web

  • Benefits of Multi-Layer Meta-Web:

    • Multi-dimensional Web info summary analysis

    • Approximate and intelligent query answering

    • Web high-level query answering (WebSQL, WebML)

    • Web content and structure mining

    • Observing the dynamics/evolution of the Web

  • Is it realistic to construct such a meta-Web?

    • It benefits even if it is partially constructed

    • The benefit may justify the cost of tool development, standardization, and partial restructuring


Web usage click stream mining
Web Usage (Click-Stream) Mining

  • Web-log provides rich information about Web dynamics

  • Multidimensional Web-log analysis:

    • disclose potential customers, users, markets, etc.

  • Plan mining (mining general Web accessing regularities):

    • Web linkage adjustment, performance improvements

  • Trend analysis:

    • Dynamics of the Web: what has been changing?

  • Customized to individual users


Intelligent tools for data mining
Intelligent Tools for Data Mining

  • Integration of users and mining algorithms paves the way to intelligent mining

  • Smart interface brings intelligence

    • Easy to use, understand and manipulate

  • One picture may be worth 1,000 words

    • Visual and audio data mining

  • Towards self-tuning, self-managing, self-triggering data mining


ad