three challenges in data mining n.
Skip this Video
Loading SlideShow in 5 Seconds..
Three Challenges in Data Mining PowerPoint Presentation
Download Presentation
Three Challenges in Data Mining

Loading in 2 Seconds...

play fullscreen
1 / 29

Three Challenges in Data Mining - PowerPoint PPT Presentation

  • Uploaded on

Three Challenges in Data Mining. Anne Denton Department of Computer Science NDSU. Why Data Mining?. Parkinson’s Law of Data Data expands to fill the space available for storage Disk-storage version of Moore’s law Capacity  2 t / 18 months Available data grows exponentially!. Outline.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Three Challenges in Data Mining' - isanne

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
three challenges in data mining

Three Challenges in Data Mining

Anne Denton

Department of Computer Science NDSU

why data mining
Why Data Mining?
  • Parkinson’s Law of Data

Data expands to fill the space

available for storage

  • Disk-storage version of Moore’s law

Capacity  2 t / 18 months

  • Available data grows exponentially!
  • Motivation of 3 challenges
    • More records (rows)
    • More attributes (columns)
    • More subject domains
  • Some answers to the challenges
    • Thesis work
      • Generalized P-Tree structure
      • Kernel-based semi-naïve Bayes classification
    • KDD-cup 02/03 and with Csci 366 students
      • Data with graph relationship
      • Outlook: Data with time dependence
  • More records
    • Many stores save each transaction
    • Data warehouses keep historic data
    • Monitoring network traffic
    • Micro sensors / sensor networks
  • More attributes
    • Items in a shopping cart
    • Keywords in text
    • Properties of a protein (multi-valued categorical)
  • More subject domains
    • Data mining hype increases audience
algorithmic perspective
Algorithmic Perspective
  • More records
    • Standard scaling problem
  • More attributes
    • Different algorithms needed for 1000 vs. 10 attributes
  • More subject domains
    • New techniques needed
    • Joining of separate fields
  • Algorithms should be domain-independent
    • Need for experts does not scale well
      • Twice as many data sets
      • Twice as many domain experts??
    • Ignore domain knowledge?
      • No! Formulate it systematically
some answers to challenges
Some Answers to Challenges
  • Large data quantity (Thesis)
    • Many records
      • P-Tree concept and its generalization to

non-spatial data

    • Many attributes
      • Algorithm that defies curse of dimensionality
  • New techniques / Joining separate fields
    • Mining data on a graph
    • Outlook: Mining data with time dependence
challenge 1 many records
Challenge 1: Many Records
  • Typical question
    • How many records satisfy given conditions on attributes?
  • Typical answer
    • In record-oriented database systems
      • Database scan: O(N)
    • Sorting / indexes?
      • Unsuitable for most problems
  • P-Trees
    • Compressed bit-column-wise storage
    • Bit-wise AND replaces database scan
p trees ordering aspect
P-Trees: Ordering Aspect
  • Compression relies on long sequences of 0 or 1
  • Images
    • Neighboring pixels are probably similar
    • Peano-ordering
  • Other data?
    • Peano-ordering can be generalized
    • Peano-order sorting
impact of peano order sorting
Impact of Peano-Order Sorting
  • Speed improvement especially for large data sets
  • Less than O(N) scaling for all algorithms
so far
So Far
  • Answer to challenge 1: Many records
    • P-Tree concept allows scaling better than O(N)

for AND (equivalent to database scan)

    • Introduced effective generalization to non-spatial data (thesis)
  • Challenge 2: Many attributes
    • Focus: Classification
    • Curse of dimensionality
    • Some algorithms suffer more than others
curse of dimensionality
Curse of Dimensionality
  • Many standard classification algorithms
    • E.g., decision trees, rule-based classification
    • For each attribute 2 halves: relevant  irrelevant
    • How often can we divide by 2 before small size of “relevant” part makes results insignificant?
  • Inverse of
    • Double number of rice grains for each square of the chess board
  • Many domains have hundreds of attributes
    • Occurrence of terms in text mining
    • Properties of genes
possible solution
Possible Solution
  • Additive models
    • Each attribute contributes to a sum
    • Techniques exist (statistics)
      • Computationally intensive
  • Simplest: Naïve Bayes
    • x(k) is value

of kth attribute

    • Considered additive model
      • Logarithm of probability additive
semi na ve bayes classifier
Semi-Naïve Bayes Classifier
  • Correlated attributes are joined
    • Has been done for categorical data
      • Kononenko ’91, Pazzani ’96
      • Previously: Continuous data discretized
  • New (thesis)
    • Kernel-based

evaluation of correlation

  • Error decrease in units of standard deviation for different parameter sets
  • Improvement for wide range of correlation thresholds: 0.05 (white) to 1 (blue)
so far1
So Far
  • Answer to challenge 1: More records
    • Generalized P-tree structure
  • Answer to challenge 2: More attributes
    • Additive algorithms
    • Example: Kernel-based semi-naïve Bayes
  • Challenge 3: More subject domains
    • Data on a graph
    • Outlook: Data with time dependence
standard approach to data mining
Standard Approach to Data Mining
  • Conversion to a relation (table)
    • Domain knowledge goes into table creation
    • Standard table can be mined with standard tools
  • Does that solve the problem?
    • To some degree, yes
    • But we can do better
claim representation as single relation is not rich enough
Claim: Representation as single relation is not rich enough
  • Example: Contribution of a graph structure to standard mining problems
    • Genomics
      • Protein-protein interactions
    • WWW
      • Link structure
    • Scientific publications
      • Citations

Scientific American 05/03

data on a graph old hat
Data on a Graph: Old Hat?
  • Common Topics
    • Analyze edge structure
      • Google
      • Biological Networks
    • Sub-graph matching
      • Chemistry
    • Visualization
      • Focus on graph structure
  • Our work
    • Focus on mining node data
    • Graph structure provides connectivity
protein protein interactions
Protein-Protein Interactions
  • Protein data
    • From Munich Information Center for Protein Sequences (also KDD-cup 02)
    • Hierarchical attributes
      • Function
      • Localization
      • Pathways
    • Gene-related


  • Interactions
    • From experiments
    • Undirected graph
  • Prediction of a property

(KDD-cup 02: AHR*)

    • Which properties in neighbors are relevant?
    • How should we integrate neighbor knowledge?
  • What are interesting patterns?
    • Which properties say more about neighboring nodes than about the node itself?

But not:

*AHR: Aryl Hydrocarbon Receptor Signaling Pathway

possible representations
Possible Representations
  • OR-based
    • At least one neighbor has property
    • Example: Neighbor essential true
  • AND-based
    • All neighbors have property
    • Example: Neighbor essential false
  • Path-based

(depends on maximum hops)

    • One record for each path
    • Classification: weighting?
    • Association Rule Mining:

Record base changes






not essential

association rule mining
Association Rule Mining
  • OR-based representation
  • Conditions
    • Association rule involves AHR
    • Support across a link greater than within a node
    • Conditions on minimum confidence and support
    • Top 3 with respect to support:

(Results by Christopher Besemann, project CSci 366)

classification results
Classification Results
  • Problem

(especially path-based representation)

    • Varying amount of information per record
    • Many algorithms unsuitable in principle
      • E.g., algorithms that divide domain space
  • KDD-cup 02
    • Very simple additive model
    • Based on visually identifying relationship
    • Number of interacting essential genes adds to probability of predicting protein as AHR
outlook time dependent data
Outlook: Time-Dependent Data
  • KDD-cup 03
    • Prediction of citations of scientific papers
    • Old: Time-series prediction
    • New: Combination with similarity-based prediction
conclusions and outlook
Conclusions and Outlook
  • Many exciting problems in data mining
  • Various challenges
    • Scaling of existing algorithms (more records)
    • Different properties in algorithms become relevant (more attributes)
    • Identifying and solving new domain-independent challenges (more subject areas)
  • Examples of general structural components that apply to many domains
    • Graph-structure
    • Time-dependence
    • Relationships between attributes