Skip this Video
Download Presentation
Lecture 5 TIES445 Data mining Nov-Dec 2007 Sami Äyrämö

Loading in 2 Seconds...

play fullscreen
1 / 16

Lecture 5 TIES445 Data mining Nov-Dec 2007 Sami Äyrämö - PowerPoint PPT Presentation

  • Uploaded on

Lecture 5 TIES445 Data mining Nov-Dec 2007 Sami Äyrämö. A data mining algorithm. ” A data mining algorithm is a well-defined procedure that takes data as input and produces output in the form of models and patterns”

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Lecture 5 TIES445 Data mining Nov-Dec 2007 Sami Äyrämö' - vanig

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Lecture 5

TIES445 Data mining

Nov-Dec 2007

Sami Äyrämö

A data mining algorithm
  • ”A data mining algorithm is a well-defined procedure that takes data as input and produces output in the form of models and patterns”
    • ”Well-defined” indicate that the procedure can be precisely encoded as a finite set of rules
    • ”Algorithm”, a procedure that always terminates after some finite number of of steps and produces an output
    • ”Computational method” has all the properties of an algorithm except a method for guaranteeing that the procedure will terminate in a finite number of steps
      • (Computational method is usually described more abstactly than algorithm, e.g., steepest descent is a computational method)
data mining tasks
Data mining tasks
  • Explorative (visualization)
  • Descriptive (clustering, rule finding,…)
  • Predictive (classification, regression,…)
Elements of data mining algorithms
  • Data mining task
  • Structure of the model or pattern
  • Score function
  • Search/optimization method
  • Data management technique
  • Structure (functional form) of the model or pattern that will be fitted to the data
  • Defines the boundaries of what can be approximated or learned
  • Within these boundaries, the data guide us to a particular model or pattern
  • E.g., hierarchical clustering model, linear regression model, mixture model
structure decision tree
Structure: decision tree

Figure from the book ”Tan,Steinbach, Kumar, Introduction to Data Mining, Addision Wesley, 2006.”

structure mlp
Structure: MLP

Figures by Tommi Kärkkäinen

score function
Score function
  • Judge the quality of the fitted models or patterns based on observed data
  • Minimized/maximized when fitting parameters to our models and patterns
  • Critical for learning and generalization
    • Goodness-of-fitness vs. generalization
      • e.g., the number of neurons in neural network
  • E.g., misclassification error, squared error,support/accuracy
score functions prototype based clustering
Score functions: Prototype-based clustering

α = 2, q=2 → K-means

α = 1, q=2 → K-spatialmedians

α = 1, q=1 → K-coord.medians

  • Different staticical properties of the cluster models
  • Different algorithms and computational methods for solving
search optimization method
Search/optimization method
  • Used to search over parameters and structures
  • Computational procedures and algorithms used to find the maximum/minimum of the score function for particular models or patterns
    • Includes:
      • Computational methods used to optimize the score function, e.g., steepest descent
      • Search-related parameters, e.g., the maximum number of iterations or convergence specification for an iterative algorithm
  • Single-fixed structure (e.g., kth order polynomial function of the inputs) or family of different structures (i.e., search over both structures and their associated parameters spaces)
search optimization k means like clustering
Search/optimization: K-means-like clustering
  • Initialize the cluster prototypes
  • Assign each data point to the closest cluster prototype
  • Compute the new estimates (may require another iterative algorithm) for the cluster prototypes
  • Termination: stop if termination criteria are satisfied (usually no changes in I)
data management technique
Data management technique
  • Storing, indexing, and retrieving data
  • Not usually specified by statistical or machine learning algorithms
    • A common assumption is that the data set is small enough to reside in the main memory so that random access of any data point is free relative to actual computational costs
  • Massive data sets may exceed the capacity of available main memory
    • The physical location of the data and the manner in which data it is accessed can be critically important in terms of algorithm efficiency
data management technique memory
Data management technique: memory
  • A general categorization of different memory structures
    • Registers of processors: direct acces, no slowdown
    • On-processor or on-board cache: fast semiconductor memory on the same chip as the processor
    • Main memory: Normal semiconductor memory (up to several gigabytes)
    • Disk cache: intermediate storage between main memory and disks
    • Disk memory: Terabytes. Access time milliseconds.
    • Magnetic tape: Access time even minutes.
data management index structures
Data management: index structures
  • B-trees
  • Hash indices
  • Kd-trees
  • Multidimensional indexing
  • Relational datatables