data mining and bioinformatics n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Data Mining and Bioinformatics PowerPoint Presentation
Download Presentation
Data Mining and Bioinformatics

Loading in 2 Seconds...

play fullscreen
1 / 26

Data Mining and Bioinformatics - PowerPoint PPT Presentation


  • 122 Views
  • Uploaded on

Data Mining and Bioinformatics. April 30, 2004. What is Data Mining?. Data mining is the process of selecting, exploring, and modeling large amounts of data to uncover previously unknown patterns for business advantage. (SAS Institute)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Data Mining and Bioinformatics' - mayda


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
what is data mining
What is Data Mining?
  • Data mining is the process of selecting, exploring, and modeling large amounts of data to uncover previously unknown patterns for business advantage. (SAS Institute)
  • Example: detecting suspicious transactions with credit cards
a newer definition
A Newer Definition
  • Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.
the beers and diapers story
The “Beers and Diapers” Story
  • Analyze sales records
  • Beers & diapers frequently occur together in customer orders
  • Put beers next to diapers
  • Sales volume increases dramatically
  • Explanation?
why do data mining
Why Do Data Mining
  • Do you know the differences between the following concepts?
    • Data
    • Information
    • Knowledge
  • Difference between data mining and data analysis
    • The latter is more specific
what do we aim to mine
What do We Aim to Mine?
  • Relationships and summaries
    • Models (global summary of a data set)
      • Linear equations, clusters, graphs, tree structures
      • Prediction, classification, interpretation
    • Patterns (local, restricted regions)
      • Recurrent patterns, rules
      • Unusualness - Anomaly detection
    • Analogy to data compression
the whole kdd process
The Whole KDD Process
  • KDD: Knowledge Discovery in Databases
    • Selecting the target data
    • Preprocessing the data
    • Transforming them if necessary
    • Performing data mining to extract patterns and relationships
    • Interpreting and assessing the discovered structures
data mining techniques
Data Mining Techniques
  • Many of them originate from statistics, machine learning, or pattern recognition
  • General steps
    • Determine the nature and structure of the represenation to be used
    • Deciding how to quantify and compare how well different representations fit the data (score function)
    • Choose an algorithm process to optimize the score function
    • Deciding what principles of data management are required to implement the algorithm efficiently
  • Example: Regression analysis X = aY + b
    • Credit card spending vs Annual income
techniques
Techniques
  • Regression/Fitting
  • Clustering
  • Neural networks
  • Bayesian networks
  • Hidden Markov models
na ve bayesian continued
Naïve Bayesian - Continued
  • 9 yes samples (out of 14):
    • 2 sunny, 3 cool, 3 high, 2 true
    • Prob of yes: 9/14 * 2/9 * 3/9 * 3/9 * 2/9 = 0.0053
  • 5 no samples (out of 14):
    • 3 sunny, 1 cool, 4 high, 3 true
    • Prob of yes: 5/14 * 3/5 * 1/5 * 4/5 * 3/5 = 0.0206
  • Yes / No = 20.5% / 79.5%
clustering
Clustering
  • Iterative clustering
    • K-means
  • Hierarchical clustering
    • Agglomerative method
  • Probabilistic model-based clustering
    • EM (Expectation Minimization)
data mining applications
Data Mining Applications
  • Interdisciplinary
    • statistics, databases, machine learning, pattern recognition, AI, visualization, etc
  • Applications:
    • Marketing – sales model, Finance – loan decision
    • Insurance – risk analysis, Telecom – load predication
    • Web/text mining, Surveillance – security
    • Bioinformatics …
in bioinformatics
In Bioinformatics
  • Analysis of Microarray Data
  • Mining free text
  • Structural genomics – protein crystallization
  • Predicting structure from sequence
  • Common theme: complex data, fast growing (outgrowing our processing power)
data collection and preprocessing
Data Collection and Preprocessing
  • Microarray Expression Data
    • Fluorescence level
    • Noisy
machine learning tasks
Machine Learning Tasks
  • Design of Microarrays
    • Probes (67 features) w/ fluorescence value  learn to choose the best probes for a new gene
  • Biological Applications of Microarrays
    • Classify new examples
    • Prediction the functional category of genes
    • Cluster genes based on similarity
    • Cluster experimental conditions
    • Learn a Bayesian network (that captures the joint prob distribution over the expression levels of genes)
machine learning tasks cont d
Machine Learning Tasks (cont’d)
  • Medical Applications of Microarrays
    • Cell disease classification
    • Predicting existing disease classes
    • Predicting the prognsis
    • Predicting the drug response of different patients
wrap it up
Wrap It Up
  • Data mining has great potential
  • Danger: don’t over predict
      • S&P index = function of the previous year’s butter production, cheese production, sheep population in Bangladesh and US?
  • Finally - don’t expect it to answer all questions