1 / 44

Computational Biology

Computational Biology. Classification (some parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar ). Lecture Slides Week 10. MBG404 Overview. Processing. Pipelining. Generation. Data. Storage. Mining. Data Mining. D ata mining noun  Digital  Technology .

ferranti
Download Presentation

Computational Biology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational Biology Classification (some parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar) Lecture Slides Week 10

  2. MBG404 Overview Processing Pipelining Generation Data Storage Mining

  3. Data Mining • Datamining • noun Digital Technology. The process of collecting, searching through,  and analyzing a large amount of data in a database,  as to discover patterns or relationships:  e.g.: the use of data mining to detect fraud. • Machine learning a branch of artificial intelligence, concerningthe construction and study of systems that can learn from data. For example, a machine learning system could be trained on email messages to learn to distinguish between spam and non-spam messages. After learning, it can then be used to classify new email messages into spam and non-spam folders.

  4. Application in Biology • Data exploration • Microarrays • Next generation sequencing • Prediction • microRNAs • Protein secondary structure

  5. Parametrization

  6. Types of Attributes • There are different types of attributes • Nominal • Examples: ID numbers, eye color, zip codes • Ordinal • Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} • Interval • Examples: calendar dates, temperatures in Celsius or Fahrenheit. • Ratio • Examples: temperature in Kelvin, length, time, counts

  7. Data Type • Distinctness (nominal) • Equal or unequal • Order (ordinal) • >,<,>=,<= • Addition (interval) • +,- • Multiplication (ratio) • *,/

  8. Data Quality • Missing data • Noise • False measurements • Outliers • Duplicate data • Precision • Bias • Accuracy

  9. Data Preprocessing • Aggregation • Sampling • Dimensionality Reduction • Feature subset selection • Feature creation • Discretization and Binarization • Attribute Transformation

  10. Aggregation Variation of Precipitation in Australia Standard Deviation of Average Monthly Precipitation Standard Deviation of Average Yearly Precipitation

  11. Dimensionality Reduction: PCA

  12. Similarity/Dissimilarity for Simple Attributes p and q are the attribute values for two data objects.

  13. Similarity • Eucledian distance • Simple matching coefficient • Jaccard coefficient • Correlation • Cosine similarity • ...

  14. Sampling • Curse of dimensionality • Feature selection • Dimensionality reduction • Principal component analysis • Aggregation • Mapping of data to different space

  15. Sampling • Dividing samples into • Training set • Test set • Using not all samples from both sets

  16. Classification • Examples with known classes (labels) • Learn rules of how the attributes define the classes • Classify unknown samples into the appropriate class

  17. Classification Workflow

  18. End Theory I • 5 min Mindmapping • 10 min Break

  19. Practice I

  20. Exploring Data (Irises) • Download the file Iris.txt • Follow along

  21. Exploring Data • Frequencies • Percentiles • Mean, Median • Visualizations

  22. Data Selection • Selecting columns • Filtering rows

  23. Data Transformation • Discretize • Continuize • Feature construction

  24. Visualizations

  25. End Practice I • Break 15 min

  26. Theory II

  27. Classification Workflow

  28. Illustrating Classification Task

  29. Example of a Decision Tree categorical categorical continuous class Splitting Attributes Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Model: Decision Tree Training Data

  30. General Structure of Hunt’s Algorithm • Let Dt be the set of training records that reach a node t • General Procedure: • If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt • If Dt is an empty set, then t is a leaf node labeled by the default class, yd • If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. Dt ?

  31. Hunt’s Algorithm Refund Refund Yes No Yes No Don’t Cheat Marital Status Don’t Cheat Marital Status Single, Divorced Refund Married Married Single, Divorced Yes No Don’t Cheat Taxable Income Cheat Don’t Cheat Don’t Cheat Don’t Cheat < 80K >= 80K Don’t Cheat Cheat Don’t Cheat

  32. How to Find the Best Split M0 M2 M3 M4 M1 M12 M34 Before Splitting: A? B? Yes No Yes No Node N1 Node N2 Node N3 Node N4 Gain = M0 – M12 vs M0 – M34

  33. Underfitting and Overfitting Overfitting Underfitting: when model is too simple, both training and test errors are large

  34. Overfitting due to Noise Decision boundary is distorted by noise point

  35. Overfitting due to Insufficient Examples Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region - Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task

  36. Cost Matrix C(i|j): Cost of misclassifying class j example as class i

  37. Cost-Sensitive Measures • Precision is biased towards C(Yes|Yes) & C(Yes|No) • Recall is biased towards C(Yes|Yes) & C(No|Yes) • F-measure is biased towards all except C(No|No)

  38. Receiver Operating Characteristic (ROC) Curve (TP,FP): • (0,0): declare everything to be negative class • (1,1): declare everything to be positive class • (1,0): ideal • Diagonal line: • Random guessing • Below diagonal line: • prediction is opposite of the true class

  39. Using ROC for Model Comparison • No model consistently outperform the other • M1 is better for small FPR • M2 is better for large FPR • Area Under the ROC curve • Ideal: • Area = 1 • Random guess: • Area = 0.5

  40. End Theory II • 5 min Mindmapping • 10 min Break

  41. Practice II

  42. Learning • Supervised (Classification) • Classification • Decision tree • SVM

  43. Classification • Use the iris.txt file for classification • Follow along as we classify

  44. Classification • Use the orangeexample file for classification • We are interested if we can distinguish between miRNAs and random sequences with the selected features • Try yourself

More Related