1 / 21

COT6930 Course Project

COT6930 Course Project. Outline. Gene Selection Sequence Alignment. Why Gene Selection. Identify marker genes that characterize different tumor status. Many genes are redundant and will introduce noise that lower performance.

mruddy
Download Presentation

COT6930 Course Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COT6930 Course Project

  2. Outline • Gene Selection • Sequence Alignment

  3. Why Gene Selection • Identify marker genes that characterize different tumor status. • Many genes are redundant and will introduce noise that lower performance. • Can eventually lead to a diagnosis chip. (“breast cancer chip”, “liver cancer chip”)

  4. Why Gene Selection

  5. Gene Selection • Methods fall into three categories: • Filter methods • Wrapper methods • Embedded methods Filter methods are simplest and most frequently used in the literature Wrapper methods are likely the most accurate ones

  6. Filter Method • Features (genes) are scored according to the evidence of predictive power and then are ranked. • Top s genes with high score are selected and used by the classifier. • Scores: t-statistics, F-statistics, signal-noise ratio, … • The # of features selected, s, is then determined by cross validation. • Advantage: Fast and easy to interpret.

  7. Good versus bad features

  8. Filter Method: Problem • Genes are considered independently. • Redundant genes may be included. • Some genes jointly with strong discriminant power but individually are weak will be ignored. • Good single features do not necessarily form a good feature set • The filtering procedure is independent to the classifying method • Features selected can be applied to all types of classifying methods

  9. Wrapper Method • Iterative search: many “feature subsets” are scored base on classification performance and the best is used. • Select a good subset of features • Subset selection: Forward selection, backward selection, their combinations. • Exhaustive searching is impossible. • Greedy algorithm are used instead.

  10. Wrapper Method: Problem • Computationally expensive • For each feature subset considered, the classifier is built and evaluated. • Exhaustive searching is impossible • Greedy search only. • Easy to overfit.

  11. Embedded Method • Attempt to jointly or simultaneously train both a classifier and a feature subset. • Often optimize an objective function that jointly rewards accuracy of classification and penalizes use of more features. • Intuitively appealing

  12. Relief-F • Relief-F a filter approach for feature selection • Relief

  13. Relief-F • Original Relief is only able to handle binary classification problem. Extension was made to handle multiple-class problem

  14. Relief-F • Categorical attributes • Numerical attributes

  15. Relief-F Problem • Time Complexity • m×(m×a+c×m×a+a)=O(cm2a) • Assume m=100, c=3, a=10,000 • Time complexity 300×106 • Only considers one single attribute, cannot select a subset of “good” genes

  16. Solution: Parallel Relief-F • Version 1: • Clusters runs ReliefF in parallel, and updated weighted weight values are collected at the master. • Theoretical time complexity O(cm2a/p) • P is the # of clusters

  17. Parallel Relief-F • Version 2: • Clusters runs ReliefF in parallel, and each cluster directly update the global weight values. • Each cluster also considers the current weight values to select nearest neighbour instances • Theoretical time complexity O(cm2a/p) • p is the # of clusters

  18. Parallel Relief-F • Version 3 • Consider selecting a subset of important features • Comparing the difference between including/excluding a specific feature, and understand the importance of a gene with respect to an existing subset of features • Discussion in private!

  19. Outline • Gene Selection • Sequence Alignment • Given a dataset D with N=1000 sequences (e.g., 1000 each) • Given an input x, • Do pair-wise global sequence alignment between x and all sequences D • Dispatch jobs to clusters • And aggregate the results

More Related