ECML PKDD Workshop Challenge : Mining and exploiting interpretable local patterns

ECML PKDD Workshop Challenge: Mining and exploiting interpretable local patterns Natalja Friesen,

Motivation for the challenge dataset Gene expression analysis: • Finding a set of genes associatedwithcertaindisease • Description otthediscovered genes accordingtotheirfunctionalrole. Question: Howtotranslategenenamesintounderstandablebiologicalknowledgeautomaticaly? • Not a pure dataminingproblem – noprediction model isrequired • Understandability of the results is the key success factor • Goal of the challenge: investigation of typical key requirements for the usage of local pattern mining

Dataset Description • The original dataset - study about responses of cancer cells to ionizing radiation Amundson et al. (2008)* • A major determinant of gene expression responses to ionizing radiation - p53 status • p53 is a well-known tumor suppressor protein involved in prevention of cancer • 60 cell lines representing nine tumor types: breast, central nervous system, colon, leukemia, lung, melanoma, ovarian, prostate, and renal • We employed the Student t-test to identify the differentially expressed genes (p-value<0.05) according to the p53 status • * Sally A. Amundson, Khanh T. Do, Lisa C. Vinikoor, R. Anthony Lee, Christine A. Koch-Paiz, Jaeyong Ahn, Mark Reimers, Yidong Chen, Dominic A. Scudiero, John N. Weinstein, Jeffrey M. Trent, Michael L. Bittner, Paul S. Meltzer, and Albert J. Fornace. Integrating Global Gene Expression and Radiation Survival Parameters across the 60 Cell Lines of the National Cancer Institute Anticancer Drug Screen. Cancer Res, 68(2):415–424, January 2008.

Enrichment by Gene Ontology terms The set of genes was enriched using Gene Ontology (GO) terms. Gene Ontology includes 38137 terms The three categories of the GO hierarchy are: • biological processes (23928 terms) • molecular functions (9467 terms) • cellular component (3050 terms) The resulting dataset consists of 6172 genes that are described by 9027 GO terms The label indicates whether genes are statistically associated with the p53 status.

Evaluation Validation of local pattern by domain experts from Biological Research Foundation (BRF) Questionnaire according to the following criteria: • Novelty – whether the subgroup comprises a new knowledge • Usability – the knowledge in subgroup is useful for researcher • Generality – the subgroup contains very general terms and is not interesting for the user.

Results: Novelty • None of discovered subgroup were considered as novel Definition of novelty: the GO terms representing in subgroups are not known from the literature Expert feedback: • “The disease is well known - cancer has been studied very thoroughly”. • “There always seem to be a PubMed mention of those terms relative to radiation and cancer disease” • “It is a good output for a general overview of the dataset, and often biologists need to have this outcome to focus on particular biological processes”

Challenge Results: Generality Small subgroups are likely to be more interesting and useful • General SD are mostly not interesting Expert feedback: • “the terms were very general”

Challenge Results: Usability Usability is a main criteria to evaluate the results • Correlation between Usability – Generality

Results

Best rules

Conclusion • Presentation of results is important to the user • Very large descriptions are hard to understand • Very general subgroups are likely to be not useful • Expert knowledge plays an important role • Optimization of algorithm parameters according to the expert feedback • Generality is not always characteristic of a data, but include domain knowledge – remove a general attributes

ECML PKDD Workshop Challenge : Mining and exploiting interpretable local patterns

ECML PKDD Workshop Challenge : Mining and exploiting interpretable local patterns

Presentation Transcript

CSE 634 Data Mining Concepts and Techniques Association Rule Mining

DETERMINANTS OF MINING INVESTMENT: A CASE STUDY OF THE ZIMBABWE MINING SECTOR

Introduction

Web Mining

Data Mining using Fractals and Power laws

Hisao Ishibuchi Osaka Prefecture University, Japan

Data Mining using Fractals and Power laws

Mining text and data on chemicals

15-826: Multimedia Databases and Data Mining

Monte F. Hancock, Jr. Chief Scientist Celestech, Inc.

Mining Billion-node Graphs: Patterns, Generators and Tools

Design Patterns

CSE 634 Data Mining Concepts and Techniques Association Rule Mining

CSE 634 Data Mining Concepts and Techniques Association Rule Mining

Design Patterns for Parallel Programming

Design patterns

Data Mining: Concepts and Techniques

Search Patterns

Patterns for the People

Smart City Application for Local Authority Using CitiAct - A Case Study