G53BIO – Bioinformatics Biological Data Mining

G53BIO – BioinformaticsBiological Data Mining Jaume Bacardit jaume.bacardit@nottingham.ac.uk Some slides provided by Prof. Natalio Krasnogor

Outline • Motivation • Data mining primer • Generating biological data • Mining biological data

Motivation • Recent developments in biotechnology have allowed the high-throughput data generation from biological samples • We have lots and lots of data about all aspects of biology (although still mostly about humans) • How can we make sense of all this data? • Visualise the data so we can extract the big picture • Analyse the data to extract new knowledge about the biology  Data Mining

DATA MINING PRIMER

What is Data Mining? • “The extraction of knowledge from large amounts of data” (Han and Kamber, 2006) • “Data mining is defined as the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful in that they lead to some advantage, usually an economic advantage. The data is invariably present in substantial quantities” (Witten and Frank, 2005)

Dataset structure • It most cases we will treat a dataset as a table with rows and columns Attributes Instances

Supervised learning • Many times the datasets got a special attribute, the class or output • If they do, the task of the data mining process consists in generating a model that can predict the class/output for a new instance based on the values for the rest of attributes • In order to generate this model, we will use a corpus of data for which we already know the answer, the training set

Process of supervised learning New Instance TrainingSet LearningAlgorithm Models Inference Engine Know-ledge AnnotatedInstance

Types of supervised learning • If the special attribute is discrete • We call it class • The dataset is a classification problem • If the special attribute is continuous • We call it output • The dataset is a regression problem • Also called modelling or function aproximation

A rule-based example Witten and Frank, 2005 (http://www.cs.waikato.ac.nz/~eibe/Slides2edRev2.zip)

Unsupervised learning • When we do not have/not take into account the class/output attribute • If the goal is to identify aggregations of instances • Clustering problem • If the goal is to detect strong patterns in the data • Association rules/Itemset mining

Clustering Partitional clustering Hierarchical clustering http://www.mathworks.com/matlabcentral/fx_files/19344/1/k_means.jpg http://www.scsb.utmb.edu/faculty/luxon.htm

Association rules mining Witten and Frank, 2005 (http://www.cs.waikato.ac.nz/~eibe/Slides2edRev2.zip)

Generation of data from biological samples

So what data can we generate? • Biological data can be generated at many different levels • Genomics (DNA) • Transcriptomics (RNA) • Proteomics (proteins) • Metabolomics (small compounds) • Lipidomics (lipids) • Hundreds of –omics have been catalogued

How an –omics dataset looks like? • In most cases datasets present a similar structure • Each sample is characteristed by a large number of variables (RNA, Proteins, lipids, etc.) • Each variable indicates (usually quantitatively) the presence of that element in the sample • Due to the high cost of most –omics technologies, variables >> samples • Problems of over-fitting

Data generation at various levels • Genomics • SNPs • DNA Methylation • Transcriptomics • Proteomics

Single Nucleotide Polymorphisms (SNPs) • One base-pair variation in DNA • In most cases in non-coding regions of DNA, but not always • When frequent enough in a population they can be linked to specific traits, e.g. a disease • SNP microarrays can be used to probe hundreds of thousands of SNPs in parallel • In reality few SNPs act on their own • Genome-Wide Association Studies identify groups of SNPs linked to a certain condition

Methylation • It is a chemical reaction that can block a certain region of a chromosome, preventing its transcription • The process can be reverted, so essentially it is an on/off switch of the affected gene • Specialised microarrays exist for the high-throughput detection of methylated genes • Afterwards, data analysis can take place

RNA expression • Not all genes are transcribed/translated into proteins all the time • The expression of genes is highly sophisticated and depends on many factors • Identifying the genes being expressed in a given point of time in a specific tissue provides crucial information about the roles and interactions of such genes • Compare the genes expressed between different groups of samples to identify those that are differentially expressed • Identify co-expressed genes, that present patterns of correlation

Measuring RNA expression • RT-PCR (Real-time reverse polimerase chain reaction) • Measures accurately the expression of a pre-determined gene • RNA Microarrays • Measures, in parallel, the expression of tens of thousands of genes, but with considerable level of noise • RNA-Seq • The next-generation sequencing variant for measuring gene expresison

Proteomics • Same question, measuring the amount of proteins in a sample • Protein gels • Easy to generate, but provides only very approximate information: it detects molecular weights, and it is course-grained

Proteomics • Western blots • When knowing in advance what proteins to study • Requires designing an antibody that will only bind to the target protein • Result is still gel-like, but only (in theory) reporting data for the specific protein being studied

Proteomics • Mass-spectrometry-based proteomics • Truly quantitative data generation technique • Proteins in a sample are broken down in fragments (peptides) • Each peptide generally has a unique molecular weight signature • MS can detect with quite fine-grained resolution the amount of all peptides in a sample • Big problem not fully solved is how to go back from the detected peptides to the corresponding proteins

Normalisation and preprocessing • Whatever the -omics technology being used, a lot of preprocessing is always required • Each technology will have its own normalisation procedures • Make comparable samples generated in different labs • Missing data (remove or impute?) • Remove (if necessary) outliers • Lots of details in this book: Bioinformatics for –omics data. Methods and Protocols. Bernd Mayer (ed). Springer

Mining biological data

What can we do with the dataset? • In most cases, samples are annotated with a qualitative label • Cancer/Non-cancer patients • Samples of seed tissue for which it is known if the seed germinated or not • Age of the sample • Therefore, we can treat these datasets as classification problems, and generate prediction models from the data • Not just as classification problems • Clustering/Biclustering • Association Rule Mining • Regression

But in most cases, domain experts are not (only) interested in predictions • Biomarker identification • Identify the key variables • Most strongly associated to each outcome • Using e.g. t-tests to identify those • Presenting higher prediction capacity • As identified by ML methods • Identify interactions between variables • By presenting very high (anti)correlation between them • By acting together to generate predictions

Case Study: Functional Network Reconstruction for seed germination • Microarray data obtained from seed tissue of Arabidopsis Thaliana • 122 samples represented by the expression level of almost 14000 genes • It had been experimentally determined whether each of the seeds had germinated or not • Can we learn to predict germination/dormancy from the microarray data? • Bassel et al., Plant Cell 23(9):3101-3116, 2011

Generating rule sets • BioHEL (Bacardit et al., 2009) was able to predict the outcome of the samples with 93.5% accuracy (10 x 10-fold cross-validation • Learning from a scrambled dataset (labels randomly assigned to samples) produced ~50% accuracy If At1g27595>100.87 and At3g49000>68.13 and At2g40475>55.96 Predict germination If At4g34710>349.67 and At4g37760>150.75 and At1g30135>17.66 Predict germination If At3g03050>37.90 and At2g20630>96.01 and At3g02885>9.66 Predict germination If At5g54910>45.03 and At4g18975>16.74 and At3g28910>52.76 and At1g48320>56.80 Predict germination Everything elsePredict dormancy

Identifying regulators • Rule building process is stochastic • Generates different rule sets each time the system is run • But if we run the system many times, we can see some patterns in the rule sets • Genes appearing quite more frequent than the rest • Some associated to dormancy • Some associated to germination • We generated 10K rule sets for each outcome • Rules predicted one of the two outcomes • Default rule captured the other

Known regulators appear with high frequency in the rules

Experimental validation • We have experimentally verified this analysis • By ordering and planting knockouts for the highly ranked genes • We have been able to identify four new regulators of germination, with phenotype different than the wild type

Combining data analysis with literature/public databases • Lots of information is publicly available that can validate/complement the data mining process • Pubmed: Database of medical publications of the US National Institute of Health • KEGG/Biocarta: Databases of known protein interactions • DAVID - http://david.abcc.ncifcrf.gov/ • Given a set of target proteins, it will mine the public databases to determine what do they have in common based on e.g. gene onthology, etc.

Literature verification • BioHEL was applied to three cancer microarray datasets from the literature (E. Glaab et al., PLoS ONE (2012) 7(7):e39932) • We checked PubMed to see if the genes linked together in BioHEL’s rules appeared together in the literature • We used Point-Wise Mutual Information (PMI) to quantify that the genes do not appear linked together in the literature by chance • Compared the PMI scores of the highly ranked pairs of genes with random pairs

BioHEL’s scores were much better than random

Visualising the results of the data mining process • Data is huge, and some times the results of the data mining are huge as well • Can we visualise in some way the “big picture” of the results? • Heatmaps • Networks

Heatmaps • An –omics dataset is essentially a matrix, we could plot it, but given the number of samples and, especially, of variables it would be impossible to see anything. • However, if what we plot is not the raw value of each cell, but a colour representing that value, things become easier  Heatmap • And if we re-order the rows and columns in smart ways (e.g. using hierarchical clustering), then we can start to observe interesting patterns.

Networks • A network/graph consists of a series of nodes (e.g. genes, proteins) connected between them by edges. • In a biological sense, an edge suggests that two nodes are interacting among them. • This interaction can mean many different things • They bind to each other • One is a promoter/repressor for the other • Sometimes their relationship is not direct, but through a third element, that we cannot measure/observe • Good book about biological networks: Analysis of Biological Networks (Wiley Series in Bioinformatics) by Björn H. Junker and Falk Schreiber (14 Apr 2008)

Properties of networks • Some nodes are more important than others. Many different metrics can be used to characterise each node in a network • Node degree (number of neighbours) • Betweenness centrality (number of shortest paths in which the node is part of) • The network can be splitted in several sub-networks (community detection) • High intra-connection, low inter-connection • There are some patterns recurrently spotted across the network  motifs

Network construction • Two methodologies for network construction • Co-expression: connected variables have similar patterns of values across samples • Co-prediction: connected variables were used together by the data mining process to perform predictions Genome-wide network model capturing seed germination reveals coordinated regulation of plant cellular phase transitions. PNAS, 108(23):9709-97, 2011 Functional Network Construction in Arabidopsis Using Rule-Based Machine Learning on Large-Scale Data Sets. The Plant Cell, 23(9):3101-3116, 2011

Network visualisation • A network is an abstract construct (nodes/edges), with optional size property associated to nodes and edges • To visualise it we need a layout algorithm that can assign 2D coordinates to each node and connect the edges • Force-directed is a typical algorithm for network visualisation. • It can be costly when the network grows • Lots of software packages exist for network visualisation (google for network layout/visualization)

Network visualisation and interactive exploration

Network refinement at UoN • TopoGSA • PathExpand • Enrichnet

TopoGSA: Network topological analysis of gene sets • What is TopoGSA? TopoGSA is a web-application mappinggene sets onto a comprehensive humanprotein interaction network and analysingtheir network topological properties. • Two types of analysis:1. Compare genes within a gene set: e.g. up- vs. down-regulated genes2. Compare a gene set against a database of known gene sets (e.g. KEGG, BioCarta, GO) E.Glaab et al., Bioinformatics, 26(9):1271-1272, May 2010

PathExpand: Expanding pathways and cellular processes • Enlarge pathways by adding genes that are „strongly connected“ to the pathway-nodes or increase the pathway-“compactness“ • Utilize same PPI network derived from • MIPS, DIP, MINT, HPRD and IntAct • only experimental evidence of binary PPI • final protein interaction network contained 9392 proteins (nodes) and 38857 interactions (edges) • Process Mapping: • KEGG, Biocarta and Reactome were mapped • Only 60% of pathways members existed in PPI Network E. Glaab et al., BMC Bioinformatics, 11(1):597, 2010

Network-based Functional Association Ranking Describe network strategies to identify and prioritize functional associations (arising in high throughput experiments) between a genes/proteins set of interest (target set) and annotated genes/proteins sets(reference set). From networks to pathways E. Glaab et al., Bioinformatics (2012) 28 (18): i451-i457.

References • G.W. Bassel, H. Lan, E. Glaab, D.J. Gibbs, T. Gerjets, N. Krasnogor, A.J. Bonner, M.J. Holdsworth, N.J. Provart. Genome-wide network model capturing seed germination reveals coordinated regulation of plant cellular phase transitions in Proceedings of the National Academy of Sciences, 108(23):9709-9714, June 2011 • Functional Network Construction in Arabidopsis Using Rule-Based Machine Learning on Large-Scale Data Sets. George W. Bassel, Enrico Glaab, Julietta Marquez, Michael J. Holdsworth and Jaume Bacardit. The Plant Cell, 23(9):3101-3116, 2011 • E. Glaab, J. Bacardit, J.M. Garibaldi and N. Krasnogor. Using Rule-Based Machine Learning for Candidate Disease Gene Prioritization and Sample Classification of Cancer Gene Expression Data. PLoS ONE 7(7):e39932. 2012. doi:10.1371/journal.pone.0039932 • HP Fainberg, K. Bodley, J. Bacardit, D. Li, F. Wessely, NP. Mongan, ME. Symonds, L. Clarke and A. Mostyn, Reduced neonatal mortality in Meishan piglets: a role for hepatic fatty acids? PLoS ONE, 7(11):e49101, 2012 • E. Glaab, A. Baudot, N. Krasnogor, R. Schneider and A. Valencia. EnrichNet: network-based gene set enrichment analysis. Bioinformatics (2012) 28 (18): i451-i457. • E. Glaab, A. Baudot, N. Krasnogor, A. Valencia. Extending pathways and processes using molecular interaction networks to analyse cancer genome data in BMC Bioinformatics, 11(1):597, 2010 • E. Glaab, A. Baudot, N. Krasnogor, A. Valencia. TopoGSA: network topological gene set analysis, in Bioinformatics, 26(9):1271-1272, May 2010 • E. Glaab, J.M. Garibaldi, N. Krasnogor, ArrayMining: a modular web-application for microarray analysis combining ensemble and consensus methods with cross-study normalization, in BMC Bioinformatics, 10(1):358, 2009 • http://icos.cs.nott.ac.uk/resources.html

G53BIO – Bioinformatics Biological Data Mining

G53BIO – Bioinformatics Biological Data Mining

Presentation Transcript

CSE 634 Data Mining Concepts and Techniques Association Rule Mining

Data Mining: Preprocessing Techniques

Chapter 3: Data Mining and Data Visualization

Reasoning about Uncertainty in Biological Systems

Mining data with PolyAnalyst

Data Mining on Streams

DATA MINING LECTURE 4

Web Mining

CS490D: Introduction to Data Mining Prof. Walid Aref

What we have covered?

Bioinformatics For MNW 2 nd Year

MMDSS 2007 Data stream management and mining

STRING Modeling of biological systems through cross-species data integration

Mining text and data on chemicals

K. SEKAR, Ph.D. BIOINFORMATICS CENTRE INDIAN INSTITUTE OF SCIENCE BANGALORE 560 012 INDIA

15-826: Multimedia Databases and Data Mining

Data Mining with Big Data

Spatial Data Mining

Data Mining: Concepts and Techniques