Advances in Genome Analysis and Data Mining for Bioinformatics Applications

Genome Informatics 2005 • ~ 220 participants • 1 keynote speaker: David Haussler • 47 talks • 121 posters

Rodger Voelker:Two classes of splice junctions • Search for 5-7 base motifs in exonic and intronic flanking sequences of known splice junctions • Computational analysis of collocations between different motifs • Many collocations between exonic and intronic sequences • Known ESEs display collocations with intronic sequences (including ISEs) • Nearly all introns (89%) can be classified into 2 classes

Chip Lawrence: futility of optima in inferences • The strong focus in bioinformatics on optimal solutions is fundamentally flawed, because the asymptotic underpinnings of these solutions, such as consistency, do not apply • The curse of dimensionality can render optimal solutions very unlikely and misleading • Example: minimum free energy predictions of RNA structures • Reason: incomplete energy function used, only sec structure considered, no tertiary

Minimum free energy predictions of RNA structures • Assumption: • molecule folds into lowest energy state • unique solution to folding problem (optimum) • Many programs (e.g. Zuker's Mfold) use the Boltzmann probability function • Most include calculations of suboptimal structures • but not all structures are computed • PPV of MFE: 48 %

Alternative prediction of RNA structures • Sample the ensemble of sec structures in proportion to their Boltzmann weights • Cluster the structures • Use centroid structure in predictions • Improved PPV compared to MFE • Srna module of Sfold (http://sfold.wadsworth.org/ )

A.tumefaciens 5S rRNA energy landscape

Alternative prediction of RNA structures • Improved PPV compared to MFE: • Ensemble centroid + 30 % • Largest cluster centroid +18 % • Best centroid + 47 %

Data mining • Geneseer – searchable name-translation database (http://geneseer.cshl.org/ ) • Access to genomic information through gene names • Mapping sequences to gene names • Identification of homologs across several species for a given gene • Used in RNAi Codex (http://codex.cshl.edu )

Data mining • Ulysses – annotate human genes based on gene interactions in model organisms(http://www.cisreg.ca:8080/ulysses/ ) • Interologs: conserved protein-protein interactions • Regulogs: conserved protein-DNA interactions • Almost no overlap between data in interaction databases • BIND  DIP: 984 refs; BIND  5 DB's: 3 refs

Data mining • Integrated Genome Browser (IGB) – visualize: • Genomic annotations from multiple data resources • Experimental data from Affymetrix arrays (http://www.affymetrix.com/support/developer/tools/download_igb.affx )

Gene expression and pathways • Skypainter tool in Reactome database: • allows overlay of gene expression data on pathway graphs • allows generation of a "movie" of a time series • (http://www.reactome.org/ )

Gene expression • ArrayBlast: • Compares gene expression signatures generated on different platforms • Uses public microarray data sets (GEO) • Used to create conserved cancer-related expression signature • (http://seq.mc.vanderbilt.edu/arrayBlast/ )

Gene expression • C. elegans Gene Expression Consortium: • SAGE data from specific stages, tissues and cell types • Database of gene expression data/pictures/movies of transgenic worms with promoter::GFP fusions for 2000 genes with human orthologs (http://elegans.bcgsc.ca/home/ge_consortium.html )

Michael Caudy: Whole genome analysis of combinatorial and architectural transcription codes • Search for TFBS in known neural pathway genes • Determine architecture: number, type, order, orientation and spacing of TFBS • Compare architecture of activated and repressed genes • Determine activity of promoters with TFBS mutations • Architecture is critical for differential response to Notch signalling

Regulatory sequence identification • Evoprinter: • highlights multi-species conserved sequences within orthologous DNAs in the context of a single species of interest • (http://evoprinter.ninds.nih.gov/ )

Regulatory sequence identification • NestedMICA: • method for discovering many over-represented short motifs in large sets of strings in a single run • candidate transcription factor binding sites • (http://www.sanger.ac.uk/Software/analysis/nmica/ )

Advances in Genome Analysis and Data Mining for Bioinformatics Applications