1 / 14

Overview

Paracel GeneMatcher2. Overview. GeneMatcher2. The GeneMatcher system comprises of hardware and software components that significantly accelerate a number of computationally intensive sequence similarity search algorithms. There are two hardware components: GeneMatcher accelerator

Download Presentation

Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Paracel GeneMatcher2 Overview

  2. GeneMatcher2 • The GeneMatcher system comprises of hardware and software components that significantly accelerate a number of computationally intensive sequence similarity search algorithms. • There are two hardware components: • GeneMatcher accelerator • Post-Processor (Blastmachine) • Two client intefaces: • Unix command line • Web-based GUI (BioView Workbench)

  3. Switch GeneMatcher2 Architecture CPU 1 CPU 2 CPU 6912 ... g a a Query #n ... Query #1(agaggt..) Web interface GeneMatcher2 Blast machine

  4. GeneMatcher2 System • Massively Parallel Bioinformatics supercomputer • Array of ASIC (Application Specific Integrated Circuit) chips combined with state-of-the-art Linux cluster technology • Accelerates dynamic programming search algorithms • 3,000 to 220,000 processors • Thousands of times faster than general purpose computers

  5. GeneMatcher2 Components 3 Processor units (6,142 processorsper unit) ULTRASparccomputer Up to 4 disk drives For database storage

  6. GeneMatcher2 Algorithms • HMM and HMM-Frame • Searches protein or DNA sequence data with domain models • HMM-Frame aligns protein models to DNA with frame shift and optional intron tolerance • Profile and Profile-Frame • Position-specific scoring with profile models • Frame shift tolerant protein profile searches against DNA sequence data • GeneWise • Aligns protein sequences or HMM against genomic data • Tolerates introns and frame shifts

  7. GeneMatcher2 Algorithms cont, • Smith-Waterman • Comparison of DNA-DNA, Protein-Protein, Protein-DNA or DNA-DNA through protein • Frame algorithms tolerate frame shifts, unlike BLAST counterparts • Optional intron tolerance for searches of genomic data • Highly sensitive search capacity finds hits BLAST potentially misses • NCBI Blast

  8. What about Blast? • Blast is an approximation of Smith-Waterman • So is FastA, but it's better and has protein fragment searches • Approx. may not yield correct results in some situations: • Data with many ambiguities or frameshifts, such as raw ESTs and unfinished genomic sequence • Distantly related sequences • When global alignments are desired • Protein alignment of Sequences with introns (not penalized on GeneMatcher)

  9. Why GeneMatcher2 • Comparison of sensitivity and selectivity of various sequence search methods • Sensitivity: What proportion of the real hits are reported? (More sensitive means more real hits) • Selectivity: What proportion of the reported hits are real? (More selective means less false positives) Less False positives More true positives

  10. GeneMatcher2 Performance • Time-to-completion comparison of original methods and methods on GeneMatcher2 • TBLASTX improvement is 20-fold • Other methods at least 100-foldimprovement Runtime for an average query 1000 1000 800 600 Seconds 376 400 270 200 16 13 16 4 1 0.1 0 NCBI TBLASTX EBI GeneWise Paracel TBLASTX Decypher HMM Paracel GeneWIse Decypher TBLASTX WUSTL HMM cluster GeneMatcher2 SW FASTA Smith-Waterman * * * Method Source:Genome Canada Bioinformatics Platform Project

  11. Running a search • Load a sequence (or set of sequences) as a query set if it will be used several times • Select the appropriate search depending on the query type and database type (only suitable candidates will be displayed on the search forms) • Check your form options! • Watch the search queue (can raise priority of small jobs if machine is busy) • Select a result format

  12. Databases • While you can load your own databases, disk space on the post-processor is not infinite! Ask us about maintaining public databases that are not currently available. • If you upload a private database. Special files need to be created to use translated database searches such as rframe. • You can create private data sets to search against (e.g. Unigene-mouse and Unigene-rat in a data set called Unigene-rodent). These don’t take up any space.

  13. Seq 1 Seq 2 Seq 3 Seq 4 THE LAST FAT CAT THE FAST CAT THE VERY FAST CAT THE FAT CAT Position specific Positive examples THE LAST FA T CAT THE FAST CAT THE VERY FAST CAT THE FA T CAT THE LAST FAST CAT orororor or VERY gap gapgapgap THE LAST FAST CAT +++ ++++ ++++ +++ all matches “AST” from LAST “V”from VERY } Hidden Markov Models Positive examples Query Query THE VAST VERY FAST CAT THE VAST FAST CAT Hidden Markov Model Multiple sequence alignment (Clustalw or T-coffee) Only nothing, “LAST” or “VERY” in that position GeneMatcher2 HMM Build

  14. GeneWise • Predict introns and exons based on conserved protein domains (e.g Pfam database) • Uses HMMs, reverse query/data set relationship holds • Unlike genscan or fgenes, you can believe these hits, though they may not be complete where exons don’t contain conserved domains.

More Related