Computational Biology

Computational Biology Dr. Jens Allmer Lecture Slides Week 3

Sequence Alignment • Exact • simple • Approximate • More difficult target pattern target pattern

Sequence Alignment • Exact pattern matching • Naive method aligns pattern with each location of the target • Boyer-Moore indexes the pattern to skip some alignments • Wu-Manber indexes many patterns and skips some alignments • Indexing • Suffix tree indexes target and then quickly finds each pattern • Many other methods

Sequence Alignment • Approximate pattern matching • Pairwise • Local • Smith Waterman • BLAST • FASTA • Global • Needlemann Wunsch • Multiple • T-Coffee • ClustalW • ...

Basic Local Alignment Seach Tool • Input • Pattern • Target • Search parameters and settings • Output • Alignments in various formats • XML • Help • http://www.ncbi.nlm.nih.gov/books/NBK1763/

BLAST • Target • Needs to be indexed • Cannot be FASTA • Must fit to the pattern and BLAST variant • protein target and protein pattern can be searched using blastp • Target indexing • makeblastdb, in the BLAST package can index FASTA files • Needs sequence input (e.g. FASTA, asn.1) • Needs sequence type to be provided e.g.: protein

BLAST • blastp • Needs indexed database • Needs query sequence (can be unindexed FASTA) • Produces alignments

Blast flavors Query: DNA Protein DB: DNA Protein • BlastN- nt versus nt database • BlastP- protein versus proteindatabase • BlastX- translated nt (6 frames) versus protein database • tBlastN - protein versus translated nt database (6 frames) • tBlastX - translated nt versus translated nt database (both 6 frames)

BLAST Output • XML • -outfmt 5 • This switch leads to XML output

End Theory I • 5 min mindmapping • 10 min break

Download Blast • http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download • Get blastp and makeblastdb from mbg404 since you are not allowed to install anything • Download a Fasta file (protein, genome, collection of sequences in fasta format) • Database must consist of amino acids since we only have access to blastp today • Use makeblastdb from the Blast package to index the file • Several files will be created when you do it right

MakeDB • Example • makeblastdb -in seq.fasta -dbtype prot -out seqBl –title seqBlastDB • More information? • Go to the doc folder of BLAST • Documentation is there • http://www.ncbi.nlm.nih.gov/books/NBK1763/

BLAST • Now that we have an indexed database try to run BLAST • Read documentation and try to solve the simplest case • You will need the indexed database and you will need a FASTA file as query • You could create queries from the database and slightly change them • Good luck

End Practice I • 15 min break

Theory II

Mass Spectra Recording (e.g. Triple Play) 4500 4505

Fragmentation Spectrum

MS/MS spectra • MS/MS spectra can be assigned a peptide sequence (PSM) • Database search • De novo sequencing

PepNovo • Performs de novo sequencing of MS/ MS spectra • Takes a single spectrum as input • Needs a mathematical model for its evaluation • Will display the results in the console • You will therefore need to redirect the output • Example • ?>PepNovo.exe -dta MSMSSpectrum.dta -model tryp_model.txt

De Novo Sequencing LY D E E L Q A I A K KA I A Q L E E D Y L 1016.4 901.6 772.4 901.6 129.2 114.8 ~ E (129.1) ~ D (115.02) E D E D

MS/MS spectra • MS/MS spectra can be assigned a peptide sequence (PSM) • Database search • De novo sequencing

Correlation 6 5 4 3 2 1 0 -0.10 -0.05 0.00 0.05 0.10 Database selection >1080ZR IAAYPGVSPGLMIHYNIGR >1137RZ AAYPGATQPGATELARRLGK >1152RZ GSGDAAYPGGPFFNLFNLGK >1152ZR GSGDAAYPGGPFFNLFNLGK >2360RZ VDSGWGGVVVVALAPYNLGR >240RZ HPGVVCRPGRGGGCSRHIGK HPGVVCCSRHRRSHTIGK

Initalization Files • X!Tandem • Taxonomy.xml • Default_Input.xml • Input.xml • Running X!Tandem • ?>tandem.exe input.xml • That was easy • But behold, what about the input?

Taxonomy XML <?xml version="1.0" ?> <bioml label="x! taxon-to-file matching list"> <taxon label="chlamy"> <file format="peptide" URL="test_chlre2.fasta.pro" /> </taxon> </bioml>

Input.xml <?xml version="1.0" ?> <bioml> <note>Each one of the parameters for x! tandem is entered as a labeled note node. Any of the entries in the default_input.xml file can be over-ridden by adding a corresponding entry to this file. This file represents a minimum input file, with only entries for the default settings, the output file and the input spectra file name. See the taxonomy.xml file for a description of how FASTA sequence list files are linked to a taxon name.</note> <note type="input" label="list path, default parameters">default_input.xml</note> <note type="input" label="list path, taxonomy information">taxonomy.xml</note> <note type="input" label="protein, taxon">chlamy</note> <note type="input" label="spectrum, path">test_spectra.mgf</note> <note type="input" label="output, path">output.xml</note> </bioml> Another input file Personally, I don’t approve of the XML used here

Default-input XML <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="tandem-input-style.xsl"?> <bioml> <note>list path parameters</note> <note type="input" label="list path, default parameters">default_input.xml</note> <note>This value is ignored when it is present in the default parameter list path.</note> <note type="input" label="list path, taxonomy information">taxonomy.xml</note> <note>spectrum parameters</note> <note type="input" label="spectrum, fragment monoisotopic mass error">0.4</note> <note type="input" label="spectrum, parent monoisotopic mass error plus">100</note> <note type="input" label="spectrum, parent monoisotopic mass error minus">100</note> <note type="input" label="spectrum, parent monoisotopic mass isotope error">yes</note> <note type="input" label="spectrum, fragment monoisotopic mass error units">Daltons</note> <note>The value for this parameter may be 'Daltons' or 'ppm': all other values are ignored</note> <note type="input" label="spectrum, parent monoisotopic mass error units">ppm</note> <note>The value for this parameter may be 'Daltons' or 'ppm': all other values are ignored</note> <note type="input" label="spectrum, fragment mass type">monoisotopic</note> <note>values are monoisotopic|average </note> <note>spectrum conditioning parameters</note> <note type="input" label="spectrum, dynamic range">100.0</note> <note>The peaks read in are normalized so that the most intense peak is set to the dynamic range value. All peaks with values of less that 1, using this normalization, are not used. This normalization has the overall effect of setting a threshold value for peak intensities.</note> <note type="input" label="spectrum, total peaks">50</note> <note>If this value is 0, it is ignored. If it is greater than zero (lets say 50), then the number of peaks in the spectrum with be limited to the 50 most intense peaks in the spectrum. X! tandem does not do any peak finding: it only limits the peaks used by this parameter, and the dynamic range parameter.</note> <note type="input" label="spectrum, maximum parent charge">4</note> <note type="input" label="spectrum, use noise suppression">yes</note> <note type="input" label="spectrum, minimum parent m+h">500.0</note> <note type="input" label="spectrum, minimum fragment mz">150.0</note> <note type="input" label="spectrum, minimum peaks">15</note> <note type="input" label="spectrum, threads">1</note> <note type="input" label="spectrum, sequence batch size">1000</note> <note>residue modification parameters</note> ........ </bioml>

Beautifying XML • XML • Only describes data • Formatting of XML • Additional files can be linked to beautify the display • Transformation (XSLT) • Translates XML into HTML • XML Styling (CSS) • Describes formatting to the elements and attributes used in the XML file • Both files need to be linked at the beginning of the XML file

XML • What is an element? • What is an attribute? • Design a Person • What are attributes of a person? • Use elements for logical grouping • Use attributes for specific information

Styling • Connect the example style • Nothing will be styled ;) • Examine the CSS file and rename the styles such that your person XMLwill be somewhat styled

End Theory II • 5 min mindmapping • 10 min break

Practice II

View Spectra and Sequence • To view matching peaks of the PepNovo prediction and the spectrum at the same time • Use the DtaViewer from http://www.biolnk.com

Download • Download PepNovo • http://www-cse.ucsd.edu/groups/bioinformatics/software.html#pepnovo • http://bioinformatics.allmer.de/tools • Download test file • http://bioinformatics.allmer.de/tools

Try PepNovo • Try to run PepNovo • Use the given input • Use the help information • Use the lecture slides • Use the lecture notes • Aim • Store the result in a text file

PepNovo Results are displayed in the console We need to redirect the output into a file. ?>PepNovo.exe -dta MSMSSpectrum.dta -model tryp_model.txt > result.txt

X!Tandem • Unzip folder and check • Mgf formated spectra (file) • Database file (FASTA) • tandem-win32-10-12-01-1 folder • Used .xml configuration files (default_input.xml, input.xml and taxonomy.xml) • To get the same output given in zip folder; • Replace configuration files in «tandem-win\bin» folder with ones in «used» folder. • Also copy database file to «fasta» folder and .mgf file to «bin» in «tandem-win»

X!Tandem Console Application

X!Tandem Default Input Parameters such as mass tolerances, enzyme type, number of charged for search can be reset in default_input.xml

X!Tandem Input.xml • In input.xml file, youshouldspecifypath of: • taxonomy.xml • default_input.xml • Spectrafilename • Outputfilename • NOTE: Here input.xml andallfilesaboveare in samefolder(directory))

X!Tandem Taxonomy In taxonomy file, you should specify «database file path». In this example, database file is in «fasta» folder in «Xtandem\tandem-win32-10-12-01-1» folder.

X!Tandem Output

Computational Biology