Secuenciación de novo

Secuenciación de novo Prácticas Salvador Martínez de Bartolomé Bioinformatics Support ProteoRed smartinez@proteored.org

Why de novo sequencing is difficult • Leucine and isoleucine have the same mass • Glutamine and lysine differ in mass by 0.036Da • Phenylalanine and oxidized methionine differ in mass by 0.033Da • Cleavages do not occur at every peptide bond (or cannot be observed on the MS-MS) • Poor quality spectrum (some fragment ions are below noise level) • The C-terminal side of proline is often resistant to cleavage • Absence of mobile protons • Peptides with free N-termini often lack fragmentation between the first and second amino acids

Why de novo sequencing is difficult (II) • Certain amino acids have the same mass as pairs of other amino acids • Gly +Gly (114.0429) Asn (114.0429) • Ala +Gly (128.0586) Gln (128.0586) • Ala +Gly (128.0586) Lys (128.0950) • Gly + Val (156.0742) Arg (156.1011) • Ala + Asp (186.0641) Trp (186.0793) • Ser + Val (186.1005) Trp (186.0793) • Directionality of an ion series is not always known (are they b- or y-ions?)

Secuenciación de novo • Dos aproximaciones bioinformáticas • Enfoque global: Se calculan todos los péptidos posibles Pi con masa Mi. Posteriormente se genera para cada una de las secuencias de aminoácidos calculadas su espectro de fragmentación teórico, T(Pi), y por último realizar la comparación de cada T(Pi) con el espectro real S. La solución consiste en determinar la secuencia del péptido Pi que genera el espectro teórico T(Pi) con mayor identidad con S. • Enfoque local: En este caso se utiliza la información de los picos del espectro para reducir el número de candidatos a generar. La secuenciación se realiza partiendo del extremo N terminal, y comprobando la existencia de algún pico cuya diferencia de masa respecto al extremo corresponda con la masa de algún aminoácido. El proceso continua chequeando la existencia de picos en el espectro de la misma manera.

Secuenciación de novo • Aproximación mejorada: local o global? Los problemas asociados a los algoritmos globales eran principalmente dos: • El primero, relativo al crecimiento exponencial del número de candidatos, y por otro lado, el tiempo computacional dedicado para la comparación de los espectros teóricos generados a partir de los candidatos y el espectro real. Estos problemas son dependientes de la masa a determinar, aunque, en el caso de existir, la solución siempre se proporciona. • Solución: la región del espectro a determinar por el enfoque global se reduce al mínimo para evitar que los inconvenientes presentados supongan un problema. • Por su parte, los algoritmos locales, más rápidos y dirigidos, plantean el problema que en zonas mal expresadas del espectro (poca abundancia de picos), no generan buenas soluciones. • Solución: Por tanto, en este sistema, sólo se realizará una búsqueda de este tipo en las zonas con abundancia de picos considerable.

Summary of de novo sequencing tools Software Source website PEAKS* www.bioinformaticssolutions.com SeqMS (download) www.protein.osaka-u.ac.jp/rcsfp/profiling/SeqMS.html Sherenga (included in SpectrumMill)* N/A Lutefisk (download) www.hairyfatguy.com/Lutefisk DeNovoX* www.thermo.com PepNovo peptide.ucsd.edu/pepnovo.py SpectrumMill* www.home.agilent.com *Commercialized

Prácticas DeNovo

Lutefisk

Lutefisk is software for the de novo interpretation of peptide CID spectra • http://www.hairyfatguy.com/lutefisk/ • To run Lutefisk, you need to have four files within the same directory or folder: • CID data file (data files can be specified with a full or partial pathname) • Lutefisk.details • Lutefisk.params • Lutefisk.residues • One additional file is optional: • Database.sequence

‘Lutefisk.details’ file • The Lutefisk.details file contains the so-called "ion probabilities" for each type of ion. • Each column in the file contains the "ion probabilities" for different fragmentation patterns (see the description of "fragmentation patterns" below). • Currently there are only two types of fragmentation pattern that have been coded, which is for low energy CID of tryptic peptides on triple quadrupole (or Qtof) instruments or ion traps, and these ion probabilities are listed in the second and third columns. The first column is not used (oddly enough).

‘Lutefisk.residues’ file • The Lutefisk.residues file contains the single letter code, monoisotopic masses, average masses, and nominal masses for each amino acid. • To add an additional residue to the list, replace the 0's in one of the rows w/ the corresponding monoisotopic, average, and nominal masses. • Up to five additional non-traditional residues can be entered here, and will be given the single letter code of J, O, U, X, or Z

‘Database.sequence’ file • The Database.sequence file is a text file containing a sequence or a list of sequences that might have been derived from a sequence database search. • In the final steps, where it determines scores for the candidate sequences, Lutefisk tosses in these database-derived sequences along with the de novo sequence candidates to determine if the database sequences are as good as or better than the de novo sequences. If so, then this constitutes evidence that the database derived sequences might actually be correct.

‘Lutefisk.params’ file 241103plata_bernabe.369.369.2.dta

‘Lutefisk.params’ file

Lutefisk help >lutefisk.exe -h

Run Lutefisk.exe • Once all files are configured correctly, • on command prompt, type: (in “C:\Documents and Settings\Bioworks32\Desktop\denovo\LUTEFISK” folder) • Lutefisk.exe

Output from Lutefisk – lut file • The candidate sequences are ranked according to Pr(C) which is the estimated probability of being correct. • Also gives four scores: • Pevzscr is an adaptation of the ideas presented by Dancik et al (J. of Comput. Biol (1999) Vol 6, 327), which is a score that penalizes for the absence of expected ions and accounts for the possibility of random matches. • Quality is the percentage of the peptide mass that can be accounted for by a contiguous ion series. • Intscr is the percentage of the fragment ion intensity that can be accounted for as b, y, internal fragment, etc, ions. • X-corr is the cross-correlation score that has been normalized by its auto-correlation score.

pepNovo

PepNovo • scoring method uses a probabilistic network whose structure reflects the chemical and physical rules that govern the peptide fragmentation • specific for Ion Trap data

pepNovo • Pepnovo was developed at the University of California, San Diego • Pepnovo uses a probabilistics network to model the peptide fragmentation events in a mass spectrometer. • It’s available online at: http://bix.ucsd.edu/MassSpec/ and also in an inhouse instalation.

pepNovo • PepNovo runs via command line arguments: • -file <full path to input file> to specify a single input file (mgf,dta,mzxml) or • -list <full path to txt file> to give a list of input files (this is the preferred method for large amounts of files since the models are not reread for each input file). • -model <model name> (currently only CID_IT_TRYP is available)

pepNovo • Optional PepNovo arguments: • -prm - only print spectrum graph nodes with scores • -prm_norm - prints spectrum graph scores after normalization and removal of negative scores. • -correct_pm - finds optimal precursor mass and charge values. • -use_spectrum_charge - does not correct charge. • -use_spectrum_mz - does not correct the precursor m/z value that appears in the file. • -no_quality_filter - does not remove low quality spectra. • -fragment_tolerance < 0-0.75 > - the fragment tolerance (each model has a default setting) • -pm_tolerance < 0-5.0 > - the precursor masss tolerance (each model has a default setting) • -PTMs <PTM string> - separated by a colons (no spaces) e.g., M+16:S+80:N+1 • -digest <NON_SPECIFIC,TRYPSIN> - default TRYPSIN • -num_solutions < number > - default 20 • -tag_length < 3-6> - returns peptide sequence of the specified length (only lengths 3-6 are allowed). • -model_dir < path > - directory where model files are kept (default ./Models)

Pepnovo • Usingpepnovo: >PepNovo.exe –list paths_of_lots_of_spectra.txt –model CID_IT_TRYP –PTMs C+57:M+16 –digest TRYPSIN This command runs Pepnovo on all the spectra files in “paths_of_lots_of_spectra.txt” assumes that peptides were digested with trypsin and that the cystine are carbomethylated and that the methionine can be oxidized. The output is the defaults output of 20 sequences. >PepNovo.exe –file my_great_spectra.mgf –model CID_IT_TRYP C+57:M+16 –digest NON_SPECIFIC –tag_length 3 –num_solutions 50 Runs pepnovo on a single mgf file and generates 50 tags of length 3 for each spectrum (assumes that the digest was not with trypsin).

Pepnovo • Usingpepnovo: (in “C:\Documents and Settings\Bioworks32\Desktop\denovo\pepNovo” folder) PepNovo.exe –file 241103plata_bernabe.369.369.2.dta –model CID_IT_TRYP –digest TRYPSIN PepNovo.exe –file FQSEEQQQTEDELQDK.dta –model CID_IT_TRYP –digest TRYPSIN

Pepnovo • PepNovo output: • The output gives the following tab delimited fields for each MS/MS spectrum: • Idx – the sequence/tag rank (starts at 0) • RnkScr - the ranking score (the major score that is used) • PnvScr – the PepNovo score of the sequence (see Anal Chem 2005, and JPR 2006 for more details on the score). • N-Gap - the mass gap from the N-terminal to the start of the de novo sequence. • C-Gap - the mass gap from the C-terminal to the end of the de novo sequence. • Sequence – the predicted amino acid sequence.

Pepnovo http://bix.ucsd.edu/MassSpec/

Secuenciación de novo

Secuenciación de novo

Presentation Transcript

De Novo Genome Assembly Using vSMP

De novo assembly from Illumina

Genovo : De Novo Assembly for Metagenomes

De Novo Sequencing and Homology Searching with De Novo Sequence Tags

De novo assembly of RNA

De-novo Assembly

MERmaid : Distributed de novo Assembler

de novo Protein Design

De novo assembly from clinical sample