L o o p p
This presentation is the property of its rightful owner.
Sponsored Links
1 / 43

L O O P P PowerPoint PPT Presentation


  • 155 Views
  • Uploaded on
  • Presentation posted in: General

L O O P P. T. Galor. INDEX The database of the program The database of LOOPP Inserting a new driver to LOOPP Inserting a new model to LOOPP The MAIN module The OPTION module The loopp_interf module The ALIGN module The THREAD module THE SEQ module The PDB module The MPS module

Download Presentation

L O O P P

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


L o o p p

LOOPP

T. Galor


L o o p p

INDEX

The database of the program

The database of LOOPP

Inserting a new driver to LOOPP

Inserting a new model to LOOPP

The MAIN module

The OPTION module

The loopp_interf module

The ALIGN module

The THREAD module

THE SEQ module

The PDB module

The MPS module

Developing a new potential (SVM,BPMPD,PCX)


L o o p p

Index continue

TE13 module

The global variables

The parameter file

Installing LOOPP

Running LOOPP

Interpretation of loopp results

Reference


L o o p p

The database

In this chapter I will talk about some of the main data structures defined in Loopp. The definition is given in the file db.hand the allocation and de-allocation of these structures is done in the file db.c.


L o o p p

The protein


L o o p p

Coordinates

The options geometric_chain, C_alpha, C_beta define which of the coordinate set is loaded into the memory. The allocation of the vector is done in the file db.c. The yellow vector is allocated by zalloc_coord(). The red vectors are allocated by the routine init_coord().

The coordinates are read into the memory by read_xyz_loopp_format() defined in the file loop_interf.c.

The default in loopp is to read Geometric side chain.

NULL


L o o p p

Each vector is of size MAX_CONTACT

Contact Map CM

First shell neighbor l

Site 1

NULL

Site 4

The contact map vector (the red vector in the picture) is generated during the allocation of the protein in alloc_info() if the option compute_CM is set on. The size of the red vector is as the number of residues in the protein.

The set of yellow vectors are are allocated during

Get_CM_for_a_prot() defined in the file cm.c.

The last routine read the CM if the file exists or generate the file and load the cm to the memory.

First shell neighbor g

The first shell neighbor g/l contains for each site the number of contact greater/less then the site index respectively.


L o o p p

Count_2nd_shell_contact

Id_2nd_shell_contact

NULL

NULL

Each cell contain the the multiplicity of the corresponding structural site in the vector ID_2ND_shell_contact.

In the example there are 2 contact of type 2 and 1 contact of type 3 in contact with site 1

Each cell contain the value of structural site in contact with site 1.

For example for THOM2 there are 16 different types structural sites numbered from 0 to 15.

The red vectors are allocated during alloc_info if the option read_CM is set on . The yellow vectors are generated with get_thom2_env_per_site() defined in the file env.c. One can also imagine a different structural site than that of thom2.


L o o p p

The model:

Is a set of information that describes rules to calculate the protein structural environment site, the cost of an alignment, the constraint.

Energy model

Alphabet_HP[2]={HYD,POL}

Model HP_M

M_env_HP[2]={15,15}

Base_score[2]={0,15}

db.c:alphabet={ALA,ARG,ASN,ASP,CYS,GLN,GLU,GLY,HIS,ILE,LEU,LYS,MET,PHE,PRO,SER,THR,TRP,TYR,VAL,GAP,GINS,GDEL,HYD,POL,GLX,ASX,CHG,CHN,CST,HST,USR1,ACE,MSE,UNK}

There might be more then one energy model per model. In this case we have a mix model


L o o p p

The cost matrix

The program stores the values of the potential in the energy model during the call of the routine. set_****_attributes(…)

The potential values are read by the routine read_scoring_matrixes(…)

Matrix is a 2 by 2 vector which contain the potential of the current Energy model.

dim1=2; dim2=2; dim3=0; symmetric=NO with_gap_score=NO

A value in the matrix is accessed using the macro INDEX_POTEN defined in the file db.h

index=INDEX_POTEN[res,env_x,base_score]= base_score[res] + env_x.


L o o p p

The model continue

There might be more then one model trained stimulatingly

TRAIN_INFO

include model for training

a vector of flag

include_model

indicate which alphabet is trained

to_train

calculate constrains coefficient per site

calculate a constraint

get_constraints_coef_of_site

define_ineq

convert feature to a environment name

score matrix

dig2env

cost


L o o p p

The alignment

mvs

ALIGNMENT

align length

begin 1

asses

input

ALIGN TRACE

align_len

begin2

M= match, D=delete I=insert

Local alignment start on different location for the two protein

alignment input

protein column

protein row

prot_col

prot_row

assessment

alignment assessment

use loopp index

pdb2loopp index 1

use_loop_indx

pdb2loopp index2

Average energy

Zscore

score

compute Zscore

energy

post ene

ene

post score

post zscore

alignment type

alignment id

#ins #del #match

global/local

thread/seq/struc/

identity

hydrophobic

polarity

num of gap segment

charge

num of mismatch

rms


L o o p p

The database

Database is used when all protein are stored in the memory.

The data base list is used when only one protein at a time is stored in the memory. F_xxxx, stands for the pointer to the file and f_xxxx stand for the file name.

The data base List is initialized with

Init_read_db(), and each new protein is read into the memory with read_nxt_prot() . After all proteins are processes we clean the list with the routine finish_reading_db().

The database is allocated in the file db.c with zalloc_db() and the data base is read into the memory with the routine Build_protein_db_from_file() defined in loop_interf.c. The proteins are read from a file containing a list of pdb name including chains.


L o o p p

The decoy

A decoy is a set of two proteins and their alignment method.

The alignment can be an Identity alignment of SN into XN, a threading alignment of SN into XD or the Sequence alignment.

The alignment energy

We calculate the total LHS energy, RHS energy and the coefficient vector, given an initial guess for the score. The coefficient vector C counts the number of assigning an amino acids ai to structural site xj.


L o o p p

The constraint to train

A pseudo protein is defined by a decoy, where a decoy is a set of two proteins and their alignment.

The equation is defined as the information of decoy1 subtracted from decoy 2.

Loopp outputs three files for training: the RHS file, the LHS file and the Log file.

In the Log file we save the norm of the two coefficient vectors the distant and energy,

In the LHS file we save the left hand side of a constraint

In RHS file we save the right hand side value.


L o o p p

The Database of LOOPP.

Loopp has a set of about 3888 proteins that span the known folds of the PDB. The folds are 6 Ǻ apart, found by LOOPP v1 structural alignment and are updated using CE program from time to time. The data base is stored at H:\\CBSU\LOOPP\DB\DB_jm on the theory center cluster. The list of the proteins of jm_list is given in H:\\CBSU\LOOPP\LIST\jm_list.

In the data base we have so far four types of data. Each file starts with header containing the name and the chain of the protein accompanied with the number of residues. The file ****.seq contain a list of the amino acids. The file ****.xyz contains the coordinates. There are 9 columns in the coordinates file. The first three columns correspond to the (x,y,z) of the geometric side chain. The next triplet correspond to the C alpha coordinates and the last triplet correspond to the C beta coordinates. Missing coordinates are designated by 999.9. The next file is ****.2nd which contains secondary structure which is produced by DSSP program. This file contain 5 columns. The first column has the name of the amino acid, the second column contains the secondary structure: A for alpha helix, B for beta sheets and X for the others. The last three columns are the dihedral angles. The number 3600 is used for unknown angle. The last file contains the surface exposure ****.surf.

Updating the database

The database is updated using the Perl script DB.pl found in H:\users\galor\loopp\perl. In order to run the script the user has to set some of the parameters in the perl script.


L o o p p

Inserting a new driver to LOOPP


L o o p p

In this section we will explain how to insert a new driver in LOOPP. A driver is a function that a user can choose from the startup menu. As an example for a driver is: threading list of sequences to the database. As one can see, from the above figure (), a driver consists of several components. We start with the first component set option.

Default options are set at the beginning of the LOOPP program in main.c. Some of these options are set according to the choice of the user of the program, and the programmer sets the rest. Some of the options are driver dependent and are set by the programmer in the driver. Lets return to our example of threading:

These options are translated to the alignment type is threading and we don’t want to compute the contact map (CM) of TE13.

The next step is to set the model, which define the energy function for LOOPP. In the first example the model is set according to the user wish and in the second example:


L o o p p

The programmer can set the model type, the potential type, and alphabet of the model. The programmer can decide if the model is to be trained, in this case, space is allocated for the training information when the variable train is set to YES.

Next, we read the database of structures to the memory of the program. To this end, we allocate the space with alloc_db(). We attach the model to the database and set options for the program to read all files connected to structures with set_struc_option (). Next we prepare to load only a portion of the database in case Loopp is run with several processors. A subset list of structures is created with take_portion_of_db_based_on_number_of_processes (op, io_in); finally, we build the database of structures, with the routine

build_protein_db_from_pdb_list (db_structure, io_in, io_out, op);

The data base can be divided on several processors

Next, we load the list of sequences into the memory in the same manner.


L o o p p

We again allocate memory for the database and assign it to the variable db_sequence. Set the appropriate model to db_sequence. Then inform the program only to load the relevant information for sequence with set_option_seq (op, model). Next, we copy the list of sequence defined by the user in

op->list_pdb_file, into the variable, Io_in->f_current_list. Finally, sequences are read in LOOPP format or FASTA format according to user setting.


L o o p p

Inserting a new model to LOOPP

We start with the smallest component of a MODEL, the ENERGY_MODEL_TEMPLET. The energy template contains definitions of the protein and operations. In addition it contains the cost function and its parameters for calculating an alignment of two proteins belonging to the same model.

The name of the model is stored in the variable model_type.

The definition of protein is given by its list of residues the ALPHABET and its structural site by *env. As an example for a valid alphabet, alphabet_20_ins_del, which has the twenty usual amino acid types and two gaps namely insertion and deletion. The size of alphabet is stored in n_alphabet.

The *env counts the number of different environment per site. In the example below, each amino acid (ac) has SEQZ sites and each gap has THOM2Z1 sites. In this particular model gaps are treated differently then ac. Gaps are assigned to THOM2 structural site.


L o o p p

  • Next we list the operation that can be used on a protein. Theses are routines that must be programmed for every new model in order to function smoothly in LOOPP:

  • Get_gap_per_site (): If in a particular model, gap depends on structural site, then one can compute the total gap cost for each site a priori. This means, that at the time, the protein features is loaded into the memory, also gap cost are automatically computed.

  • Copy_struc_feature(): Every new protein feature beside the protein coordinates or resides, has to have a copy routine for that feature. As an example are secondary structure, surface exposure, or any new feature that will be added in the future.

  • Res2pos (): Convert residue number to its position in the ALPHABET vector.

  • Get_contact_types_for_multiple_env_per_site(): Computes number of contacts in case of multiple environments per site. So far it was used only for THOM2 structural site.

  • Std_residue(): Checks if an amino acid name is standard for that model.

  • Recall, that ENERGY_MODEL_TEMPLET contains also the cost function and its parameters. The parameters are stored in the variable cost and are loaded from the file, which its name is stored in f_potnat the time set_option_align () is called. The cost function is divided into two parts, that for the amino acid and that for the gap. These routines must be also written for any new model introduced to LOOPP.

  • Get_energy_cost (): Calculate the energy for assigning an ac to a structural site.

  • Get_gap_cost (): Calculates the energy for assigning a gap to a structural site.


L o o p p

As an example of a new model we have here a sequence model with gap depending on structural site of THOM2.


L o o p p

In the file model.c some of the models are experimental and should be used with caution. Below is the list of available models for loopp:

TE13: Set_model_te13_regular20 ();

PDB: Set_model_clean_pdb ()

SEQ: Set_model_seq_alignment ();

THOM2: Set_model_thom2_regular_20_gap ();

Secondary structure: Set_model_2nd_struc ();

Surface exposure: Set_model_surf_regular_20();

A model can be a mix of several models. As an example we will use the mix model of OT

This section is plugged in the file model.c in set_model ().


L o o p p

The main module

The main routine has the following functions:

Decipher the command line for loopp.

Loopp.exe  Interactive mode

Loopp.exe x.x loopp.par  Batch model

Loopp.exe x.x loopp.par #proc proc_Id proc_Id  Batch mode, multiple processors

Setting the options by the user with the function set_option().

Prints interactively the command option available with a short explanation.

Calls for the driver depending on the command option.

Print end message of LOOPP


L o o p p

The option module

Set_option() : read loopp.par and set the value for the structure OPTION. Set the pointer F_stdout (global variable) for redirecting the output to screen or to an output file.

Set_option_seq() : set option before reading a sequence information.

Set_option_struc(): set option before reading structual information of a protein.

Set_option():

Parse the parameter file loopp.par. Every line in loopp.par starts with pond (#) for comment or with at (@) for option definition.

#comment comment line

@USR_PARAMETER value option definition

The same option definition can appear several times in the file loopp.par with different value, yet the last definition only counts.

Adding a new option to LOOPP:

Add the structure option in the file db.hthe appropriate new option field

Add to set_option() the following lines to parse the new option:

As an example we add the new option field called parameter which accept real value number:

if (strcmp("USR_PARAMETER",operator) == EQ){

sscanf(line,"%s%s%f",crd_opening,operator,&fval);

fprintf(F_stdout,"%s\t\t\t%f\n",operator,fval);

op->parameter = fval;

}


L o o p p

The module loop_interf

The major task of this module is to add protein information to the memory of the program.

build_protein_db_from_file() : Build protein database form old loopp format

read_a_pdb_in_loop_format(): Store protein information given in new loopp format.

get_prot_name(), read_header_loop_forma(), read_log_loop_format(), read_seq_loop_format(), read_xyz_loop_format(), read_surf_loop_format(), read_2ndstruc_loop_format() : Read the different files of loopp.

build_protein_db_from_pdb_list()

rm_path(), get_prot_len(), read_a_pdb_in_loop_format(), get_db_TE13_CM_from_pdb_list(), get_gap_per_site_for_db(), get_list_env_per_site_for_db()


L o o p p

check_if_missing_coord() :

Compute the percentage of missing coordinates. If the percentage is greater then the threshold set by op->check_percent_missing. Then the protein is diagnosed as corrupted protein and is not loaded to the memory.

Compute the size as well as the edges index for the reliable chunk in a protein. Usually both edges of the protein contain a lot of missing coordinate. This edged are trimmed and not used for training new potential.

Convert old loopp format to new loopp format, printing routine:

prn_db_in_loop_format(), prn_db_in_old_loop_format(), drv_transform_nloopp_to_oloopp_format(), drv_get_list_from_old_loop_format().

define_a_sublist():

In case loopp is run on several processors. This routine calculate the portion of the database list to load in to the memory for a specific processor.

Fasta format

read_seq_list_fasta2loop_format(), read_seq_prot_fasta2loop_format(),

Load one protein at a time to the memory in case of insufficient of memory:loopp database

init_read_db(), read_nxt_prot(), finish_reading_db(), LM_read_a_pdb_in_loop_format(), LM_read_xyz_loop_format().


L o o p p

The align module

How to use the align module

Set alignment attributes

set_***_attributes(&align,model,op,io_in,mvs,indx1,indx2); len_list=0

Can be seq/thread/..

clean_align_list(alignment_list, len_list); len_list=0;


L o o p p

The dynamic matrix

The dynamic matrix is allocated dynamically. It size depends on the query and the structure sizes. LOOPP has local and global algorithm implemented in align.c.

The dynamic matrix is compute with the following routines : scoring_energy and gap_energy.

Below one can see that cost is the sum of all existing energy_models, that are not post_energy model. (TE13 is considered as post_energy_model)

float scoring_energy(OPTION *op, MODEL *model, PROTEIN *prot_col, PROTEIN *prot_row, int pos1, int pos2 ){

int k;

float ret_score = 0.0;

ENERGY_MODEL *m;

for (k=0; k<model->n_ene_models;k++){

m = &model->ene_models[k];

if (m->get_energy_cost != NULL && !m->post_ene){

ret_score += m->unit_conversion * m->get_energy_cost(op,m,prot_col,prot_row,pos1,pos2);

}

}

return(ret_score);

}

T=

Prot_row

Prot_col

Align : Prot_col ------- Prot_row

There are two routines for debugging the dynamic matrix. The first one prints the dynamic table to the screen. The size of the window is given by last four parameters. It must be inserted in local_align or global_align before the routines are exited.

The second routine can be called after align(..) was called to see the energy of the alignment path.

DEBUG(1, dbg_align_window(seq1_length,seq2_length,S,T,align_info->trace ,0,prot1,prot2,0,20,0,20));

DEBUG(1, dbg_align(seq1_length, seq2_length, S,T,align_info->trace,align_info->align_len));

S=


L o o p p

Ene.dbg output file example

index, align=M/D/I, Native, Structure, Ene, cost, count structural site, structural site

Align protein1.seq.1 ---> seq.2

0 M TYRGLU: ene = -1.112score=-1.112 ( 4 4)

1 MPHE GLU: ene = -1.112score=0.000

2 MGLNASP: ene = -1.376score=-0.264 ( 1 0) ( 1 1)

3 MGLY GLU: ene = -1.264score=0.111 ( 3 4)

4 MHISGLU: ene = -1.163score=0.102 ( 1 0) ( 1 1)

5 MMETGLU: ene = -1.456score=-0.294 ( 2 1)

6 MASNPHE: ene = -0.866score=0.591 ( 1 6) ( 4 7) ( 1 8)

An example for the dynamic matrix

align: 8fab_B---->8atc_A total_ene=405.881042

align_length=310 prot2=224 prot1=310

index of window printing prot2=[214 224] prot1=[300 310]

TRACE ALIGN

D D D D D D D D D D D D D D D D D D D D D D D D D D m m m m m D D D D D D m m D D m m m m m m m m m D m m m m m D m m m m m m D m m m m D m m m m D D m m m D m m m m m D m D m m D m m m m m m D m m m m m m m m m m D m m m m m m m m D m m m D m m m m m m m m m m m m m m m m m m m m m m m m m m m m D m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m D m m D m D m D m m D m m D D m D m D m m D m m D m D m m m D m m m m m m m m m m m D D m m D D D m D D m m D m D m m m m m D m m D m m D D m m m m m m m m m m D m m m m D D m m D m D m m m m D D m m D m m m D m m m m m m m m m m m m D m D m

DYNAMIC MATRIX FOR GLOBAL ALIGNMENT

300 301 302 303 304 305 306 307 308 309 310

LEU ALA LEU VAL LEU ASN ARG ASP LEU VAL LEU

LYS 411.4 421.6 435.3 444.7 445.9 459.5 470.0 471.3 479.1 480.3 490.0

VAL 398.5 411.1 426.0 433.2 444.2 445.7 459.4 460.7 470.3 471.6 479.7

ASP 392.1 400.6 418.4 426.9 435.6 444.2 445.8 447.1 461.5 462.7 472.3

LYS 386.6 395.5 409.1 420.0 430.5 436.0 443.8 445.1 447.8 449.0 463.5

LYS 383.2 389.9 403.9 410.7 423.5 430.8 435.6 436.8 445.8 447.1 449.8

VAL 368.6 382.8 394.3 401.8 410.2 423.4 430.7 431.9 435.8 437.1 446.5

GLU 356.5 370.9 389.9 395.5 404.3 410.6 423.5 424.8 432.6 433.8 437.9

PRO 348.7 358.3 377.7 391.0 397.6 404.7 411.0 412.3 425.8 427.0 434.4

LYS 351.4 352.0 366.7 379.3 394.6 397.9 404.2 405.5 413.0 414.2 427.8

SER 340.3 352.7 358.3 367.6 381.0 394.8 398.1 399.4 406.5 407.8 415.6

CYS 324.1 338.7 355.9 355.6 366.1 379.4 393.9 395.1 396.9 398.2 405.9


L o o p p

Computing the Z score

The Zscore measure the homology of prot2 to prot1 with respect to random noise.

The Z score is computed in the routine align(….) in the file align.c

if (input->compute_zscore == YES){

srand(RANDOM_SEED);

shuffled_prot = alloc_prot(op,MAX_SEQ);

for ( k=0; k<n_rnd_alignments; k++ ){

shuffle_sequence(prot2,shuffled_prot);

rnd_input->prot_row = prot1;

rnd_input->prot_col = shuffled_prot ;

if (align_data->input.alignment_type == GLOBAL){

I f (op->strucAlignment) rnd_ene = struc_global_align(&rnd_align,do_trace_back,op,model);

else rnd_ene = global_align(&rnd_align,do_trace_back,op,model);

}

else if (align_data->input.alignment_type == LOCAL)

rnd_ene= local_align(&rnd_align,do_trace_back,op,model);

sumT += rnd_ene;

sumT2 += rnd_ene*rnd_ene;

}

Shuffle the sequence residues of prot2

Add protein to align.input structure

Compute random energy of aligning the random sequence into prot1

Compute average energy of aligning the random sequence into prot1

avT = sumT/n_rnd_alignments;

avT2 = sumT2/n_rnd_alignments;

norm = fabs(avT2 - avT*avT);

align_data->assess.score = avT;

if (norm == 0) align_data->assess.zscore = -999.9;

else align_data->assess.zscore = -(align_data->assess.ene - avT)/sqrt(norm);

if (align_data->assess.zscore < -999.9) align_data->assess.zscore = -999.9;


L o o p p

Printing the statistic of aligning the query to LOOPP database.

#Mon Jul 07 10:56:28 2003

#LOOPP V2: ALIGNMENT INFORMATION

#======================================================

#This file contains statistics of sequence to sequence alignment

#with constant gap penalty 8.000000

#and the potential is multiplied with the factor scale 1.000000

#Alignment type : GLOBAL

#The following models were used:

#Potential : NHseq_gte_thom2.pot

#Model : SEQ_M with mixing parameter: 1.000000

#The model produced the alignment : YES

#

#Data Base : H:\users\galor\LISTS\test

#The difference in length between the query sequence and the data base sequence is less then 30.00 percent

#

#The number of random sequence to compute zscore was set to 100

#Only prints zscore above threshold 0.00

# ========================================================

# 1 matches to 1dbt_A zscore ene identity te_ene te_zscore length align_len

7tim_A 0.11 -89.00 5.40 999.00 999.90 247 278


L o o p p

The Threading module

What is threading

S1: AWGHKI

Sequence information is used for the probe protein. Structural information for the target.

G

K

I

H

X2: s1s0s2s3s0s3


L o o p p

LIST OF FUNCTIONS:

drv_threading_a_list_of_seq_against_the_db() : Thread a list of sequences against the database

drv_threading_a_seq_against_the_db() : Thread one sequence against the database.

LM_drv_threading_a_list_against_the_db() : Thread a list of sequences against the database (Low memory)

drv_threading_a_seq_against_a_struc() : Thread a sequence against one structure

drv_threading_a_db_against_itself() : Thread the database against itself used for recognizing native

set_threading_attributes() : Set attributes for alignment of sequence to structure.

thom2_gapless_threading_gap_penalty() : Compute gaps for Thom2 model REJM model

thom2_threading_scoring_energy() : Compute scoring energy for Thom2.


L o o p p

The seq module

drv_seq_alignment_of_a_list_of_seq_against_the_db(): Align a list of sequence against the data base

LM_drv_seq_alignment_a_list_against_the_db(): Align a sequence against the database (Low memory)

drv_seq_alignment_of_db_against_the_db(): Align the database against itself (for recognizing the native)

drv_seq_alignment_of_seq_against_seq(): Align one sequence against one sequence.

set_seq_attributes(): Set attributes for sequence alignment.

seq_alignment_gap_penalty(): Compute structural gap dependent penalty.

seq_alignment_constant_gap_penalty(): Compute constant gap penalty

constant_seq_alignment_gap_penalty_for_pdb_seq_to_atom(): Compute gap penalty for aligning SEQRES to ATOM section for a PDB file.

seq_alignment_scoring_energy(): Compute scoring energy using Blusom 50


L o o p p

The PDB module:

The main task of this model is to create the database for Loopp. The database is created into stages. First step the PDB files are cleaned. IN the second step LOOPP files are created.

The first step:

A pdb file pdb****.ent is converted in to 2 files: pdb****.ent.logpdb****.ent.new. A clean PDB from the original PDB in pdb****.ent.new. A log file in pdb****.ent.log which contains information on the clean pdb. The later file contains lines of the form: "tag resName resSeqNum atomCounter gapIndicator CA-distance“, which describe how the file *.new was derived form the original pdb file *.ent

<tag> is a character of +, -, =, or *,

+ stands for adding NTER and CTER card in *.new as chain designators

- deleted residue in *.new

= copied residue in *.new

* copied residue but some of the atoms are missing in *.new.

<atomCounter>: Display the number of atoms found for the current residue;

<gapIndicator> : Display the index in a chain. A chain starts with index 1, and terminate with index 0, if no CA found at the current residue.

<CA-distance>: Display C-alpha distance between previous and current residue.

The created new files *.new and *.log are defined by the option = USR_PDB_PATH in the parameter file loopp.par.


L o o p p

The routine which is responsible for cleaning the pdb is drv_clean_pdb_from_a_list_of_pdb_names(). It calls the interface routine openInterfaceToCleanPDB() in PDBparser.c file.

Step 2: Generating loopp database

The routine structure for parsing a pdb file containing all sections as defined by RCSB database:


L o o p p

The main database in the file pdb.c : PDB_INFO:

code_name : PDB acronym

Chain_code_file : Chain identity (extract from the file name)

Res_atom : A protein whose sequence is taken from SEQRES section

Res_atom : A protein whose sequence and coordinates is taken from ATOM section.

N_card : Number of cards in atom section

N_atom_list[] : Display the number of atoms for the current residue

Trace_pass2 : Pass 2 alignment trace

Trace_original : Pass 1 alignment trace

Trace_final : Final alignment trace

gapMarkbond : Save gapIndicator from *.log file

JumpMarkBond : Gap marker according to C-alpha bond length

jumpMarkPdb : Gap marker according to bond residue index

Align_info : Alignment of SEQRES section onto ATOM section

Current_chain_id :The current chain in case there are several chains in the pdb

DiscrepancyInJump : JumpMarkBond and JumpMarkPdb disagree flag

MatchNotIdetical :In the alignment of SEQRES onto ATOM there is a match but the residue are not identical. (Error in the pdb file)


L o o p p

Algorithm outline:

The pdb file given from RCSB database is full of discrepancies. One way to fish out these problems is to align SEQRES section to the atom section residues. The program uses a sequence alignment with constant gap penalty and constant match score. After the first pass there is a need to check the alignment trace ( alignment path) if the alignment make sense. That is, gaps are concentrated in distinct area, gaps according to C-alpha distance correspond with jump in the PDB index, for match segment the program check whether the residue at the SEQRES section coincide with that of the ATOM section.

The program tries to correct some of the errors in the pass 2, by shifting gaps, or using a different alignment not based on dynamic programming. The user can choose which alignment path make more sense based on reading the comments in the PDB file, or to manually make his own version if the two alignment fails.

In case the program fails there is need only to correct small portion of the alignment.In this case the program prints section of the alignment at a time and wait for approval or correction. Unfortunately in rare occasion it might happened that the algorithm fails completely.


L o o p p

MPS module

Convert loopp format output to MPS format:

The design of new potential leads eventually to solving a set of linear equations:

Loopp generate LHS and RHS files which contain the coefficients of the inequalities. One of the options to solve these

Set is using the software of BPMPD which requires MPS input format.

Mps format

MPS

Here is a simple example of mps file:

NAME example2.mps

ROWS

N obj

L c1

L c2

COLUMNS

x1 obj -1 c1 -1

x1 c2 1

x2 obj -2 c1 1

x2 c2 -3

x3 obj -3 c1 1

x3 c2 1

RHS

rhs c1 20 c2 30

BOUNDS

UP BOUND x1 40

ENDATA

Read loopp LHS and RHS file into memeory

Write bilinear objective :

Write linear objective

Define space of field

Define variable name


L o o p p

The driver to convert loopp to MPS format is drv_loopp2mps(). This routine calls to the different routines

Depending on *.par file parameter. The most common routine is solving the linear set of inequalities with out objective: loopp2mps_nobj().


L o o p p

Design new potential

Gapless threading

Create a equations of the type a_j[0]x[0]+...+ a_j[n]x[n] = r_j where a_j[0],...,a_j[n] (j=1...m) are stored in

In the file op->train_lhs_file_nobj and and the rhs r_j (j=1...m) are stored in op->train_rhs_file_nobj.

The equations are generated by gapless threading method assigning a seq into a structure with out gaps. The pair (seq_i,struc_j) construct a pseudo protein denoted as decoy. The equation is defined as the energy difference of assigning a native seq into a decoy structure (A non native structure) and assigning a native seq into a native structure.

E(N->D)-E(N->N)=A*X > 0.

The energy definition depends on the model chosen by the user in USR_MODEL_TYPE. The length of the seq should be shorter then that of structure. The sequence is sled into the structure (N_struc - n_seq +1) times or less depending on USR_GAPLESS_THREADING_WINDOW;

There are two main routine for gapless threading:

drv_compute_fix_threading_constrains(): Generate the LHS, RHS and LOG file from a LOOPP database

drv_compute_fix_threading_constraints_where_the_db_is_based_on_abintio_decoys() Generate LHS, RHS, LOG file for abintio database. The difference is that the native is gapless thread to its family of decoys. As an example of decoys is the Skolnik set, TB set, and Baker set. The decoy length equal to its native.


  • Login