Allele Mining: with respect to Comparative Protein Structure Modelling and Docking study

Allele Mining: with respect to Comparative Protein Structure Modelling and Docking study Sunil Kumar Institute of Life Sciences Bhubaneswar E-mail: skybiotech@gmail.com

Allele Mining: an Introduction • Enormous sequence information is available in public databases as a result of sequencing of diverse crop genomes. • It is important to use this genomic information for the identification and isolation of novel and superior alleles of agronomically important genes from crop gene pools to suitably deploy for the development of improved cultivars. • Allele mining is a promising approach to dissect naturally occuring allelic variation at candidate genes controlling key agronomic traits which has potential applications in crop improvement programs. • It helps in tracing the evolution of allels, identification of new haplotypes and development of allele specific markers for use in marker-assisted selection.

Allele Mining…..cont • Initial studies of allele mining have focused only on the identification of SNP/InDels at coding sequences or exons of the gene. • Since these variations were expected to affect the encoded protein structure and/or function • However, recent reports indicate that the nucleotide changes in non-coding regions (5’UTR) including promoter, introns and 3’ UTR) also have significant effect on transcript synthesis and accumulation which in turn alter the trait expression

Information Transfer pathway within the cell ……ATGCATGCATGCATGCATGC.. DNA ………CGUACGUACGUACGU………… RNA ………CGUACGUACGUACGU………… DECODING MECHANISM PROTEIN Sequence PROTEIN Structure Biological function

Proteins Proteins are the building blocks of life. In a cell, 70% is water and 15%-20% are proteins. Examples: hormones – regulate metabolism structural – hair, wool, muscle,… antibodies – immune response enzymes – chemical reactions

Amino Acids A protein is composed of a central backbone and a collection of (typically) 50-2000 amino acids There are 20 different kinds of amino acids Name 3-letter code1-letter code Leucine Leu L Alanine Ala A Serine Ser S Glycine Gly G Valine Val V Glutamic acid Glu E Threonine Thr T

Amino Acids Side chain Each amino acid is identified by its side chain, which determines the properties of this amino acid.

Side Chain Properties • Hydrophobic stays inside, while hydrophilic stay close to water • Oppositely charged amino acids can form salt bridge. • Polar amino acids can participate hydrogen bonding The amino acids names are colored according to their type: positively charged, negatively charged, polar but not charged, aliphatic (nonpolar), and aromatic. Amino acids that are essential to mammals are marked with an asterisk (*).

Protein Folding • Proteins must fold to function • Some diseases are caused by misfolding e.g., mad cow disease

Three Structure Levels • Primary structure: sequence of amino acids • e.g., DRVYIHPF • Secondary structure: local folding patterns • e.g., alpha-helix, beta-sheet, loop • Tertiary structure: complete 3D fold Helix Beta Sheet Loop

Beta Sheet Examples Parallel beta sheet Anti-parallel beta sheet

Helix Examples

Domain, Fold, Motif • A protein chain could have several domains • A domain is a discrete portion of a protein, can fold independently, possess its own function • The overall shape of a domain is called a fold. There are only a few thousand possible folds. • Sequence motif: highly conserved protein subsequence • Structure motif: highly conserved substructure

Protein Data Bank About 50,000 protein structures, solved using experimental techniques ~800 are unique structural folds Same structural folds Different structural folds

The Problem protein structure • Protein functions determined by 3D structures • ~ 50,000 protein structures in PDB (Protein Data Bank) • Experimental determination of protein structures time-consuming and expensive • Many protein sequences available medicine sequence function

Why Protein 3D Structures? 3D Structures of Proteins Better Understanding of Protein Functions “Three-dimensional protein structures are important in understanding the mechanisms of human genetic diseases, predicting the effect of non-synonymous single nucleotide polymorphisms and developing new personalized medicines” Xie and Bourne (2005) PLoS Compt.Biol. 1:e31

What is Homology Modeling? An approach to predict a model of the three-dimensional structure of a given protein sequence (TARGET) based on an alignment to one or more known protein structures (TEMPLATES) The homology modeling method is based on the assumption that the structure of an unknown protein is similar to known structures of reference proteins

Why a Model? A model is desirable when either X-ray crystallography or NMR spectroscopy can not determine the structure of a protein in time or at all. While the 3-D structure of proteins can be determined by x-ray crystallography and NMR spectroscopy. These experimental techniques are time consuming and not possible if a sufficient quantity and quality of a proteins is not available. The built model provides a wealth of information of how the protein functions with information at residue property level. This information can than be used for mutational studies or for drug design..

Protein Structure Determination • High-resolution structure determination • X-ray crystallography (~1Å) • Nuclear magnetic resonance (NMR) (~1-2.5Å) • Low-resolution structure determination • Cryo-EM (electron-microscropy) ~10-15Å

most accurate An extremely pure protein sample is needed. The protein sample must form crystals that are relatively large without flaws. Generally the biggest problem. Many proteins aren’t amenable to crystallization at all (i.e., proteins that do their work inside of a cell membrane). ~$100K per structure X-ray crystallography

Nuclear Magnetic Resonance • Fairly accurate • No need for crystals • limited to small, soluble proteins only.

Steps in homology modelling Target’s sequence 1. Identification of structures that will form the template for modelling 2. Sequence Alignment of the target with template 3. Transfer of the coordinates from the template(s) to the target of structurally conserved regions (SCR’s) 4. Modelling the missing regions 5. Refinement and validation of the model Target’s structure

Template search • Homology modeling is based on using similar structures • i.e. no Similar structures = No Model • 40% amino acid identity or higher is best; below that is not advisable but examples of success do exist • Need sequence similarity across the whole sequence, • not just in one part

Searching Databases Query BLASTING…. FASTING…. Database

Key Step: Sequence alignment of the target with the basis structures Good Alignment Good Model

Sequence alignment is a basic technique in homology modeling. • It is used to establish a one-to-one correspondence between the amino acids of the reference protein (template) and those of the unknown protein (target) in the structurally conserved regions. • The correspondence is the basis for transferring coordinates from the reference to the model protein

Sequence A GGTGGAC GGTGGAC Sequence B AAAGGTGAC AAAGGTG - AC A Sample alignment of two DNA sequences (a) Un-gapped alignment (b) Gapped alignment. The “I” indicates matching nucleotides

Global Alignment Local Alignment Sequence Alignment Applications: Global alignment : essential for comparative modeling. Local alignment : sufficient for functional domains. N.B:Global alignment is computationally more time consuming than the local alignment.

Sequence Homology Vs Sequence Similarity

Dotplot: A dotplot gives an overview of all possible alignments A   T   T   C  A   C  A   T   A    T A C A T T A C G T A C Sequence 2 Sequence 1

Dynamic Programming • Dynamic programming is a computational method used for aligning two protein or nucleotide sequences. The method compares every pair of residues/nucleotides in the two sequences and generates an alignment. • In the alignment matches, mismatches and gaps in the two sequences are positioned in such a way that the number of matches between identical or similar residues is maximum possible. • Needleman and Wunsch Algorithm - Global Alignment - • Smith and Waterman Algorithm - Local Alignment -

F(i, j) =F(i-1, j-1) + s(xi ,yj) F(i, j) = max F(i, j) = F(i-1, j) - d F(i, j) = F(i, j-1) - d F(i-1, j-1)F(i, j-1) F(i-1,j)F(i, j) s(xi ,yj) -d -d

Steps Initialization:- 1st Row and 1st Column- Filled with Multiple of Gap Penalty Rest of the cells: Filled with Vmax Value Generation of Optimal path: Through back tracking Generation of optimal alignment: For the optimal path (No. of optimal path = No. of optimal alignment Scoring Scheme :- Given an alignment between two sequences, we can compute its similarity by :- 1) Rewarding for a match Match => +1 2) Penalizing for a mismatch Mismatch => -1 3) Penalizing for a gap Gap or Indel => -2

Smith and Waterman(local alignment) Two differences: 1. 2. An alignment can now end anywhere in the matrix 0 F(i, j) = F(i-1, j-1) + s(xi ,yj) F(i, j) = F(i-1, j) - d F(i, j) = F(i, j-1) - d F(i, j) = max Example: Sequence 1 H E A G A W G H E E Sequence 2 P A W H E A E Scoring parameters: BLOSUMGap penalty: Linear gap penalty of 8

Comparative Modelling Methods • Restrained based methods • MODELLER • (Sali and Blundell, 1993)

MODELLER • MODELLER is a computer program that models three-dimensional structures of proteins and their assemblies by satisfaction of spatial restraints. • MODELLER is most frequently used for homology or comparative protein structure modeling. • The user provides an alignment of a sequence to be modeled with known related structures and MODELLER will automatically calculate a model with all non-hydrogen atoms. • A 3D model is obtained by optimization of a molecular probability density function (pdf).

Format for Modeller: INCLUDE SET ATOM_FILES_DIRECTORY = './:../‘ SET PDB_EXT = '.atm‘ SET STARTING_MODEL = 1 SET ENDING_MODEL = 20 SET MD_LEVEL = 'refine1‘ SET DEVIATION = 4.0 SET KNOWNS ='1JKE‘ SET HETATM_IO = off SET WATER_IO = off SET ALIGNMENT_FORMAT = 'PIR‘ SET SEQUENCE = 'target1‘ SET ALNFILE = 'multiple1.ali CALL ROUTINE = 'model'

Loop Modelling Calculate distances between the anchor residues. FRAGMENT DATABASE Loop region Loop Generation Process: 1. Select a loop for each region 2. Fixing of the loop

Loop Library • Loops extracted from PDB using high resolution (<2 Å) X-ray structures • Typically thousands of loops in DB • Includes loop coordinates, sequence, # residues in loop, Ca-Ca distance, preceding 2o structure and following 2o structure (or their Ca coordinates)

Structure Validation • Stereochemical Quality Check • (b) Residue Environment Check

Stereochemical Quality Check PROCHECK (Thornton and Co-workers) Following properties are calculated and analysed in comparison with those of highly refined structures solved at varying resolutions. Torsional angles: - (f,y) combination - c1-c2 combination - c1 torsion for those residues without c2 - combined c3 and c4 angles - w angles Covelent geometry: - main-chain bond lengths - main-chain bond angles

Profiles-3D • Amino acid residues in proteins can be classified according to their local environments: • solvent accessibility • secondary structure • polarity of other protein chemical groups in contact with them

Refining the Model -Energy minimize N- and C-termini. - Repair spliced peptide bonds. - Minimize loop regions - Energy minimize mutated side chains in SCRs. - Minimize segments together.

Energy Minimization • Energy minimization adjusts the structure of the molecule in order to lower the energy of the system. • For small molecules, a global minimum energy configuration can often be found. • for large macromolecular systems, energy minimization allows one to examine the local minimum around a particular conformation.

Modelling on the Web • Prior to 1998 homology modelling could only be done with commercial software or command-line freeware • The process was time-consuming and labor-intensive • The past few years has seen an explosion in automated web-based homology modelling servers • Now anyone can homology model!

Application of Comparative Modeling - Comparative modeling is an efficient way to obtain useful information about the proteins of interest. For example – comparative modeling can be helpful in - Designing mutants to text hypothesis about the proteins function. - Identifying active and binding sites. - Searching for designing and improving. - Modeling substrate specificity. - predicting antigenic epitopes. - Simulating protein – protein docking. - Confirming a remote structural relationship.

Allele Mining: with respect to Comparative Protein Structure Modelling and Docking study