1 / 47

Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Applications of knowledge discovery to molecular biology: Identifying structural regularities in proteins. Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook Dr. Edward Bellion. Outline. Motivation and goal of the research

tyler
Download Presentation

Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Applications of knowledge discovery to molecular biology:Identifying structural regularities in proteins Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook Dr. Edward Bellion

  2. Outline • Motivation and goal of the research • SUBDUE knowledge discovery system • Proteins and PDB • Methods and results • Discussion and conclusion • Future research

  3. Motivation and Goal • Explosive amount of molecular biology info need to be analyze to help understanding the underlining structure-function relationship in protein and other macromolecules. • Apply SUBDUE to the Brookhaven Protein Data Bank (PDB) to identify biologically meaningful patterns

  4. SUBDUE knowledge discovery system • SUBDUE discovers patterns (substructures) in structural data sets • SUBDUE represent data as a labeled graph • Inputs: vertices and edges • Outputs: discovered patterns and instances

  5. Example Vertices: objects or attributes Edges: relationships shape triangle object shape square on object 4 instances of

  6. SUBDUE’s search algorithm • Minimum Description Length (MDL) principle: The best theory to describe a set of data is the one that minimizes the DL of the entire data set • DL of the graph: the number of bits necessary to completely describe the graph • Search for the substructure that results in the maximum compression

  7. Inexact graph match approach Find instances with a slight distortion: insertion, deletion, and substitution of edges/vertices. Threshold parameter: specify amount of distortion allowed.

  8. Overview of proteins • most important biomolecule • composed from 20 amino acids • structural hierarchy • very diverse structure and function

  9. Structural hierarchy in proteins • Primary structure (sequence of protein) • Secondary structure (helix, sheet, random) • Tertiary structure (3-D)

  10. Primary Structure of proteins • Average 100-150 residues (a.a.) linked in head to tail • N-terminus and C-terminus • Peptide bond, alpha-carbon N-terminus C-terminus R1 O H R2 O + - H3N - C1 - C - N - C2 - C - O first a.a second a.a peptide bond

  11. Secondary structure elements • Ordered backbone arrangement: helix and sheet • Helix (0 % to 90 %; average 11 a.a; several types) • Sheet (2 to 15 strands per sheet; parallel and anti-parallel; average 6 a.a. per strand)

  12. Tertiary Structure of protein • Highly complicated 3-D arrangement • Folding of its secondary structure elements

  13. Brookhaven Protein Data Bank (PDB) • Brookhaven National Laboratory • Over 6000 Experimentally determined 3-D structure of biomolecules • Majority: protein structures

  14. Contents of PDB • SEQRES: sequence of a.a. (three letter code) • HELIX: starting, ending, and type • SHEET: starts, ends, sense • ATOM: (x, y, z) coordinates for each atoms in protein

  15. Applications of SUBDUE to PDB- Methods and Results • July 1997 PDBTM release (6000 PDB) • Global data set (4000 PDB) • Category data sets hemoglobin Myoglobin Ribonuclease A

  16. Flowchart of Research Preprocessing Application Inputs to SUBDUE Brookhaven PDB Patterns in Category Graphic representation Instance mapping Patterns in Global others

  17. Preprocessing • compile PDB list for each category • model.c: extract first model • seq.c: extract sequence info convert to graphic format • secondary.c: extract secondary structure info and convert to graphic format • coor.c: extract 3D coordinates convert to grahic format

  18. Primary structure and its representation • Sample PDB lines: SEQRES 1 150 ALA ASN LYS THR 1ASH 139 SEQRES 2 150 LYS SER LEU GLU 1ASH 140 • Sequence (N-terminus to C-terminus): ALA ASN LYS THR LYS SER LEU GLU • SUBDUE graphic input (ALA ASN): v 1 ALA - - - ALA residue v 2 ASN - - - ASN residue e 1 2 bond - - - a peptide bond between ALA and ASN

  19. Secondary structure and its representation -HELIX • Sample PDB lines(starting, ending, type):HELIX 1 ASN 1 HIS 13 1 HELIX 2 ASN 20 ASN 36 1 • vertex: h_type_length • Helix Length:Hlength = SeqNum(last a.a.) - SeqNum(first a.a.) • SUBDUE graphic input:v 1 h_1_12 - - - helix 1, type 1, length 12 v 2 h_1_16 - - - helix 2, type 1, length 16

  20. Secondary structure and its representation - SHEET • Sample PDB lines(sense, length):SHEET 1 TYR 284 ILE 286 0 SHEET 2 HIS 292 THR 294 - 1 • vertex: s_sense_length • SUBDUE graphic input:v 1 s_0_2 - - - strand 1, sense 0, length 2 v 2 s_-1_2 - - - strand 2, sense -1, length 2

  21. Overall secondary structure representation • PDB line: SUBDUE graphic input HELIX 1 THR 3 MET 13 1 v 1 h_1_10 HELIX 2 ASN 24 ASN 34 1 v 2 h_1_10 e 1 2 sh HELIX 3 SER 50 GLN 60 1 v 3 s_0_7 e 2 3 sh SHEET 1 LYS 41 HIS 48 0 v 4 h_1_10 e 3 4 sh SHEET 2 MET 79 THR 87 -1 v 5 s_-1_8 e 4 5 sh • sequential relationship is represented as edge “sh” • Visualization: N-terminus C-terminus

  22. Tertiary structure and its representation • Sample PDB lines:X Y ZATOM CA ALA 1 10.369 0.997 10.519 ATOM CA ASN 2 6.691 0.239 9.830 • vertex: backbone carbon; edge: distance (vs, s) • Distance (Å): distance = ((x2-x1)2 + (y2-y1)2 + (z2 - z1)2)1/2 • v 1 CA_ALA v 2 CA_ASN e 1 2 vs - - - very short distance

  23. Rationale for representation choice-Criteria • Patterns identified by SUBDUE must be representative for each category • Patterns discovered by SUBDUE should discriminate one category from others

  24. Primary sequence • vertex - a.a. residue name • edge - peptide bond e 1 2 bond e 2 3 bond bond bond ARG GLU ALA v 1 ARG v 2 GLU v 3 ALA

  25. Secondary structure elements • Type of the helix • starting and ending points (a.a name and seq number) Helix 1 type length 1 12 starts ends ASN … HIS N-terminus C-terminus

  26. Other ways of representing helix • Separate type and length • combine type and length Helix 1 Helix_1_12 type length 1 12

  27. Tertiary structure • (x, y, z) coordinates vary with different origin choice • avoid numeric number, use vs (4 Å), s (4 Å < dist  6Å) 10.4 6.7 x x y vs y 1.0 C1 C2 0.2 z z 10.5 9.8

  28. Results:Primary structure patterns Hemo_seq (63/65) Hemo_sequence: THR LYS THR TYR PHE PRO HIS PHE ASP LEU SER HIS GLY SER ALA GLN VAL LYS GLY HIS GLY LYS LYS VAL ALA ASP ALA LEU THR ASN ALA VAL ALA HIS VAL ASP ASP MET PRO ASN ALA LEU SET ALA LEU SER THR LEU ALA ALA HIS LEU PRO LAL GLU PHE THR PRO ALA VAL HIS ALA SET LEU ASP LYS PHE LEU ALA SET VAL SER THR VAL LEU THR SER LYS TYR Myo_seq (67/103) Myoglo_sequence: VAL LSU SER GLU GLY GLU TRP GLN LEU VAL LEU HIS VAL TRP ALA LYS VAL GLU ALA ASP VAL ALA GLY HIS GLY GLN ASP ILE LEU ILE ARG LEU PHE LYS SER HIS PRO GLU THR LEU GLU LYS PHE ASP ARG Ribo_A (59/68) Ribonuclease_A_sequence: GLY GLN THR ASN CYS TYR GLN SER TYR SER THR MET SER ILE THR ASP CYS ARG GLU THR GLY SER SER LYS TYR PRO ASN CYS ALA TYR LYS THR THR GLN ALA ASN LYS HIS ILE ILE VAL ALA CYS GLU GLY ASN PRO TYR VAL PRO VAL HIS PHE ASP ALA SER VAL

  29. Primary structure patterns • Unique to each sample category • hemoglobin and myoglobin proteins share little sequence similarity

  30. Results:Hemo secondary structure patterns 1: h_1_14 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20 7: h_1_15 -> h_1_15 -> h_1_6 -> h_1_1 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20

  31. Results:Myo secondary structure patterns 1: h_1_15 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_9 -> h_1_18 -> h_1_25

  32. Results:Ribo_A secondary structure patterns 1: h_1_10 -> h_1_10 -> s_0_7 -> s_0_7 -> h_1_10 -> s_0_3 -> s_0_3 -> s_-1_4 -> s_-1_4 -> s_-1_8 -> s_-1_1 -> s_-1_10 -> s_-1_10 -> s_-1_8 -> s_-1_8 -> s_-1_5 -> s_-1_3 10: h_1_10 -> h_1_10 -> s_0_7 -> h_1_10 -> s_0_3 -> s_-1_4 -> s_-1_8 -> s_-1_8 -> s_-1_6

  33. Results:Tertiary structural patterns • SUBDUE finds small patterns (2 or 3 a.a.) • not unique for each category of proteins • not biologically meaningful

  34. Visualization of secondary structure patterns -hemoglobin complete hemoglobin 2 instances of pattern structure N-terminus C-terminus

  35. Visualization of secondary structure patterns -myoglobin complete myoglobin 1 instance of pattern structure N-terminus C-terminus

  36. Visualization of secondary structure patterns -ribonuclease_A complete ribonuclease_A 1 instance of pattern structure N-terminus C-terminus

  37. Discussion-Hemoglobin • Hemoglobin: A, B, C, D chains • Two types of patterns identified by SUBDUE One for A, C chains, the other for B, D chains • Patterns exist in a majority of hemoglobin proteins • No instances of the best hemoglobin pattern found in other proteins in the global data set

  38. Occurrence of hemo patterns

  39. Occurrence of hemo patterns -continued

  40. Discussion-Myoglobin • Myoglobin: one chain • One dominant pattern identified by SUBDUE • Patterns exist in most of myoglobin proteins • No instances of the best myoglobin pattern found in other proteins in the global data set

  41. Discussion:-Hemoglobin and Myoglobin • Similar secondary structure patterns Hemoglobin B, D chains (from N- to C-terminus) h_1_14 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20 Myoglobin chain (from N- to C-terminus) h_1_15 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_9 -> h_1_18 -> h_1_25 Hemoglobin A, C chains (from N- to C-terminus) h_1_15 -> h_1_15 -> h_1_6 -> h_1_1 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20

  42. Discussion:-Hemoglobin and Myoglobin • Consistent with the genetic studies • Hemoglobin and myoglobin share one ancestral gene • Divergence occurred in the course of evolution. One copy of gene for myoglobin, four copies for hemoglobin. • The last helix of the hemoglobin is shorter; One of the helix in hemoglobin A, C chains almost disappear: allow conformational change

  43. Discussion:-ribonuclease A proteins • All patterns have three helices of the same size • Several strands appear twice indicating participation in two sheet formation. • Ribonuclease S protein (S-protein fragment) also has the pattern.

  44. Conclusion of the results • Secondary structure patterns discovered by SUBDUE are representative to each category • Secondary structure patterns discovered by SUBDUE are distinct for each category • SUBDUE has the ability to discover biologically interesting patterns from PDB and other similar MB data bases

  45. Comparison with other related studies • Different graphic representation • predefined patterns with exact or inexact graph match • Not applied systematically to PDB or other DB • SUBDUE would perform similar task if the inexact graph match routine is incorporated

  46. Conclusions of the study • Abstraction over 3D structure to its secondary structural elements is suitable for discovery • SUBDUE discovered secondary structure patterns for each category can be used as a signature for its class • Inexact graph match is useful for finding similar patterns • SUBDUE is suitable for knowledge discovery in MB structural DB

  47. Future Research • More consistent and detailed description of secondary structure • Add relative positions of the secondary structural elements to represent spatial relationship • Investigate alternative representation: more suitable 3D coordinates representation; weighting on different edges • Inexact graph match in predefined substructure • More collaboration with domain scientists

More Related