1 / 59

Conceptual basis for critical thinking, data analysis and problem solving

STRATEGY. Conceptual basis for critical thinking, data analysis and problem solving (and I don’t know what this is either !). Challenges for bioinformatics. With the sequence/structure deficit, the challenges are to rationalise the mass of sequence data

shanna
Download Presentation

Conceptual basis for critical thinking, data analysis and problem solving

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. STRATEGY Conceptual basis for critical thinking, data analysis and problem solving (and I don’t know what this is either !)

  2. Challenges for bioinformatics With the sequence/structure deficit, the challenges are to rationalise the mass of sequence data derive more efficient means of data storage design more reliable analysis tools Imperative - to convert sequence information into biochemical & biophysical knowledge

  3. What we cannot do well “Give us sequence, we do rest”

  4. What is the function of this structure? What is the function of this sequence? What is the function of this motif? • the fold provides a scaffold, which can be decorated in different ways by different sequences to confer different functions - knowing the fold & function allows us to rationalise how the structure effects its function at the molecular level

  5. Complication – Multiprotein Complexes

  6. ATPase 1H8E (ADP.ALF4)2(ADP.SO4) BOVINE F1-ATPASE (ALL THREE CATALYTIC SITES OCCUPIED) MENZ, R.I., WALKER, J.E., LESLIE, A.G.W.

  7. Multiprotein transcription complexes- RNA Polymerase Science 288, 640 (2000) P. Cramer et.al. 1NT9COMPLETE 12-SUBUNIT RNA POLYMERASE II ARMACHE, K.-J., KETTENBERGER, H., CRAMER, P

  8. STRING: a database of predicted functional associations between proteins. von Mering C, Huynen M, Jaeggi D, Schmidt S, Bork P, Snel B • http://string.embl.de/ • Prolinks: a database of protein functional linkages derived from coevolution P.M. Bowers, M.Pellegrini, M.J. Thompson,J.Fierro, T.O. Yeates, D.Eisenberg • http://dip.doe-mbi.ucla.edu/pronav(? )

  9. Ground rules for bioinformatics Don't always believe what programs tell you they're often misleading & sometimes wrong! Don't always believe what databases tell you they're often misleading & sometimes wrong! Don't always believe what lecturers tell you they're often misleading & sometimes wrong! In short, don't be a naive user when computers are applied to biology, it is vital to understand the difference between mathematical & biological significance computers don’t do biology - they do sums quickly!

  10. General Evaluation Criteria Be sceptical and cynical! • When you are searching for information you need to judge its quality and suitability. • Think critically about each piece of information you find and how you found it. • Relevance: • Does the information you have found adequately support your research? • Does it answer the question, or support one of your arguments? • How general or specific is the information about the topic?

  11. Building a search protocol The usual starting point searching the primary data sources NRDB, SPTR, etc. Pattern recognition methods searching the secondary sources patterns, profiles, blocks, fingerprints & HMMs Estimating significance when do we believe a result?

  12. A central goal is to predict protein function from sequence Given a sequence, we want to know what is my protein? to what family does it belong? what is its function? how can we explain its function in structural terms? By searching pattern dbs & fold libraries, we may recognise patterns that allow us to infer relationships with previously-characterised families & folds Given the variety of dbs to search, how do we use them to build a sensible search protocol?

  13. Planning a database Search • To find various aspects of your query sequence, you may have to search a number of databases • Identify the sequenceSearch for a matching or similar sequence using a 'BLAST' program. • Find related sequences(a) For a protein sequence, find the mRNA sequence that produces the protein, and the DNA sequence that codes for the mRNA.(b) For mRNA sequence, find the protein it produces, and the DNA sequence that codes for the mRNA.(c) For DNA sequence, find the mRNA it translates to, and the protein that the mRNA produces.

  14. If a the sequence is from a protein, find a structural image. • Research the functionality of the sequence: (a) What is its function in different tissues (homology)?(b) What is its function in different organisms (phylogeny).(c) Are there any mutations, and what are their consequences?(d) What is the role of the protein in cell function?

  15. Protein sequence database identity search e.g., for short fragments, pinpoints identical matches to probe - may identify correct reading frame Protein sequence database similarity search e.g., nrdb, OWL, SP+SPTrEMBL - identifies homologues to probe Protein pattern database search e.g., PROSITE, profiles, PRINTS, BLOCKS,Pfam - identifies familyrelationships or pinpoints key structural or functional sites Known structure Structure classification database query library search e.g., scop, CATH, FSSP provides details ligand-binding, etc. Unknown Structure Protein fold pattern e.g. threading identifies compatible of structural class

  16. http://eol.sdsc.edu iGAP

  17. Protein sequences structure info sequence info Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG) NR, PFAM SCOP, PDB Step 1 Building FOLDLIB: PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP 90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30) Structural assignment of domains by WU-BLAST Step 2 Structural assignment of domains by PSI-BLAST profiles on FOLDLIB Step 3 Structural assignment of domains by 123D on FOLDLIB Step 4 Functional assignment by PFAM, NR assignments FOLDLIB Step 5 Domain location prediction by sequence Step 6 Data Warehouse

  18. http://harvester.embl.de/ “Harvester” collects information from selected public databases

  19. Similarity searching Whether or not an identity search finds a match, the next step is to look for similar sequences e.g., you may wish to know if a wider family exists The most rapid option is to use BLAST & variants and look for high scores with low P-values (unlikely to be random) clusters of high scores at the top of the hitlist (a family?) trends in the type of sequences matched Use a composite databases e.g., UNIPROT

  20. Structural & functional interpretation db searches often does little more than identify a protein family this only scratches the surface - we still want to know what our protein does & what it might look like The first step is to examine the detailed family in InterPro may help to elucidate function The next step is to examine the fold classification & structure summary resources e.g., SCOP, CATH

  21. Gene prediction, structure & function prediction are non-trivial structure & function prediction tools are, at best, 70% accurate What are the lessons for sequence analysis? when searching for distant homologues, several dbs should be searched different methods provide different perspectives dbs aren’t complete & their contents don’t fully overlap The more dbs searched, the more difficult it can be to interpret results

  22. Thinking about your Topic Can you identify what you already know about the topic, and identify what you do not know. Can you create questions based on these knowledge 'gaps', that is, can you identify your information needs. What do you require about your protein sequence. Develop a concept map to organise your ideas and structure your approach to the topic. Discuss your topic with others.

  23. Identifying the Type of Information you need • As well as thinking about your topic, you need to consider the type of information you will need. • Which information tools are best suited to your inquiry? • How much information do you need - to what degree of detail?

  24. Appreciate how difficult it is to draw a complex 3-D object and appreciate the complexity of the requirements for storing sequence and structural information of molecules in a database. • There are a lot of interrelated pieces of information about a biomolecule, such as • sequence similarities • genome location • protein structure • Expression • chemistry

  25. All information on a molecule or sequence will not be found in one record, nor even in the one database. Be prepared to search in several databases for information on your query sequence As different organisations create databases to suit their own purposes, there will not be a great deal of similarity between these databases.

  26. Some of the obstacles of searching databases are: • Data formats are not standard. • The nomenclature is not standard. • There is more than one database offering the same information (data redundancy). • Links between databases may not be easy to follow. • The number of databases available makes it confusing to choose from

  27. Once you have found some information on your query sequence, you will find a new focus for your research from this information. Through exploring any linked text in the databases:-

  28. What function does the protein/mRNA/DNA have? • Do mutations occur and what are their effects? • Does it play a role in disease? • Homologies: Does it have the same function in different tissues? • Phylogenies: Does it have the same function in different organisms? • What role does structure play in the protein's function? • Does it have a similar function to other molecules with similar structure?

  29. Pitfalls of searching databases Remember that you are looking for information about a molecule, not database records. • Duplication of information (even within the same database) • Links that are not always intuitive (or self-explanatory) • Nomenclature that is not always standard

  30. Accuracy or Validity You need to determine whether the information is reliable or not

  31. Quality Control Issues The quality of archived data is no better than the data determined in the contributing laboratories. Curation of the data can help to identify errors. Disagreement between duplicate determinations is a clear warning of an error in one or the other. Similarly, results that disagree with established principles may contain errors. It is useful, for instance, to flag deviations from expected stereochemistry in protein structures, but such ``outliers'' are not necessarily wrong.

  32. Data quality Data Consistency Data Models Reliability Evidences ? Level of confidence ? Assignation of function by similarity recursive process  propagation of errors

  33. Data quality It’s hard to judge whether something “makes sense”. The lack of labeling on many web pages makes it hard to know the source. Calculations based on databases are even harder to deal with Logical deductions may be worse. “tacR gene regulates the human nervous system” “tacQ gene is similar to tacR but is found in E. coli” “so tacQ gene regulates the E. coli nervous system”

  34. Who spotted ? E. coli nervous system

  35. Evaluating database records In order for your research to reliable you must use reliable sources of information It is important to evaluate the information you find in databases as you would any other type of information In the case of sequencing research however,peer review does not necessarily happen prior to publication.

  36. Significance Appreciating that mathematical & biological significance are different is crucial Important in understanding the limitations of database search algorithms multiple sequence alignment algorithms pattern recognition techniques functional site & structure prediction tools Contrary to popular opinion, there is currently still no biologically-reliable automatic multiple alignment algorithm no infallible pattern-recognition technique no reliable gene, function or structure prediction algorithm

  37. Summary Difficult questions on big data Data and Information Database and Databanks Organise the data to provide a service Visualization and Rendering Keep it up-to-date Provide a means to ask questions Provide a useful service to a large and diverse scientific field

  38. Data & Information Data : a collection of facts i.e. X-ordinate, B-value, sequence Information : acquired knowledge Data within a scientific “context” Meaning of the data Sequence/structure alignment

  39. Databases & Databanks Databank A (usually large) collection of data Database A (usually large) set of data organized to allow rapid retrieval of information. Organized for a reason Rapid retrieval : human short term memory is ~5 seconds information

  40. WHAT IS THE PDB?

  41. Databanks and Databases The PDB Archive is a “databank” A series of flat files that have a format originally designed for Fortran card readers The MSD, RCSB, and PDBj provide “databases” Collections of data (1000’s attributes) organized into relational tables and held with a RDMS.

  42. Data & information ATOM 2567 N PHE B 175 7.821 -25.530 -22.848 1.00 8.71 ATOM 2568 CA PHE B 175 8.845 -25.172 -21.877 1.00 9.41 ATOM 2569 C PHE B 175 9.449 -23.798 -22.169 1.00 10.02 ATOM 2570 O PHE B 175 10.664 -23.613 -22.103 1.00 10.37 ATOM 2571 CB PHE B 175 9.928 -26.251 -21.848 1.00 9.53 ATOM 2572 CG PHE B 175 10.969 -26.137 -22.982 1.00 10.03 ATOM 2573 CD1 PHE B 175 12.356 -25.819 -22.988 1.00 10.51 ATOM 2574 CD2 PHE B 175 11.725 -27.211 -23.402 1.00 10.25 ATOM 2575 CE1 PHE B 175 11.821 -27.095 -22.869 1.00 11.17 ATOM 2576 CE2 PHE B 175 12.282 -26.086 -24.008 1.00 10.95 ATOM 2577 CZ PHE B 175 10.953 -26.335 -23.622 1.00 11.38

  43. http://www.rcsb.org/pdb/ http://www.ebi.ac.uk/msd/ http://www.pdbj.org/ http://oca.ebi.ac.uk/oca-docs/oca-home.html http://srs.ebi.ac.uk/

  44. wwPDB are service providers We provide a service to the scientific community 24/7 (almost) : parallel DB with fail-over, etc. Service “ping” baseline check several times/day Data is incremented with new data weekly Systems are extensible

  45. Query capabilities Browsing (click and read) Simple search select records with some constraints More elaborate search select specific fields of some records with constraints on some fields Complex querying ability to return an answer that results from a "live" computation, and was not part of any record of the database

  46. Interfaces User interfaces user-friendly convenient browsing intuitive query forms visualization (graphical output) Programmatic interfaces - communication with external programs: other databases (concept of distributed database) analysis tools

More Related