1 / 35

Visualization & Data mining

Visualization & Data mining. Visualisation. The process of representing abstract data to aid in understanding the meaning of the data. Not to be confused with rendering data (drawing pictures) Typically though, we render data in such a way to visualize the information within that data.

feryal
Download Presentation

Visualization & Data mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Visualization& Data mining

  2. Visualisation The process of representing abstract data to aid in understanding the meaning of the data. Not to be confused with rendering data (drawing pictures) Typically though, we render data in such a way to visualize the information within that data.

  3. Introduction Biological data comes from & is of interest to: Chemists : reaction mechanism, drug design Biologists : sequence, expression, homology, function. Structure biologists : atomic structure, fold, classification, function. Medicine : clinical effect Education : Media : Presentation of diverse information to a diverse audience. Each has there own point of view (context). Expert = scientist working within their own field of expertise Non-expert = scientist using data/information outside their field Novice = Non-scientist

  4. Web pages These are notoriously badly designed often resulting in the information on that site being unusable. The front page should load quickly The main point should appear on the first full screen Clutter – not logically laid out Too busy – cannot find the salient point 8% men & 0.5% women are colour blind Bad text/fonts Too often it doesn’t work User will go somewhere else The latest wiz-bang stuff only works on the latest browsers Only works in one browser – they only tested on one. Does not conform to standard HTMl Not just presentation of results Google is a good design

  5. Asking questions • Asking questions • Biological data is very complex • Chemistry, Biology, Physics, Statistics, Medicine.. • Most users will be from a different field • Asking the right question is difficult. • The user cannot use the correct terminology • Too many things to query (2000 attributes in MSD) • SQL : not suitable for most users • Interface too complex • Too many check boxes, widgets etc • Trying to be too clever • The “Go” button is buried somewhere

  6. Result presentation • Results • Biological data is complex • Chemistry, physics, biology, statistics, medicine… • Experts users want all the detail • Ie : want to use a specific method • They want all the details • The want (I hope) the statistical validity of the results • The non-expert wants the best practice answer returned within their own context. • The want comparative analysis with other fields • The want to know the results are valid

  7. Query design Suitable for text queries Only one logic AND or OR Predefined Easy to use Limited scope 2000 attributes -> 2000 check-boxes ! • The simple text box design is very common

  8. Query design Graphical interface Multiple logic AND/OR/NOT Under users control Slower Steep learning curve Some users just cannot get it Intuitive once mastered Pretty

  9. Query design Figurative 2D sketch for 3D query (Active sites) Informative – presents meaning for the question Slower Less error prone HIS|SER:S/H>C2.0 HIS.ne2:S/S>C2.0 HIS.[n]/T>C2.0

  10. YAMGP (yet another molecular graphics program) • Many different programs are available VMD AstexViewer@MSD-EBI LigPlot Quanta InsightII Bobscript WebMol Frodo iMol Chime Grasp Pymol POVRay Spock Rasmol Pymol Mage Raster3D Yasara Molscript Chimera O MolMol Whatif Frodo XtalView WebLab-viewer Swiss-PDBviewer

  11. Result visualisation Multiple types of biological data Textual data 3D structure 2D chemical sketches 1D sequence Node linked General/derived data Web pages Time Errors/Variance Patented !

  12. Visualisation : AstexViewer@MSI-EBI Visualisation Lensing Linked views Brushing Picking Flying views Hyperbolic distortion Animation Solid rendering Depth cues Colour,lighting Highlighting Etc… Structure/sequence/data

  13. Visualisation : comparative analysis Similarity/Difference Data superposition Attribute display Colour, size… Correlation Attribute mapping Sequence colour by structure alignment Analysis Example

  14. Animation Animation Time dependent display Reaction chemistry Visual clues. Expression data Shown as… Rotation Flash On/off Object Synchronization Size, Colour…. Sound NO : incredibly annoying Animation Example

  15. Multidimensional analysis • Comparative analysis on multiple data • Eg. Phi,Psi, Bvalue, Omega • 1D & 2D easy • 3D graphs are difficult to see. • 4D requires 3D + iso-surfaces • Higher – too busy • Use 2D + multiple properties • SPOTFIRE is the most well known • Use : X/Y/Colour/size/shape… • Interactive bracketing Example

  16. Visualization- Summary Rendering data is not visualization Not just the display of results Huge array of non-specific techniques – and entire scientific field !

  17. Data mining “Analysis of data in a database using tools which look for trends or anomalies without knowledge of the meaning of the data.” (Hyperdictionary) “True data mining software does not just change the presentation, but discovers previously unknown relationships among the data.” (IBM)

  18. Data mining & Data analysis Traditional analysis is via “verification-driven analysis” Requires hypothesis of the desired information (target) Requires correct interpretation of proposed query Discovery-driven data mining Finds data with common characteristics Results are ideal solutions to discovery Finds results without previous hypothesis Results have unbiased mean and variance

  19. So what is Hypothesis driven data analysis ? Define a target = hypothesis Search for target There are/are-not “hits” Verify/negate hypothesis Distribution is centred on target “catalytic triad” : text string matching Atomic coordinates : coordinate superposition Mathematical graph : graph matching HIS,ASP,SER : data hierarchy knowledge

  20. Four types of data mining Creation of predictive models : future data expectation Link analysis : connections between data objects Database segmentation : classification Deviation detection : finding outliers. IBM : white papers

  21. Given multiple sets of primary data (dependant variables) Characters, numbers, Function(numbers),…. Find anomalies To many : numerical occurrence Data variation : Derivatives Singularities ….. Correlations and clusters Within primary data with other data (independent variables) So what is this data mining ? Finds new things ! But not what it means !

  22. Eg Wife rings husband, “get some nappies for the weekend” Husband takes opportunity to buy some beer ! • Retail and Financial industry are heavily into DM. • A well known US food supermarket chain found a correlation : • Babies nappies • Beer • 5pm on Friday You won’t grant funding to test this hypothesis !

  23. Self/Cross data mining • Most mining software looks for correlations between dependent variables. • Rainfall, temperature, cloud-cover • It rains when it is cloudy • Free : http://www.cs.waikato.ac.nz/~ml/ • Bioinformatics usually involves anomalies within data objects • Sequence clusters (sequence finger prints) • Local coordinate clusters (active sites) • Global coordinate cluster (folds)

  24. Data mining – not idiot proof Date of birth and age will give 100 % correlation Authors for structure submission will be correlated to authors on primary citation. “Lysozyme” is the most common fold pattern 36 spelling’s of E.Coli will mask results. Requires representative sets Statistically valid ones too ! Signal/Noise ratio is a problem

  25. Discovery driven data mining of the PDB Analysis of 3-dimensional coordinates Defined common patterns of atomic interactions locally DB segmentation - active sites & common packing features Link analysis - Similarity between different functional group Defined globally DB segmentation - common patterns of super-secondary str’ Link analysis - common folds in diverse protein families Outlier detection - unique folds

  26. Issues Systematic “error” propagates as solution 300 lysozyme structures return as a strong solution Results cannot be found below the noise level Need to characterise the noise level Need to improve signal/noise ratio (S/N) to see information Target is not biologically defined It does not give you the biological answer Results should reproduce known biology Can give you new results not previously observed

  27. Data selection Cannot leave in 300 lysozyme structures ! Select by sequence similarity at 70% exact alignment Different “phase space” to select data Remove structures with resolution < 2.5A Remove NMR (different statistics) Remove pre-1982 etc. Geometrical analysis criteria to check for outliers Using properties NOT target parameters of structure solution

  28. Local atomic interactions Data Function(3D coordinates) = distance Atom names (independent variable) Residue names (independent variable) Create 3D Hash table of triplets of distances(*) between “points” This is the dependant variable Order = 3

  29. Local atomic interactions Merge triplets Any pair of N-fold interactions are a (N+1) interaction if they have (N-1) equivalence. Order = N Just keep going until no more (N+1) interaction are found. Time = 8 seconds to find ~ 2000 interactions (Digital alpha ES40)

  30. Catalytic quartet

  31. Electrostatic interaction Ligands are found close by rather than associated with the residues

  32. Iron binding site

  33. Double disulphide

  34. N-linked glycosolation binding site + Spot the non-sugar This glycosolation site is the same as active site found in “1a53” – indol-3-glycerolphosphate synthase

  35. Summary Nearly all Bioinformatics is based on hypothesis driven data analysis Data mining has lost its meaning within Bioinformatics. Discovery driven data-analysis (true data mining) : Can find unknown dependencies, clusters, outliers Is based on statistical probability Returns distributions unbiased by previous ideas Information technology may be better for genomes (1D) “A numerical measure of the uncertainty of an outcome” Information content of gene sequences can be defined by the normalized probability of finding “words” within that sequence

More Related