1 / 43

Patrick Glenisson

What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management. Patrick Glenisson. Bio-informatics group Dept Electrical Engineering K.U.Leuven, Belgium. Steunpunt O&O Statistieken Faculty of Economy K.U.Leuven, Belgium. ntroduction. I. Text mining.

kimball
Download Presentation

Patrick Glenisson

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What's in a word ?Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics groupDept Electrical Engineering K.U.Leuven, Belgium Steunpunt O&O Statistieken Faculty of EconomyK.U.Leuven, Belgium

  2. ntroduction I

  3. Text mining Gibbs sampling Graphical models Classification & clustering Introduction: K.U. Leuven Faculty of Applied Sciences Department of Electrical Engineering Bio-informatics research Research on algorithms and software development for: clinical bioinformatics gene regulation bioinformatics

  4. Introduction: K.U. Leuven Faculty of Applied Sciences Department of Electrical Engineering Bio-informatics research Text mining research Combine statistical approaches with domain-specific requirements Knowledge discoverythrough literature analysis in various domains: Bio-informatics Sciento- & Technometrics Knowledge management

  5. Overview • Bio-informatics: • gene profiling • multi-view learning • Scientific trend mapping • clustering and bibliometric indicators • Innovation & Spillovers • Tracing of person in science & technology spaces 25’ 5-10’

  6. Overview • Text mining goals InformationRetrieval Document analysis &Extraction of tokens InformationExtraction • Text mining methodology Shallow Statistics Shallow Parsing Full NLP parsing • Overall approach Domain-specific Problemspecific Generic

  7. ase 1: C Literature & biological data

  8. protein

  9. Sample annotations C1 .. C2 C3 Gene annotations G1 G2 G3 .. Gene expressionmeasurement ‘Post-genome’ biology • focus shift : • from single gene to gene groups • complex interactions within cellular environment • microarrays measure the simultaneous activity:

  10. conditions Expression data gene Clustering Interpretation

  11. conditions Expression data gene annotations and relationsencoded as free text gene expression Databases Integrated analysis PRIORINFORMATION

  12. Hence, 2 views: • Text analysis for interpretation (supportive role) • Text analytics for ‘inference’ (active role)

  13. GO GeneRIF 12133521VEGF is associated with the development and prognosis of colorectal cancer. 12168088PTEN modulates angiogenesis in prostate cancer by regulating VEGF expression. 11866538Vascular endothelial growth factor modulates the Tie-2:Tie-1 receptor complex • cell proliferation • heparin binding • growth factor activity A ‘historical’ quote: `Until now it has been largely overlooked that there is little difference between retrieving an abstract from MEDLINE and downloading an entry from a biological database’ (M. Gerstein, 2001)

  14. Increased awareness • Controlled vocabularies are of great value when constructing interoperable and computer-parsable systems. • Structured vocabularies are on the rise • GO • MeSH • eVOC • Standards are systematically being adopted to store biological concepts or annotations: • HUGO for gene names • GOA • …

  15. gene T 3 T 2 T 1 vocabulary (GOF) Vector space model • Document processing • Remove punctuation & grammatical structure (`Bag of words’) • Define a vocabulary • Identify Multi-word terms (e.g., tumor suppressor) (phrases) • Eliminate words low content (e.g., and, gene, ...) (stopwords) • Map words with same meaning (synonyms) • Strip plurals, conjugations, ... (stemming) • Define weighing scheme and/or transformations (tf-idf,svd,..) • index

  16. Text-based coherence score • Modeled wrt a background distribution of • through random and permuted gene groups Validity of gene index Genes that are functionally related should be close in text space:

  17. Validity of gene index Genes that are functionally relatedshould be close in text space:

  18. Validity of gene index Genes that are functionally relatedshould be close in text space:

  19. Optimal number of clusters ? Define `optimal’ ? Text-based scoring • Data-centered statistical scores • Coherence vs separation of clusters • Stability of a cluster solution when leaving out data C3 C2 C1

  20. Optimal number of clusters ? Define `optimal’ ? • Data-centered statistical scores • Knowledge-based scores • Enrichment of GO annotations in clusters • Literature-based scoring

  21. Collaborative gene filtering

  22. TXTGate • a platform that offers multiple ‘views’ on vast amounts of (gene-based) free-text information available in selected curated database entries & linked scientific publications. • incorporates term-based indices .. • .. and use them as a starting point • to explore the text through the eyes of different domain vocabularies • to link out to other resources by query building, or • to sub-cluster genes based on text.

  23. Term-centric Gene-centric Domain vocabularies as ‘views’

  24. Query building to external DB

  25. Features of the approach • Flexible tool for analyzing gene groups (~100 genes) due to various term- and gene-centric vocab’s • … that allow some level of interoperability with external annotation databases • Sub-clustering gene groups useful to detect • biological sub-patterns • Reasonably robust to corrupted groups • Gene index normalizes for unbalanced references

  26. Text analysis for interpretation (supportive role) • Text analytics for ‘inference’ (active role)

  27. Meta-clustering text & data • As multiple information sources are available when analyzing gene expression data, we pose the question:“How can we analyze data in an integrated fashion to extract more information than from the expression data alone ? ” ..

  28. Mathematical integration

  29. Integration of text & data • In each information space • Appropriate preprocessing • Choice of distance measures

  30. Combine data: • confidence attributed to either of the two data types • in case of distance, we can see it as a scaling constant between the norms of the data- and text representations.

  31. However, distribution of distances invoke a bias  Scaling problem • Therefore, use technique from statistical meta-analysis(so-called omnibus procedure) Expression Distancehistogram Text Distancehistogram

  32. Optimal k ? Various cutoffs k of the cluster tree M-scoreintegrated clustering M-score expression data only

  33. A peek inside

  34. A peek inside Text Profile Expression Profile Strongre-enforcement

  35. ase 2: C Sciento- & technometrics

  36. Mapping of Science • Journal ‘Scientometrics’ • Full-text articles • Document cluster analysis • Co-word mapping • Temporal dimension:clusters over time

  37. Mapping of Science • Coupling with bibliometric indicators; • Based on reference (hyperlink) information • Mean reference Age • Nr Serials

  38. Domain studies in Patent space Similarities ‘Seed’ patent 30 technology classes

  39. User profiling & Author-Inventor linkage • Name resolution • Same persons (variants, mistakes) • Different persons (similar initials, or even full name) Van Veldhoven Veldhoven, Van Van Veldhoven Vanveldhoven Wim Van Veldhoven Walter Van Veldhoven Wim Van Veldhoven Wim Van Veldhoven

  40. Content-based name matching • Detect spillovers and entrepreneurial activities at (e.g.) university-level • Matching of ‘inventors’ & ‘authors’ time-consuming  semi-automated approach: Relevance ranking Patent DB Publication DB

  41. Acknowledgements Steunpunt O&O Statistieken Debackere K Glänzel W ESAT / BioI / Text Mining: Coessens B Van Vooren S Janssens F Van Dromme D ESAT / BioI: Moreau Y De Moor B

  42. Thanks! ? ? CONTACT INFO: Patrick.glenisson@econ.kuleuven.be

More Related