Andrew Smith Thesis Defense 8/25/2006

Semantic Web Data Interoperation in Support of Enhanced Information Retrieval and Datamining in Proteomics Andrew Smith Thesis Defense 8/25/2006 Committee: Martin Schultz, Mark Gerstein (co-advisors), Drew McDermott, Steven Brenner (UC Berkeley)

Outline • Problem Description • Enhanced information retrieval, datamining • LinkHub – supporting system • Biological identifiers and their relationships. • Semantic web RDF graph, RDF query languages • Cross-database queries • Combined relational / keyword-based search • Enhanced automated information retrieval • The web • PubMed (biomedical scientific literature) • Empirical performance evaluation for yeast proteins • Related Work and Conclusions

Web Information Management and Access – Opposing Paradigms • Search Engines over the web • Automated • People can publish in natural languages  flexible • Vast coverage (almost whole web) • Currently, preeminent paradigm because of vast size of web and its unstructured heterogeneity • Unfortunately, only gives coarse-grained topical access, no real cross-site interoperation / analysis • Semantic Web • Very fine-grained data modeling and connection • Very precise cross-resource query / question answering supported • Unfortunately, requires much more manual intervention, people must change how they publish (RDF, OWL, etc.) Thus, limited acceptance and size

Combining Search and Semantic Web Paradigms • Two paradigms largely independent. • Seem to have complementary strengths and weaknesses. • Key idea: These two approaches to web information management and retrieval can work together and complement one another and there are interesting, practical, and useful ways they can work with, leverage, and enhance each other.

Combining Relational and Keyword-based Search Access to Free Text Documents • Consider query for “all documents containing information for proteins which are members of the Pfam Adenylate Kinase (ADK) family” • Standard keyword-based search engines couldn’t support this. • Relational information about documents required, i.e. that they are related to particular proteins in the Pfam ADK family.

Using Semantic Web for Enhanced Automated Information Retrieval • Basic idea: the semantic web provides detailed information about terms and their interrelationships which can be used as additional information to improve web searches for those terms (and related terms). • As proof of concept, the particular, practical problem we are addressing is to find additional relevant documents for proteomics identifiers on the web or in the scientific literature.

Finding Additional Relevant Documents for Proteomics Identifiers • Why not just do a web search directly for the identifier, e.g. ‘P26364’?  likely not good result • Conflated senses of the identifier text, e.g. product catalog codes, etc. • Might be synonyms of the identifier  should search these too (semantic web could provide them). • Many important, relevant documents might not directly mention the identifier • E.g. pages about ‘cancer pathways’ but not specifically containing cancer pathway related protein identifier • Should search for important related concepts. • Potentially much extra info from semantic web use it!

Example Cross-database Queries • For yeast protein interactions, find corresponding, evolutionarily related (homologous) protein interactions in Worm • Requires 4 databases: UniProt, SGD, PFAM, WormBase • Explore pseudogenes for yeast essential genes; Explore pseudogenes of human homologues of yeast essential genes • Requires 4 databases: PseudoGene, SGD, UniProt, PFAM

Characteristics of Biological Data • Biological data, especially proteomics data, is the motivation and domain of focus. • Vast quantities of data from high throughput experiments (genome sequencing, structural genomics, microarray, etc.) • Huge and growing number of biological data resources: distributed, heterogeneous, large size variance. • Practical domain to work in - need for better interoperation • Challenging but not overly complex. • Rich semantic relationships among data resources, but often not made explicit.

Problem Description • Integration and interoperation of structured, relational data (particularly proteomics data) that is: • Large-scale • Widely distributed • Independently maintained • In support of important applications: • Enhanced information retrieval / web search • Cross-database queries for datamining

Data Heterogeneity • Lack of standard detailed description of resources • Data are exposed in different ways • Programmatic interfaces • Web forms or pages • FTP directory structures • Data are presented in different ways • Structured text (e.g., tab delimited format and XML format) • Free text • Binary (e.g., images)

Classical Approaches to Interoperation: Data Warehousing and Federation • Data Warehousing • Focuses on data translation • Translate data to common format, under unified schema, cross-reference, store and query in single machine/system. • Federation • Focuses on query translation • Translating and distributing the parts of a query across multiple distinct, distributed databases and collating their results into single result.

General Strategy for Interoperation • Data is vast, distributed, independently maintained • Complete centralized data-warehousing integration is impractical • Must rely on federated, cooperative, loosely coupled solutions which allow partial and incremental progress • Widely used and supported standards necessary • Semantic Web is excellent fit to these needs

Semantic Web Resource Description Framework (RDF) Models Data as a Directed Graph with Objects and Relationships Named by URI http://www.ncbi.nlm.nih.gov/SNP http://purl.org/dc/elements/1.1/language http://purl.org/dc/elements/1.1/creator http://www.ncbi.nlm.nih.gov en • <?xml version="1.0"?> • <rdf:RDF xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#” • xmlns:dc=“http://purl.org/dc/elements/1.1” • xmlns:ex=“http://www.example.org/terms”> • <rdf:Description about=“http://www.ncbi.nlm.nih.gov/SNP”> • <dc:creator rdf:resource=“http://www.ncbi.nlm.nih.gov”></dc:creator> • <dc:language>en</dc:language> • date> • </rdf:Description> • </rdf:RDF> Above RDF graph in XML

Resource1 Resource2 Resource3 Resourcen <xml> … </xml> <rdf> … </rdf> DOM/SAX D2RQ XSLT RDF1 RDF2 RDF3 RDF/DB (Sesame) RDQL Users/Agents YeastHub – Lightweight RDF Data Warehouse

Name / ID proliferation problem • Identifiers for biological entities are a simple but key way to identify and interrelate the entities; Important “scaffold” for biological data. • But often many synonyms for same entity, e.g. strange, legacy names; e.g. fly gene called “sonic hedgehog”. • Even simple syntactic variants can be cumbersome, e.g. GO:0008150 vs. GO0008150 vs. GO-8150, etc. • Not just synonyms, many kinds of relationship, one-to-many mappings. • Known relationships among entities not always stored, or stored in non-standard ways. • Implicit overall data structure: enormous, elaborate graph of relationships among biological entities.

Complex Biological Relationships

LinkHub • The major proteomics hub UniProt performs centralized identifier mapping for large, well-known databases. • Large staff, resource intensive, manual curation  centralization bottleneck. Not viable as complete solution • Just simple mappings, no relationship types. • Many data resources not covered (e.g., small, transient, local, lab-specific, “boutique”). • Need for system / toolkit to enable local, collaborative integration of data  allow people to create “mini UniProts” and connect them  LinkHub. • Practically, LinkHub provides common “links portal” into a lab’s resources, also connecting them to larger resources. • LinkHub used this way for Gerstein Lab and NESG, connecting them to major proteomics hub UniProt.

Major / Minor Hubs and Spokes Federated Model • LinkHub as local minor hubs to connect groups of common resources, single common connection to major hubs; more efficient organization of biological data.

LinkHub Web GUI – Links Portal

Example YeastHub / LinkHub queries • Query 1: Finding Worm ‘Interologs’ of Yeast Protein Interactions: • For each yeast gene in interacting pair find corresponding WormBase genes: yeast gene name  UniProt Accession  Pfam accession  UniProt Accession  WormBase ID • Query 2: Exploring Pseudogene Content versus Gene Essentiality in Yeast and Humans • yeast gene name  UniProt Accession  yeast pseudogene • yeast gene name  UniProt Accession  Pfam accession  human UniProt Id  UniProt Accession  Pseudogene LSID

Combining Relational and Keyword-based Search Access to Free Text Documents • Consider query for “all documents containing information for proteins which are members of the Pfam Adenylate Kinase (ADK) family” • Standard keyword-based search engines couldn’t support this. • Relational information about documents required, i.e. that they are related to proteins in the Pfam ADK family. • LinkHub attaches documents to identifier nodes and supports such relational query access to them.

LinkHub Path Type Queries • View all paths in LinkHub graph matching specific relationship types, e.g. “family views”: • PDB ID  UniProt ID  Pfam family  UniProt ID  PDB ID  MolMovDB Motion • NESG ID  UniProt ID  Pfam family  UniProt ID  NESG ID • Practically used as secondary, orthogonal interface to other databases. • MolMovDB and NESG’s SPINE both use LinkHub for such “family views”.

LinkHub Subgraphs as Gold-standard Training Sets for Enhanced Automated Information Retrieval • The LinkHub subgraph emanating from a given identifier and the web pages (hyperlinks) attached to the identifiers in the subgraph is concrete, accurate, extra information about the given identifier that can be used to improve document retrieval for the given central identifier. • LinkHub subgraphs and associated documents for a given identifier are used as a training set to build classifiers to rank documents obtained from the web or scientific literature.

Weight 1.0 Weight 1.0 Weight 0.5 Weight 1.0 Weight 0.75 0.8 G 0.4 F E 0.7 0.5 0.5 0.5 0.5 .33 C A D B 0.78 0.85 1.0 1.0 Training Set Docs are Scaled down based on Distance and Link-types from Central Identifier

Term Frequency-Inverse Document Frequency (TF-IDF) Word Weighting • Vector Space Model: Documents are modeled as vectors of word weights, where the weights come from TF-IDF • TF: frequently occurring words in query more likely semantically meaningful • IDF: less frequent words in corpus are more discriminating • Document frequency D: Number of docs in corpus (of N total docs) containing a term • IDF = Log (N/D)

Classifier for Document Relevance Reranking • Use standard information retrieval techniques: tokenization, stopword filtering, word stemming, TF-IDF term weighting, and cosine similarity measures. • Classifier model: add weighted subgraph documents’ word vectors, TF-IDF weight it, and take top weighted 20% terms. • Standard cosine similarity value used to score documents against classifier.

Obtaining Documents to Rank • Use major web search engines, via their web APIs. For demo purposes, we used Yahoo. • Perform individual, base searches • Top 40 training set feature words • Identifiers in the subgraph • Combine all results into one large result set. • Rerank combined result set using the constructed classifier. • Essentially, systematically exploring “concept space” around the identifier. • Searches returning most relevant docs on average could be called semantic signatures • Key concepts related to the given identifier • Succinct snippets of what the identifier is “about”.

Example: UniProt P26364 Query: P26364 Spawned Searches Results Note: Direct Yahoo search for P26364 returned very poor results. In manual results inspection, 17/40 clearly had nothing to do with the UniProt protein. Many of others didn’t seem too useful: large tabular dumps of identifiers, etc. First clearly unrelated result in LinkHub’s results was at position 72. LinkHub’s results arguably better.

PubMed Application • PubMed is a database of all scientific literature citations for about the last 50-100 years. • Currently, no automated information retrieval of PubMed for biological identifier-related citations. • Built app to search for related PubMed abstracts, using Swish-e to index and provide base search access to PubMed.

PubMed search for UniProt P26364 Manual annotations exist (above); Only 3 and 4 are directly related --- LinkHub-based automated method ranked these 13 and 7 and returned many more relevant docs.

Empirical Performance Tests • Preceding results seemed reasonably good, but can we empirically measure performance? • Use gold-standard set of documents (curated bibliography) known related to particular identifiers • gene_literature.tab from yeast genome database (SGD)

Goals of Performance Tests • Quantify the performance level of the procedure • Performance close to optimal or lot’s of room for improvement? • How can we know that adding in documents (downweighted) for related identifiers actually helps? • Proof of concept: PFAM and GO are key related concepts for proteins, let’s objectively see if they help. • Quantify performance of a particular enhancement: pre-IDF step.

Pre-Inverse Document Frequency (pre-IDF) step • Idea: maximally separate all proteomics identifiers’ classifiers while making them specifically relevant and discriminating. • Determine document frequencies for all pages of a type, e.g. all or sample of UniProt pages. • Perform IDF against the type’s doc freqs and then again against the corpus you are searching; e.g. first UniProt then PubMed

Pre-IDF Step is Generally Useful • For example, imagine wanting to find web pages highly relevant to a particular digital camera or mp3 player. • Cnet has many pages about different digital cameras or mp3 players  doc freqs for these. • Build classifier for particular digital camera by first doing pre-IDF step against doc freqs for all Cnet digital camera pages.

Experimental Protocol • Pick few hundred random Yeast proteins from TrEMBL and Swiss-Prot separately • Each with at least 20 citations and GO and PFAM relations. • A protein’s citations are its “in” group • Other protein’s citations are “mid” group • Randomly selected PubMed citations (not in gene_lit.tab or UniProt) are an “out” group • Classifier should match “in” – “mid” – “out” order • But focus on “in”, most important • Degree of deviation from this is the objective test

Sensitivity / specificity tradeoff, shows true positive rate vs false positive rate Depicts classifier performance without regard to class distribution or error costs Area under the curve (AUC) is single, summary measure: 1 is max, 0 is min. AUC is probability randomly chosen + ranked higher than randomly chosen - Performance Measure: ROC curves

Measuring Performance • Full AUC and top 5% AUC (.05 AUC) • It is how well you do in the top of rankings that really matters. • UniProt page weight set to 1.0. Perform optimization over parameters at 0.1 granularity: • PFAM and GO weights • Percentage of features kept • Use/not use pre-IDF step. • Compare average AUC values for various parameter values • Determine statistical significance with paired t-tests.

Results • Perc features kept didn’t really matter • 0.4 or 0.5 for computational efficiency • Pre-IDF gave largest performance boost • Any trial with pre-IDF gave better result than one without, regardless of other parameters. • Pre-IDF and PFAM at small weight increased avg AUC for all trials for both TrEMBL and Swiss-Prot. • GO did not help (PFAM more info rich). • TrEMBL was helped more than Swiss-Prot • SwissProt is curated, high quality, complete, whereas TrEMBL is automated, lower quality  makes sense • All comparisons for TrEMBL were statistically significant by t-tests, only pre-IDF for Swiss-Prot.

Mean AUC Results TrEMBL Swiss-Prot

Related WorkSearch Engines / Google • Small number of search terms entered. • So little information to go by  maybe millions of result documents, hard to rank well. • Good ranking was big problem for early search engines. • Google (Brin and Page 1998) provided a popular solution • Hyperlinks as “votes” for importance • Rank web pages with most and best votes highest.

Alternative to Millions of Result Documents:Increased Search Precision • Increase search precision so fewer, more manageable number of documents returned. • More search terms • Problem: users are lazy, won’t enter too many terms. • Solution: semantic web provides the increased precision  users just select semantic web nodes for concepts, the nodes are automatically expanded to increase search precision. • This is what LinkHub does

Very Recent Related Work: Aphinyanaphongs et al 2006 • Argues for specialized, automated filters for finding relevant documents in huge and ever expanding scientific literature. • Constructed classifiers for predicting relevance of PubMed documents for various clinical medicine themes • State of the art SVM classifiers • Used large, manually curated, respected bibliographies to train • Used text from article title, abstract, journal name, and MeSH terms for features

Aphinyanaphongs et al 2006 cont. • LinkHub-based search by contrast: • Fairly basic classifier model, word weight vectors compared with cosine similarity. • Small training sets (UniProt, GO, PFAM pgs) • Fairly noisy also: web pages vs focused text • Only abstract text used as features • Some gene_lit.tab citations more generally relevant  True performance understated • Classifiers built automatically & easily at very large scale as natural byproduct of LinkHub • But LinkHub’s .927 and .951 AUCs better than or negligibly smaller than 0.893, 0.932, 0.966 AUCs of Aphinyanaphongs et al 2006

Aphinyanaphongs et al 2006 and Citation Metrics • Also compared to citation-based metrics • Citation count • Journal impact factor • Indirectly to Google PageRank • SVM classifiers outperformed these, and adding citation metrics as features gave marginal improvement at best. • Surprising given Google’s success with PageRank

Aphinyanaphongs et al 2006 and Citation Metrics cont. More generally stated, the conceivable reasons for citation are so numerous that it is unrealistic to believe that citation conveys just one semantic interpretation. Instead citation metrics are a superimposition of a vast array of semantically distinct reasons to acknowledge an existing article. It follows that any specific set of criteria cannot be captured by a few general citation metrics and only focused filtering mechanisms, if attainable, would be able to identify articles satisfying the specific criteria in question. Conclusion: Aphinyanaphongs et al 2006 is consistent with and supports the general approach taken by LinkHub of creating specialized filters (in the form of word weight vectors) for retrieval of documents specific to particular proteomics identifiers. It is arguably state of the art for focused tasks, superior to most commonly used search technology of Google; by extension, LinkHub-based search is also.

Publications • LinkHub: a Semantic Web System for Efficiently Handling Complex Graphs of Proteomics Identifier Relationships that Facilitates Cross-database Queries and Information Retrieval. Andrew K. Smith, Kei-Hoi Cheung, Kevin Y. Yip, Martin Schultz1, Mark B. Gerstein. International Workshop on Semantic e-Science 3rd September, 2006, Beijing, China. (SeS2006), co-located with ASWC2006. To be published in special proceedings of BMC Bioinformatics. • YeastHub: a semantic web use case for integrating data in the life sciences domain. KH Cheung, KY Yip, A Smith, R Deknikker, A Masiar, M Gerstein (2005) Bioinformatics 21 Suppl 1: i85-96. • An XML-Based Approach to Integrating Heterogeneous Yeast Genome Data. KH Cheung, D Pan, A Smith, M Seringhaus, SM Douglas, M Gerstein. 2004 International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences (METMBS); pp 236-242. • Network security and data integrity in academia: an assessment and a proposal for large-scale archiving. A Smith, D Greenbaum, SM Douglas, M Long, M Gerstein (2005) Genome Biol 6: 119. • Computer Security in academia-a potential roadblock to distributed annotation of the human genome. D Greenbaum, SM Douglas, A Smith, J Lim, M Fischer, M Schultz, M Gerstein (2004) Nat Biotechnol 22: 771-2. • Impediments to database interoperation: legal issues and security concerns. D Greenbaum, A Smith, M Gerstein (2005) Nucleic Acids Res 33: D3-4. • Mining the structural genomics pipeline: identification of protein properties that affect high-throughput experimental analysis. CS Goh, N Lan, SM Douglas, B Wu, N Echols, A Smith, D Milburn, GT Montelione, H Zhao, M Gerstein (2004) J Mol Biol 336: 115-30.

Acknowledgements • Committee: Martin Schultz, Mark Gerstein (co-advisors), Drew McDermott, Steven Brenner • Kei Cheung • Michael Krauthammer • Kevin Yip • Yale Semantic Web Interest Group • National Library of Medicine (NLM)

(Very) Brief Proteomics Overview The Central Dogma of Biology states that the coded genetic information hard-wired into DNA (i.e. the genome) is transcribed into individual transportable cassettes, composed of messenger RNA (mRNA); each mRNA cassette contains the program for synthesis of a particular protein (or small number of proteins). Proteins are they key agents in the cell. Proteomics is the large-scale study of proteins, particularly their structures and functions. While the genome is a rather constant entity, the proteome differs from cell to cell and is constantly changing through its biochemical interactions with the genome and the environment. One organism has radically different protein expression in different parts of its body, in different stages of its life cycle and in different environmental conditions.

Proteins are modular, composed of domains or families (based on evolution)

Andrew Smith Thesis Defense 8/25/2006

Andrew Smith Thesis Defense 8/25/2006

Presentation Transcript

PhD Defense 8 July 2004

Thesis Defense Olufunke Olaleye

Masters Thesis Defense

Thesis Defense

Andrew Smith

MS Thesis Defense:

Hierarchical Component Models A True Story (Ph.D. Thesis Defense)

Andrew Smith

Thesis Defense

THESIS DEFENSE

Andrew Smith

Andrew Smith

MS Thesis Defense

Elizabeth Waring Thesis Defense

Master’s Thesis Defense

Final Thesis Defense

Thesis Defense

THESIS DEFENSE

Andrew Smith March, 2000

Andrew Smith

Andrew Smith

Master’s Thesis Defense: Aspectual Concepts