Linking Open Drug Data Susie Stephens, Principal Research Scientist, Eli Lilly

Linking Open Drug DataSusie Stephens,Principal Research Scientist, Eli Lilly

The Linked Data Cloud Source: Chris Bizer

Linking Open Drug Data • HCLSIG task started October 1, 2008 • Primary Objectives • Survey publicly available data sets about drugs • Publish and interlink these data sets on the Web • Explore interesting questions in competitive intelligence that could be answered if the data sets are linked • Participants: Bosse Andersson, Chris Bizer, Kei Cheung, Don Doherty, Oktie Hassanzadeh, Anja Jentzsch, Scott Marshall, Eric Prud’hommeaux, Matthias Samwald, Susie Stephens, Jun Zhao

Assessment of Data Sources Mark Sharp et al. A Framework for Characterizing Drug Information Sources. AMIA 2008

Published Data Sets • LinkedCT (http://linkedct.org) • Online registry of more than 60,000 clinical trials • Published in XML • 7,011,000 triples (290,000 interlinking) • DrugBank (http://www4.wiwiss.fu-berlin.de/drugbank) • A repository of almost 5,000 FDA-approved drugs • Published as DrugBank DrugCards • 1,153,000 triples (23,000 interlinking) • DailyMed (http://www4.wiwiss.fu-berlin.de/dailymed/) • High quality information about marketed drugs • Flat file representation • 124,000 triples (29,600 interlinking) • Diseasome (http://www4.wiwiss.fu-berlin.de/diseasome) • Information about 4,300 disorders and disease genes linked by known disorder-gene associations • Published in XML • 88,000 triples (23,000 interlinking)

Classes of Links • Based on common identifiers • Links present in the source data sets • Based on link discovery and record linkage techniques • String matching • E.g., “Alzheimer’s disease” in LinkedCT was matched with “Alzheimer_disease” in Diseasome • Semantic matching • E.g. “Varenicline” has the synonym “Varenicline Tartrate” and the brand names “Champix” and “Chantix”

Business Use Case • A neuroscience focused business manager is interested in seeing an update on new clinical trials by competitors on Alzheimer’s Disease (AD) • A phase III trial by Pfizer for a drug called Varenicline has just been listed in linkedCT • More information of interest is found in DBpedia, DailyMed, and DrugBank • DailyMed indicates the drug is already on the market for Nicotine addiction and has minimal side effects • DrugBank allows the manager to see the targets for Varenicline • Diseasome, however, indicates that the corresponding genes are only implicated in nicotine addiction, rather than AD • This suggests a more complex relationship between the diseases than just the drug target • Extending the browsing to the SWAN Knowledgebase shows that there are hypotheses relating AD to nicotine receptors through amyloid beta

Technical Challenges • Life sciences data is difficult to connect due to inconsistent terminology and the prevalence of synonyms, and homonyms • Refinement of tools and techniques for enabling more automatic linking of entities across data sets • Selection of ontologies to enable consistent mappings • Development a sufficiently robust platform as to enable inferencing • Provide an interface to users that supports browsing, querying, and filtering data • Persuade data providers to publish in RDF would alleviate the need for us to update data, and provide some of the interlinking

Next Steps • Ensure that existing data are accurately and comprehensively linked • Incorporate additional data sources into the LODD cloud that are of interest to competitive intelligence (e.g. Traditional Chinese Medicine) • Use novel link discovery tools and frameworks including Silk and LinQuer • Explore using SIOC to aggregate information as what patients are saying about drugs • Submit paper to the iTriplify Challenge

Task Alignment • LODD is looking to use Pharma Ontology’s work to help inform the mappings • Data converted to RDF is also loaded into BioRDF’s HCLS KB

Conclusions • Added 4 drug-related data sets into the cloud for competitive intelligence • Will add further data sources to the LODD cloud to enable more insights to be gleaned • Will continue to explore and test tools that are being developed for LOD

Linking Open Drug Data Susie Stephens, Principal Research Scientist, Eli Lilly

Linking Open Drug Data Susie Stephens, Principal Research Scientist, Eli Lilly

Presentation Transcript

Linking Open Drug Data (HCLSIG LODD)

Semantics for eScience Susie Stephens, Principal Research Scientist, Eli Lilly

IBM Eli Lilly Intel Tools Boeing

Labor Forecasting at Eli Lilly and Company

Gerhardt Pohl Eli Lilly and Company

Linking Social, Open, and Enterprise Data

Open Access Research Data

Linking Data to Open Access Publications

Dr. William E. Underwood Principal Research Scientist Georgia Tech Research Institute

Linking Open Data

Eli Lilly and Company v. Government of Canada

Eli Lilly and Company

Linking Research Data

Open Scientist

The Integration of Biological Data Using Semantic Web Technologies Susie Stephens

Simon Ferrier Senior Principal Research Scientist

Linking Open Drug Data Susie Stephens, Principal Research Scientist, Eli Lilly

Linking Open Drug Data (HCLSIG LODD)

Semantics for eScience Susie Stephens, Principal Research Scientist, Eli Lilly

Dr. William E. Underwood Principal Research Scientist Georgia Tech Research Institute

Eli Lilly - Company Profile & SWOT Analysis