110 likes | 117 Views
Linking Open Drug Data Susie Stephens, Principal Research Scientist, Eli Lilly. The Linked Data Cloud. Source: Chris Bizer. Linking Open Drug Data. HCLSIG task started October 1, 2008 Primary Objectives Survey publicly available data sets about drugs
E N D
Linking Open Drug DataSusie Stephens,Principal Research Scientist, Eli Lilly
The Linked Data Cloud Source: Chris Bizer
Linking Open Drug Data • HCLSIG task started October 1, 2008 • Primary Objectives • Survey publicly available data sets about drugs • Publish and interlink these data sets on the Web • Explore interesting questions in competitive intelligence that could be answered if the data sets are linked • Participants: Bosse Andersson, Chris Bizer, Kei Cheung, Don Doherty, Oktie Hassanzadeh, Anja Jentzsch, Scott Marshall, Eric Prud’hommeaux, Matthias Samwald, Susie Stephens, Jun Zhao
Assessment of Data Sources Mark Sharp et al. A Framework for Characterizing Drug Information Sources. AMIA 2008
Published Data Sets • LinkedCT (http://linkedct.org) • Online registry of more than 60,000 clinical trials • Published in XML • 7,011,000 triples (290,000 interlinking) • DrugBank (http://www4.wiwiss.fu-berlin.de/drugbank) • A repository of almost 5,000 FDA-approved drugs • Published as DrugBank DrugCards • 1,153,000 triples (23,000 interlinking) • DailyMed (http://www4.wiwiss.fu-berlin.de/dailymed/) • High quality information about marketed drugs • Flat file representation • 124,000 triples (29,600 interlinking) • Diseasome (http://www4.wiwiss.fu-berlin.de/diseasome) • Information about 4,300 disorders and disease genes linked by known disorder-gene associations • Published in XML • 88,000 triples (23,000 interlinking)
Classes of Links • Based on common identifiers • Links present in the source data sets • Based on link discovery and record linkage techniques • String matching • E.g., “Alzheimer’s disease” in LinkedCT was matched with “Alzheimer_disease” in Diseasome • Semantic matching • E.g. “Varenicline” has the synonym “Varenicline Tartrate” and the brand names “Champix” and “Chantix”
Business Use Case • A neuroscience focused business manager is interested in seeing an update on new clinical trials by competitors on Alzheimer’s Disease (AD) • A phase III trial by Pfizer for a drug called Varenicline has just been listed in linkedCT • More information of interest is found in DBpedia, DailyMed, and DrugBank • DailyMed indicates the drug is already on the market for Nicotine addiction and has minimal side effects • DrugBank allows the manager to see the targets for Varenicline • Diseasome, however, indicates that the corresponding genes are only implicated in nicotine addiction, rather than AD • This suggests a more complex relationship between the diseases than just the drug target • Extending the browsing to the SWAN Knowledgebase shows that there are hypotheses relating AD to nicotine receptors through amyloid beta
Technical Challenges • Life sciences data is difficult to connect due to inconsistent terminology and the prevalence of synonyms, and homonyms • Refinement of tools and techniques for enabling more automatic linking of entities across data sets • Selection of ontologies to enable consistent mappings • Development a sufficiently robust platform as to enable inferencing • Provide an interface to users that supports browsing, querying, and filtering data • Persuade data providers to publish in RDF would alleviate the need for us to update data, and provide some of the interlinking
Next Steps • Ensure that existing data are accurately and comprehensively linked • Incorporate additional data sources into the LODD cloud that are of interest to competitive intelligence (e.g. Traditional Chinese Medicine) • Use novel link discovery tools and frameworks including Silk and LinQuer • Explore using SIOC to aggregate information as what patients are saying about drugs • Submit paper to the iTriplify Challenge
Task Alignment • LODD is looking to use Pharma Ontology’s work to help inform the mappings • Data converted to RDF is also loaded into BioRDF’s HCLS KB
Conclusions • Added 4 drug-related data sets into the cloud for competitive intelligence • Will add further data sources to the LODD cloud to enable more insights to be gleaned • Will continue to explore and test tools that are being developed for LOD