1 / 21

EVS Data Curation

EVS Data Curation. The processing and publication of data for web browsing and programmatic access. Data Curation Flowchart. Gene Ontology and Zebrafish. Downloaded as OBO from web sites Processed with C++ program into Ontylog xml – OBO2TDE.exe

Download Presentation

EVS Data Curation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EVS Data Curation The processing and publication of data for web browsing and programmatic access

  2. Data Curation Flowchart

  3. Gene Ontology and Zebrafish • Downloaded as OBO from web sites • Processed with C++ program into Ontylog xml – OBO2TDE.exe • Processed with C++ program into OWL – ontyxToOWL.exe • Loaded using LoadNCIThesOWL.sh • Metadata loaded using LoadMetadata • Hierarchy and Sources manually edited

  4. HL7 and VA_NDFRT • Retrieved from sources • Processed by Apelon into Ontylog XML • Loaded into LexBIG using LoadNCIThesOwl and manifest • Metadata loaded using LoadMetadata

  5. MGED • OWL file downloaded from source web site • Loaded into Protégé • Classified • Inferred version exported as OWL file • Loaded into LexBIG using LoadNCIThesOwl • Metadata loaded using LoadMetadata • Hierarchy and Sources manually edited

  6. Snomed, MedDRA and LOINC • Extracted from the UMLS into RRF files • Loaded into LexBIG using LoadUMLSFiles • Metadata loaded using LoadMetadata

  7. UMLS Semnet • Downloaded from UMLS Semnet web site • Loaded using LoadUMLSSemnet • Metadata loaded using LoadMetadata

  8. Metathesaurus • Load from UMLS into MEME • NCI Thesaurus imported monthly • Other vocabs added or removed • NCI specific edits made to data and relations • Exported as RRF • Imported to LexBIG using LoadNCIMeta • Metadata loaded using LoadMetadata

  9. Preparing TDE Thesaurus for MEME • Thesaurus Ontylog XML baseline is processed through C++ app publishMEME.exe • Current baseline compared to previous to get summary of new properties or roles • Summary used to create import configuration file • Baseline imported into MEME

  10. Preparing Thesaurus for MEME

  11. NCI Thesaurus from TDE • Edited in TDE and exported to Ontylog XML by name • Run through publishTDE to remove unpublishable properties • run through OntyxToOwl.exe to create OWL file by code • Loaded into LexBIG using LoadNCIThesOWL • Metadata loaded using LoadMetadata • History generated from TDE baseline • History loaded using LoadNCIHistory

  12. NCI Thesaurus from TDE

  13. NCI Thesaurus from Protege • Run OWL through application to get Ontylog XML by name • Run Ontylog XML through publishTDE to remove unpublishable properties • Run through OntylogtoOWL to get OWL by code • Do history using the Ontylog XML

  14. NCI Thesaurus History Processing • evs_history records concept modifications made in editor • These records are extracted monthly to consolidate and to remove identifying information • Cleaned records are loaded into concept_history • Full concept_history loaded into LexBIG for NCI Thesaurus

  15. History

  16. TDE to DTS

  17. log.out New concepts created through Create or Split actions: C72675|Feet_First . Concepts merged into other concepts: C17841|Oncologic_Surgeon . Retired concepts (including merged): C17841|Oncologic_Surgeon . New concepts not found in BSLN2: C73140|Ethaverine_ . Retired concepts not found in BSLN2 C73401|Maqui_Berry_Flavor . Modify records correponding to Retired_Kind are discarded: 667487|C62920|Medical_Device_Unsafe_to_Use|Modify|2008-03-05 … . Modify records correponding to new codes are discarded: 666753|C72831|Pramiracetam_Hydrochloride|Modify|2008-02-29 … . Modify records correponding to merged codes are discarded: 668629|C3824|Lesion|Modify|2008-03-06 11:03:49.0|remennik|6116otsaremennl.nci.nih.gov|(null)|0 . Records correponding to codes not found in BSLN2 are discarded: 671933|C73140|Ethaverine_|New|2008-03-19 12:03:01.0|shaiu|MSDCorp-Mesh001.inside.msdinc.com|(null)|0 . WARNING: New codes created, then retired, but still found in BSLN2: (to be edited manually) C72675|Feet_First . List of all remaining records . List of all discarded records: 666753|C72831|Pramiracetam_Hydrochloride|Modify|2008-02-29 09:02:56.0|shaiu|MSDCorp-Mesh001.inside.msdinc.com|(null)|0 .

  18. tde_history_report.txt Spilanthes_oleracea (Code: C72446) Number of modelers: 3 Modeler: shaiu Modeler: thomas Modeler: creech Modeler: shaiu Action: modify time: 2008-03-05 05:03:58.0 Modeler: thomas Action: modify time: 2008-03-06 02:03:05.0 Action: modify time: 2008-03-14 10:03:06.0 Modeler: creech Action: modify time: 2008-03-06 02:03:06.0 ------------------------------------------------------------------ . Edited actions for the following concepts are discarded: Concept codes requiring manual review:

  19. DTS_history • DTS_history_script.sql insert into concept_history(concept, editaction, editdate, reference) values ('C72675', 'create', '28-MAR-08', null); insert into concept_history(concept, editaction, editdate, reference) values ('C72676', 'create', '28-MAR-08', null); . . • DTS_history_out.txt 666540|C72675|create|28-MAR-08|(null) 666541|C72676|create|28-MAR-08|(null) 666542|C62171|modify|28-MAR-08|(null) . .

  20. DTS_history_out.out Lists complete contents of both baselines . Number of codes in {baseline A} : 65265 Number of codes in {baseline B} : 66022 Concepts found in {baseline B}: but not in {baseline A} C72675 C72676 . Concepts found in {baseline A}: but not in {baseline B} (should be empty) . Verify DTS_history_out.txt against baseline data. New Concepts: 757 (1) C72675 (2) C72676 . Concepts created through Split: 0 Split Concepts: 0 Retired Concepts: 4 (1) C20920 (2) C62920 Concepts retired through Merge: 5 (1) C14142 Merge Concepts: 5 (1) C1363 Modified Concepts: 1364 Invalid actions: 0

  21. Tiered Deployments • NCICB uses 4-tiered deployments • Dev tier – used internally by EVS team to test software and data • QA tier – used by QA and other software teams to test against new EVS software or data • Stage tier – used to test software deployments in a near-production environment • Production – available to outside users

More Related