Transforming Data Integration: Overcoming Challenges in Linked Data Practices
In the 21st century, modeling complex systems presents a significant research challenge. This document discusses the limitations of current data integration practices, highlighting issues such as heterogeneous tools, fragmented data formats, and inadequate stewardship funding. It emphasizes the importance of collaboration and the need for global practices that enable seamless data linking across disciplines. By adopting open models, explicit semantics, and community-driven resources, researchers can foster a more interconnected and efficient data environment, ensuring greater accessibility and usability of global data.
Transforming Data Integration: Overcoming Challenges in Linked Data Practices
E N D
Presentation Transcript
Linked Data:Principles and Practice Joe FutrelleWoods Hole Oceanographic Institutionjfutrelle@whoi.edu WHOI / BCO-DMO, July 11, 2011
Grand challenge: whole systems • Observation and modelling of multiple systems at multiple scales • Linking data from different disciplines • to get useful global results! “... modelling complex systems will be a major research challenge for the 21st century” - National Science Foundation
Building current practices up isn't working • Heterogeneous tools, data formats • Can’t get everyone in one workgroup • Funding goes to science, not stewardship M.C. Escher, “Tower of Babel” (1928)
Proposed solutions aren't working • e-Journals – not machine-interpretable • Collaboration tools • everyone falls back on email & other p2p • Portals and repositories – typically: • centralized • domain-specific • “The Grid” – can orchestrate complex processing jobs, but that's not science
Only networks work at scale • Single researcher • Ad hoc data mgt, single-user apps • Community • Community tools, resources, control • Global • No global practice, tools, control Desktop Workgroup Network
Or to put it another way … Ted Nelson, Computer Lib / Dream Machines (1974)
Data is the network linkeddata.org (2009) There is no boundary, center, or locus of control, … so it scales
“If you can’t tweet your dataset, it doesn’t exist” • Links are the global currency of the internet • The more people link to you, the more you matter (e.g., Page rank) • If nobody can link to your data, they will choose data they can link to instead • If someone links to your data, someone will link to them, and thus to you • The lowest entry barrier wins
Don’t drink the Kool-aid • Semantic web “layer cake” • Where do we do actual work? • User interface? • Applications? • “Semantic Grid” (D. DeRoure, C. Goble) (source: World Wide Web Consortium)
Semantics = what they hear • Shared semantics are minimal • Maximal semantics emerge when multiple nodes act on partial information • Validating each exchange doesn’t scale Gary Larson (1983)
Design data for network effects • Global, persistent identification • Open models (tolerate incompleteness) • Transparent protocols (pass-through) • “Graceful degradation” (cf. Dublin Core) • Data outlives code, so data should control code, not the other way around • Semantics matter, so they must be explicit and machine-readable (not a side effect of running code)
Practices that grow the network • Give everything a portable identifier • Link entities via properties = network • Reuse existing ontologies and only build the partial ontologies that fill in the gaps (e.g., don’t re-develop Dublin Core terms) • Emit metadata early and often; don’t assume curators will do it later (who? $?) • “Not building a wall; building a brick” (Oblique Strategies, 1970)