Curated databases
This presentation is the property of its rightful owner.
Sponsored Links
1 / 41

Curated Databases PowerPoint PPT Presentation


  • 76 Views
  • Uploaded on
  • Presentation posted in: General

Curated Databases. Peter Buneman School of Informatics University of Edinburgh With thanks to James Cheney, Heiko Müller , Wang Chiew Tan, Stijn Vansummeren , and many others. The Population of Corfu.

Download Presentation

Curated Databases

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Curated databases

Curated Databases

Peter Buneman

School of Informatics

University of Edinburgh

With thanks to James Cheney, HeikoMüller, Wang Chiew Tan, StijnVansummeren, and many others


The population of corfu

The Population of Corfu

***The only site to give attribution: http://www.statistics.gr/portal/page/portal/ESYE


Curated databases

These are both curated databases


What is a curated database

What is a curated database?

  • A curated database is one that is maintained with a lot of human effort

  • Curare: Latin “to care for”

  • Prime concern is quality of data


What is a database for the purposes of this talk

What is a database?(for the purposes of this talk)

  • Any structured collection of data that is subject to change/revision

    • Ontologies

    • XML and other structured text files

    • Structured wikis

    • Standard relational and object-oriented databases


Curated databases have interesting properties

Curated databases have interesting properties…

  • A digital reference work. Traditional dictionaries, gazetteers, encyclopedia have been replaced by curated databases.

  • Value lies in the organization and annotation of data

  • Commonly constructed by copying parts of other (curated) databases.

  • Fundamental to “citizen science”

  • Rapidly increasing in scientific research. (> 1000 in molecular biology)

  • Constantly checked/verified. Data quality and timeliness are important.

  • Often group efforts. Produced by a dedicated organization or collaboration.

  • Increasingly seen as “publications” by scientists. (You get kudos if someone uses your database – like a citation.)

  • They are not data warehouses!!!


And they are very expensive

“Reliable” code / Curated data

10

“Production” code/Curated data

1

Book

10-1

[Movie]

10-3

Big physics (LHC) data

10-7

... and they are very expensive

In $/€/£ per byte


A change for the better

A change for the better?

Storage:

  • Redundant

  • Persistent

  • Distributed

  • Readable by people

    Clear standards for citation

    Historical record (old data is useful)

    Well understood ownership/IP

Storage:

  • Single-source

  • Volatile

  • Centralised

  • Internal DBMS format

    No standards for citation

    No historical record

    Mind-boggling legal issues

20th century libraries did some things better!


Some computer science issues

Some computer science issues

  • Archiving (CS usage)

  • Provenance

  • Annotation/citation

  • Data cleaning

All of these are intimately connected.

For example, if you cite some part of a curated database, the version you cited should be available (archiving)


Curated databases

Some well-known curated databases

ID 11SB_CUCMA STANDARD; PRT; 480 AA.

AC P13744;

DT 01-JAN-1990 (REL. 13, CREATED)

DT 01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE)

DT 01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE)

DE 11S GLOBULIN BETA SUBUNIT PRECURSOR.

OS CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH).

OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE;

OC VIOLALES; CUCURBITACEAE.

RN [1]

RP SEQUENCE FROM N.A.

RC STRAIN=CV. KUROKAWA AMAKURI NANKIN;

RX MEDLINE; 88166744.

RA HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.;

RL EUR. J. BIOCHEM. 172:627-632(1988).

RN [2]

RP SEQUENCE OF 22-30 ND 297-302.

RA OHMIYA M., HARA I., MASTUBARA H.;

RL PLANT CELL PHYSIOL. 21:157-167(1980).

CC -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN.

CC -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND A

CC BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY A

CC DISULFIDE BOND.

CC -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS).

DR EMBL; M36407; G167492; -.

DR PIR; S00366; FWPU1B.

DR PROSITE; PS00305; 11S_SEED_STORAGE; 1.

KW SEED STORAGE PROTEIN; SIGNAL.

FT SIGNAL 1 21

FT CHAIN 22 480 11S GLOBULIN BETA SUBUNIT.

FT CHAIN 22 296 GAMMA CHAIN (ACIDIC).

FT CHAIN 297 480 DELTA CHAIN (BASIC).

FT MOD_RES 22 22 PYRROLIDONE CARBOXYLIC ACID.

FT DISULFID 124 303 INTERCHAIN (GAMMA-DELTA) (POTENTIAL).

FT CONFLICT 27 27 S -> E (IN REF. 2).

FT CONFLICT 30 30 E -> S (IN REF. 2).

SQ SEQUENCE 480 AA; 54625 MW; D515DD6E CRC32;

MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR

RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA

IPGCAETYQT DLRRSQSAGS AFKDQHQKIR PFREGDLLVV PAGVSHWMYN RGQSDLVLIV

FADTRNVANQ IDPYLRKFYL AGRPEQVERG VEEWERSSRK GSSGEKSGNI FSGFADEFLE

EAFQIDGGLV RKLKGEDDER DRIVQVDEDF EVLLPEKDEE ERSRGRYIES ESESENGLEE

TICTLRLKQN IGRSVRADVF NPRGGRISTA NYHTLPILRQ VRLSAERGVL YSNAMVAPHY

TVNSHSVMYA TRGNARVQVV DNFGQSVFDG EVREGQVLMI PQNFVVIKRA SDRGFEWIAF

KTNDNAITNL LAGRVSQMRM LPLGVLSNMY RISREEAQRL KYGQQEMRVL SPGRSQGRRE

//

CIA World Factbook

Uniprot


Archiving database preservation

Archiving / Database Preservation

  • How do we preserve something that evolves (both in content and structure)

  • Keep snapshots?

    • frequent: space consuming

    • infrequent: lose “history”

Most curated databases have a hierarchical structure that we can exploit…


A working system

A Working System

  • Implemented by Heiko Müller

  • For scale, we require external sorting of large XML files

    • Designed and implemented by Ioannis Koltsidas Heiko Müller and Stratis Viglas

  • Has a simple temporal query language

  • Experimented with recent (HTML) versions of CIA world factbook


How did the population of china change from 2002 2007

How did the population of Chinachange from 2002-2007?

<T t="2002-2007">

<FACTBOOK>

<COUNTRY>

<CATEGORY>

<PROPERTY>

<NAME>Population</NAME>

<TEXT>

<T t="2002">1,284,303,705 (July 2002 est.)</T>

<T t="2003">1,286,975,468 (July 2003 est.)</T>

<T t="2004">1,298,847,624 (July 2004 est.)</T>

<T t="2005">1,306,313,812 (July 2005 est.)</T>

<T t="2006">1,313,973,713 (July 2006 est.)</T>

<T t="2007">1,321,851,888 (July 2007 est.)</T>

</TEXT>

</PROPERTY>

</CATEGORY>

</COUNTRY>

</FACTBOOK>

</T>


How did land area of countries change in 2002 2007

How did land area of countries change in 2002-2007?

<Tt="2002-2007">

<FACTBOOK KEY="">

<COUNTRY KEY="NAME Austria">

<CATEGORY KEY="NAME Geography">

<PROPERTY KEY="NAME Area">

<SUBPROP>

<NAME>land</NAME>

<TEXT>

<T t="2004-2007">82,444 sq km</T>

<T t="2002-2003">82,738 sq km</T>

</TEXT>

</SUBPROP>

</PROPERTY>

</CATEGORY>

</COUNTRY>

<COUNTRY KEY="NAME France">

<CATEGORY KEY="NAME Geography">

<PROPERTY KEY="NAME Area">

<SUBPROP>

<NAME>land</NAME>

<TEXT>

<T t="2002-2006">545,630 sq km</T>

<T t="2007">640,053 sq km; 545,630 sq km (metropolitan France)</T>

</TEXT>


What are the differences between the factbooks on 21 08 2007 and 10 09 2007

What are the differences between the factbookson 21/08/2007 and 10/09/2007?

<T t="21/08/2007-10/09/2007">

<CIAWFB KEY="">

<COUNTRY KEY="NAME Afghanistan">

<CATEGORY KEY="NAME Communications">

<PROPERTY KEY="NAME Internet users">

<T t="21/08/2007">

<TEXT>30,000 (2005)</TEXT>

</T>

<T t="10/09/2007">

<TEXT>535,000 (2006)</TEXT>

</T>

</PROPERTY>

<PROPERTY KEY="NAME Telephones - mobile cellular">

<T t="21/08/2007">

<TEXT>1.4 million (2005)</TEXT>

</T>

<T t="10/09/2007">

<TEXT>2.52 million (2006)</TEXT>

</T>


Curated databases

http://homepages.inf.ed.ac.uk/hmueller/xarch/download.html

  • HeikoMüller’sXarch

  • Examples of use with

    • Ontologies

    • XML files

    • Relational databases (even atmospheric data!)

  • Automatically converts RDBs into XML

  • Efficiently extracts snapshots

  • Simple temporal query language


Provenance a huge issue

Provenance – a huge issue

  • Where did this data come from?

  • How did it get here?

  • How was it constructed?

  • . . .

  • Two schools of research:

  • Workflow (coarse-grain) provenance – a complete record of how some large scientific analysis/simulation was performed.

  • Data (fine-grain) a record of how some small piece of data (in a larger databases) was produced


Curated databases

Data provenance: an example

Copy-paste, or

<cntrl>C <cntrl>V


Where provenance

“Where provenance”

Possible explanations of how something was copied:

This data item was extracted from location L1 in document D1 and placed in location L2 in document D2

or

This data item was extracted from database D1 by query Q1 and placed in database D2 by update U2

(or some combination of the two)


Curated databases

ID 11SB_CUCMA STANDARD; PRT; 480 AA.

AC P13744;

DT 01-JAN-1990 (REL. 13, CREATED)

DT 01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE)

DT 01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE)

DE 11S GLOBULIN BETA SUBUNIT PRECURSOR.

OS CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH).

OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE;

OC VIOLALES; CUCURBITACEAE.

RN [1]

RP SEQUENCE FROM N.A.

RC STRAIN=CV. KUROKAWA AMAKURI NANKIN;

RX MEDLINE; 88166744.

RA HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.;

RL EUR. J. BIOCHEM. 172:627-632(1988).

RN [2]

RP SEQUENCE OF 22-30 AND 297-302.

RA OHMIYA M., HARA I., MASTUBARA H.;

RL PLANT CELL PHYSIOL. 21:157-167(1980).

CC -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN.

CC -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND A

CC BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY A

CC DISULFIDE BOND.

CC -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS).

DR EMBL; M36407; G167492; -.

DR PIR; S00366; FWPU1B.

DR PROSITE; PS00305; 11S_SEED_STORAGE; 1.

KW SEED STORAGE PROTEIN; SIGNAL.

FT SIGNAL 1 21

FT CHAIN 22 480 11S GLOBULIN BETA SUBUNIT.

FT CHAIN 22 296 GAMMA CHAIN (ACIDIC).

FT CHAIN 297 480 DELTA CHAIN (BASIC).

FT MOD_RES 22 22 PYRROLIDONE CARBOXYLIC ACID.

FT DISULFID 124 303 INTERCHAIN (GAMMA-DELTA) (POTENTIAL).

FT CONFLICT 27 27 S -> E (IN REF. 2).

FT CONFLICT 30 30 E -> S (IN REF. 2).

SQ SEQUENCE 480 AA; 54625 MW; D515DD6E CRC32;

MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR

RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA

IPGCAETYQT DLRRSQSAGS AFKDQHQKIR PFREGDLLVV PAGVSHWMYN RGQSDLVLIV

FADTRNVANQ IDPYLRKFYL AGRPEQVERG VEEWERSSRK GSSGEKSGNI FSGFADEFLE

EAFQIDGGLV RKLKGEDDER DRIVQVDEDF EVLLPEKDEE ERSRGRYIES ESESENGLEE

TICTLRLKQN IGRSVRADVF NPRGGRISTA NYHTLPILRQ VRLSAERGVL YSNAMVAPHY

TVNSHSVMYA TRGNARVQVV DNFGQSVFDG EVREGQVLMI PQNFVVIKRA SDRGFEWIAF

KTNDNAITNL LAGRVSQMRM LPLGVLSNMY RISREEAQRL KYGQQEMRVL SPGRSQGRRE

//

Where Provenance

DE 11S GLOBULIN BETA SUBUNIT PRECURSOR.

OS CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH).

OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE;

OC VIOLALES; CUCURBITACEAE.

CC -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN.

CC -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND A

CC BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY A

CC DISULFIDE BOND.

CC -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS).

FT CHAIN 22 480 11S GLOBULIN BETA SUBUNIT.

FT CHAIN 22 296 GAMMA CHAIN (ACIDIC).

FT CHAIN 297 480 DELTA CHAIN (BASIC).

FT MOD_RES 22 22 PYRROLIDONE CARBOXYLIC ACID.

FT DISULFID 124 303 INTERCHAIN (GAMMA-DELTA) (POTENTIAL).

Where does this information come from? Which curator? Or was it the cited papers?

Was it copied from some other DB?


Copy paste model of curated dbs

Copy-paste model of curated DBs

Curated databases are not views!!

(a) A biologist copies some UniProtrecords into her DB.

(b) She fixes entries so that UniProtPTMs are not confused with hers.

(c) She copies in some publication details from OMIM

(d) She corrects a mistake in a PubMed publication number.

[B. Chapman, Cheney, Sigmod ’06]


A very simple copy paste language uses a deterministic tree model

A very simple copy-paste language(uses a “deterministic” tree model)

(1) delete c5 from T;

(2) copy S1/a1/y into T/c1/y;

(3) insert {c2 : {}} into T;

(4) copy S1/a2 into T/c2;

(5) insert {y : {}} into T/c2;

(6) copy S2/b3/y into T/c2/y;

(7) copy S1/a3 into T/c3;

(8) insert {c4 : {}} into T;

(9) copy S2/b2 into T/c4;

(10) insert {y : 12} into T/c4;

How costly is it to record all this?


How to reduce space

How to reduce space

  • Complete provenance: Record every update.

  • Transactional provenance: Record the links at the end of some user-defined transaction (sequence of updates)

  • Hierarchical (inferred) provenance. Only record a link if it cannot be inferred from the provenance of a higher node

Taken together these provide a substantial saving on storage. Overhead comparable with the size of the DB in some realistic simulations


Curated databases

Query languages and where provenance

A

A

B

B

(select A, 5 as B from R where A = 1)

union

(select * from R where A <> 1)

delete from R where A = 1;

insert into R values (1,5)

update R set B = 5 where A = 1

6

1

6

1

1

6

1

1

6

1

6

6

3

5

7

7

3

7

5

7

3

7

5

7

[B., Cheney, Vansummeren, TODS 33,4, 2008]


Other forms of provenance in query languages

Other forms of provenance in query languages

  • Why-provenance: why is a tuple in the output, or what parts of the input “contributed to” the tuple? [Widomet al]

  • How-provenance: how (by what process) was this tuple constructed. [Tannenet al]

Complex program or process

Large, heterogeneous

source

Database

Simpler program/process

Small part of

source

“Piece” of data: data value, tuple.etc

Taken together, these are the “explanation”.


Workflow provenance

Workflow provenance

  • Taken from [Davidson & Freire, Sigmod 08]

  • Each step S1. . . S4 is itself a workflow.

  • How does one record an “enactment” of the workflow?

  • How much “context” does one record?

    • from people

    • from databases that change

  • Recent attempts to produce a general model

    • Open Provenance Model [Moreau et al. 2007]

    • Petri Net + Complex Object [Hidderset al.InfSyst 2008]


Provenance is very general issue

Provenance is very general issue

  • Intrinsic to data quality.

  • It is starting to be used in several areas of CS:

    • Semantics of update languages.

    • Probabilistic databases

    • Data integration

    • Debugging schema transformations

    • File/data synchronization

    • Program debugging (program slicing)

    • Security

  • The fundamental problem is finding the right model/models

  • Can we combine data and workflowmodels?

    • OPM + complexobjects (B. et al.2010)


How do you cite something in a database

How do you cite something in a database?

Many scientific databases ask you to cite them, but..

  • they don’t tell you how, or

  • they tell you to give the URL, or

  • they tell you to cite a paper about the database.

Nutrition Education for Diverse Audiences [Internet]. Urbana (IL): University of Illinois Cooperative Extension Service, Illinet Department; [updated 2000 Nov 28; cited 2001 Apr 25]. Diabetes mellitus lesson; [about 1 screen]. Available fromhttp://www.aces.uiuc.edu/~necd/inter2_search.cgi?ind=854148396

NLM Recommended Formats for Bibliographic Citation.

Internet Supplement. NLM Technical report Bethesda, MD 20894, July 2001.


What is a citation

What is a citation?

Bard JB and Davies JA. Development, Databases and the Internet. Bioessays. 1995 Nov; 17(11):999-1001.

[Location and descriptiveinformation]

Ann. Phys., Lpz 18 639-641

Nature, 171,737-738

(We often want more than location)


Automatically generating citations

Automatically generating citations

A rule:

{ DB=IUPHAR, Version=$v, Family=$f Receptor=$r, Contributors=$a, Editor=$e, Date=$d, DOI=$i}

/Root[ ]/Version[Number=$’v, Editor=$?e, DOI=$.i, Date=$.d] /Data[ ] /Family[FamilyName=$’f] /Contributor-list/Contributor=$+a] /Receptor[ReceptorName=$’r]

What gets generated (example):

{ DB=IUPHAR, Version=11, Family=Calcitonin,

Receptor=CALCR, Contributors={Debbie Hay, David R. Poyner},

Editor=Tony Harmar, Date=Jan 2006, DOI=10.1234 }


Other topics data quality and data cleaning

Other topics: Data quality and data cleaning

  • Published data often looks clean but is intrinsically messy

    • “Dead” fields in the underlying data

    • Multiple syntactic conventions

    • Abuse of / confusion over formats & schema

  • Human errors require human correction

    • Automate error detection rather than error correction

  • Cleaning is an essential prerequisite in any integration or preservation task.


Other topics evolution of structure

Other topics: Evolution of Structure

  • Curated DBs evolve from humble origins. Schemas are often wrong; they are

    • designed by people who don’t understand schemas

    • designed before the domain is fully understood

  • Do ontologies help (you can build an ontology without worrying much about the schema) or do they defer the problem and make it worse?


The larger economic and social issues

The larger (economic and social) issues

  • Who will archive/curate curated databases?

  • Should they be open-access?

    • who pays for their maintenance?

  • What are the legal/IP issues?


A case study iuphar database curated by tony harmar and team

A case study: IUPHAR database(curated by Tony Harmar and team)

  • “Standard” curated database

  • Labour-intensive (hundreds of contributors)

  • Valuable (supported by drug companies)

  • Simple, clean structure – as seen by users

IUPHAR

DCC

50m


We wanted to use our archiver

We wanted to use our archiver

  • Our first task was to convert the database into a hierarchical structure (following the web presentation) so that we could archive it.

  • We used the Prata XML (Fan et al) publishing software

  • This had some unexpected benefits…


Curated databases

  • We can preserve all versions of the data (as intended)

  • We can generate static web pages (less software, more efficient)

  • We can make the database citable

  • Tony can trace the history of entries

  • Tony can generate an old-fashioned book (yes, he wants to do this!)

  • We have a “community model” for data exchange

  • The data got cleaned up in the process

  • The representation information (required by archivists) is greatly simplified


Curated databases

Selected pages from the book – generated by a 100-line style sheet


Curated databases

Our library will “host” the book, but not the database!


Centralized vs distributed publishing

Centralized vs. distributed publishing

20th century libraries provided robust, distributed dissemination and preservation of reference material

Is this still happening?

Valuable information was lost in earlier “data centers” .

Replication and distribution has always been the best guarantee of preservation. We should do the same for curated databases – a database LOCKSS ?


Many of the issues are non technical

Many of the issues are non “technical”

  • A good economic model for sustainability

    • Open access works for journal papers

    • Can it work for curated DBs? They require long-term support. And people who write reference manuals sometimes expect to make money out of them.

  • Intellectual property in curated databases is a nightmare

    • legislation still largely based on the notion of copying.

  • We can still help by providing good models of the processes in curating and publishing databases


Summary

Summary

  • Study of database curation and preservation is producing new problems for databases and digital libraries

  • We need to bring curateddatabases into the scope of libraries and other archival institutions and organisations.

  • Is this Data Intensive Research?


  • Login