1 / 14

e-Science and the Humanities

Researching e-Science Analysis of Census Holdings www.ucl.ac.uk/reach/ Dr Melissa Terras School of Library, Archive and Information Studies University College London m.terras@ucl.ac.uk. e-Science and the Humanities. Little use has been made of the computational grid in humanities research

Download Presentation

e-Science and the Humanities

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Researching e-Science Analysis of Census Holdingswww.ucl.ac.uk/reach/Dr Melissa TerrasSchool of Library, Archive and Information Studies University College Londonm.terras@ucl.ac.uk

  2. e-Science and the Humanities • Little use has been made of the computational grid in humanities research • The aims of the ReACH project were • To establish the potential of applying grid technologies to analyse a complex and rich humanities dataset • Pre-digitised • Historical census data • Of interest to academic researchers and general public • To investigate how e-Science technologies may be appropriated in the arts and humanities • Academic, Technical, Legal, Managerial, aspects of analysing large scale pre-digitized datasets using e-Science technologies • Understand the characteristics and features of large scale humanities datasets which differentiate them from scientific datasets • How does this affect the application of e-Science for research in the arts and humanities?

  3. Partners • UCL SLAIS • Digital humanities, informatics, archives and digital preservation • UCL Research Computing • World leading expertise in High Performance, Grid and e-Science computing • “Research Computing” • High Levels of SRIF funding • The National Archives • who select, preserve and provide access to, and advice on, historical records, • e.g. the censuses of England and Wales 1841-1901 (and also the Isle of Man, Channel Islands and Royal Navy censuses) • Ancestry.co.uk • who own a massive dataset of census holdings worldwide, and who have digitized the censuses of England and Wales under license from The National Archives

  4. Historical Census Data • England and Wales Census Data • 1841-1901 • 7 different censuses taken at 10 year intervals • 20 GB, 200 million records • Complex data set • Fields vary between each census year • Errors • from those supplying the data • from those writing down those answers • from those transcribing those answers into the enumerator returns • from those entering the data into the digital version of the records

  5. Overview of aims • Ascertain whether it would be technically possible • Ascertain whether access to the data would be feasible • Ascertain whether is would be useful to historians • Ascertain whether the results from the project would by worthy of the intellectual and financial investment • And what financial investment would be required to undertake the project

  6. Data • How do humanities datasets differ from scientific datasets? • Does this preclude them from utilising e-Science technologies in research? • Understand issues pertaining to the historical census • Quality of data • Importance of data to historians and researchers • What can be done to process the data to improve and facilitate research • How feasible, or useful, will that processing be • Understanding legal and managerial aspects of licensing pre-digitized datasets for analysis using grid technologies • Security • Who owns the research outcomes?

  7. Methodology - ReACH Workshop Series • Series of 3 AHRC funded Workshops • at UCL from June – August 2006 • All Hands Workshop -June 2006 • Featuring input from Historians, Archivists, Digital Librarians, Computing Scientists, Physicists, and Humanities Computing Experts • What is the research question? • It may be technically feasible – but will outcomes be useful? • Technical Workshop -June 2006 • Computing scientists, physicists, archivists • Determining input, output, processing techniques, workflow, and costings of potential project • Managerial Workshop – July 2006 • Legal, security, and managerial aspects to using pre-digitized commercially sensitive data for research purposes

  8. Historical issues – will it be useful? • If data quality/ computational complexity is not an issue: • Longitudinal dataset • Dictionaries of variants • Probability modelling of variants • Log analysis of how people are using census material • Checking and cleansing of census data • Generation of simple statistics • Calculating and identifying individuals who have been missed out in various censuses. • Reconstitution of missing data in the records through contextual information • Develop OCR techniques which can be used on copperplate • Techniques for social computing and family histories • Geographically normalised dataset • Mapping of geography to names • Assign grid references to historical data • Adding current geographical data to the census • Visualisation techniques

  9. Is it technically possible? • Implement a project would be relatively straightforward • Mount it on UCL Research Computing facilities • SGI Altix Facility: 135GFlops • Access to data relatively straightforward • Outputted to XML database • 20 GB of data, warrants use of grid computing for searching and analysis • Computational Grid techniques (and CS algorithms) • No real understanding of tools to benchmark cross dataset record matching • Of great interest to physicists, astronomers, astrophysicists, computing scientists…. • Further research could investigate how automated record linking could be initiated, using probability modelling of variants

  10. Is it feasible? Managerial Issues • Send in the lawyers… • Major legal issues in gaining access to commercially sensitive digitized data sets • Need for consortium agreements • Need to safeguard intellectual property rights • Need to ascertain who owns research outcomes • Datasets created in the process of analysing other datasets • Arts and Humanities need institutional backing in this area • Access to small subset of data in first instance to prove proof of concept • Need to set up secure systems and data management to ensure limited access to commercial datasets • Following lead of medical sciences

  11. But is this possible with the information available? • Historical census material • Complex, and flawed dataset • For historical reasons • The very fact it is complex provides interesting opportunities to investigate record matching techniques • Also, access to other datasets needed • “triangulation” • Births, marriages and Deaths • Burials • Parish registers • In England and Wales, this data is not in the public domain (yet), and not available in digital form • In order to undertake this project successfully, a massive digitisation project would have to be undertaken first • Or wait a few years until others undertake the digitisation project.

  12. Findings: e-Science and the Census • There has been much financial, industrial and academic investment in the creation of digital records from the English and Welsh historical census data • BUT there is not the quantity nor quality of information currently available to allow useful and usable results to be generated, checked, and assessed • will change as more data is digitised and becomes public • The potential for high performance processing of large scale census data is large • may result in useful techniques and datasets (for historian, genealogist and beyond) • Only when adequate historical data becomes available. • This should be revisited in the future

  13. Findings – e-Science and the A + H • High performance computing and e-Science community were very welcoming to researchers in the Arts and Humanities • Often the problems facing e-Science research in the arts and humanities are not technical • Nature of humanities data means that novel computational techniques need to be developed to analyse and process them • fuzzy, small scale, heterogeneous, of varying quality, and transcribed by human researchers • as opposed to scientific datasets • large scale, homogenous, numeric, and generated (or collected/sampled) automatically • Arts and Humanities projects need to engage with the legal issues in using and creating commercially sensitive datasets • Sensitive data sets and security: Arts and Humanities researcher should look towards Medical Sciences for their methodologies in data security and management • in particular utilising ISO 17799 to maintain data integrity and security

  14. Conclusion • Aimed to deliver a full project proposal for future funding rounds • Had to decide not to take this forward • Undertaking this pilot project prevented long term funding being wasted on a project which would have failed • Highlighted issues, problems, solutions, and barriers to any humanities project who may wish to use the computational grid to do complex record analysis • Report available from www.ucl.ac.uk/reach/

More Related