Project Data Archiving - Lessons from a Case Study
Release date: March 1998
This is one of a series of guides for research and support staff involved in natural resources projects. The subject-matter here is on-farm trials. Other guides give information on allied topics. Your comments on any aspect of the guides would be welcomed.
An integral part of many research projects is the collection of survey or experimental data at considerable expense, time and effort. Great care may be taken to produce good quality data at both the data collection and computerisation stages, but there is usually little emphasis on ensuring that the data are available to other users in a form that will allow the data to be readily understood and correctly used in subsequent studies. Ideally the creation of an archive must be integrated with the ongoing work of a project rather than being an afterthought, on which time may run out when the team is dispersed.
This brief note is intended to raise awareness of the importance of preserving project data, to discuss characteristics that make for a good data archive, and to provide an example of a successful archive.
2. Why preserve project data?
In the past, the generally-available records of project data have often been only in publications where limited space was available to summarise key features relevant to the specific slants of the papers concerned. Modern computing and data storage facilities mean there is now no technical reason why much more detailed data should not be preserved and readily reproduced, in a form where it can be accessed and used by others.
As part of the case for support for certain projects, proposers may argue that quantitative information produced would be of relatively long-term or wide-ranging interest and certainly producing results of only ephemeral value should not recommend a project. This implies a duty on the project team to document and archive data collected in the course of the work.
Given a worthwhile project, such a record is potentially valuable to secondary users and later workers, if they are given the opportunity to extract information in a form where it will make their own work more effective. Guaranteeing to add this value strengthens the case for funding the initial project.
3. What should be archived?
There are three main types of information which need to be accurately recorded:
* the project main data themselves, not just summary tables;
* the record of how and why data were acquired, and what they represent; and
* documentation about computer files which will allow later data retrieval.
To be useful beyond the project lifespan, archives need to be in an organised form, in almost all cases computerised.
Files should be backed up, with a securely stored master version, and should have a system set up during the project to make full or partial copies accessible to legitimate users thereafter.
General principles of data quality control apply at all stages, e.g. during the definition and development of the data to be collected, data acquisition, and the creation of final computerised datafiles.
4. What is a good data archive?
Many characteristics determine the production of a good data archive. In brief:
* Accessibility, so that users can reach the stored information via widely-available software.
* Ease of use, by ensuring that (i) the data archiving structure is simple so that the relationship between the forms used in the field and the computerised information is evident; (ii) there are clear definitions of variables stored in the archive (e.g. units of measurement) and codes used (labels for categorical variates, etc.); and (iii) there is consistency in names, codes, units of measurement, and abbreviations throughout the archive.
* Reliability must be ensured with the archive as free of errors as can be managed within the timescale and budget of the project.
* Documentation viz. (i) procedures used for data collection including sampling methodology and sampling units used, (ii) the structure of the archive, e.g. how different files link together, (iii) a list of computer files comprising the archive, (iv) a full list of all variables including notes on how missing values are treated, (v) summary statistics that allow the user to cross-check if the information retrieved corresponds to that required, and (vi) relevant warnings and comments relating to any part of the database.
* Preservation of anonymity or any conditions of confidentiality with which the data sets were made available by the sources.
* Completeness as far as that is possible and useful. The archive should include a computer file copy of (i) the field forms; (ii) the data management log-book; (iii) descriptions of derived variables, and (iv) special comments and observations.
5. Medium of dissemination
The medium for dissemination of project data has to be considered in planning any form of archiving. So does the choice of items to be disseminated which clearly may be selective, e.g. because of confidentiality of some data.
The argument made at the present time is that:
* it is easier and cheaper to duplicate floppy disks than to photocopy lots of reports, and
* it is easier to re-use numerical information if it is disseminated in computer-readable form.
In the near future it will be possible to disseminate data on CDs. These could include GIS data and also large images with built-in software to let the user view them. At present the hardware required to write CDs is rather expensive, but prices are dropping and may soon be within even small project budgets.
For the moment it seems that items like aerial photographs should be lodged in a place where they can be preserved safely for a reasonably long time, and which has the capacity to copy negatives and positives for legitimate users. Of course details of how to obtain copies should be part of the archive information.
6. Example of a successful data archiving exercise
The Statistical Services Centre (SSC) at the University of Reading was closely involved with statistical aspects of the Estate Land Utilisation Study (ELUS) in Malawi. As part of this involvement, a proposal was made for archiving the large volume of information that had been collected. The proposal was supported by the NR Adviser in Malawi and the ELUS project team.
A member of SSC staff, who had considerable experience of all the computer packages concerned, visited Malawi in February 1997 for three weeks to carry out the archiving exercise. An additional week was needed after the visit to complete the archive and its documentation. The time needed for such work depends on the length of the questionnaires concerned and the quality of the data available to the data archiving consultant. While the ELUS team had paid close attention to ensuring that their main datasets were as free as possible from errors, we note on the basis of other experiences that data-cleaning can be immensely time-consuming.
The archiving exercise involved a one-day workshop to a few identified users of the ELUS database. The aim was to familiarise the participants with the archive structure and organisation and to get their views on ways to improve the archiving procedure.
Archiving the ELUS database successfully in the time span described was possible as the work was approved, funded and completed within the life span of the project, as SSC staff were familiar with the ELUS data structure, and because of cooperation by the ELUS team in making all relevant information available on disk at the appropriate time.
The Estate Land Utilisation Study - a large nationwide survey carried out in Malawi over the period from mid-1995 to mid-1997, with funding from ODA (now DFID).
Detailed information was collected about the socio-economic structure and utilisation of land within the estate sector. The main survey involved a 125- and a 411-response questionnaire, while three subsequent and more detailed sub-sample studies used longer questionnaires. Two additional but smaller surveys were also a part of the project.
7. What does the archive comprise?
In the case of ELUS, a large A4 ringbinder contains three write-protected floppy disks of zipped (compressed) files, and software to allow these to be automatically restored into 15Mb of hard disk space. The data files are all included in duplicate - in two common formats - as SPSS portable files and dBase IV files, at least one of which should be accessible to users for many years to come. Word 6 was used for text files, including the full description of the sampling schemes. On paper, there is a ten-page introduction and a summary sampling report.
There are then several hundred pages giving details, for each questionnaire used, of every file and every variable. For each file the description includes the number of cases, the number of variables, the full list of variables declared, and of variable labels and value labels. For each variable the description includes the variable name and label, the minimum and maximum values and the number of valid (non-missing) cases stored. The volume weighs 1.65 kg.
Thirty copies were prepared, five including some information rated as confidential for commercial reasons or to protect the anonymity of respondents. Both types of copy were appropriately distributed e.g. amongst offices of the Government of Malawi, academic institutions, and DFID. Legitimate users and authorised researchers should be able to find a copy of the data in a form where they can for example (a) perform further analyses; (b) with appropriate access, use the information in the archive as an extremely detailed sampling frame, through which to revisit sub-samples of the ELUS estates; (c) integrate ELUS data with their own later findings for longitudinal analysis.
8. Further Information
We hope that this document provides some initial ideas on issues of importance in data archiving. The Statistical Services Centre is preparing more detailed guidelines on archiving procedures for potential researchers initiating projects, for their appraisers, and perhaps for government agencies in countries where projects may be done. We would be pleased to hear from researchers with examples or experience of work similar to that described here, from successful - or frustrated - users of data archives, or anyone with ideas to share on the issues involved.
Last updated 23/04/03