Chapter 21
Managing research data Mary Anne Kennan
Charles Sturt University, Australia
This chapter describes the various forms and sources of research data and the importance of planning to appropriately manage data throughout their life cycle. The many reasons that data should be managed within research projects and programs (and beyond to enable future use) are discussed. Legal, ethical and policy reasons for planning are introduced, as are practical and pragmatic reasons, along with the role of researchers in data management processes. Ten important components of a data management plan are addressed and a checklist for researchers in the early stages of constructing a data management plan is provided. The chapter concludes by providing references to useful data management tools and resources.
Research Methods: Information, Systems, and Contexts. DOI: http://dx.doi.org/10.1016/B978-0-08-102220-7.00021-2 Copyright © 2018 Mary Anne Kennan. Published by Elsevier Ltd. All rights reserved.
505
Research Methods: Information, Systems, and Contexts
Introduction Researchers, research funders and research institutions such as universities place increasing importance on data management planning as a way of improving the access to, and the longevity, sharing and reuse of research data. Formal research data management is a developing area and best practice is emerging. Data management has traditionally been an ‘orphan’ activity expected to be carried out, but not explicitly taught, supported or funded (Donnelly, 2012, p. 94). As Donnelly pointed out, neglecting data management is risky for institutions and funders as activities which do not attract attention such as funding, reward or recognition are the first to suffer when time or money get tight. Thus it is important for data management plans to have associated policies, infrastructures, staffing and systems to support researchers in their implementation. Research data are heterogeneous and can take many forms depending on the research problem being addressed and the discipline of the researcher. Thus different types of data in different disciplines require different treatment in the data management process. In disciplines where data may be impossible to reproduce and is important over time (such as the data in longitudinal studies in the social sciences or climate data in the environmental sciences), stress may be placed on preservation, but in fields where fast and easy sharing of data is desirable (such as medicine and the biological sciences) the planning and management emphasis may be on discovery and reuse. A plan must be fit for purpose for the particular types of research and data. Once a plan is written, the plan and its operationalisation must move forward together as a living document, and be in the minds of researchers and data managers/curators across the life cycle of a research project, as well as the generally longer life of the data.
Research data: Forms and sources There is a wealth of information on the complexities of research data, defined as data “collected, observed or created for the purposes of analysing to produce original research results” (Rice, 2009, p. 16). Data sources are varied. In the life and physical sciences, data are generally gathered or produced by researchers through observations, experiments or by computer modelling. Borgman (2007, p. 182) provides some examples from different scientific disciplines: X-rays in medicine, protein structures in chemistry, spectral surveys in astronomy, specimens in biology, events and objects in physics. In the social sciences researchers may gather or produce their own data from, e.g., interviews, surveys, and observations, or obtain it from third parties such as the Australian Bureau of Statistics or the Organisation for Economic Co-operation and Development. Humanities data often come from cultural records, archives and objects, both published and unpublished (Borgman, 2007). 506
Chapter 21 – Managing research data
Digital data Researchers have always collected data as evidence for their findings and claims. These data are now increasingly becoming digital. For example, scientific research used to be based in the laboratory. In the past, scientists kept meticulous notebooks, recording and detailing data and all stages of research. Now much research, even research conducted in a laboratory, uses computers in some way and produces digital data. Similarly much humanities and social science research data, which may have been textual data, recorded in notebooks, are now in a digital format in word processing documents, databases and spreadsheets. Digital research can often be complex and heterogeneous, meaning that the issues in data management and curation are similarly complex. Issues include: technology obsolescence; technology fragility (e.g., corruption of files); lack of guidelines on good practice; inadequate financial and human resources to manage data well; and, lack of evidence about best infrastructures (Oliver & Harvey, 2016).
Why research data need to be managed Until recently data have rarely been seen by people beyond the initial research team. Data are usually analysed, summarised and published as a summary, often theorised, in articles, books and other publications. Analysed and published data are generally selective representations of a small amount of the raw data originally collected by the researchers (Latour, 1987). The analysis and summary of data in publications inevitably incorporate methodological and pragmatic choices made by the researchers at different stages of the research and limit subsequent interrogation of the data. Data that are abstracted and prepared for publishing in these ways do not necessarily provide sufficient information for future users (Markauskaite, 2010). These considerations have led to the rise of the open data movement. ‘Open data’ is an emerging term in the process of defining how scientific data may be published and re-used without price or permission barriers. It is related, but not completely analogous, to open access to publications (Murray-Rust, 2008). For discussion of ‘open access’ issues, see Chapter 22: Research dissemination. There are several reasons for managing data, whether or not they are made open, within research programs, to enable sharing and future use. The characteristics of data that are particularly relevant here include that data: 1.
are expensive to collect and therefore publicly funded research should be publicly available (Murray-Rust 2008);
2.
may be unique, e.g., represent a snapshot in time or space (Henty, Weaver, Bradbury & Porter, 2008);
3.
can be re-used to reproduce and validate original findings, to advance the original research or to open another line of enquiry (Witt, 2009); 507
Research Methods: Information, Systems, and Contexts
4.
can also contribute to answering questions which may require interdisciplinary problem solving (Cragin, Palmer, Carlson & Witt, 2010);
5.
may be used to examine a phenomenon from different epistemic or social perspectives (Markauskaite, 2010); and
6.
may need to be collected from a variety of sources, beyond the scope of one research team, time or location (Borgman, 2007).
Several of the points, above, imply the need for data sharing, which is a key element of collaboration (Borgman, 2006; Williamson, Kennan, Johanson, & Weckert, 2016). Altruism, citation and the potential for new collaboration opportunities may motivate some researchers to share their data, but currently there are no explicit and tangible rewards for doing so and researchers report it as low on the list of their priorities (Henty, Weaver, Bradbury and Porter, 2008; Markauskaite, Kennan, Richardson, Aditomo & Hellmers, 2012). Data come in sets the collation of many individual datum. In most research, researchers create or use a number of datasets or databases. These sets of data are referred to as data collections. Data collections are defined by the infrastructure, organisations, and individuals needed to provide persistent access to them. Research data collections can refer to the output of a single researcher, collaborative teams or laboratories, be produced during the course of a specific research project, and beyond that if they are stored, created and preserved into the future. The useful life of data and data collections is referred to as the data life cycle. Decisions made at each stage of the data life cycle determine what data are available at the next stage, how they are handled, and the purposes for which they are useful (Wallis, Borgman, Mayernik, & Pepe, 2008). The life cycle of data will be different in different disciplines or research traditions within those broad headings. At different stages in the life cycle, different people will be responsible for the data. There are many different data life cycle models: including that of the Digital Curation Centre (DCC) in the United Kingdom, which focusses on data curation and preservation issues (DCC, n.d.; Higgins, 2008; Higgins, 2012) and that of the Data Documentation Initiative (DDI) (2012) both of which elaborate researcher and curator roles in data conceptualisation, collection, processing, distribution, discovery, analysis, repurposing, and archiving.
Data management planning and processes Reasons for the need to plan for data management were mooted in the introduction to this chapter. This section examines legal, ethical and policy requirements for planning, as well as discussing pragmatic and practical reasons to plan. The role of researchers in the planning process and the influences of organisations and associated actors on planning are also discussed, along with the processes involved in data management. 508
Chapter 21 – Managing research data
Legal, ethical and policy requirements for planning Decisions made in relation to the management of research data should be informed by relevantlegislation and codes; national and institutional policy; and procedures and guidelines that the research project must adhere to. These will vary in different countries and states. The examples given here are largely Australian, but similar ones will inform research practice in many other countries. In Australia, the Australian Code for Responsible Conduct of Research guides institutions and researchers in responsible research practice. The Code has been jointly developed by the National Health and Medical Research Council (NHMRC), the Australian Research Council (ARC) and Universities Australia. Compliance with the Code is a pre-requisite for NHMRC and ARC funding (NHMRC/ARC/Universities Australia, 2007). Other related national codes exist for the ethical conduct of research (e.g., NHMRC/ARC/AVCC, 2007). A central aim of the Australian Code for the Responsible Conduct of Research is that sufficient data and materials are retained to justify the outcomes of research and to defend such outcomes should they be challenged. The Code also makes recommendations about what research data and associated information should be kept (NHMRC/ARC/Universities Australia, 2007). Similarly, national ethical codes make recommendations about the ethical considerations that should be considered by researchers regarding their data. (See Chapter 20: Ethical research practices.) A data management plan enables processes and checks to be put in place to ensure these requirements are met. In addition there are legislative requirements in different jurisdictions that need to be taken into consideration by researchers. A plan enables compliance issues to be visible and achievable. For example, in New South Wales, institutions and researchers will be guided by the provisions of the State Records Act 1998 regarding the retention and disposal of research data. Various copyright acts may also apply to data, including digital forms of text and images, packaged variously as videos or DVDs, for example. The code of computer programs (both the human readable source code and the machine readable object code) is protected by copyright as a literary work. Data compilations such as datasets and databases can also be protected by copyright. Although, as a general rule of copyright, ownership resides with the author or creator, copyright can belong to a person’s employer or via contract to others, depending on conditions of employment or via contracts in multiinstitutional collaborations. If researchers wish to share their data online, deposit into a database, repository or archive, or allow others to access and use their data in any way, they will need to build appropriate permissions and licenses into their ethical approval processes and their data management plan and metadata. Finally, there will be an array of institutional polices that will have a bearing on any research data management plan. Taking my own institution as an example,
509
Research Methods: Information, Systems, and Contexts
these may include policies on intellectual property, research data management, authorship, and ethics policies.
Pragmatic and practical reasons to plan Good research data management practices ensure that researchers and institutions are able tomeet their obligations to funders, improve the efficiency of research, and make data available for sharing, validation and re-use. Without proper management, data, or the media on which they are stored, may quickly become outdated or obsolete. Data may be misinterpreted through poor or inadequate description, or can sometimes go missing entirely, especially if files are not maintained and backed up on a regular basis. Without a data management strategy and skills, data can become lost or rendered useless, or be unable to provide their optimal usefulness. Many research funding bodies recognise this and now, around the world, are requiring data management plans to be attached to applications for funding. The plans go through peer review, along with the rest of the application.
The role of researchers When many researchers think about data management, they may be thinking about common research processes involving data in an active research project or program rather than the management of data through its entire life cycle. For example, quantitative researchers may be talking about processes they already undertake in preparing their data for statistical analysis, for example cleansing, coding, documenting, variable creation, checking and digitising (e.g., the processes described in Chapter 18: Quantitative data analysis). Qualitative researchers might consider the work they do with their data, for example documenting, coding, describing, sorting, summarising, and iterative analysing (e.g., the processes described in Chapter 19: Qualitative data analysis). Data management planning includes these processes within the research process, but also can include longerterm concerns such as storage, curation, preservation, and description. These processes are better performed by or in conjunction with data librarians, curators and managers within researchers’ institutions. Unmanaged data present financial and opportunity loss, and yet the description, long-term storage and preservation of data are not necessarily key concerns of researchers who create and use data. Nevertheless, effective data management must begin with the researcher or research team (Pryor, 2012). Because data management is a part of digital preservation which is a part of data curation, generally a data management plan will fit into a suite of policies and procedures. Good data management requires input from a team of stakeholders: from the research team, to university policy makers, information technology and information management staff, and librarians or information managers.
510
Chapter 21 – Managing research data
Many researchers will regard the full cycle of data management planning as an additional and unwelcome burden. Some researchers question why data need to be kept after the results of the research have been published and statutory requirements met. However, there are a range of reasons that data need to be maintained before, during and after the research results have been published. Key issues to be considered are: n
Risk minimisation. Many researchers have had an experience of data loss or corruption. Many research data are kept on degradable, portable (and therefore easily lost) media such as external disk drives, CD-ROMs, DVDs, USB flash drives and memory sticks.
n
Funder and institutional policies and requirements, legislation (such as the State Records Acts). These require data management planning.
n
Not all data management plans are the same. Some will be complex and require that data are preserved and archived. Others may simply require that data are stored for a period of time and then destroyed. Researchers need to know into which category their data fall.
n
Contribution to future research. Some data will contribute to future research and there are good arguments for their retention with open access. For other data there are strong ethical reasons which constrain sharing and reuse.
n
Planning and curation from the creation stage. It is far cheaper and less time consuming to plan and appropriately curate data from the creation stage, than it is to correct mistakes or attempt to recover data later from obsolete storage media or after corruption has occurred (Donnelly, 2012).
Influences of organisations and other actors Every data management plan will need to be different, reflecting the particular research involved; the disciplinary backgrounds of the researchers and the influences on them; national, funder and institutional policies; whether the data are ethically sharable; and whether they are unique or reproducible. Donnelly (2012, p. 88) lists the influences and actors as: funder requirements; deposit requirements, e.g., data and metadata formats; organisational and project requirements; research group and laboratory requirements and constraints; disciplinary, social and personal influences; publisher requirements; and legal requirements. A number of people and groups may need to be involved in the planning and implementation stages, from the researchers, to data librarians, to repository or archive managers. Other people who may need to be involved include research policy staff and grants officers and, where appropriate, technical and laboratory staff. It is not realistic to expect all stakeholders to become experts in all the 511
Research Methods: Information, Systems, and Contexts
aspects of data management, but the planning process and policy should enable clear identification of roles and responsibilities at each stage of the process.
The processes involved The Digital Curation Centre (DCC, n.d.) in the UK recommends three stages in data management planning. The first is at the grant application or project conceptualisation stage. This plan is relatively short and promissory. Once funding is achieved and/or the project definite, the core plan should be developed. Towards the end of the project, the full plan is finalised and includes long-term management and preservation plans often developed in consultation with people with relevant expertise such as archivists and information managers, curation and preservation specialists, and repository managers (Kennan, 2016). There are no hard and fast rules abou t what should be included in a data management plan, as each plan depends on the nature of the research and data. However, some guidance is necessary. Donnelly (2012) lists ten important components of a data management plan:
512
1.
Introduction and context: addresses the aims of the project and the data management plan, the intended audience of each, and relates the plan to relevant policies and procedures.
2.
Data types, formats, standards and capture methods: describes the data and the metadata which will make the data (re)usable.
3.
Ethics and intellectual property: includes legal obligations, licensing, sharing, embargo periods and release schedules.
4.
Access, sharing and re-use: expands on the above, by addressing who are the likely re-users, and when and how the data might be accessed.
5.
Short-term storage and data management: provides specifics during the research project lifetime, e.g., when data are likely to be only available to team members, or if there are likely to be large resource requirements, and specifying data volumes and formats.
6.
Deposit and long-term preservation: need to be addressed where the usefulness of a dataset is likely to outlive the interest or career of the original researcher/s. Aspects to consider include selection, appraisal, and future discovery, access and preservation.
7.
Resourcing: the human, financial and infrastructure resources that are required for the short- and long-term management of the data.
8.
Adherence and review: how the plan will be adhered to, how often it will be reviewed, and who will be responsible for these activities.
9.
Statement of agreement: formalises the plan, especially in the case of multi-partner projects.
Chapter 21 – Managing research data
10.
Annexes: may be included, e.g., contact details and expertise of nominated data managers, glossary of technical terms, references to previous versions of the plan (if appropriate) (Donnelly, 2012, pp. 93 94).
Each of the components above has multiple parts.
Data management planning tools and resources There are many tools and resources created to assist researchers with data management planning. Researchers beginning data management planning need to first consult with research collaborators (or supervisors in the case of research higher degree students), consider disciplinary norms, and check the data management and reporting requirements of their institutions, collaborators and research fund granting bodies. Many of these people and bodies may have quite explicit requirements and guidelines. If not, Box 21.1. contains examples of some useful tools and resources for beginning research data planners.
Box 21.1 Data management planning tools and resources Tools for reference. Book mark for referral in your research practice. Australian National Data Service (ANDS) data management planning resources http://ands.org.au/datamanagement/index.html Australian National Data Service (ANDS) data management plans http://ands.org.au/guides/data-management-planning-awareness.html Digital Curation Centre checklist, and online planning tool and other resources http://www.dcc.ac.uk/resources/data-management-plans Purdue University data management plan (DMP) self-assessment tool http://purr.purdue.edu/resources/14/download/DMP_Self_Assessment. pdf Queensland University of Technology data management planning tools https://dmp.qut.edu.au/
513
Research Methods: Information, Systems, and Contexts
UK Data Archive. Create and Manage Data http://www.data-archive.ac.uk/create-manage University of California, DMP Tool https://dmp.cdlib.org/
Conclusion This chapter has explored the reasons researchers, institutions and information managers need to plan for the management of research data. It has also examined the processes involved and the basic components of a data management plan. Different management processes and plans are required for different types of research and data. For example, when small amounts of data are collected or the data are simple transcriptions of interviews or small surveys, data management plans may only need to be minimal and adaptable to the immediate research. As the scale of the project or the data collection increases, where the number of people involved is large, or where the data may have a use beyond its original purpose (i.e., be suitable for re-use or sharing), then data management plans need to become more explicit and complete (Borgman, 2007). This chapter provides the basic foundational concepts and reasons for researchers to plan and manage their data and concludes with the links to some tools and resources useful for data management planning.
References Borgman, C. L. (2006). What can studies of e-learning teach us about collaboration in e-research? Some findings from digital library studies. Computer Supported Cooperative Work, 15(4), 359 383. Borgman, C. L. (2007). Scholarship in the digital age: Information, infrastructure, and the Internet. Boston: MIT Press. Cragin, M. H., Palmer, C. L., Carlson, J. R., & Witt, M. (2010). Data sharing, small science and institutional repositories. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 368(1926), 4023 4038. Data Documentation Initiative (DDI). (2012). Why use DDI? Retrieved from http:// www.ddialliance.org/training/why-use-ddi. Digital Curation Centre (DCC). (n.d.). DMP Online. Retrieved from https:// dmponline.dcc.ac.uk/. Donnelly, M. (2012). Data management plans and planning. In G. Pryor (Ed.), Managing research data. London: Facet. 514
Chapter 21 – Managing research data
Henty, M., Weaver, B., Bradbury, S. J., & Porter, S. (2008). Investigating data management practices in Australian universities. Canberra: Australian Partnership for Sustainable Repositories. Retrieved from http://apsr.anu.edu.au/orca/ investigating_data_management.pdf Higgins, S. (2008). The DCC curation lifecycle model. International Journal of Digital Curation, 3(1), 134 145. Higgins, S. (2012). The lifecycle of data management. In G. Pryor (Ed.), Managing research data (pp. 17 45). London: Facet. Kennan, M. A. (2016). Data management: Knowledge and skills required in research, scientific and technical organisations. Paper presented at IFLA World Library and Information Congress, 82nd IFLA General Conference and Assembly 13 19 August, Columbus, OH. Retrieved from http://library.ifla.org/1466/1/221kennan-en.pdf. Latour, B. (1987). Science in action. Cambridge, MA: Harvard University Press. Markauskaite, L. (2010). Digital media, technologies and scholarship: Some shapes of eResearch in educational inquiry. Australian Educational Researcher, 37(4), 79 101. Markauskaite, L., Kennan, M. A., Richardson, J., Aditomo, Investigating eResearch: Collaboration practices In A. Juan, T. Daradoumis, M. Roca, S. Grasman, & J. and distributed e-research: Innovations in technologies, (pp. 1 33). Hershey, PA: IGI Books. Murray-Rust, P. (2008). Open data in science. Serials Review,
A., & Hellmers, L. (2012). and future challenges. Faulin (Eds.), Collaborative strategies and applications 34(1), 52 64.
NHMRC/ARC/Universities Australia. (2007). Australian code for the responsible conduct of research. Canberra: Australian Government. Retrieved from https://www. nhmrc.gov.au/guidelines-publications/r39 NHMRC/ARC/AVCC. (2007, updated 2015). National statement on ethical conduct in human research. Canberra, Australian Government. Retrieved from https://www. nhmrc.gov.au/guidelines-publications/e72. Oliver, G., & Harvey, R. (2016). Digital curation (2nd ed.). London: Facet. Pryor, G. (Ed.) (2012). Managing research data. London: Facet. Rice, R. (2009). DISC-UK datashare project: Final report. Bristol: JISC. Retrieved from http://repository.jisc.ac.uk/336/1/DataSharefinalreport.pdf. Wallis, J., Borgman, C., Mayernik, M., & Pepe, A. (2008). Moving archival practices upstream: An exploration of the life cycle of ecological sensing data in collaborative field research. International Journal of Digital Curation, 1(3), 114 126. Williamson, K., Kennan, M. A., Johanson, G., & Weckert, J. (2016). Data sharing for the advancement of science: Overcoming barriers for citizen scientists. Journal of the Association of Information Science and Technology, 67(10), 2392 2403. Witt, M. (2009). Institutional repositories and research data curation in a distributed environment. Library Trends, 57(2), 191 201.
515