Identifying how your data files will be described during all phases of your research is an important component of planning for data management.
Why provide a description of your research data?
Describing research data with documentation helps ensure the meaning or significance of your data remains clear over time.
- Help yourself understand your dataset: you may be very familiar with your dataset in the collection stage, but the chances are slim that over time you will still remember that the variable "sglmemgp" means "single member group," for example, or what the exact algorithm was that you used to transform a variable or create a derived one.
- Help others understand your dataset: other scholars may want to examine or use your data to understand your findings, verify your findings, review your submitted publication, replicate your results, or design a similar study. Documentation makes the significance of research data clear for individuals not involved in the research project themselves.
More details are available in our recorded workshop "Organize Your Research Data Part 1: Documentation".
Reproducibility
Reproducibility in a research context refers to the ability of independent researchers to reproduce a given result (using the original data from the study being tested) and produce the same result. This is slightly different from "replicability," where the original study is tested by independent researchers who use the same methodology and criteria but with new data collected.
Promoting reproducibility in the area of research plays an important role in ensuring that current and future work is grounded on solid research whose results have been independently verified. In addition, a number of journal publishers and research funding agencies are increasingly requiring that research data be structured and shared in an open manner that facilitates subsequent reproducibility.
Types of documentation
Metadata
Metadata is "data about data." It is the information necessary to make your data findable and understandable in an online archive or data repository.
Describing your data well ensures that it remains usable over time. Using established metadata standards will help preserve your data, and make it discoverable, citable, and ready-to-use by others.
Metadata can be descriptive, structural or administrative in nature:
- Descriptive metadata provides basic details about an object or data, including key elements such as object title, creator, keywords, and abstract or summary.
- Administrative metadata supplies information about an object’s management history, including its creation date, copyright permissions, and provenance or history.
- Structural metadata describes how objects are related to one another, helping users navigate research data by making the sequence, format, and structure of research objects clear.
Metadata can be stored simply as a spreadsheet or a plain text document containing the elements chosen to describe your data. You may also choose to embed your metadata into the data file itself.
Metadata standards
Standards vary by discipline and reflect the requirements necessary for accurately describing the data. These standards govern qualities like required fields and level of detail provided in the metadata. Choose a standard that is commonly used in your discipline or for the type of data you will be creating.
The Digital Curation Centre website has a searchable list of metadata standards organized by discipline as well as metadata recommendations for certain use cases and links to tools for creating and organizing metadata.
Data dictionaries
A data dictionary is used to define variable names, allowed values, units, and abbreviations found in your data. This information can be stored as a spreadsheet with columns for the variable name as given in the data, a description of what it is, the type of value (i.e., numeric, date, text), and allowed values (e.g. a particular value or date range). The data dictionary should also include how missing values are represented (e.g. NA or blank). This can help researchers in the future to better understand your data and can help you when cleaning and analyzing the data.
For more information, see the Open Science Framework's guide to creating a data dictionary and find more examples from the U.S. Geological Survey (USGS).
Codebooks
A codebook describes the nature of data collection for a study and describes the resulting data from any survey instruments. A codebook includes information about the study such as the methodology, time frame, scope, and population size. Codebooks also include descriptions of each variable, what the exact survey questions for them were, and the levels of each response. Summary statistics may also be included.
For more information, see the DDI Alliance's Marked-up Codebook examples and "What is a Codebook?" from ICPSR.
ReadMe text files
A ReadMe is typically a plain text file which includes information about the dataset such as the title, date(s) and location of collection, license, version, and variable definitions. It should also include the name and contact information of the principal investigator or other contact person. A ReadMe can be created for a single data file or for multiple data files related to the same research project.
For more information or to download a ReadMe template, see Cornell University's guide to writing readmes