A Brief Introduction to Data Management

Research data management concerns the organization, storage, preservation, and sharing of data that is collected or analysed during a research project. Proper planning and management of research data will make project management easier and more efficient while projects are being performed. It also facilitates sharing and allows others to validate as well as reuse the data. Also, funding agencies are recognizing the importance of research data management and some now request Data Management Plans (DMP) as part of the grant application process. Please find some guidance in the sections below regarding these matters. Some of the topics are also covered as recorded presentations in the SciLifeLab Data Management playlist.

Data Management Plan

To make your research and data sharing more efficient it is recommended, and sometimes required by funders, to create a Data Management Plan early on in your project. A Data Management Plan (DMP) is a revisable document explaining how you intend to handle new and existing data, during and following the conclusion of your research project. It is wise to write a DMP as early as possible, using either a tool provided by your host institution or SciLifeLab DS Wizard. Ethical and legal considerations regarding the data will depend on where the research is conducted, this is especially true for projects including sensitive human data. For more information about the Swedish context, please review this page on Working with human data.

Data storage

Backing up and archiving your data is essential. The 3-2-1 rule of thumb means that you should have 3 copies of the data, on 2 different types of media, and 1 of the copies at different physical location. Consider uploading the raw data to a repository already when receiving them, under an embargo (if that is important to you). This way you always have an off-site backup with the added benefit of making the data sharing phase more efficient. Identifying a suitable repository early on will allow you to conform to their standards and metadata requirements already from the start.

Archiving is often the responsibility of your host institution, contact them for more details.

Data sharing

In the era of FAIR (Findable, Accessible, Interoperable and Reusable) and Open science, datasets should be made available to the public, for example by submitting your data to a public repository.

Choosing a repository

It’s recommended to choose a domain-specific repository when possible. It is also important to consider the sustainability of the repository to ensure that the data will remain public. Please see SciLifeLab’s RDM Guidelines about Sharing or the EBI archive wizard for suggestions depending on data type. You can also refer to the ELIXIR Deposition Databases and Scientific Data’s Recommended Data Repositories, to find suitable repositories.

For datasets that do not fit into domain-specific repositories, you can use an institutional repository when available (e.g. SciLifeLab Data Repository) or a general repository such as Figshare and Zenodo.

Preparing for submission

Describing and organizing your data

Metadata should be provided to help others discover, identify and interpret the data. Researchers are strongly encouraged to use community metadata standards and ontologies where these are in place, consult e.g FAIRsharing.org. Data repositories may also provide guidance about metadata standards and requirements. Capture any additional documentation needed to enable reuse of the data in Readme text files and Data Dictionaries that describe what all the variable names and values in your data really mean. Identifiers to refer to e.g. ontology terms can be designed for computers or for people; in a FAIR data context it is recommended to supply both a human-readable as well as a machine-resolvable Persistent Identifier (PID) for each concept used in the data.

Choosing a license

To ascertain re-usability data should be released with a clear and accessible data usage license. We suggest making your data available under licences that permit free reuse of data, e.g. a Creative Commons licence, such as CC0 or CC-BY. The EUDAT licence selector wizard can help you select suitable licences for your data. Note that sequence data submitted to ENA (or GenBank) are implicitly free to reuse by others as specified in the INCD Standards and policies.

Further questions

If you have further questions regarding Data management, you are welcome to contact the NBIS data management team (data@nbis.se).