Skip to main content

Computer and Network Resource Group

Resources

Data Management

Best Practices for Data Management in Biological Research

This page provides guidelines for effective and efficient management of research data in biological studies. Proper data management ensures data integrity, accessibility, reproducibility, and compliance with funding and publication requirements. 

General Best Practices


  • Naming Conventions: Use clear, descriptive, and consistent filenames. Avoid spaces and special characters. Use underscores (_) or dashes (-).
  • Script Organization: Number or prefix scripts to indicate the order of execution (e.g., 01_cleaning, 02_analysis).
  • Versioning: Track versions of cleaned data and scripts. Use version control systems like Git.
  • Documentation: Add comments to scripts and include a README file in each folder.
  • Reproducibility: Share software environments using tools like Docker or Conda.

Data Management Plan (DMP)


Before starting a study, create a Data Management Plan (DMP). A DMP is a document that outlines how data will be handled during and after the project. DMPs are required by major funding agencies (NIH, NSF, DOE, etc.) as part of grant proposals. CNRG offers support for DMP writing and you can contact data@igb.illinois.edu for more info.

Key Components of a DMP

  • Data Types: Describe the data you will collect, generate, or re-use.
  • Metadata: Outline standards for documentation and annotation.
  • Storage: Specify where and how the data will be stored and backed up.
  • Access and Sharing: Explain how data will be shared (repositories, embargoes, restrictions).
  • Retention and Preservation: Define how long data will be retained and where it will be archived.
  • Responsibility: Assign roles for data management tasks.

Example Tool: DMPTool helps create DMPs compliant with funding agency requirements. NIH as provided sample DMS plans as examples to download.

Data Collection


Standard Operating Procedures (SOPs)

Develop SOPs to ensure consistency during data collection and processing.

Clearly define protocols for sample collection, labeling, storage, and analysis.

Include steps for quality control to minimize errors.

Data Formats

Use standardized, non-proprietary file formats for better accessibility and preservation.

Data Type Preferred Formats
Text and Logs .csv .txt
Images .tiff .png
Microscope Files .tiff .ome
Sequencing Data .fastq .bam .vcf
Statistical Analysis .R .py .do

 

Data Documentation and Metadata


Metadata 

Metadata provides essential information to understand and use the data. Adopt community-specific metadata standards, such as:

  • MIAME (for microarray experiments)
  • MINSEQE (for sequencing data)
  • ISA-Tab (generalized standard for life sciences) 

Documentation 

  • Use clear and descriptive filenames (avoid generic names like data1.xlsx).
    • Example filename: ProjectA_SampleKO_MP_Timepoint1_2024-07-12.csv
  • Include:
    • Experiment details (e.g., study design, replicates, conditions).
    • File contents (variable names, units, abbreviations).
    • Software used (version and settings). 

Data Storage and Backup


Data Storage 

  • Use centralized storage systems (e.g., institutional servers, cloud storage).
  • Avoid storing critical data on local machines or USB drives. 
Data Storage Resources

IGB resources:

Campus resources:

Backup Strategy

 Adopt the 3-2-1 Rule

  • 3 copies of your data.
  • 2 different storage media (e.g., server, external hard drive).
  • 1 copy stored off-site (e.g., cloud-based backups). 

Automate backups to ensure regular updates. 

Data Security 

  • Control access to sensitive data (e.g., human/animal data).
  • Use encryption for data transfers and secure access protocols. 

Data Analysis and Processing


Version Control

Use version control systems (e.g., Git) to track changes to scripts, analyses, and data files. 

Code and Workflow Management

  • Document all steps of analysis workflows.
  • Use tools like Jupyter Notebooks (Python) or R Markdown for reproducibility. 

Quality Assurance

  • Validate raw and processed data for consistency.
  • Record decisions made during data cleaning and preprocessing. 

Data Sharing and Publication


Choosing a Repository Deposit data in discipline-specific or general repositories: 

Repository Data Type
NCBI SRA Genomic sequencing data
Dryad General research data
FigShare General data and supplementary files
Gene Expression Omnibus (GEO) Trascriptomics, arrays
Zenodo Any research data (with DOI)

FAIR Principles 

Ensure your data is: 

  • Findable: Use DOIs, rich metadata, and search keywords.
  • Accessible: Share openly unless ethical or privacy concerns exist.
  • Interoperable: Use standardized file formats and metadata.
  • Reusable: Provide sufficient context for re-analysis (clear documentation).

Citing Data

Use persistent identifiers (e.g., DOIs) for datasets and cite them in publications. 

Example: Author(s), Year, Title, Repository, DOI

Long-Term Data Preservation 

  • Follow institutional and funding agency policies for data retention (often 5–10 years).
  • Archive data in trusted repositories for long-term preservation. 

Compliance and Ethics


Ethical Guidelines 

  • Obtain appropriate permissions for human or animal data (e.g., IRB approval).
  • Follow ethical guidelines like Helsinki Declaration for human subjects. 

Data Privacy 

Anonymize personal or sensitive data before sharing to comply with privacy laws (e.g., GDPR, HIPAA). 

Training and Roles 

Ensure all team members are trained in data management practices. Assign roles for: 

  • Data collection
  • Metadata annotation
  • Storage and backup management
  • Analysis and reproducibility testing 

Checklist for Researchers


Below is an example of a checklist to ensure data management best practices:

  • Create a Data Management Plan
  • Use standardized file formats
  • Document and annotate metadata
  • Implement a backup strategy
  • Use version control for analysis code
  • Share data in appropriate repositories
  • Ensure compliance with ethical policies

By following these best practices, biological researchers can ensure their data remains high-quality, reproducible, and reusable, fostering scientific progress and collaboration.