Data sharing¶

Why should I share my data?¶

There are many reasons to share one’s data:

Collaboration and connection. Sharing data allows one to become a visible member of the open science community, and can lead to new collaborations and recognition of one’s contributions by the broader community.
Improving reproducibility. Without access to the raw data, it is impossible to fully reproduce the results from a published study. Data sharing allows researchers to confirm published findings and test their generalizability (for example, using different analysis methods). In addition, by allowing researchers to combine shared datasets, data sharing improves the statistical power of research, which is directly related to its reproducibility.
Receiving credit for data generation. It is increasingly common for the generation and sharing of high-value datasets to be viewed as an important scientific contribution in its own right. This is particularly evident in the advent of “data papers”, which are publications that describe a shared dataset and can provide citation credit for shared data.
Responsibility to research funders. Most research is funded by taxpayers or private foundations, who expect researchers to maximize the potential benefit of that investment. When researchers fail to share data effectively, they are limiting the potential impact of the funding, by preventing others from using those data to generate additional knowledge or test new hypotheses. In addition, some funding agencies (such as the National Institute of Mental Health and Wellcome Trust) require the sharing of data from research that they fund.
Improving power through data aggregation. There are many cases in which an individual researcher cannot feasibly obtain enough data to robustly test a particular scientific hypothesis; this is particularly the case for rare diseases, high-energy physics, or for finding effects such as genetic associations that require very large samples. In such cases, the sharing of data across sites can allow a larger consortium of researchers to combine data in order to more robustly ask particular scientific questions.
[Human participant research] Responsibility to research participants. One of the fundamental principles of human subjects research outlined in the Belmont Report is the principle of beneficence: “maximize possible benefits and minimize possible harms”. Because failing to share data (while minimizing risks to the participant) will necessarily reduce the possible benefits of the data, there is an ethical argument that researchers are ethically obligated to share data unless the risks to confidentiality and privacy cannot be reduced.

What are “metadata” and why are they important?¶

Metadata refers to information that describes the content or structure of a dataset, or related information that is needed to properly interpret the data. This can include information about the provenance of the data (e.g. who created the data, when they were created, and how they were created), the structure of the data (for example, what units are a particular variable specified in), and other annotations (for example, ratings of the quality of the data). Metadata are important because they allow us to interpret the data properly, and are also important for finding the dataset, as outlined by the FAIR principles. It is important that these be included along with the data to allow proper interpretation.

The FAIR Principles for open data¶

The FAIR principles describe a set of features that shared data should have in order to be maximally useful. Shared data should be:

Findable: discoverable with metadata, identifiable and locatable by means of a standard identification mechanism
Accessible: always available and obtainable; even if the data is restricted, the metadata is open
Interoperable: both parseable and understandable, allowing data exchange and reuse between researchers, institutions, organisations or countries
Reusable: sufficiently described and shared with the least restrictive licences, allowing the widest reuse possible and the least cumbersome integration with other data sources.

See here for more on how to make one’s data FAIR.

Frequently Asked Questions¶

How should my files be named?¶

NEVER use any identifying information in file names. See section on deidentification below for more on the particular details that should be excluded.
Data should be named consistently across a project
File names should be as concise as possible, while avoiding any potential naming collisions
- Additional metadata can be stored separately, in addition to source files
Consider using a key-value scheme for file names, like that used in the BIDS project
- Each key and value are connected by a dash
- Key-value pairs are separated by underscores
  - E.g. study-1_sub-005_task-stroop_data.tsv would contain data for subject 5 in study 1 on the stroop task.
  - This allows the file name to be automatically parsed
Always zero-pad numerical values
- This ensures that the alphabetical and numerical listing orders are identical
- Pad to one order of magnitude more than you expect
  - E.g. if you expect to collect data from 125 subjects or samples, then use four digits (e.g. sub-0125).
Avoid using spaces in file names
- These can make parsing of file names more difficult on some systems.
Stick with lower-case letters
- Computer systems differ in whether they are case-sensitive or case-insensitive (even within the same operating system; for example, there are both case-sensitive and case-insensitive versions of the Mac OS file system).
- Using upper case letters can cause confusion, e.g. whereby one system would treat “Data” and “data” as the same while another would not.
- For these reasons, snake case (e.g. “my_large_data_file”) is preferable to camel case (“myLargeDataFile”).

[Human participant research] What does “deidentification” mean and how can I do it?¶

Deidentification refers to the removal of any information by which the identity of a human subject could potentially be recovered from the data.
The Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule specifies that a dataset can be considered “deidentified” (and thus no longer treated as Protected Health Information) if it meets the following criteria:
- The following identifiers of the individual or of relatives, employers, or household members of the individual, are removed:
  1. Names
  2. All geographic subdivisions smaller than a state, including street address, city, county, precinct, ZIP code, and their equivalent geocodes, except for the initial three digits of the ZIP code if, according to the current publicly available data from the Bureau of the Census:
    - The geographic unit formed by combining all ZIP codes with the same three initial digits contains more than 20,000 people; and
    - The initial three digits of a ZIP code for all such geographic units containing 20,000 or fewer people is changed to 000
  3. All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older
  4. Telephone numbers
  5. Fax numbers
  6. Email addresses
  7. Social security numbers
  8. Medical record numbers
  9. Health plan beneficiary numbers
  10. Account numbers
  11. Certificate/license numbers
  12. Vehicle identifiers and serial numbers, including license plate numbers
  13. Device identifiers and serial numbers
  14. Web Universal Resource Locators (URLs)
  15. Internet Protocol (IP) addresses
  16. Biometric identifiers, including finger and voice prints
  17. Full-face photographs and any comparable images
  18. Any other unique identifying number, characteristic, or code
- The researcher does not have actual knowledge that the information could be used alone or in combination with other information to identify an individual who is a subject of the information.
Some datasets may be identifiable even if they do not contain any of the 18 identifiers. For example, a dataset that contains only families with 4 or more children within a particular age range from a particular state could allow those individuals to be re-identified. In this case, one would need additional protections for data sharing.

How can I maximize the impact of my shared dataset?¶

Publish a data descriptor (see Step 6 above)
Provide a top level description of the dataset, covering information such as:
- Subject that data was acquired from (i.e., Humans, Particles)
- Number of participants / samples
- Data acquisition method (e.g. MRI, camera, surveys)
- Participant / sample summary information
Provide a README file
- High level description of dataset tree
- Brief description of the data
Provide contact information
- Allows data users to clarify any questions / discretions
Structure your metadata so that it can be indexed by Google Dataset Search

What if I discover an error in my shared dataset?¶

The need to fix errors highlights the importance of using a repository that allows versioning of the data.
Some repositories allow one to simply upload a revised version of the data
- Consider writing a description for the README file describing the error and how it was resolved.
- Alternatively, include a CHANGES file that details all changes made to the data for each version.
For repositories that involve manual submission, immediately reach out to the repository hosting the data. describing what the error is and how you would like to proceed with remedying it.

Resources¶

General¶

http://opendatahandbook.org/guide/en/

FAIR¶

Intro: https://www.openaire.eu/how-to-make-your-data-fair
Paper: https://www.nature.com/articles/sdata201618
FAIR principles: https://www.force11.org/group/fairgroup/fairprinciples

Data organization¶

Metadata¶

Center for Expanded Data Annotation and Retrieval

Standards¶

Psychology and Brain sciences¶

Psych-DS is an emerging standard for behavioral datasets
Brain Imaging Data Structure is a widely adopted community standard for neuroimaging data (including MRI, EEG, and other modalities)
- BIDS can also be used for some simpler behavioral datasets
Neurodata Without Borders is an emerging standard for electrophysiology data
Examples
- BIDS Dataset
- OpenNeuro - datasets are BIDS validated prior to upload
- Neurodata without borders tutorial

Open By Design at Stanford

Data sharing¶

Why should I share my data?¶

Why doesn’t everyone share their data?¶

What are “metadata” and why are they important?¶

The FAIR Principles for open data¶

Getting started with data sharing¶

Step 1: Plan for sharing (prior to data collection):¶

Human participant research considerations¶

Step 2: Organize your data and metadata¶

Step 3: Determine the most appropriate repository¶

Step 4: Determine the terms of sharing for your data¶

Step 5: Upload the data¶

Step 6 (optional): Publish a data descriptor¶