diff --git a/mkdocs.yml b/mkdocs.yml index c81a99bc26..ef8b344d4d 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -48,6 +48,7 @@ nav: - Quantitative MRI: appendices/qmri.md - Arterial Spin Labeling: appendices/arterial-spin-labeling.md - Cross modality correspondence: appendices/cross-modality-correspondence.md + - Phenotypic data guidelines: appendices/phenotype.md - Changelog: CHANGES.md - The BIDS Website: - Getting started: https://bids.neuroimaging.io/getting_started/ diff --git a/src/appendices/phenotype.md b/src/appendices/phenotype.md new file mode 100644 index 0000000000..a6edfdfc70 --- /dev/null +++ b/src/appendices/phenotype.md @@ -0,0 +1,412 @@ +# Tabular phenotypic data guidelines + +This appendix is a collection of guidelines and examples +for curating well-organized tabular phenotypic data. + +## Guidelines + +These guidelines are intended to improve the organization and clarity of +tabular phenotypic data like the participants file, sessions file, +and phenotypic and assessment data. + +They are recommendations and are by default ignored during validation. +You can make them mandatory during validation by setting the +[`AdditionalValidation` key](../modality-agnostic-files/dataset-description.md#additional-validation) +to contain `"Phenotype"` in the `dataset_description.json`. + +### 1. Aggregate data across sessions using `"IndexColumns"` + +In multi-session datasets, +aggregate phenotypic and assessment data across all sessions +into one tabular tab-separated value (TSV) file per measurement tool. +In order to aggregate, you MUST use the `"IndexColumns"` list/array field +in the corresponding JavaScript Object Notation (JSON) sidecar file. +There are two examples of this usage [below in this appendix](#examples). +Store each of the TSV and JSON files in the `/phenotype` directory +using the file-naming template `/phenotype/tool-_phenotype.tsv`. +Read the [phenotypic and assessment data section](../modality-agnostic-files/phenotypic-and-assessment-data.md) +for further explanation of how to use `"IndexColumns"` +to aggregate longitudinal or multi-session tabular phenotypic data. + +### 2. Always pair tabular data with data dictionaries + +Tabular phenotypic data MUST be prepared as one pair of a tabular file +in TSV format and a corresponding data dictionary in JSON format. +See the [Tabular files section](../common-principles.md#tabular-files) for more information. + +### 3. Add `MeasurementToolMetadata` to each tabular phenotypic measurement tool + +Whenever possible, it is RECOMMENDED to add `MeasurementToolMetadata` to +each `phenotype/.json` data dictionary. +This improves reusability and provides clarity about the measurement tool. +See [`MeasurementToolMetadata` in the glossary](../glossary.md#measurementtoolmetadata-metadata) for more. + +### 4. Ensure minimal annotation for phenotypic and assessment data + +In multi-session phenotypic and assessment data, +each measurement tool SHOULD have an independent +aggregated data TSV file in which the user collects all subjects, sessions, +and/or runs of data as one entry per row +(with a row defined by the smallest unit of acquisition). +This also means the user MUST use the `"IndexColumns"` field +in each JSON sidecar for multi-session data. +Some common index columns are: +`participant_id`, `session_id`, `run_id`, and `acq_time`. + + +{{ MACROS___make_columns_table("modality_agnostic.Phenotypes") }} + +Furthermore, if you add a `session_id` index column to any tabular phenotypic data, +you MUST introduce a session directory to the imaging data, +even if only one imaging session has been created. +And vice versa, if imaging data has session directories, +all imaging data and tabular phenotypic data MUST have sessions. + +This produces files in which same-participant entries can take up as many rows as needed +according to the smallest unit of acquisition. + +### 5. Use a demographics file for multi-session data + +If there is more than one session for any one participant, then +it is REQUIRED to provide a demographics file in the `/phenotype` directory +named as `/phenotype/tool-Demographics_phenotype.tsv` +using the `"IndexColumns"` JSON sidecar field. +It is RECOMMENDED to store the `age` column for multi-session datasets +in this demographics file to record participant age for every session +on their own rows. + +### 6. Record acquisition time of all sessions with `acq_time` + +It is RECOMMENDED to store acquisition time[2](#footnotes) +for tabular phenotypic data and store the time of acquisition of each row +inside a column named `acq_time` in the demographics file. +This is consistent with how acquisition time is recorded for MRI data +and other time-sensitive measurements (for example systolic blood pressure). + +## Summary + +This appendix described guidelines for best tabular phenotypic data. +In summary, it is RECOMMENDED to always use the participants file +and separate files by measurement instrument in +the phenotypic and assessment data directory, +since they each collect different information. +If you have multi-session data, then follow the aggregation guidelines above. + +## Examples + +What follows are a few common use case examples for tabular phenotypic files. + +### 1 participant session with both non-tabular and tabular phenotypic data + +File tree + + +{{ MACROS___make_filetree_example( + { + "phenotype": { + "tool-Measurements_phenotype.json": "", + "tool-Measurements_phenotype.tsv": "", + }, + "sub-01": { + "anat": { + "sub-01_T1w.json": "", + "sub-01_T1w.nii.gz": "", + } + } + } +) }} + +Contents of `phenotype/tool-Measurements_phenotype.tsv` + +```tsv +participant_id measurement_1 measurement_2 +sub-01 value1 value2 +``` + +### 1 participant with 2 sessions, where 1 session is only tabular phenotype and the other is only imaging + +With only one imaging and one phenotypic session each in this example you might want +to merge both imaging and phenotypic data under one session. But it is more correct to +have separate sessions for the imaging and phenotypic data, especially if +the sessions were collected days, weeks, or months apart. You can denote all of +`participant_id`, `session_id`, and `acq_time` in the `tool-Measurements_phenotype.tsv` file +and note `session_id` `Levels` in the `tool-Measurements_phenotype.json` sidecar. +Below are a CORRECT and an INCORRECT example of prepared data following these guidelines. + +#### CORRECT + +File tree + + +{{ MACROS___make_filetree_example( + { + "phenotype": { + "tool-Measurements_phenotype.json": "", + "tool-Measurements_phenotype.tsv": "", + }, + "sub-01": { + "ses-MRI": { + "anat": { + "sub-01_ses-MRI_T1w.json": "", + "sub-01_ses-MRI_T1w.nii.gz": "", + } + } + } + } +) }} + +Contents of `phenotype/tool-Measurements_phenotype.tsv` + +```tsv +participant_id session_id acq_time measurement_1 measurement_2 +sub-01 ses-pheno 2001-01-01T12:05:00 value1 value2 +sub-01 ses-MRI 2001-03-01T13:14:00 n/a n/a +``` + +Contents of `phenotype/tool-Measurements_phenotype.json` + +```json +{ + "IndexColumns": [ + "participant_id", + "session_id" + ], + "participant_id": { + "Description": "Participant identifier" + }, + "session_id": { + "Description": "Session identifier", + "Levels": { + "ses-pheno": "Phenotype-only session", + "ses-MRI": "MRI-only session" + } + }, + "acq_time": { + "Description": "When the data acquisition started" + }, + "measurement_1": { + "Description": "A first measurement taken at a phenotypic session" + }, + "measurement_2": { + "Description": "A second measurement taken at a phenotypic session" + } +} +``` + +#### INCORRECT + +File tree + + +{{ MACROS___make_filetree_example( + { + "phenotype": { + "tool-Measurements_phenotype.json": "", + "tool-Measurements_phenotype.tsv": "", + }, + "sub-01": { + "anat": { + "sub-01_T1w.json": "", + "sub-01_T1w.nii.gz": "", + } + } + } +) }} + +Contents of `phenotype/tool-Measurements_phenotype.tsv` + +```tsv +participant_id measurement_1 measurement_2 +sub-01 value1 value2 +``` + +A session directory MUST be present in the participant's directory and +the `session_id` column MUST be present +in `phenotype/tool-Measurements_phenotype.tsv`, +and the `"IndexColumns"` of `participant_id` and `session_id` MUST be present +in `phenotype/tool-Measurements_phenotype.json`. +Sessions must be used consistently for the combination of tabular and +non-tabular phenotypic data. + +### 2 participants with a mix of tabular phenotypic data and imaging sessions + +In this example, participants acquired both +a phenotypic measurement tool and an MRI during `ses-MRI1`. +`sub-01` has a `ses-MRI2` with no phenotypic measurement tool acquired +and `sub-02` has a `ses-pheno` where no MRI was acquired. + +File tree + + +{{ MACROS___make_filetree_example( + { + "phenotype": { + "tool-Measurements_phenotype.json": "", + "tool-Measurements_phenotype.tsv": "", + }, + "sub-01": { + "ses-MRI1": { + "anat": { + "sub-01_ses-MRI1_T1w.json": "", + "sub-01_ses-MRI1_T1w.nii.gz": "", + } + }, + "ses-MRI2": { + "anat": { + "sub-01_ses-MRI2_T1w.json": "", + "sub-01_ses-MRI2_T1w.nii.gz": "", + } + } + }, + "sub-02": { + "ses-MRI1": { + "anat": { + "sub-02_ses-MRI1_T1w.json": "", + "sub-02_ses-MRI1_T1w.nii.gz": "", + } + } + } + } +) }} + +Contents of `phenotype/tool-Measurements_phenotype.tsv` + +```tsv +participant_id session_id acq_time measurement_1 measurement_2 +sub-01 ses-MRI1 2001-01-01T11:12:00 value1 value2 +sub-01 ses-MRI2 2001-07-01T13:14:00 n/a n/a +sub-02 ses-MRI1 2001-01-181T15:16:00 value3 value4 +sub-02 ses-pheno 2001-02-20T12:05:00 value5 value6 +``` + +### 3 participants with 3 different kinds of sessions among them + +The `ses-baseline` session collects an MRI and tabular phenotypic data. + +File tree + + +{{ MACROS___make_filetree_example( + { + "participants.json": "", + "participants.tsv": "", + "phenotype": { + "tool-Demographics_phenotype.json": "", + "tool-Demographics_phenotype.tsv": "", + "tool-Survey_phenotype.json": "", + "tool-Survey_phenotype.tsv": "", + }, + "sub-01": { + "ses-baseline/": "", + "ses-followupMRI/": "", + }, + "sub-02": { + "ses-baseline/": "", + }, + "sub-03": { + "ses-baseline/": "", + "ses-followupMRI/": "", + } + } +) }} + +Contents of `participants.tsv`. +Unchanging participant properties belong here. + +```tsv +participant_id sex +sub-01 M +sub-02 F +sub-03 F +``` + +Contents of `phenotype/tool-Demographics_phenotype.tsv`. +Participant properties that can change +from session to session belong here especially. + +```tsv +participant_id session_id acq_time age gender race household_income +sub-01 ses-baseline 2001-01-01T12:05:00 10 3 4 5 +sub-01 ses-followupMRI 2001-07-01T13:33:00 10 3 4 5 +sub-01 ses-interview 2002-01-01T11:21:00 11 4 4 6 +sub-02 ses-baseline 2001-04-01T11:01:00 9 1 3 3 +sub-02 ses-interview 2002-04-01T14:08:00 10 1 7 3 +sub-03 ses-baseline 2001-09-01T11:45:00 11 2 10 4 +sub-03 ses-followupMRI 2002-03-01T12:17:00 12 5 10 4 +``` + +Partial contents of `phenotype/tool-Demographics_phenotype.json`. +Note how the `session_id` `Levels` are clearly described +and how `"IndexColumns"` is present. + +```json +{ + "IndexColumns": [ + "participant_id", + "session_id" + ], + "participant_id": { + "Description": "Participant identifier" + }, + "session_id": { + "Description": "Session identifier", + "Levels": { + "ses-baseline": "Baseline visit for MRI and assessments", + "ses-followupMRI": "6-months after baseline MRI follow-up", + "ses-interview": "1-year after baseline in-person follow-up" + } + }, + "acq_time": { + "Description": "When the data acquisition started" + } +} +``` + +Contents of `phenotype/tool-Survey_phenotype.tsv`. +Note how `sub-03` does not have a row for `ses-interview` +because that session was not collected and is absent above +in the `phenotype/tool-Demographics_phenotype.tsv` file as well. + +```tsv +participant_id session_id question_1 question_2 question_3 +sub-01 ses-baseline A 2 no +sub-01 ses-interview A 3 yes +sub-02 ses-baseline A 2 no +sub-02 ses-interview B 1 unsure +sub-03 ses-baseline B 3 no +``` + +For more complete examples, see the `pheno00*` +[bids-examples on GitHub](https://github.com/bids-standard/bids-examples/). + +## Footnotes + +1 A session is any logical grouping of imaging and behavioral data consistent +across participants. Session can (but doesn't have to) be synonymous to a visit +in a longitudinal study. In situations where different data types are obtained over +several visits (for example fMRI on one day followed by DWI the day after) +those can still be grouped in one session. Refer to the +[definition of session](../glossary.md#session-entities) for more details. + +2 Datetime format and the anonymization procedure are +described in [Units](../common-principles.md#units). diff --git a/src/common-principles.md b/src/common-principles.md index 403c7f702b..e421cfce60 100644 --- a/src/common-principles.md +++ b/src/common-principles.md @@ -510,7 +510,7 @@ NIfTI header. ### Tabular files -Tabular data MUST be saved as plain-text, tab-delimited values (TSV) files +Tabular data MUST be saved as plain-text, tab-separated values (TSV) files (with [extension `.tsv`](glossary.md#tsv-extensions)), that is, [CSV files](https://en.wikipedia.org/wiki/Comma-separated_values) where commas are replaced by tab characters. Tabs MUST be true tab characters and MUST NOT be a series of space characters. @@ -561,7 +561,7 @@ onset duration response_time trial_type trial_extra are not part of the tabular data file's content. Tabular files MAY be optionally accompanied by a simple data dictionary -in the form of a JSON [object](https://www.json.org/json-en.html) +in the form of a [JSON object](https://www.json.org/json-en.html) within a JSON file. The JSON files containing the data dictionaries MUST have the same name as their corresponding tabular files but with `.json` extensions. diff --git a/src/longitudinal-and-multi-site-studies.md b/src/longitudinal-and-multi-site-studies.md index 5623bec305..5c79c6b0ae 100644 --- a/src/longitudinal-and-multi-site-studies.md +++ b/src/longitudinal-and-multi-site-studies.md @@ -109,3 +109,10 @@ underscores are not allowed in subject labels. In case of studies such as "Traveling Human Phantom" it is possible to incorporate site within session label. For example `sub-human1/ses-NUY`, `sub-human1/ses-MIT`, `sub-phantom1/ses-NUY`, `sub-phantom1/ses-MIT` and so on. + +## On tabular phenotypic data across sessions + +Read the [tabular phenotypic data guidelines appendix](appendices/phenotype.md) +or the [phenotypic and assessment data section](modality-agnostic-files/phenotypic-and-assessment-data.md) +for an explanation of how to use `"IndexColumns"` +to aggregate longitudinal or multi-session tabular phenotypic data. diff --git a/src/metaschema.json b/src/metaschema.json index 7d5b83ebf5..4a843788d4 100644 --- a/src/metaschema.json +++ b/src/metaschema.json @@ -326,8 +326,11 @@ "type": "object", "properties": { "issue": { - "allOf": [{ "$ref": "#/definitions/ruleTypes/issue" }], - "required": ["level"] + "allOf": [ + { "$ref": "#/definitions/ruleTypes/issue" }, + { "required": ["level"] } + ], + "unevaluatedProperties": false }, "selectors": { "$ref": "#/definitions/ruleTypes/expressionList" @@ -473,17 +476,18 @@ "type": "object", "patternProperties": { "^[a-zA-Z0-9_]+$": { - "type": "object", - "properties": { - "code": { "type": "string" }, - "message": { "type": "string" }, - "level": { "enum": ["error", "warning"] }, - "selectors": { - "$ref": "#/definitions/ruleTypes/expressionList" + "allOf": [ + { "$ref": "#/definitions/ruleTypes/issue" }, + { + "properties": { + "selectors": { + "$ref": "#/definitions/ruleTypes/expressionList" + } + }, + "required": ["level"] } - }, - "required": ["message", "level"], - "additionalProperties": false + ], + "unevaluatedProperties": false } }, "additionalProperties": false @@ -595,7 +599,10 @@ "level": { "$ref": "#/definitions/enums/requirement_level" }, "level_addendum": { "type": "string" }, "description_addendum": { "type": "string" }, - "issue": { "$ref": "#/definitions/ruleTypes/issue" } + "issue": { + "$ref": "#/definitions/ruleTypes/issue", + "unevaluatedProperties": false + } }, "required": ["level"], "additionalProperties": false @@ -606,11 +613,13 @@ "type": "object", "properties": { "code": { "type": "string" }, + "subCode": { "type": "string" }, + "level": { "enum": ["error", "warning"] }, + "location": { "type": "string" }, "message": { "type": "string" }, - "level": { "enum": ["error", "warning"] } + "suggestion": { "type": "string" } }, - "required": ["code", "message"], - "additionalProperties": false + "required": ["code", "message"] }, "expressionList": { "type": "array", diff --git a/src/modality-agnostic-files/data-summary-files.md b/src/modality-agnostic-files/data-summary-files.md index c615196b5c..cf4884b879 100644 --- a/src/modality-agnostic-files/data-summary-files.md +++ b/src/modality-agnostic-files/data-summary-files.md @@ -53,6 +53,9 @@ to date of birth. ```JSON { + "participant_id": { + "Description": "participant identifier" + }, "age": { "Description": "age of the participant", "Units": "year" @@ -81,6 +84,24 @@ to date of birth. } ``` +It is RECOMMENDED to use the `age` column to record participant age. +This reduces data duplication across tabular data files. The `Units` of `age` +do not have to be years so long as the units of the age +are written in `participants.json`. +Consider participant privacy or study objectives when selecting +the `Units` of `age` or the accuracy of `age` data. + +There are two methods to record `age` in longitudinal or multi-session data sets. +Choose what makes the most sense for your dataset's expected users. +The first method is to aggregate into a single file in the phenotypic and assessment data folder +using the `"IndexColumns"` field in the sidecar JSON. +Read the [tabular phenotypic data guidelines appendix](../appendices/phenotype.md) +or the [phenotypic and assessment data section](phenotypic-and-assessment-data.md) +for an explanation of how to use `"IndexColumns"` +to aggregate longitudinal or multi-session tabular phenotypic data. +The second method is to segregate `age` into participant's session files. +Read the [sessions file section](#sessions-file) below for further explanation. + ## Samples file Template: @@ -201,19 +222,36 @@ meg/sub-control01_task-rest_split-02_meg.nii.gz 1877-06-15T12:15:27 Template: -```Text -sub-