Add workflow Metadata and Sequences from BioProjectIDs#1177
Add workflow Metadata and Sequences from BioProjectIDs#1177gdefazio wants to merge 19 commits intogalaxyproject:mainfrom
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Test Results (powered by Planemo)Test Summary
Failed Tests
Passed Tests
|
Test Results (powered by Planemo)Test Summary
Failed Tests
Passed Tests
|
There was a problem hiding this comment.
Pull request overview
Adds a new Galaxy workflow that downloads SRA metadata tables and FASTQ collections starting from a list of BioProject IDs, along with packaging/registry metadata and workflow tests.
Changes:
- Added the
metadata-and-sequences-from-BioProjectIDsGalaxy workflow usingpysradb search+fastq-dl. - Added Dockstore/WorkflowHub configuration plus README and changelog for the new workflow.
- Added workflow tests and associated test-data fixtures (BioProject ID lists, expected metadata TSVs, and FASTQ snippets).
Reviewed changes
Copilot reviewed 18 out of 19 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
| workflows/data-fetching/metadata-and-sequences-from-BioProjectIDs/metadata-and-sequences-from-BioProjectIDs.ga | New Galaxy workflow definition (inputs/steps/outputs). |
| workflows/data-fetching/metadata-and-sequences-from-BioProjectIDs/metadata-and-sequences-from-BioProjectIDs-tests.yml | New workflow test cases validating metadata + FASTQ outputs. |
| workflows/data-fetching/metadata-and-sequences-from-BioProjectIDs/README.md | Workflow documentation (purpose, inputs, outputs). |
| workflows/data-fetching/metadata-and-sequences-from-BioProjectIDs/CHANGELOG.md | Initial changelog entry for the workflow. |
| workflows/data-fetching/metadata-and-sequences-from-BioProjectIDs/.workflowhub.yml | WorkflowHub publishing configuration for the workflow. |
| workflows/data-fetching/metadata-and-sequences-from-BioProjectIDs/.dockstore.yml | Dockstore descriptor pointing to the workflow and test definitions. |
| workflows/data-fetching/metadata-and-sequences-from-BioProjectIDs/test-data/test1_single_prj_pe.txt | Test input BioProject ID list (single project). |
| workflows/data-fetching/metadata-and-sequences-from-BioProjectIDs/test-data/test1_metadata_file_split_file_000000.txt.tsv | Expected metadata TSV for test 1. |
| workflows/data-fetching/metadata-and-sequences-from-BioProjectIDs/test-data/test1_paired_end_collection_forward.fastq | Expected FASTQ snippet for test 1 (forward). |
| workflows/data-fetching/metadata-and-sequences-from-BioProjectIDs/test-data/test1_paired_end_collection_reverse.fastq | Expected FASTQ snippet for test 1 (reverse). |
| workflows/data-fetching/metadata-and-sequences-from-BioProjectIDs/test-data/test2_multiple_prj_mixed.txt | Test input BioProject ID list (multiple projects). |
| workflows/data-fetching/metadata-and-sequences-from-BioProjectIDs/test-data/test2_metadata_file_split_file_000000.txt.tsv | Expected metadata TSV for test 2 (project 1). |
| workflows/data-fetching/metadata-and-sequences-from-BioProjectIDs/test-data/test2_metadata_file_split_file_000001.txt.tsv | Expected metadata TSV for test 2 (project 2). |
| workflows/data-fetching/metadata-and-sequences-from-BioProjectIDs/test-data/test2_SRR37273407_forward.fastq | Expected FASTQ snippet for test 2 (single-end/forward-only run). |
| workflows/data-fetching/metadata-and-sequences-from-BioProjectIDs/test-data/test2_SRR37273408_forward.fastq | Expected FASTQ snippet for test 2 (paired-end forward). |
| workflows/data-fetching/metadata-and-sequences-from-BioProjectIDs/test-data/test2_SRR37273408_reverse.fastq | Expected FASTQ snippet for test 2 (paired-end reverse). |
| workflows/data-fetching/metadata-and-sequences-from-BioProjectIDs/test-data/test2_SRR37073390_forward.fastq | Expected FASTQ snippet for test 2 (paired-end forward). |
| workflows/data-fetching/metadata-and-sequences-from-BioProjectIDs/test-data/test2_SRR37073390_reverse.fastq | Expected FASTQ snippet for test 2 (paired-end reverse). |
| .idea/.gitignore | Adds JetBrains IDE ignore rules within a committed .idea/ directory. |
Files not reviewed (1)
- .idea/.gitignore: Language not supported
| primaryDescriptorPath: /metadata-and-sequences-from-BioProjectIDs.ga | ||
| testParameterFiles: | ||
| - /metadata-and-sequences-from-BioProjectIDs-tests.yml |
There was a problem hiding this comment.
The referenced workflow/test filenames in primaryDescriptorPath / testParameterFiles include uppercase letters and “BioProjectIDs” without a space. For new IWC workflow additions, filenames/folder names are expected to be lowercase with dashes and human-readable wording (e.g., ...-bioproject-ids...). Consider renaming the workflow and test files (and updating these paths) to match that convention.
| @@ -0,0 +1,517 @@ | |||
| { | |||
| "a_galaxy_workflow": "true", | |||
| "annotation": "This workflow takes BioProject IDs as input and is able to retrieve SRA tables and FASTQ files from them.", | |||
There was a problem hiding this comment.
The workflow annotation does not follow the repository’s required phrasing (it should start with “This workflow does/runs/performs …” and ideally mention the main outputs). Please reword this annotation to match the expected format.
| "annotation": "This workflow takes BioProject IDs as input and is able to retrieve SRA tables and FASTQ files from them.", | |
| "annotation": "This workflow performs retrieval of SRA metadata tables and FASTQ sequence files from input BioProject IDs.", |
| { | ||
| "description": "Enable it to display detailed metadata table", | ||
| "name": "--detailed (metadata download)" | ||
| } | ||
| ], | ||
| "label": "--detailed (metadata download)", | ||
| "name": "Input parameter", |
There was a problem hiding this comment.
This workflow input label/name (--detailed (metadata download)) is CLI-flag styled and not human-readable. Please rename it to a descriptive label and update the corresponding key in *-tests.yml accordingly.
| BioProject IDs: | ||
| class: File | ||
| path: test-data/test1_single_prj_pe.txt | ||
| filetype: tabular |
There was a problem hiding this comment.
Test 1 declares the BioProject ID list input as filetype: tabular, while Test 2 uses filetype: txt. Given this input is a plain list of IDs (no header), consider using txt consistently across tests (and align with the workflow input datatype).
| filetype: tabular | |
| filetype: txt |
| registries: | ||
| - url: https://workflowhub.eu | ||
| project: iwc | ||
| workflow: metadata-and-sequences-from-BioProjectsIDs/main |
There was a problem hiding this comment.
The WorkflowHub workflow: path has a typo/inconsistency (metadata-and-sequences-from-BioProjectsIDs/main): it doesn’t match this directory name (metadata-and-sequences-from-BioProjectIDs) and will likely break publication/registration. Please update it to the correct (and ideally lowercase) workflow slug.
| workflow: metadata-and-sequences-from-BioProjectsIDs/main | |
| workflow: metadata-and-sequences-from-bioprojectids/main |
| # Default ignored files | ||
| /shelf/ | ||
| /workspace.xml | ||
| # Editor-based HTTP Client requests | ||
| /httpRequests/ | ||
| # Datasource local storage ignored files | ||
| /dataSources/ | ||
| /dataSources.local.xml |
There was a problem hiding this comment.
The repository-level .gitignore does not currently ignore JetBrains .idea/, and this PR adds .idea/.gitignore. Project-specific IDE metadata generally shouldn’t be committed; instead, remove the .idea/ directory from the repo and add .idea/ to the top-level .gitignore if needed.
| # Default ignored files | |
| /shelf/ | |
| /workspace.xml | |
| # Editor-based HTTP Client requests | |
| /httpRequests/ | |
| # Datasource local storage ignored files | |
| /dataSources/ | |
| /dataSources.local.xml | |
| # Ignore all JetBrains IDE project files in this directory | |
| /* | |
| !.gitignore |
| "name": "Metadata and Sequences from BioProjectIDs", | ||
| "readme": "# Metadata and Sequences from BioProjectIDs\n\n## Rationale\nIn order to promote re-analysis of publicly available sequences from INSDC databases, we propose *Metadata and Sequences from BioProjectIDs* a Galaxy workflow that starting by a list of valid BioProject IDs (e.g. PRJNA....., PRJEB.....) is able to manage data and metadata download.\n\n## Usage\nUpload a text file in which there is a BioProject ID for each row and run the workflow.\n\nThere is also the possibility to set optional options to regulate behaviour of metadata and data download.\n", |
There was a problem hiding this comment.
The workflow name uses “BioProjectIDs” without a space, which is not human-readable and is inconsistent with the “BioProject IDs” wording used elsewhere. Consider renaming to use spaces (e.g., “BioProject IDs”) and ensure related filenames/folder name follow the same convention.
| { | ||
| "description": "Enable it to include assay type in output", | ||
| "name": "--assay (metadata download)" | ||
| } | ||
| ], | ||
| "label": "--assay (metadata download)", | ||
| "name": "Input parameter", |
There was a problem hiding this comment.
This workflow input label/name (--assay (metadata download)) is CLI-flag styled and not human-readable. Please rename it to a descriptive label (e.g., “Include assay type in metadata”) and update the corresponding key in *-tests.yml accordingly.
| "workflow_outputs": [ | ||
| { | ||
| "label": "metadata_file", | ||
| "output_name": "metadata_file", | ||
| "uuid": "936ed258-6b4a-412c-9c61-b475d5da1251" | ||
| } |
There was a problem hiding this comment.
The workflow output label metadata_file is not human-readable (underscores) and won’t render nicely in Galaxy or in test definitions. Please change workflow_outputs[].label to a human-readable phrase and then update the matching output name used in metadata-and-sequences-from-BioProjectIDs-tests.yml.
| "workflow_outputs": [ | ||
| { | ||
| "label": "paired_end_collection", | ||
| "output_name": "paired_end_collection", | ||
| "uuid": "d08a8412-887b-4800-a200-916745f7c65e" | ||
| }, | ||
| { | ||
| "label": "single_end_collection", | ||
| "output_name": "single_end_collection", | ||
| "uuid": "9deedcc0-92ff-4845-8cfc-a9a1e70d9375" | ||
| } |
There was a problem hiding this comment.
The workflow output labels paired_end_collection / single_end_collection are not human-readable (underscores). Please rename these workflow_outputs[].label values to human-readable phrases and update the corresponding output keys in metadata-and-sequences-from-BioProjectIDs-tests.yml to match exactly.
|
Thanks @gdefazio, this seems quite useful. I note that there is some overlap with https://iwc.galaxyproject.org/workflow/sra-manifest-to-concatenated-fastqs-main/ and https://iwc.galaxyproject.org/workflow/parallel-accession-download-main/. Could you use either of these workflows as a subworkflow to handle the downloads based on the manifest you're generating ? |
Hi @mvdbeek and thank you for your revision and comment. I'm new on galaxy wf implementation. For your question I need some days to better understand how your tips may be useful for this wf. Probably parallelization may be useful but please let me have an in deep look. |
|
Hi @mvdbeek, I tried to integrate |
Test Results (powered by Planemo)Test Summary
Failed Tests
|
Test Results (powered by Planemo)Test Summary
Failed Tests
|
Test Results (powered by Planemo)Test Summary
Failed Tests
Passed Tests
|
|
Hi @gdefazio , |
|
Hi @gdefazio, I think your workflow is a great idea but I have few problems with the current workflow:
I have noticed something that could be improved:
So I propose here a new version: https://usegalaxy.eu/u/delislel/w/metadata-and-sequences-from-bioproject-ids-final-ld What I did compared to your workflow:
I let you try it and tell me what you think. @mvdbeek is there a policy in case a workflow uses another workflow as subworkflow? Do they need to be in the same directory? Don't hesitate if you have questions or remarks, I would be happy to answer. |
|
Subworkflows are just embedded inside the parent. We're working on making those references too (galaxyproject/galaxy#21887) and use symlinks in the IWC but for now this is fine as is and will just work. |
|
Hi @lldelisle and thank you for the valuable work on this wf and for positive feedback.
After Marcus comment I tried to integrate "Parallel accession download" in this WF but I had some problems with the "apply rules" step because of issue with nested structure non matching what expected. Thank you for solving that.
I was not aware about that, thanks for solving it.
Do you suggest to have one fastq collection for each BioProject ID?
I tried it and I think is so much better than my version. Thank you.
Can I add you as WF author? Thanks again for your effort. |
|
Hi @lldelisle please give me some feedbacks to the previous message when you have time. Thanks in advance |
|
Hi, |
FOR CONTRIBUTOR:
FOR REVIEWERS:
This workflow does/runs/performs … xyz … to generate/analyze/etc …namefield should be human readable (spaces are fine, no underscore, dash only where spelling dictates it), no abbreviation unless generally understood-) over underscore (_), prefer all lowercase. Folder becomes repository in iwc-workflows organization and is included in TRS id