Add RNA-seq Variant Discovery (RNAvar) workflow#1188
Add RNA-seq Variant Discovery (RNAvar) workflow#1188emregulben wants to merge 3 commits intogalaxyproject:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds a new Galaxy workflow under workflows/variant-calling/rna-variant-discovery/ implementing an RNA-seq variant discovery pipeline (STAR → GATK preprocessing/calling → filtering → SnpEff annotation) along with Dockstore metadata, documentation, and a Planemo test suite with bundled test data.
Changes:
- Added the
RNAvar.gaGalaxy workflow for RNA-seq variant calling and annotation, with MultiQC reporting. - Added a Planemo test definition (
RNAvar-tests.yml) and corresponding small test datasets. - Added workflow packaging/documentation files (
.dockstore.yml,README.md,CHANGELOG.md).
Reviewed changes
Copilot reviewed 9 out of 11 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| workflows/variant-calling/rna-variant-discovery/RNAvar.ga | New Galaxy workflow definition for RNA-seq variant discovery and reporting |
| workflows/variant-calling/rna-variant-discovery/RNAvar-tests.yml | Planemo test job + output assertions for key workflow outputs |
| workflows/variant-calling/rna-variant-discovery/.dockstore.yml | Dockstore registration metadata pointing to the workflow and tests |
| workflows/variant-calling/rna-variant-discovery/README.md | High-level explanation of workflow steps and expected behavior |
| workflows/variant-calling/rna-variant-discovery/CHANGELOG.md | Initial release entry |
| workflows/variant-calling/rna-variant-discovery/test-data/test_rnaseq_1.fastq.gz | Test input R1 reads for Planemo validation |
| workflows/variant-calling/rna-variant-discovery/test-data/test_rnaseq_2.fastq.gz | Test input R2 reads for Planemo validation |
| workflows/variant-calling/rna-variant-discovery/test-data/genome.fasta | Small reference FASTA used by tests |
| workflows/variant-calling/rna-variant-discovery/test-data/genome.gff3 | Small annotation file used by tests |
| workflows/variant-calling/rna-variant-discovery/test-data/dbsnp_146.hg38.vcf | Known variants VCF used for BQSR / known-sites |
| workflows/variant-calling/rna-variant-discovery/test-data/mills_and_1000G.indels.vcf | Known indels VCF used for BQSR / known-sites |
| "owner": "iuc", | ||
| "tool_shed": "toolshed.g2.bx.psu.edu" | ||
| }, | ||
| "tool_state": "{\"__input_ext\": \"input\", \"annotations\": null, \"chr\": null, \"chromInfo\": \"/opt/galaxy/tool-data/shared/ucsc/chrom/?.len\", \"csvStats\": true, \"filter\": {\"specificEffects\": \"no\", \"__current_case__\": 0}, \"filterOut\": null, \"generate_gene_stats\": false, \"generate_stats\": true, \"input\": {\"__class__\": \"ConnectedValue\"}, \"inputFormat\": \"vcf\", \"intervals\": null, \"noLog\": true, \"outputFormat\": \"vcf\", \"snpDb\": {\"genomeSrc\": \"named\", \"__current_case__\": 2, \"genome_version\": \"hg38\"}, \"spliceRegion\": {\"setSpliceRegions\": \"no\", \"__current_case__\": 0}, \"spliceSiteSize\": \"2\", \"transcripts\": null, \"udLength\": \"0\", \"__page__\": 0, \"__rerun_remap_job_id__\": null}", |
There was a problem hiding this comment.
SnpEff is configured with a hardcoded named genome database (hg38). Since the workflow also accepts a user-provided reference FASTA and annotation, this can silently produce incorrect annotations (or fail) when the inputs are not hg38. Consider either (a) building a SnpEff database from the provided reference+annotation (as other variant-calling workflows here do) or (b) exposing the SnpEff genome version as an explicit workflow parameter and documenting that the workflow is hg38-specific.
|
If I may add to copilot review, I think would be good to describe in the README each of the expected inputs and the expected outputs. |
| 3. **Recalibration:** **GATK4 BaseRecalibrator** (BQSR) using known polymorphic sites to adjust base quality scores. | ||
| 4. **Variant Calling:** **GATK4 HaplotypeCaller** with parameters specifically tuned for RNA-seq (e.g., ignoring soft-clipped bases). | ||
| 5. **Filtering:** **bcftools filter** applying hard filters ($FS > 30.0$, $QD < 2.0$) to reduce false positives. | ||
| 6. **Annotation:** Functional annotation of variants using **SnpEff** against the hg38 database. |
There was a problem hiding this comment.
It makes limited sense to hard-code the SnpEff database and, thereby, restrict analysis to just one organism and genome version. The database choice should instead be turned into a workflow parameter.
I would populate the user choices from the "Locally installed SnpEff databases" instead of letting the user download databases on demand.
Please let us know if you need help with this.
| # RNA-seq Variant Discovery (RNAvar) | ||
|
|
||
| This workflow is an implementation of a standard RNA-seq variant discovery pipeline. It handles the transition from raw reads to annotated variants using industry-standard tools. | ||
|
|
There was a problem hiding this comment.
Here, the README should clearly state that the aim is to provide a faithful reimplementation of https://nf-co.re/rnavar and that currently there are still limitations.
| 4. **Variant Calling:** **GATK4 HaplotypeCaller** with parameters specifically tuned for RNA-seq (e.g., ignoring soft-clipped bases). | ||
| 5. **Filtering:** **bcftools filter** applying hard filters ($FS > 30.0$, $QD < 2.0$) to reduce false positives. | ||
| 6. **Annotation:** Functional annotation of variants using **SnpEff** against the hg38 database. | ||
|
|
There was a problem hiding this comment.
After presenting the current workflow logic, it wouldbe good to have a "Limitations" section, where you discuss which features of the nf-core ppeline are currently not yet implemented.
Things hat I can immediately think of here are
- Sample Sheet support
- support for SnpEff alternatives for variant annotation
- optional UMI-Tools-based preprocessing
- no parallelization of the GATK4 SplitNCigarReads step (is this something that should be implemented at the tool level similar to what's done in the freebayes wrapper?
- base recalibration and variant filtering are currently non-optional when they are optional in th nf-core pipeline
Basically, compare https://nf-co.re/rnavar/1.2.3/docs/usage/ to what's implemented here.
Of course, if you still want to fix any of these limitations, you're welcome to do so.
| 6. **Annotation:** Functional annotation of variants using **SnpEff** against the hg38 database. | ||
|
|
||
| ## Testing and Parity | ||
| The workflow is validated against a targeted subset of human RNA-seq data on chromosome 22. It successfully identifies the 54 high-confidence variants required to match the output of the original discovery engine. No newline at end of file |
There was a problem hiding this comment.
Be more precise here and state the origin of that test data.
| "9": { | ||
| "annotation": "", | ||
| "content_id": "toolshed.g2.bx.psu.edu/repos/devteam/samtools_sort/samtools_sort/2.0.8", | ||
| "errors": null, | ||
| "id": 9, | ||
| "input_connections": { | ||
| "input1": { | ||
| "id": 7, | ||
| "output_name": "mapped_reads" | ||
| } | ||
| }, | ||
| "inputs": [], | ||
| "label": null, | ||
| "name": "Samtools sort", | ||
| "outputs": [ | ||
| { | ||
| "name": "output1", | ||
| "type": "bam" | ||
| } | ||
| ], | ||
| "position": { | ||
| "left": 1275.194213718916, | ||
| "top": 1154.8021613220824 | ||
| }, | ||
| "post_job_actions": {}, | ||
| "tool_id": "toolshed.g2.bx.psu.edu/repos/devteam/samtools_sort/samtools_sort/2.0.8", | ||
| "tool_shed_repository": { | ||
| "changeset_revision": "69dca890852c", | ||
| "name": "samtools_sort", | ||
| "owner": "devteam", | ||
| "tool_shed": "toolshed.g2.bx.psu.edu" | ||
| }, | ||
| "tool_state": "{\"__input_ext\": \"input\", \"chromInfo\": \"/opt/galaxy/tool-data/shared/ucsc/chrom/?.len\", \"input1\": {\"__class__\": \"ConnectedValue\"}, \"minhash\": false, \"prim_key_cond\": {\"prim_key_select\": \"\", \"__current_case__\": 0}, \"__page__\": 0, \"__rerun_remap_job_id__\": null}", | ||
| "tool_uuid": null, | ||
| "tool_version": "2.0.8", | ||
| "type": "tool", | ||
| "uuid": "27f1d0d4-b06c-4e39-a76b-a705964f4a90", | ||
| "when": null, | ||
| "workflow_outputs": [] | ||
| }, |
There was a problem hiding this comment.
The RNAStar Galaxy wrapper should already generate coordinate-sorted BAM output, so is this step necessary?
Please check and/or clarify.
There was a problem hiding this comment.
I verified STAR already handles the sorting. Removed the step: 5146320
| "10": { | ||
| "annotation": "", | ||
| "content_id": "toolshed.g2.bx.psu.edu/repos/devteam/picard/picard_AddOrReplaceReadGroups/3.1.1.0", | ||
| "errors": null, | ||
| "id": 10, | ||
| "input_connections": { | ||
| "inputFile": { | ||
| "id": 9, | ||
| "output_name": "output1" | ||
| } | ||
| }, | ||
| "inputs": [], | ||
| "label": null, | ||
| "name": "AddOrReplaceReadGroups", | ||
| "outputs": [ | ||
| { | ||
| "name": "outFile", | ||
| "type": "bam" | ||
| } | ||
| ], | ||
| "position": { | ||
| "left": 1550.8885393603414, | ||
| "top": 1254.1892548481562 | ||
| }, | ||
| "post_job_actions": {}, | ||
| "tool_id": "toolshed.g2.bx.psu.edu/repos/devteam/picard/picard_AddOrReplaceReadGroups/3.1.1.0", | ||
| "tool_shed_repository": { | ||
| "changeset_revision": "3f254c5ced1d", | ||
| "name": "picard", | ||
| "owner": "devteam", | ||
| "tool_shed": "toolshed.g2.bx.psu.edu" | ||
| }, | ||
| "tool_state": "{\"CN\": null, \"DS\": null, \"DT\": null, \"PI\": null, \"PL\": \"ILLUMINA\", \"PU\": \"run\", \"__input_ext\": \"input\", \"chromInfo\": \"/opt/galaxy/tool-data/shared/ucsc/chrom/?.len\", \"inputFile\": {\"__class__\": \"ConnectedValue\"}, \"read_group_id_conditional\": {\"do_auto_name\": true, \"__current_case__\": 0}, \"read_group_lb_conditional\": {\"do_auto_name\": true, \"__current_case__\": 0}, \"read_group_sm_conditional\": {\"do_auto_name\": true, \"__current_case__\": 0}, \"validation_stringency\": \"LENIENT\", \"__page__\": 0, \"__rerun_remap_job_id__\": null}", | ||
| "tool_uuid": null, | ||
| "tool_version": "3.1.1.0", | ||
| "type": "tool", | ||
| "uuid": "cf07f182-d8ab-4c43-8df3-9c263101cbe3", | ||
| "when": null, | ||
| "workflow_outputs": [] | ||
| }, |
There was a problem hiding this comment.
Try to understand if and how this step changes the data. Is there a way to use it in a meaningful way either now without having implemented sample sheet funtionality or maybe in the future?
There was a problem hiding this comment.
This step is required for the GATK4 tools used later in the workflow. Tools like HaplotypeCaller and BaseRecalibrator are designed to look for 'Read Group' metadata in the BAM header. Without these tags, the tools throw an error and won't process the file (as noted in the GATK documentation on Read Groups).
By including this step now with default values, we ensure the workflow is robust and GATK-compliant. It also sets us up nicely for the future. If we implement sample sheet support later, this tool is exactly where that metadata would be injected into the pipeline.
| "owner": "iuc", | ||
| "tool_shed": "toolshed.g2.bx.psu.edu" | ||
| }, | ||
| "tool_state": "{\"__input_ext\": \"input\", \"chromInfo\": \"/opt/galaxy/tool-data/shared/ucsc/chrom/?.len\", \"comment\": \"\", \"export\": false, \"flat\": false, \"image_content_input\": null, \"png_plots\": false, \"results\": [{\"__index__\": 0, \"software_cond\": {\"software\": \"star\", \"__current_case__\": 28, \"output\": [{\"__index__\": 0, \"type\": {\"type\": \"log\", \"__current_case__\": 0, \"input\": {\"__class__\": \"ConnectedValue\"}}}]}}, {\"__index__\": 1, \"software_cond\": {\"software\": \"fastqc\", \"__current_case__\": 8, \"output\": [{\"__index__\": 0, \"type\": \"data\", \"input\": {\"__class__\": \"ConnectedValue\"}}]}}, {\"__index__\": 2, \"software_cond\": {\"software\": \"picard\", \"__current_case__\": 17, \"output\": [{\"__index__\": 0, \"type\": \"markdups\", \"input\": {\"__class__\": \"ConnectedValue\"}}]}}, {\"__index__\": 3, \"software_cond\": {\"software\": \"snpeff\", \"__current_case__\": 26, \"input\": {\"__class__\": \"ConnectedValue\"}}}, {\"__index__\": 4, \"software_cond\": {\"software\": \"samtools\", \"__current_case__\": 24, \"output\": [{\"__index__\": 0, \"type\": {\"type\": \"flagstat\", \"__current_case__\": 1, \"input\": {\"__class__\": \"ConnectedValue\"}}}]}}, {\"__index__\": 5, \"software_cond\": {\"software\": \"gatk\", \"__current_case__\": 11, \"output\": [{\"__index__\": 0, \"type\": \"base_recalibrator\", \"input\": {\"__class__\": \"ConnectedValue\"}}]}}], \"title\": \"\", \"__page__\": 0, \"__rerun_remap_job_id__\": null}", |
There was a problem hiding this comment.
Please order the different MultiQC input reports in a logical way proceeding from raw read metrics through alignment, variant calling and annotation.
| "owner": "iuc", | ||
| "tool_shed": "toolshed.g2.bx.psu.edu" | ||
| }, | ||
| "tool_state": "{\"__input_ext\": \"input\", \"algo\": {\"params\": {\"settingsType\": \"default\", \"__current_case__\": 0}}, \"chimOutType\": \"\", \"chromInfo\": \"/opt/galaxy/tool-data/shared/ucsc/chrom/?.len\", \"filter\": {\"basic_filters\": null, \"output_params2\": {\"output_select2\": \"no\", \"__current_case__\": 1}}, \"oformat\": {\"outSAMattributes\": [\"NH\", \"HI\", \"AS\", \"nM\", \"ch\"], \"HI_offset\": \"1\", \"outSAMprimaryFlag\": \"OneBestScore\", \"outSAMmapqUnique\": \"60\", \"wasp_conditional\": {\"waspOutputMode\": \"\", \"__current_case__\": 1}}, \"outWig\": {\"outWigType\": \"None\", \"__current_case__\": 0, \"outWigStrand\": \"false\"}, \"perf\": {\"outBAMsortingBinsN\": \"50\", \"winAnchorMultimapNmax\": \"50\"}, \"refGenomeSource\": {\"geneSource\": \"history\", \"__current_case__\": 1, \"genomeFastaFiles\": {\"__class__\": \"ConnectedValue\"}, \"genomeSAindexNbases\": \"14\", \"GTFconditional\": {\"GTFselect\": \"with-gtf\", \"__current_case__\": 0, \"sjdbGTFfile\": {\"__class__\": \"ConnectedValue\"}, \"sjdbGTFfeatureExon\": \"exon\", \"sjdbOverhang\": \"14\", \"quantmode_output\": {\"quantMode\": \"-\", \"__current_case__\": 0}}, \"diploidconditional\": {\"diploid\": \"No\", \"__current_case__\": 1}}, \"singlePaired\": {\"sPaired\": \"paired\", \"__current_case__\": 1, \"input1\": {\"__class__\": \"ConnectedValue\"}, \"input2\": {\"__class__\": \"ConnectedValue\"}}, \"twopass\": {\"twopassMode\": \"Basic\", \"__current_case__\": 1, \"twopass_read_subset\": \"\", \"sj_precalculated\": \"\"}, \"__page__\": 0, \"__rerun_remap_job_id__\": null}", |
There was a problem hiding this comment.
The primary advantage of building the STAR index on te fly (which otherwise is just very wasteful) is that you can control the --sjdbOverhang parameter (see https://nf-co.re/rnavar/1.2.3/docs/usage/#alignment-options).
This means that you need to expose the sjdbOverhang wrapper parameter as a workflow parameter with a helpful annotation.
There was a problem hiding this comment.
For me sjdbOverhang is always exposed if you use a prebuilt genome reference without a prebuit gene model.
There was a problem hiding this comment.
That's also a good reason, yes.
Do you think we should go this way @lldelisle ? Let the user choose from cached, prebuilt references, but bring their own gene model? Could be a very good compromise.
There was a problem hiding this comment.
This is always the way I do, this is also the way it is done in the RNA-seq IWC workflow.
I think it should be cached prebuilt reference + own gtf.
Therefore I would add a step toolshed.g2.bx.psu.edu/repos/iuc/snpeff/snpEff_build_gb/5.4+galaxy0 to build the snpEff database matching the same reference fasta and the same gtf.
Using the cache option (reuse jobs with identical parameters) will allow to not really rebuild the database if it was already built with the same gtf.
There was a problem hiding this comment.
This is always the way I do, this is also the way it is done in the RNA-seq IWC workflow.
I think it should be cached prebuilt reference + own gtf.
Cool, sounds excellent to me, thanks!
In that case, however, how can we make sure that the reference fasta matches the one used for building the STAR index?
Also,
Therefore I would add a step toolshed.g2.bx.psu.edu/repos/iuc/snpeff/snpEff_build_gb/5.4+galaxy0 to build the snpEff database matching the same reference fasta and the same gtf.
I'm not so convinced this is a good idea, building SnpEff annotation files from gff is a fragile art for eukaryotic genomes and the pre-built cached ones might contain fewer errors.
So, essentially, the question to answer is: what's the least error-prone way to have matching genomes at the STAR, GATK and SnpEff steps.
Potential differences between SnpEff and custom gff annotations are another thing, but wouldn't be catastrophic for the workflow. Users who have a carefully crafted SnpEff db based specifically for their gff could also ask for server-side installation?
For exotic cases where there's no reasonably close existing SnpEff db, we could offer an optional branch of the WF to build the db from the reference and the gff, I just wouldn't make this the default.
Of course, I'm only making all of this up while typing so feel free to disagree with me :-)
There was a problem hiding this comment.
I’ve exposed sjdbOverhang as a mandatory workflow parameter. Regarding the SnpEff database discussion, I’ve kept the database selection as a string input to use pre-built/cached local versions, avoiding the potential fragility of building custom databases from GFFs during the run. Let me know if the current version is fine. 5146320
| 2. **Preprocessing:** **MarkDuplicates** and **GATK4 SplitNCigarReads** to prepare the BAM for variant calling by splitting reads into exon segments. | ||
| 3. **Recalibration:** **GATK4 BaseRecalibrator** (BQSR) using known polymorphic sites to adjust base quality scores. | ||
| 4. **Variant Calling:** **GATK4 HaplotypeCaller** with parameters specifically tuned for RNA-seq (e.g., ignoring soft-clipped bases). | ||
| 5. **Filtering:** **bcftools filter** applying hard filters (`FS > 30.0`, `QD < 2.0`) to reduce false positives. |
There was a problem hiding this comment.
These filters are user-configurable in the nf-core pipeline so should become workflow parameters, too.
| "top": 209.96850976711244 | ||
| }, | ||
| "tool_id": null, | ||
| "tool_state": "{\"multiple\": false, \"validators\": [], \"parameter_type\": \"text\", \"optional\": false}", |
|
Here is my proposed variation: https://usegalaxy.eu/u/delislel/w/rna-seq-variant-discovery-ld-version
Tell me what you think and feel free to change it. |

Summary
Adds the RNA-seq Variant Discovery (RNAvar) workflow.
Technical Details
Implementation based on
nf-core/rnavarlogic:MarkDuplicates,SplitNCigarReads, andBaseRecalibrator.HaplotypeCallerusing RNA-seq parameters.Validation
FOR CONTRIBUTOR:
FOR REVIEWERS:
This workflow does/runs/performs … xyz … to generate/analyze/etc …namefield should be human readable (spaces are fine, no underscore, dash only where spelling dictates it), no abbreviation unless generally understood-) over underscore (_), prefer all lowercase. Folder becomes repository in iwc-workflows organization and is included in TRS id